The Iterative Residual Rescaling algorithm: An analysis and generalization of Latent Semantic Indexing

Professor Lillian Lee

Using vector-based representations of document collections has enabled the application of powerful dimension-reduction techniques to information retrieval, document clustering, and other text analysis tasks. One of the most prominent of these techniques is Latent Semantic Indexing (LSI). However, despite ample empirical experience with it, there is still little understanding of when LSI can -- and, just as importantly, cannot -- be expected to perform well.

This talk consists of two parts. First, after a self-contained introduction to LSI, we provide a novel formal analysis of it that links its performance in a precise way to the uniformity of the underlying topic-document distribution. Second, we present a new algorithm, Iterative Residual Rescaling (IRR), that corrects for skewed distributions by automatically adjusting to non-uniformity in the topic distribution without knowledge of the underlying topics. A series of experiments over a variety of evaluation metrics validates our analysis and demonstrates the effectiveness of IRR.

Joint work with Rie Kubota Ando.


Home : News : Jamboree : 2003 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh