# Latent Semantic Indexing

###### Latent Semantic Indexing
Introduction to Information Retrieval Introduction to Information Retrieval Latent Semantic IndexingIntroduction to Information Retrieval Ch. 18 Today’s topic Latent Semantic Indexing  Termdocument matrices are very large  But the number of topics that people talk about is small (in some sense)  Clothes, movies, politics, …  Can we represent the termdocument space by a lower dimensional latent spaceIntroduction to Information Retrieval Linear Algebra BackgroundIntroduction to Information Retrieval Sec. 18.1 Eigenvalues Eigenvectors  Eigenvectors (for a square mm matrix S) Example (right) eigenvector eigenvalue  How many eigenvalues are there at most only has a nonzero solution if This is a mth order equation in λ which can have at most m distinct solutions (roots of the characteristic polynomial) – can be complex even though S is real.Introduction to Information Retrieval Sec. 18.1 Matrixvector multiplication has eigenvalues 30, 20, 1 with corresponding eigenvectors On each eigenvector, S acts as a multiple of the identity matrix: but as a different multiple on each. Any vector (say x= ) can be viewed as a combination of the eigenvectors: x = 2v + 4v + 6v 1 2 3Introduction to Information Retrieval Sec. 18.1 Matrixvector multiplication  Thus a matrixvector multiplication such as Sx (S, x as in the previous slide) can be rewritten in terms of the eigenvalues/vectors:  Even though x is an arbitrary vector, the action of S on x is determined by the eigenvalues/vectors.Introduction to Information Retrieval Sec. 18.1 Matrixvector multiplication  Suggestion: the effect of ―small‖ eigenvalues is small.  If we ignored the smallest eigenvalue (1), then instead of we would get  These vectors are similar (in cosine similarity, etc.)Introduction to Information Retrieval Sec. 18.1 Eigenvalues Eigenvectors For symmetric matrices, eigenvectors for distinct eigenvalues are orthogonal All eigenvalues of a real symmetric matrix are real. All eigenvalues of a positive semidefinite matrix are nonnegativeIntroduction to Information Retrieval Sec. 18.1 Example  Let Real, symmetric.  Then  The eigenvalues are 1 and 3 (nonnegative, real).  The eigenvectors are orthogonal (and real): Plug in these values and solve for eigenvectors.Introduction to Information Retrieval Sec. 18.1 Eigen/diagonal Decomposition  Let be a square matrix with m linearly independent eigenvectors (a ―non defective‖ matrix) Unique diagonal for  Theorem: Exists an eigen decomposition distinct eigen values  (cf. matrix diagonalization theorem)  Columns of U are the eigenvectors of S  Diagonal elements of are eigenvalues of Introduction to Information Retrieval Sec. 18.1 Diagonal decomposition: why/how Let U have the eigenvectors as columns: Then, SU can be written –1 Thus SU=U, or U SU= –1 And S=UU .Introduction to Information Retrieval Sec. 18.1 Diagonal decomposition example Recall The eigenvectors and form Recall Inverting, we have –1 UU =1. –1 Then, S=UU =Introduction to Information Retrieval Sec. 18.1 Example continued –1 Let’s divide U (and multiply U ) by Then, S= 1 T Q (Q = Q ) Why Stay tuned …Introduction to Information Retrieval Sec. 18.1 Symmetric Eigen Decomposition  If is a symmetric matrix:  Theorem: There exists a (unique) eigen decomposition  where Q is orthogonal: 1 T  Q = Q  Columns of Q are normalized eigenvectors  Columns are orthogonal.  (everything is real)Introduction to Information Retrieval Sec. 18.1 Exercise  Examine the symmetric eigen decomposition, if any, for each of the following matrices:Introduction to Information Retrieval Time out  I came to this class to learn about text retrieval and mining, not to have my linear algebra past dredged up again …  But if you want to dredge, Strang’s Applied Mathematics is a good place to start.  What do these matrices have to do with text  Recall M  N termdocument matrices …  But everything so far needs square matrices – so …Introduction to Information Retrieval Similarity  Clustering  We can compute the similarity between two document vector representations x and x by i j T xx i j  Let X = x … x 1 N T  Then XX is a matrix of similarities  X is symmetric ij T T  So XX = QΛQ  So we can decompose this similarity space into a set of orthonormal basis vectors (given in Q) scaled by the eigenvalues in Λ 17  This leads to PCA (Principal Components Analysis)Introduction to Information Retrieval Sec. 18.2 Singular Value Decomposition For an M  N matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: MM MN V is NN (Not proven here.)Introduction to Information Retrieval Sec. 18.2 Singular Value Decomposition MM MN V is NN T T  AA = QΛQ T T T T T T 2 T  AA = (UΣV )(UΣV ) = (UΣV )(VΣU ) = UΣ U T The columns of U are orthogonal eigenvectors of AA . T The columns of V are orthogonal eigenvectors of A A. T T Eigenvalues  …  of AA are the eigenvalues of A A. 1 r Singular valuesIntroduction to Information Retrieval Sec. 18.2 Singular Value Decomposition  Illustration of SVD dimensions and sparsenessIntroduction to Information Retrieval Sec. 18.2 SVD example Let Thus M=3, N=2. Its SVD is Typically, the singular values arranged in decreasing order.Introduction to Information Retrieval Sec. 18.3 Lowrank Approximation  SVD can be used to compute optimal lowrank approximations.  Approximation problem: Find A of rank k such k that Frobenius norm A and X are both mn matrices. k Typically, want k r.Introduction to Information Retrieval Sec. 18.3 Lowrank Approximation  Solution via SVD set smallest rk singular values to zero k column notation: sum of rank 1 matricesIntroduction to Information Retrieval Sec. 18.3 Reduced SVD  If we retain only k singular values, and set the rest to 0, then we don’t need the matrix parts in color T  Then Σ is k×k, U is M×k, V is k×N, and A is k M×N  This is referred to as the reduced SVD  It is the convenient (spacesaving) and usual form for computational applications  It’s what Matlab gives you kIntroduction to Information Retrieval Sec. 18.3 Approximation error  How good (bad) is this approximation  It’s the best possible, measured by the Frobenius norm of the error: where the  are ordered such that  . i i i+1 Suggests why Frobenius error drops as k increases.Introduction to Information Retrieval Sec. 18.3 SVD Lowrank approximation  Whereas the termdoc matrix A may have M=50000, N=10 million (and rank close to 50000)  We can construct an approximation A with 100 rank 100.  Of all rank 100 matrices, it would have the lowest Frobenius error.  Great … but why would we  Answer: Latent Semantic Indexing C. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211218, 1936.Introduction to Information Retrieval Latent Semantic Indexing via the SVDIntroduction to Information Retrieval Sec. 18.4 What it is  From termdoc matrix A, we compute the approximation A k.  There is a row for each term and a column for each doc in A k  Thus docs live in a space of kr dimensions  These dimensions are not the original axes  But whyIntroduction to Information Retrieval Vector Space Model: Pros  Automatic selection of index terms  Partial matching of queries and documents (dealing with the case where no document contains all search terms)  Ranking according to similarity score (dealing with large result sets)  Term weighting schemes (improves retrieval performance)  Various extensions  Document clustering  Relevance feedback (modifying query vector)  Geometric foundationIntroduction to Information Retrieval Problems with Lexical Semantics  Ambiguity and association in natural language  Polysemy: Words often have a multitude of meanings and different types of usage (more severe in very heterogeneous collections).  The vector space model is unable to discriminate between different meanings of the same word.Introduction to Information Retrieval Problems with Lexical Semantics  Synonymy: Different terms may have an identical or a similar meaning (weaker: words indicating the same topic).  No associations between words are made in the vector space representation.Introduction to Information Retrieval Polysemy and Context  Document similarity on single word level: polysemy and context ring jupiter ••• space voyager meaning 1 …… planet saturn ... ... car meaning 2 company ••• dodge contribution to similarity, if used ford st nd in 1 meaning, but not if in 2Introduction to Information Retrieval Sec. 18.4 Latent Semantic Indexing (LSI)  Perform a lowrank approximation of documentterm matrix (typical rank 100–300)  General idea  Map documents (and terms) to a lowdimensional representation.  Design a mapping such that the lowdimensional space reflects semantic associations (latent semantic space).  Compute document similarity based on the inner product in this latent semantic spaceIntroduction to Information Retrieval Sec. 18.4 Goals of LSI  LSI takes documents that are semantically similar (= talk about the same topics), but are not similar in the vector space (because they use different words) and rerepresents them in a reduced vector space in which they have higher similarity.  Similar terms map to similar location in low dimensional space  Noise reduction by dimension reductionIntroduction to Information Retrieval Sec. 18.4 Latent Semantic Analysis  Latent semantic space: illustrating example courtesy of Susan DumaisIntroduction to Information Retrieval Sec. 18.4 Performing the maps  Each row and column of A gets mapped into the kdimensional LSI space, by the SVD.  Claim – this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval.  A query q is also mapped into this space, by  Query NOT a sparse vector.Introduction to Information Retrieval LSA Example  A simple example termdocument matrix (binary) 37Introduction to Information Retrieval LSA Example  Example of C = UΣVT: The matrix U 38Introduction to Information Retrieval LSA Example  Example of C = UΣVT: The matrix Σ 39Introduction to Information Retrieval LSA Example T T  Example of C = UΣV : The matrix V 40Introduction to Information Retrieval LSA Example: Reducing the dimension 41Introduction to Information Retrieval Original matrix C vs. reduced C = 2 T UΣ V 2 42Introduction to Information Retrieval Why the reduced dimension matrix is better  Similarity of d2 and d3 in the original space: 0.  Similarity of d2 and d3 in the reduced space: 0.52 ∗ 0.28 + 0.36 ∗ 0.16 + 0.72 ∗ 0.36 + 0.12 ∗ 0.20 + −0.39 ∗ −0.08 ≈ 0.52  Typically, LSA increases recall and hurts precision 43Introduction to Information Retrieval Sec. 18.4 Empirical evidence  Experiments on TREC 1/2/3 – Dumais  Lanczos SVD code (available on netlib) due to Berry used in these experiments  Running times of one day on tens of thousands of docs still an obstacle to use  Dimensions – various values 250350 reported. Reducing k improves recall.  (Under 200 reported unsatisfactory)  Generally expect recall to improve – what about precisionIntroduction to Information Retrieval Sec. 18.4 Empirical evidence  Precision at or above median TREC precision  Top scorer on almost 20 of TREC topics  Slightly better on average than straight vector spaces  Effect of dimensionality: Dimensions Precision 250 0.367 300 0.371 346 0.374Introduction to Information Retrieval Sec. 18.4 Failure modes  Negated phrases  TREC topics sometimes negate certain query/terms phrases – precludes simple automatic conversion of topics to latent semantic space.  Boolean queries  As usual, freetext/vector space syntax of LSI queries precludes (say) ―Find any doc having to do with the following 5 companies‖  See Dumais for more.Introduction to Information Retrieval Sec. 18.4 But why is this clustering  We’ve talked about docs, queries, retrieval and precision here.  What does this have to do with clustering  Intuition: Dimension reduction through LSI brings together ―related‖ axes in the vector space.Introduction to Information Retrieval Simplistic picture Topic 1 Topic 2 Topic 3Introduction to Information Retrieval Some wild extrapolation  The ―dimensionality‖ of a corpus is the number of distinct topics represented in it.  More mathematical wild extrapolation:  if A has a rank k approximation of low Frobenius error, then there are no more than k distinct topics in the corpus.Introduction to Information Retrieval LSI has many other applications  In many settings in pattern recognition and retrieval, we have a featureobject matrix.  For text, the terms are features and the docs are objects.  Could be opinions and users …  This matrix may be redundant in dimensionality.  Can work with lowrank approximation.  If entries are missing (e.g., users’ opinions), can recover if dimensionality is low.  Powerful general analytical technique  Close, principled analog to clustering methods.Introduction to Information Retrieval Resources  IIR 18  Scott Deerwester, Susan Dumais, George Furnas, Thomas Landauer, Richard Harshman. 1990. Indexing by latent semantic analysis. JASIS 41(6):391—407.
Website URL
Comment
Presentations
Free
##### Document Information
Category:
Presentations
User Name:
RyanCanon
User Type:
Teacher
Country:
United Arab Emirates