Large Scale Machine Learning and Other Animals: LSI vs. SVD

Thursday, April 12, 2012

LSI vs. SVD

Here is a question about LSI I got from Ashish Dhamelia from capitalnovus.com:

I am new to LSI.
I have read few papers about LSI and they all suggest using SVD.
Our data are mostly emails and few office documents as well.
We do get lots of data and we index using Lucene/Solr.

I started with SemanticVectors. It basically performs SVD on lucene index ( We can view lucene index as Term X Document matrix with element being TF-IDF score) So our input matrix is huge and sparse (lots of zeros). I do not have exact count about non-zeros element. But I will try to get more details. So given TFIDF matrix (Lucene index), Semantic vectors performs SVD and created two output matrix. Left singular matrix is term vector matrix while right singular matrix is document vector matrix.

So from term vector matrix, we can query a term and find semantically similar terms based on cosine similarity. Issue here is scalability. Semantic vector does everything in memory so it’s not scalable.

Right now we are evaluating mahout and graphlab to address this issue.

I forwarded this question to our LSI expert, Andrew Olney from University of Memphis. And this is what I got:

For either LSI/LSA

Basically you construct a term/doc matrix (each cell is frequency count of word (row) in document (col)

Normalize if you want e.g. log-entropy or tfidf

Take the SVD

LSA is primarily interested in word/word;doc/doc similarity and uses the U matrix (left singular vectors) LSI is primarily interested in IR and also uses the V matrix (right singular vectors) For LSA the edited book "Handbook of LSA" is a good reference. For LSI I suggest this paper. What is meant in the email below by 'concept search' is not clear to me, but it seems similar to a collaborative filtering approach.
Stochastic SVDs (e.g. Gorrell) have been used for that though they are approximate. See also perhaps Lingo.
And finally, maybe an online topic model would be better if the user is selecting 'concepts' or concepts are otherwise reflected in the UI. Topics seem more comprehensible to people than SVD dimensions. See Blei’s paper.

Since we have now a very efficient SVD solver in GraphLab v2, I would love to help anyone who is interested in implementing LSA/LSI on top of it. Let me know if you want to try it out!

Large Scale Machine Learning and Other Animals

Thursday, April 12, 2012

LSI vs. SVD

No comments:

Post a Comment

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax