Sunday, February 26, 2012

SVD strikes again!

As you may now, SVD is a topic which constantly comes up in my blog. This is a chicken and egg problem: as more I work on SVD, I get more requests for help with large scale SVD solutions, and then I work more on SVD..

Here is some updates from the last couple of weeks.
Vincent Bor, an independent consultant from Dallas/Forth Worth area sent me the following. The problem size he was trying is a document term matrix of size 207,710 terms 153,328 documents with 22,518,138 non zero elements. It seems he has some older SVD implementation he is trying to compare to GraphLab.

I do bidiagonalization so I use original matrix A not ATA or AAT, maybe because of this Lanczos better processes ill-conditioned matrices (?), I have restarting but I haven't seen yet that restarting was ever used (I do not see all logs, only from time to time). The algorithm is based on the paper "Lanczos bidiagonalization with partial reorthogonalization" by Rasmus Munk Larsen, but I did full reorthogonalization. I use The Compressed Row Storage (CRS) format to keep matrix in memory and for vector matrix multiplication. During the iterations I keep V, U vectors on the disk, only matrix is in memory. I use parallel calculations but only in two places: vector,matrix multiplication and creating final singular vectors.

It seems that our first version of SVD solver has numerical issues with his matrix. In a few days I am going to release a fix that include restarts, as a part of GraphLab v2.

Brian Murphy, a postdoc at CMU, sent me the following:
By the way, the SVD task I have is not enormous - it's a 50m (web documents) x 40k (words of vocabulary) matrix of real values, and very sparse (3.8bn entries or sparsity of 0.2%). Until now we've been using the sparse SVD libraries in python's scipy (which uses ARPACK). But with that the largest matrix I can comfortably manage is about 10-20m.

Even if I could refactor the code to fit 50m, we would still like to be able to run this with much larger collections of web documents.

Another solution I have tried is the GenSim package, which uses a couple of approximate SVD solutions: , but so far the results using the decompositions it produces are not so encouraging. This paper describes the methods used in Gensim: Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms

Now, this sounds as a challenging magnitude - 3.8 billion non-zeros. I am definitely going to help Brian use GraphLab v2 for cracking his problem!

Andrew McGregor Olney, assistant prof at the dept. for psycology at the university of Memphis has sent me the following:
You might be interested in this paper that describes my partial SVD with no reorthongonalization. It's not highly original, but attempts to synthesize some of the previous work in to a simple, practical, and accurate low-memory solution for SVD of term-doc matrices.

I've got some open-source C# code to share (it's not really pretty, but at least it's OO) and some of the original octave scripts I cobbled together in 2005 (using B instead of AtA) if you are interested.

Andrew further sent me a sample matrix of size 4364328 rows, 3333798 columns, 513883084 non zeros to try out. He only care about the left vectors and singular values.

Edin Muharemagic is working on LexisNexis new ML implementation, and he is currently implementing SVD on top of it. You can read some more in my blog about it here.

The best SVD large scale implementation I am aware of is Slepc. Now, if this problem is solved, you may ask, why is everyone emailing me? Unfortunately using Slepc requires understanding of MPI, C programming ability and a non negligible learning curve. To understand why I a proper SVD implementation is no trivial, I asked Joel Tropp, who gave the following answer:

In short, the Lanczos method is very unstable because it tries to "cheat" when it computes an orthogonal basis for the Krylov subspace. This procedure only works well in exact arithmetic; it breaks very quickly in the presence of floating-point errors. Drastic remedies are required to fix this problem. See the book by Golub and Van Loan for a discussion and references. A more introductory presentation appears in Trefethen and Bau.

Another possible approach to computing a large-scale SVD is to use column-sampling approaches, perhaps followed by a randomized SVD as advocated by Kwok and Li. See Chiu and Demanet (arXiv) or Gittens (arXiv) for some related methods.

Anyway, stay tuned, since it the coming week or so I am going to release an improved SVD solver in Graphlab version 2. I hope it will be able to address all the above requests for help. And if you are working on large scale SVD/ or need such an implementation please let me know!

2 comments:

  1. Hello Danny, did you finish your SVD implementation yet?

    It'd be extremely interesting to compare it to gensim's "online stochastic SVD" approach in particular. Sorry to hear Brian didn't find that decomposition useful -- I'd be curious to hear more!

    ReplyDelete
    Replies
    1. Hi Radim,
      The implementation is working long ago on both multicore (graphchi) and distributed (graphlab). Will be happy to learn about any comparisons to other algos.

      Delete