Friday, October 21, 2011

Speeding up SVD computation on a mega matrix using GraphLab

About two weeks ago, I got a note from Tom Mitchell, head of the Machine Learning Dept. at Carnegie Mellon University, that he is looking for a large scale solution for computing SVD (singular value decomposition) on a matrix of size 1.2 billion non-zeros. Here is his note:

Hi Danny,

I see at
that you have an implementation of SVD in GraphLab, so I thought I'd
check in with you on the following question:

We have a large, sparse, 10^7 by 10^7 array on which we'd like to run
SVD/PCA.    It is an array of corpus statistics, where each row
represents a noun phrase such as "Pittsburgh", each column represents
a text fragment such as "mayor of __", and the i,j entry in the array
gives the count of co-occurences of this noun phrase with this text
fragment in a half billion web pages.  Most elements are zero.

To date, Abhay (cc'd) has thinned out the rows and columns to the most
frequent 10^4 noun phrases and text fragments, and we have run
Matlab's SVD on the resulting 10^4 by 10^4 array.   We'd like to scale
up to something bigger.

My question: do you think your GraphLab SVD is relevant?   In case
yes, we'd enjoy giving it a try.


Aside from the fact that Tom is my beloved dept. head, the presented challenge is quite exciting and I started immediately to look into it.

One factor that complicates this task, is that he further required to run no less then 1000 iterations. Because the matrix size is 1,200,000,000 non-zeros, and each iteration involves multiplying by A and then A', and equivalently by A' and A, total of 4 passes over the matrix, 
we get that we need to compute 4,800,000,000,000 multiplications!

Furthermore, since there are 8,000,000 rows and 7,000,000 columns, to store the eigenvectors we need to store 17,000,000,000 numbers. 

As explained in my previous post, SVD can be computed using the Lanczos iteration, using the trick that Abhay taught me. So first, I have changed the code to compute both AA' and A'A on each iteration (instead of solving full Lanczos of AA' and then of A'A which would require much more time). 
I let Abhay try this solution, and program crashed after consuming 170GB of RAM on a 96GB machine.

Next, I have changed all the usage of doubles to float, reducing memory consumption by half. I further saved all rows and columns using itpp sparse vectors (instead of Graphlab edge structures to save memory). Now on my 8 core machine, it took about 2150 seconds to load the problem into memory, 
and each SVD iteration took 160 seconds.  Again I handed this solution to Abhay, which had trouble again operating the program. This is because I tested it using 10 iterations, and I was not aware at this point he needed to run 1000 iterations. Even with floats, allocating data structure to hold the eigen vectors takes 72GB of RAM! 

Finally, I made a another change, which is dumping each eigenvector to disk. On each iteration each eigenvector is saved to a swap file of size 72MB. That way the 72GB for storing those partial results are saved to disk and not handled in memory. By doing a few more optimizations of the loading process I
reduced the download time to 1600 seconds. 

I also compile the Graphlab solution on BlackLight supercomputer, and using 16 cores it takes 80 seconds per iteration. Overall, now the time for computing 1000 eigenvectors of this mega matrix can be done in a reasonable time of 24 hours using a single machine of 16 cores! 

An update: this is a note I got from Abhay: "using ncpus=24, 1000 iterations finished in 10 hours, which is pretty neat. Also, this time around I didn't see a lot of memory usage. "

Overall, now we have one more happy GraphLab user! Contact me if you have any large problem you think can be solved using GraphLab and I will be happy to help.

Next: part 2 of this post is here.


  1. Danny, do you know? Just for curiosity.
    What are the practical implications of such an analysis? Which problem can we solve using this 1000 singular vectors? Why 1000?

  2. Hi,
    The idea is that the original data is too big. They first reduce its dimensionality, to let's say, a feature vector of size 1000 for each row and column. Then they can compute classification using SVM or other methods on the low rank data which supposed to represent the high dimensional data pretty closely.



  3. You start from 1,200,000,000 non-zeros and end up with 17,000,000,000 floating point numbers. There is no saving in memory consumption. Neither do you gain on operation count performing matrix vector multiplication. Do you know why their classification algorithm performs better on reduced data?

  4. Hi Agnonchik!
    This is an excellent question which actually puzzled. This is the reply I got from Tom Mitchell: CMU Machine Learning Dept Head:
    "Our vectors are cooccurrence statistics, where the counts are fairly
    sparse (even though they are based on 500,000,000 web pages).
    Therefore, you can think of each row as a high-variance approximation
    to the unknown, true probability distribution of contexts for that
    row's noun phrase. One hoped-for benefit is that the the lower
    dimension representation will aggregate these high-variance counts in
    a way that gives us a lower variance, and therefore more reliable,
    1000 dimensional representation of the noun phrase's meaning."

    I would love to hear your thoughts about it.

  5. Thanks for your answer, Danny!
    If you're asking me, I have no idea. The only thing I know is that such dimensionality reduction seems to be a common preprocessing step before dealing with document-term matrix (or "noun phrase"-"text fragment" matrix in your case). Could be some kind of denoising as Tom explained. You can check "ASF Mail Archives" example for a similar application scenario here.