I see at http://bickson.blogspot.com/
that you have an implementation of SVD in GraphLab, so I thought I'd
check in with you on the following question:
We have a large, sparse, 10^7 by 10^7 array on which we'd like to run
SVD/PCA. It is an array of corpus statistics, where each row
represents a noun phrase such as "Pittsburgh", each column represents
a text fragment such as "mayor of __", and the i,j entry in the array
gives the count of co-occurences of this noun phrase with this text
fragment in a half billion web pages. Most elements are zero.
To date, Abhay (cc'd) has thinned out the rows and columns to the most
frequent 10^4 noun phrases and text fragments, and we have run
Matlab's SVD on the resulting 10^4 by 10^4 array. We'd like to scale
up to something bigger.
My question: do you think your GraphLab SVD is relevant? In case
yes, we'd enjoy giving it a try.
Aside from the fact that Tom is my beloved dept. head, the presented challenge is quite exciting and I started immediately to look into it.
One factor that complicates this task, is that he further required to run no less then 1000 iterations. Because the matrix size is 1,200,000,000 non-zeros, and each iteration involves multiplying by A and then A', and equivalently by A' and A, total of 4 passes over the matrix,
we get that we need to compute 4,800,000,000,000 multiplications!
Furthermore, since there are 8,000,000 rows and 7,000,000 columns, to store the eigenvectors we need to store 17,000,000,000 numbers.
As explained in my previous post, SVD can be computed using the Lanczos iteration, using the trick that Abhay taught me. So first, I have changed the code to compute both AA' and A'A on each iteration (instead of solving full Lanczos of AA' and then of A'A which would require much more time).
I let Abhay try this solution, and program crashed after consuming 170GB of RAM on a 96GB machine.
Next, I have changed all the usage of doubles to float, reducing memory consumption by half. I further saved all rows and columns using itpp sparse vectors (instead of Graphlab edge structures to save memory). Now on my 8 core machine, it took about 2150 seconds to load the problem into memory,
and each SVD iteration took 160 seconds. Again I handed this solution to Abhay, which had trouble again operating the program. This is because I tested it using 10 iterations, and I was not aware at this point he needed to run 1000 iterations. Even with floats, allocating data structure to hold the eigen vectors takes 72GB of RAM!
Finally, I made a another change, which is dumping each eigenvector to disk. On each iteration each eigenvector is saved to a swap file of size 72MB. That way the 72GB for storing those partial results are saved to disk and not handled in memory. By doing a few more optimizations of the loading process I
reduced the download time to 1600 seconds.
I also compile the Graphlab solution on BlackLight supercomputer, and using 16 cores it takes 80 seconds per iteration. Overall, now the time for computing 1000 eigenvectors of this mega matrix can be done in a reasonable time of 24 hours using a single machine of 16 cores!
An update: this is a note I got from Abhay: "using ncpus=24, 1000 iterations finished in 10 hours, which is pretty neat. Also, this time around I didn't see a lot of memory usage. "
Overall, now we have one more happy GraphLab user! Contact me if you have any large problem you think can be solved using GraphLab and I will be happy to help.
Next: part 2 of this post is here.