A few days ago, I wrote about our new ICML paper, which discusses how to speed up L1 regularized loss (either LASSO or logistic regression) on multicore. The paper is a joint work with Joseph Bradley, Aapo Kyrola and Carlos Guestrin.
Today, we have released a parallel Matlab mex code here. The code was written by Aapo Kyrola, CMU.
To give some motivation why a parallel code is needed, I will give an example. In their paper: Kogan, S., Levin, D., Routledge, B.R., Sagi, J.S., and Smith, N.A. Predicting risk from financial reports with regression. In Human Language Tech.-NAACL, 2009. Kogan et. al gives and example on how to predict stock behavior by analyzing diagrams of words that appear in the report. The computational problem is that the data matrix is rather large: it is a sparse matrix of size 30465 x 216842 with 0.0575% of non-zeros.
We tried to compare to other efficient l1 algorithms, but the best performance (after our own method, shotgun) was obtained by Boyd's interior point method. The following Graph shows the running time:
Shotgun, is about x65 times faster than the interior point method. (All tests ran on an AMD processor using up to 8 Opteron 8384 cores (2.69 GHz)).
To make things worse, on the full data set of bigrams, (matrix of size 30465 x 5845762 ) we did not find even a single algorithm for comparison that could cope with the data magnitude. Shotgun crunched this dataset in 2000 seconds.
The following graphs summarize the properties of shotgun vs. 4 categories of datasets:
compressed sensing (Sparco), single pixel camera, sparse compressed images, and large sparse datasets.
Whenever the tickers are above the diagonal line, Shotgun is performing faster. The algorithms for comparison used are: Shooting, L1_LS, FPC_AS, GPSR_BB, Hard_l0, SpaRSA. Except of the single pixel camera datasets, shotgun performs better on all datasets. The worser performance is explained by the theoretical P* bound, which limits the number of processors to be used efficiently in parallel. For the single pixel camera dataset the bound is 3, so using 8 cores in parallel does not gain much, since only 3 of them are effective and the others actually hurt performance.
As always, comparison between algorithms should be taken with a grain of salt, since the termination conditions are different for each algorithm. But overall, shotgun has a significant speedup over competing methods.
Some comments we got from Alex Smola, Yahoo! Research:
a) you probably want to look at the Beck & Teboulle 2008 paper
http://iew3.technion.ac.il/~becka/papers/71654.pdf
the issue is that FISTA (yes, they named their algorithm this way ...) has O(1/T^2) convergence rather than O(1/T).
in fact, FISTA could be implemented in parallel rather trivially since all it needs is a subgradient. in fact, i guess you could even use noisy subgradient estimates initially where the noise is less than the accuracy penalty (this would be a simple application of concentration inequalities to get it done - that's _not_ in their paper but imho quite obvious).
also, the actual thresholding stage parallelizes very nicely, so it shouldn't be to hard to build overall.
b) you should be able to get some additional mileage from whitening the A^\top A matrix, at least with a diagonal preconditioner. that way you end up rescaling updates in accordance to their mutual interference. this should be quite helpful in particular since A will also give you some idea of the subspace generating the gradient, so things are effectively scaling like (A^\top A)^2 rather than (A\top A) when the gradient and the interference term are evaluated.
c) a slight wrinkle in your analysis is that you only look at expected improvement. while this is perfectly good for the purpose of getting an intuition of convergence and capturing most of the behavior, it doesn't quite suffice to _prove_ convergence without additional gymnastics (not sure whether the Shalev-Schwartz proof has it).
d) equation (10) probably has a spurious 1/2 factor on the rhs, or at least i can't see how you got it. it works out in the end since (8) is probably where it came from ...
e) you probably want to look at a partitioning which has small interference since ultimately you don't really need to draw from an independent selection distribution. for instance, imagine that you had a chain. in this case if you drew every third element you'd be safe (think of the chromatic engine in graphlab).
while this is generally not possible, it has the flavor of a jacobi-like approach to it. i.e. randomize over decompositions that minimize interactions, provided that the overall distribution sill picks each coordinate uniformly. this would work well e.g. in a graph.
Thanks Alex!
Monday, June 27, 2011
Friday, June 17, 2011
KDD Cup progress update - We are in the 5th place!
I am proud to report that yesterday JustinYan, a graduate student from the Institute of Automation, Chinese Academy of Sciences, a member of the LeBuSiShu team, suggested that I will join their team.
The reason is that LeBuSiShu is using GraphLab ALS solution as one of the components in their final result. Currently LeBuSiShu is the 5th group in the leaderboard (out of 100 groups!) with RMSE = 22.2820.
Anyway, of course I was honored to accept as well as quite excited! Although when writing the GraphLab collaborative filtering code of matrix factorization I had no competition in mind, it is always nice to know that someone finds your solution useful.
Besides of JustinYun, another nice team member in the LeBuSiShu group is: Yao Wu, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academic of Sciences.
A quick update 6/23: We are actually now at the 4th place! The Chinese group is an amazing team, they simply work around the clock. I myself am running some fancy MCMC methods for slightly improving prediction, but the Chinese team is responsible of 99.9% of the progress.
The reason is that LeBuSiShu is using GraphLab ALS solution as one of the components in their final result. Currently LeBuSiShu is the 5th group in the leaderboard (out of 100 groups!) with RMSE = 22.2820.
Anyway, of course I was honored to accept as well as quite excited! Although when writing the GraphLab collaborative filtering code of matrix factorization I had no competition in mind, it is always nice to know that someone finds your solution useful.
Besides of JustinYun, another nice team member in the LeBuSiShu group is: Yao Wu, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academic of Sciences.
A quick update 6/23: We are actually now at the 4th place! The Chinese group is an amazing team, they simply work around the clock. I myself am running some fancy MCMC methods for slightly improving prediction, but the Chinese team is responsible of 99.9% of the progress.
Friday, June 10, 2011
Tutorial: Jacobi method in GraphLab
I have implemented today a parallel version of the Jacobi method, a very simple algorithm for solving
a system of linear equations in Graphlab. The whole process including debugging
took me less than an hour and a half.
Just to show that creating new Graphlab applications is easy, I give below the
resulting code, which is about 20 lines of code including debug traces..
In GraphLab you first have to define node data and edge data. In our case each node represents a row in the linear system of equations, and each edge is a non-zero entry in the linear relation
matrix.
And the edge data is:
a system of linear equations in Graphlab. The whole process including debugging
took me less than an hour and a half.
Just to show that creating new Graphlab applications is easy, I give below the
resulting code, which is about 20 lines of code including debug traces..
In GraphLab you first have to define node data and edge data. In our case each node represents a row in the linear system of equations, and each edge is a non-zero entry in the linear relation
matrix.
struct node_data{ double x_i; //value from previous iteration, placeholder for next value double b_i; //Entry i of the observtion vector b double A_ii; //Entry i of the main diagonal of matrix A }
And the edge data is:
struct edge_data{ double A_ij; //contains the A_ij value of the matrix A: i != j }
/** * Functionality: The code solves the linear system Ax = b using * The Jacobi algorithm. (A is a square matrix). * A assumed to be full column rank. Algorithm is described * http://en.wikipedia.org/wiki/Jacobi_method * Written by Danny Bickson */ /*** * JACOBI UPDATE FUNCTION * Update rule is: * x_i = (b_i - \sum_j A_ij * x_j)/A_ii */ void jacobi_update_function(gl_types::iscope &scope, gl_types::icallback &scheduler) { /* GET current vertex data */ vertex_data& vdata = scope.vertex_data(); graphlab::edge_list outedgeid = scope.out_edge_ids(); //store last round x_i for convergence detection vdata.prev_mean = vdata.cur_mean; double x_i = vdata.b_i; double A_ii = vdata.A_ii; assert(A_ii != 0); //diagonal of square matrix can not be zero // go over each edge (non zero A_ij value) for(size_t i = 0; i < outedgeid.size(); ++i) { edge_data& out_edge = scope.edge_data(outedgeid[i]); const vertex_data & other = scope.neighbor_vertex_data(scope.target(outedgeid[i])); x_i -= out_edge.A_ij * other.x_i; } x_i /= A_ii; //store computed result in node data vdata.x_i = x_i; }And here is an example for running Jacobi:
<129|0>bickson@biggerbro:~/newgraphlab/graphlabapi/debug/demoapps/gabp$ ./gabp 1 mat3x3 --scheduler="round_robin(max_iterations=25)" Loading mat3x3 Creating 6 edges... symmetry: 0 .max iterations = 25 step = 1 max_iterations = 25 INFO: asynchronous_engine.hpp(run:94): Worker 0 started. INFO: asynchronous_engine.hpp(run:94): Worker 1 started. INFO: asynchronous_engine.hpp(run:102): Worker 0 finished. INFO: asynchronous_engine.hpp(run:102): Worker 1 finished. Jacobi finished in 0.0013 Assuming the linear system is Ax=y, and the correct solution is x*,Jacobi converged to an accuracy norm(x-x*) of 1.41598e-22 Writing result to file: mat3x3.out You can read the file in Matlab using the load_c_gl.m matlab script === REPORT FOR core() === [Numeric] ncpus: 2 [Other] affinities: false compile_flags: engine: async scheduler: round_robin schedyield: true scope: edge === REPORT FOR engine() === [Numeric] num_edges: 6 num_syncs: 2 num_vertices: 3 updatecount: 75 [Timings] runtime: 0 s [Other] termination_reason: task depletion (natural) [Numeric] updatecount_vector: 75 (count: 2, min: 37, max: 38, avg: 37.5) updatecount_vector.values: 37,38,If anyone wants to experiment with the code, you should download GraphLab, the Jacobi example is found on demoapps/gabp/ directory.
Thursday, June 9, 2011
SVD (Lanczos algorithm) in GraphLab
I have now implemented the Lanczos algorithm in GraphLab, as part of
GraphLab's collaborative filtering library.
Here are some performance results using the Netflix data (100M non-zeros):
On a 16 core machine (2.6Ghz Quad-Core AMD Opteron x 2), we get a speedup of about 9. Each Iteration over Netflix data takes 10 seconds using 16 cores (iteration is composed in multiplying by the matrix A and then multiplying by A^T).
Using KDD Cup data (track1), we have 252M non-zeros, and each iteration takes 31.4 seconds on the same 16 cores machine.
To give a concrete performance comparison, I have also computed svds() in Matlab on Netflix data. For extracting 2 eigenvalues it takes 126.2 seconds in Matlab, while in GraphLab the same computation (using 16 cores) takes 21.9 seconds. So we are about x6 times faster than Matlab. (Note that there may be a potential difference in the implemented algorithm so this comparison should be taken with a grain of salt).
GraphLab's collaborative filtering library.
Here are some performance results using the Netflix data (100M non-zeros):
On a 16 core machine (2.6Ghz Quad-Core AMD Opteron x 2), we get a speedup of about 9. Each Iteration over Netflix data takes 10 seconds using 16 cores (iteration is composed in multiplying by the matrix A and then multiplying by A^T).
Using KDD Cup data (track1), we have 252M non-zeros, and each iteration takes 31.4 seconds on the same 16 cores machine.
To give a concrete performance comparison, I have also computed svds() in Matlab on Netflix data. For extracting 2 eigenvalues it takes 126.2 seconds in Matlab, while in GraphLab the same computation (using 16 cores) takes 21.9 seconds. So we are about x6 times faster than Matlab. (Note that there may be a potential difference in the implemented algorithm so this comparison should be taken with a grain of salt).
What are the most widely deployed machine learning algorithms?
One of the interesting questions is: "what are the most useful machine learning algorithms?".
I did a little survey by looking at the Mahout user mailing list and counting occurrences of keywords. The results I got are shown in the plot above.
It seems that matrix factorization (SVD) is the most widely used algorithm, and then K-means. We have just implemented SVD as a part of the GraphLab Collaborative Filtering library. Anyone who wants to beta test it is welcome!
I did a little survey by looking at the Mahout user mailing list and counting occurrences of keywords. The results I got are shown in the plot above.
It seems that matrix factorization (SVD) is the most widely used algorithm, and then K-means. We have just implemented SVD as a part of the GraphLab Collaborative Filtering library. Anyone who wants to beta test it is welcome!
GraphLab PMF on 32 bit Ubuntu
NOTE: This code is deprecated. Please take a look here for GraphChi (multicore implementation): http://bickson.blogspot.co.il/2012/12/collaborative-filtering-with-graphchi.html
Or here for GraphLab (distributed implementation): http://graphlab.org/toolkits/collaborative-filtering/
As GraphLab collaborative filtering library is growing, we are adding new algorithms and getting more and more users (around 110 unique installations originated from this blog in the last couple of months). Yesterday I had an interesting email exchange with Timmy Wilson, our man in Cleveland Ohio. Timmy is working on clustering in social networks.
Timmy tried to install Graphlab on linode, but encountered some problems. Here is what he writes:
Installing Graphlab on Ubuntu 10.04 (32bit)
Or, 'I wish i had installed 64bit Ubuntu'
I followed Danny's instructions here:
http://bickson.blogspot.com/2011/04/yahoo-kdd-cup-using-graphlab.html
At the end of this email, attached are all the commands I used.
I can't say for sure, but i assume everything would have went without
a hitch if i were running the 64bit version of Ubuntu. When trying to compile GraphLab I got the following error:
This is the output of my
gcc version:
Luckily, the graphlab guys take this stuff pretty seriously. "There
never was a compilation problem we failed to solve the same day...",
Danny told me.
Here Yucheng Low describes my problem:
> The problems you are encountering are somewhat complicated.
> I suspect you have a strange mismatch between what gcc thinks your system is
> and what your system architecture actually is.
>
> According to an email you sent before, gcc seems to think the target
> architecture is 486. I am going to assume that you are not actually running
> this on a 486 machine. The missing "__sync_add_and_fetch_8" instructions
> are available 586 and beyond. What architecture are you running on?
>
> This ITPP error looks like ITPP was build without SSE2 support possibly
> because it used the default architecture as defined by gcc. However, for
> performance reasons, the GraphLab compile flags force SSE/SSE2 through
> -mfpmath=sse -msse2
>
> Two possible solutions:
> 1: line 310 through line 328 of the root CMakeLists.txt
> To each set of compile flags add -march=pentium, delete -mfpmath=sse -msse2
>
> 2: More fundamentally, try to figure out why gcc thinks you are running on a
> 486.
First the more interesting problem:
> 2: More fundamentally, try to figure out why gcc thinks you are running on a
> 486.
I wrote the guys at http://www.linode.com/ and got a response in 15minutes:
> Ubuntu compiles their software optimized for the 486 architecture in its 32-bit
> iteration. There is not a significant performance increase when optimizing for
> 686 compared to optimizing for 486, so 486 optimization remains the default
> for 32-bit Ubuntu/Debian systems.
Mystery solved.
I opted for the first solution:
> 1: line 310 through line 328 of the root CMakeLists.txt
> To each set of compile flags add -march=pentium, delete -mfpmath=sse -msse2
Then build pmf:
Now I got the following error:
PMF compilation error:
It seems that libitpp.a location was not identified. I linked to the installed itpp libraries by adding the following line below line 210 of CMakeLists.txt:
Awesome -- it works!
Timmy has also provided step by step instructions:
Instructions
http://bickson.blogspot.com/2011/04/yahoo-kdd-cup-using-graphlab.html
Installing itpp
http://bickson.blogspot.com/search/label/itpp
Installing graphlabapi
CMakeLists.txt edits
add following line below line 210:
replace all instances of: (be careful - only for 32 bit Linux!)
with
Build PMF
Test
Or here for GraphLab (distributed implementation): http://graphlab.org/toolkits/
As GraphLab collaborative filtering library is growing, we are adding new algorithms and getting more and more users (around 110 unique installations originated from this blog in the last couple of months). Yesterday I had an interesting email exchange with Timmy Wilson, our man in Cleveland Ohio. Timmy is working on clustering in social networks.
Timmy tried to install Graphlab on linode, but encountered some problems. Here is what he writes:
Installing Graphlab on Ubuntu 10.04 (32bit)
Or, 'I wish i had installed 64bit Ubuntu'
I followed Danny's instructions here:
http://bickson.blogspot.com/
At the end of this email, attached are all the commands I used.
I can't say for sure, but i assume everything would have went without
a hitch if i were running the 64bit version of Ubuntu. When trying to compile GraphLab I got the following error:
CMakeFiles/disk_graph_test.cxxtest.dir/disk_graph_test.cxx.o: In
function `graphlab::atomic::inc()':
/home/timmyt/graphlabapi/src/graphlab/parallel/atomic.hpp:39:
undefined reference to `__sync_add_and_fetch_8'
This is the output of my
gcc version:
$ gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
4.4.3-4ubuntu5'
--with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --enable-multiarch --enable-linker-build-id
--with-system-zlib --libexecdir=/usr/lib --without-included-gettext
--enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4
--program-suffix=-4.4 --enable-nls --enable-clocale=gnu
--enable-libstdcxx-debug --enable-plugin --enable-objc-gc
--enable-targets=all --disable-werror --with-arch-32=i486
--with-tune=generic --enable-checking=release --build=i486-linux-gnu
--host=i486-linux-gnu --target=i486-linux-gnu
Thread model: posix
gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
Luckily, the graphlab guys take this stuff pretty seriously. "There
never was a compilation problem we failed to solve the same day...",
Danny told me.
Here Yucheng Low describes my problem:
> The problems you are encountering are somewhat complicated.
> I suspect you have a strange mismatch between what gcc thinks your system is
> and what your system architecture actually is.
>
> According to an email you sent before, gcc seems to think the target
> architecture is 486. I am going to assume that you are not actually running
> this on a 486 machine. The missing "__sync_add_and_fetch_8" instructions
> are available 586 and beyond. What architecture are you running on?
>
> This ITPP error looks like ITPP was build without SSE2 support possibly
> because it used the default architecture as defined by gcc. However, for
> performance reasons, the GraphLab compile flags force SSE/SSE2 through
> -mfpmath=sse -msse2
>
> Two possible solutions:
> 1: line 310 through line 328 of the root CMakeLists.txt
> To each set of compile flags add -march=pentium, delete -mfpmath=sse -msse2
>
> 2: More fundamentally, try to figure out why gcc thinks you are running on a
> 486.
First the more interesting problem:
> 2: More fundamentally, try to figure out why gcc thinks you are running on a
> 486.
I wrote the guys at http://www.linode.com/ and got a response in 15minutes:
> Ubuntu compiles their software optimized for the 486 architecture in its 32-bit
> iteration. There is not a significant performance increase when optimizing for
> 686 compared to optimizing for 486, so 486 optimization remains the default
> for 32-bit Ubuntu/Debian systems.
Mystery solved.
I opted for the first solution:
> 1: line 310 through line 328 of the root CMakeLists.txt
> To each set of compile flags add -march=pentium, delete -mfpmath=sse -msse2
Then build pmf:
$ cd ~/graphlabapi/release/demoapps/pmf
$ make
Now I got the following error:
PMF compilation error:
[ 97%] Built target graphlab
[100%] Built target itdiff
Linking CXX executable pmf
CMakeFiles/pmf.dir/pmf.o: In function `itpp::DSFMT<19937 data-blogger-escaped-10376655713290109737ull="" data-blogger-escaped-1047295u="" data-blogger-escaped-1048063u="" data-blogger-escaped-117="" data-blogger-escaped-19="" data-blogger-escaped-1ull="" data-blogger-escaped-4237361149u="" data-blogger-escaped-4291106551315987578ull="" data-blogger-escaped-4294966079u="" data-blogger-escaped-4432916062321256576ull="" data-blogger-escaped-4498102069230399ull="" data-blogger-escaped-4501400546508797ull="">::init_gen_rand(unsigned int)':
/usr/local/include/itpp/base/random_dsfmt.h:161: undefined reference
to `itpp::DSFMT<19937 data-blogger-escaped-10376655713290109737ull="" data-blogger-escaped-1047295u="" data-blogger-escaped-1048063u="" data-blogger-escaped-117="" data-blogger-escaped-19="" data-blogger-escaped-1ull="" data-blogger-escaped-4237361149u="" data-blogger-escaped-4291106551315987578ull="" data-blogger-escaped-4294966079u="" data-blogger-escaped-4432916062321256576ull="" data-blogger-escaped-4498102069230399ull="" data-blogger-escaped-4501400546508797ull="">::sse2_param_mask'
CMakeFiles/pmf.dir/pmf.o: In function `itpp::DSFMT<19937 data-blogger-escaped-10376655713290109737ull="" data-blogger-escaped-1047295u="" data-blogger-escaped-1048063u="" data-blogger-escaped-117="" data-blogger-escaped-19="" data-blogger-escaped-1ull="" data-blogger-escaped-4237361149u="" data-blogger-escaped-4291106551315987578ull="" data-blogger-escaped-4294966079u="" data-blogger-escaped-4432916062321256576ull="" data-blogger-escaped-4498102069230399ull="" data-blogger-escaped-4501400546508797ull="">::do_recursion(itpp::DSFMT<19937 data-blogger-escaped-10376655713290109737ull="" data-blogger-escaped-1047295u="" data-blogger-escaped-1048063u="" data-blogger-escaped-117="" data-blogger-escaped-19="" data-blogger-escaped-1ull="" data-blogger-escaped-4237361149u="" data-blogger-escaped-4291106551315987578ull="" data-blogger-escaped-4294966079u="" data-blogger-escaped-4432916062321256576ull="" data-blogger-escaped-4498102069230399ull="" data-blogger-escaped-4501400546508797ull="">::W128_T*, itpp::DSFMT<19937 data-blogger-escaped-10376655713290109737ull="" data-blogger-escaped-1047295u="" data-blogger-escaped-1048063u="" data-blogger-escaped-117="" data-blogger-escaped-19="" data-blogger-escaped-1ull="" data-blogger-escaped-4237361149u="" data-blogger-escaped-4291106551315987578ull="" data-blogger-escaped-4294966079u="" data-blogger-escaped-4432916062321256576ull="" data-blogger-escaped-4498102069230399ull="" data-blogger-escaped-4501400546508797ull="">::W128_T*,
itpp::DSFMT<19937 data-blogger-escaped-10376655713290109737ull="" data-blogger-escaped-1047295u="" data-blogger-escaped-1048063u="" data-blogger-escaped-117="" data-blogger-escaped-19="" data-blogger-escaped-1ull="" data-blogger-escaped-4237361149u="" data-blogger-escaped-4291106551315987578ull="" data-blogger-escaped-4294966079u="" data-blogger-escaped-4432916062321256576ull="" data-blogger-escaped-4498102069230399ull="" data-blogger-escaped-4501400546508797ull="">::W128_T*,
itpp::DSFMT<19937 data-blogger-escaped-10376655713290109737ull="" data-blogger-escaped-1047295u="" data-blogger-escaped-1048063u="" data-blogger-escaped-117="" data-blogger-escaped-19="" data-blogger-escaped-1ull="" data-blogger-escaped-4237361149u="" data-blogger-escaped-4291106551315987578ull="" data-blogger-escaped-4294966079u="" data-blogger-escaped-4432916062321256576ull="" data-blogger-escaped-4498102069230399ull="" data-blogger-escaped-4501400546508797ull="">::W128_T*)':
/usr/local/include/itpp/base/random_dsfmt.h:375: undefined reference
It seems that libitpp.a location was not identified. I linked to the installed itpp libraries by adding the following line below line 210 of CMakeLists.txt:
link_directories(/usr/local/lib/)
Awesome -- it works!
Timmy has also provided step by step instructions:
Instructions
http://bickson.blogspot.com/2011/04/yahoo-kdd-cup-using-graphlab.html
Installing itpp
http://bickson.blogspot.com/search/label/itpp
$ sudo apt-get install --yes --force-yes automake autoconf libtool* gfortran
$ sudo apt-get install --yes --force-yes liblapack-dev
$ export LDFLAGS="-L/usr/lib -lgfortran"
$ cd ~/
$ wget http://sourceforge.net/projects/itpp/files/itpp/4.2.0/itpp-4.2.tar.gz
$ tar xvzf itpp-4.2.tar.gz
$ cd itpp-4.2
$ ./autogen.sh
$ ./configure --without-fft --with-blas=/home/ubuntu/lapack-3.3.0/blas_LINUX.a --with-lapack=/home/ubuntu/lapack-3.3.0/lapack_LINUX.a CFLAGS=-fPIC CXXFLAGS=-fPIC CPPFLAGS=-fPIC
$ make
$ sudo make install
Installing graphlabapi
$ hg clone https://graphlabapi.googlecode.com/hg/ graphlabapi
$ cd graphlabapi/
$ ./configure --bootstrap
CMakeLists.txt edits
$ vim CMakeLists.txt
add following line below line 210:
link_directories(/usr/local/lib/)
replace all instances of: (be careful - only for 32 bit Linux!)
-mfpmath=sse -msse2
with
-march=pentium
Build PMF
$ cd ~/graphlabapi/release/demoapps/pmf
$ make
Test
$ ./pmf smalltest 0 --float=true --scheduler="round_robin(max_iterations=15)"
Thanks so much Timmy - we really appreciate your feedback!
Monday, June 6, 2011
Matrix factorization algorithms - which algorithms can be parallelized?
The following matrix factorization algorithms have a rather straightforward
palatalization (and are implemented in GraphLab open source collaborative filtering library):
Probabilistic matrix/tensor factorization:
A) Liang Xiong, Xi Chen, Tzu-Kuo Huang, Jeff Schneider, Jaime G. Carbonell, Temporal Collaborative Filtering with Bayesian Probabilistic Tensor Factorization. In Proceedings of SIAM Data Mining, 2010. html (source code is also available).
B) Salakhutdinov and Mnih, Bayesian Probabilistic Matrix Factorization using Markov Chain Monte Carlo. in International Conference on Machine Learning, 2008. pdf, since our code implements matrix factorization as a special case of a tensor as well.
Alternating least squares:
C) Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan. Large-Scale Parallel Collaborative Filtering for the Netflix Prize. Proceedings of the 4th international conference on Algorithmic Aspects in Information and Management. Shanghai, China pp. 337-348, 2008. pdf
Collaberative filtering:
D) SVD++ algorithm: Koren, Yehuda. "Factorization meets the neighborhood: a multifaceted collaborative filtering model." In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 426434. ACM, 2008. http://portal.acm.org/citation.cfm?id=1401890.1401944
Gradient descent method:
E) SGD (sotchastic gradient descent) algorithm: Matrix Factorization Techniques for Recommender Systems Yehuda Koren, Robert Bell, Chris Volinsky In IEEE Computer, Vol. 42, No. 8. (07 August 2009), pp. 30-37.
F) Tikk, D. (2009). Scalable Collaborative Filtering Approaches for Large Recommender Systems. Journal of Machine Learning Research, 10, 623-656.
SVD:
G) For Lanczos algorithm (SVD) see: wikipedia.
I got an additional wish-list from Marinka Zitnik, a graduate student University of Ljubljana, Slovenia. She is working on implementing them as part of the GSoC Orange project. This is a very interesting list of matrix factorization algorithms. I marked in red the ones which are already implemented as part of GraphLab matrix factorization library. I am looking for volunteers who like to help us implement some more!
Non-negative matrix factorization
Alternating Least Square NMF. It is meant to be very fast compared to other approaches.
Probabilistic matrix factorization. Model which scales linearly
Bayesian factor regression model. Markov chain Monte Carlo technique.
palatalization (and are implemented in GraphLab open source collaborative filtering library):
Probabilistic matrix/tensor factorization:
A) Liang Xiong, Xi Chen, Tzu-Kuo Huang, Jeff Schneider, Jaime G. Carbonell, Temporal Collaborative Filtering with Bayesian Probabilistic Tensor Factorization. In Proceedings of SIAM Data Mining, 2010. html (source code is also available).
B) Salakhutdinov and Mnih, Bayesian Probabilistic Matrix Factorization using Markov Chain Monte Carlo. in International Conference on Machine Learning, 2008. pdf, since our code implements matrix factorization as a special case of a tensor as well.
Alternating least squares:
C) Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan. Large-Scale Parallel Collaborative Filtering for the Netflix Prize. Proceedings of the 4th international conference on Algorithmic Aspects in Information and Management. Shanghai, China pp. 337-348, 2008. pdf
Collaberative filtering:
D) SVD++ algorithm: Koren, Yehuda. "Factorization meets the neighborhood: a multifaceted collaborative filtering model." In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 426434. ACM, 2008. http://portal.acm.org/citation.cfm?id=1401890.1401944
Gradient descent method:
E) SGD (sotchastic gradient descent) algorithm: Matrix Factorization Techniques for Recommender Systems Yehuda Koren, Robert Bell, Chris Volinsky In IEEE Computer, Vol. 42, No. 8. (07 August 2009), pp. 30-37.
F) Tikk, D. (2009). Scalable Collaborative Filtering Approaches for Large Recommender Systems. Journal of Machine Learning Research, 10, 623-656.
SVD:
G) For Lanczos algorithm (SVD) see: wikipedia.
I got an additional wish-list from Marinka Zitnik, a graduate student University of Ljubljana, Slovenia. She is working on implementing them as part of the GSoC Orange project. This is a very interesting list of matrix factorization algorithms. I marked in red the ones which are already implemented as part of GraphLab matrix factorization library. I am looking for volunteers who like to help us implement some more!
Non-negative matrix factorization
Standard NMF based on Kullbach Leibler divergence[1,3],simple multiplicative updates, enhanced to avoid numerical overfow [2]
[1] Kim H, Park H. Sparse non-negative matrix factorizations via alternating non-negativity-
constrained least squares for microarray data analysis. Bioinformatics (Oxford, England)
2007, 23:1495--502.
[2] Lee, D. D. and Seung, H. S. Algorithms for non-negative matrix factorization. NIPS,2000,
556--562. [Implemented in GraphLab]
[3] Brunet, J. P., Tamayo, P., Golub, T. R., and Mesirov, J. P. Metagenes and molecular
pattern discovery using matrix factorization. Proc Natl Acad Sci U S A, 2004, 101(12),
4164--4169.
Sparse non negative matrix factorization
Sparse NMF based on alternating non-negativity constrained least squares, solved by
a fast non-negativity constrained least squares. Sparseness imposed on the left, right factor.
[4] Kim, H., Park, H. Sparse non-negative matrix factorizations via alternating non-
negativity-constrained least squares for microarray data analysis. Bioinformatics (Oxford,
England), 2007, 23(12), 1495-502.
Non-smooth NMF. Uses a modifed version of Lee and Seung's multiplicative
updates for Kullbach-Leibler divergence. It is meant to give sparser results.
[5] Pascual-Montano, A., Carazo, J. M., Kochi, K., Lehmann, D., and Pascual-Marqui, R. D.
Nonsmooth nonnegative matrix factorization (nsnmf ). IEEE transactions on pattern anal-
ysis and machine intelligence, 2006 28(3), 403--415.
Local Fisher NMF. Add the Fisher constraint
to maximize the between-class scatter and minimize the within-class scatter.
[6] Wang, Y., Turk, M. Fisher non-negative matrix factorization for learning local features.
Alternating Least Square NMF. It is meant to be very fast compared to other approaches.
[4] Kim, H., Park, H. Sparse non-negative matrix factorizations via alternating non-
negativity-constrained least squares for microarray data analysis. Bioinformatics (Oxford,
England), 2007, 23(12), 1495-502.
Comment: alternating least squares is currently implemented in GraphLab.
Probabilistic matrix factorization. Model which scales linearly
with the number of observations and performs well on large, sparse, imbalanced datasets.
[7] Salakhutdinov, R., Mnih, A. Probabilistic Matrix Factorization. Learning.
[Implemented in GraphLab]
Nonnegative matrix approximation. Method for dimensionality reduction with respect on the nonnegativity of input data. Multiplicative iterative scheme.
[8] Sra, S.,Dhillon, I. S. Nonnegative Matrix Approximation : Algorithms and Applications.
Sciences-New York, 2006, 1--36.
Interval-valued NMF.
[9] Zhiyong Shen, Liang Du, Xukun Shen, Yidong Shen, Interval-valued Matrix Factorization
with Applications, Data Mining, IEEE International Conference on, pp. 1037--1042, 2010
IEEE International Conference on Data Mining, 2010.
Interval-valued PMF.
[9] Zhiyong Shen, Liang Du, Xukun Shen, Yidong Shen, Interval-valued Matrix Factorization
with Applications, Data Mining, IEEE International Conference on, pp. 1037--1042, 2010
IEEE International Conference on Data Mining, 2010.
Probabilistic Sparse MF. PSMF allows for varying levels of
sensor noise, uncertainty in the hidden prototypes used
to explain the data and uncertainty as to the prototypes selected to explain each data vector.
[10] Dueck, D., Morris, Q. D., Frey, B. J. Multi-way clustering of microarray data using
probabilistic sparse matrix factorization. Bioinformatics (Oxford, England), 21 Suppl 1,
2005, 144-51.
Bayesian decomposition. A Bayesian treatment of NMF, based on a normal likelihood
and exponential priors, Gibbs sampler to approximate the posterior density.
[11] Schmidt, M. N., Winther, O., Kai Hansen, L. Bayesian non-negative matrix factorization.
bfrm
Bayesian factor regression model. Markov chain Monte Carlo technique.
[12] Schmidt, M. N. Bayesian non-negative matrix factorization. Bayesian Statistics 7 (Ox-
ford), 2003.
Variational Bayesian decomposition Linearly constrained Bayesian decomposition..
[13] Ochs, M. F., Kossenkov A. V. NIH Public Access. Methods, Methods Enzymol., 2009,
467: 59--77.
Graphlab user feedback form
As we start to gain hundreds of Graphlab users, we would like to understand better
who are the users, what are they working on and what are the future features they would like to see in GraphLab. We would also love to learn about installation experience, bugs (we have no bugs they are only features.. :-), new algorithms you would like to see implemented.
I have created a user feedback form (given below).
We would appreciate if anyone who is using Graphlab could take 5 minutes to complete this form. All data collected is confidential and will be used only for improving our open source project. PLEASE TAKE 5 MINUTES OF YOUR TIME AND HELP THE OPEN SOURCE COMMUNITY!!
Once you are done, hit "ENTER" and your form will be recorded.
who are the users, what are they working on and what are the future features they would like to see in GraphLab. We would also love to learn about installation experience, bugs (we have no bugs they are only features.. :-), new algorithms you would like to see implemented.
I have created a user feedback form (given below).
We would appreciate if anyone who is using Graphlab could take 5 minutes to complete this form. All data collected is confidential and will be used only for improving our open source project. PLEASE TAKE 5 MINUTES OF YOUR TIME AND HELP THE OPEN SOURCE COMMUNITY!!
Once you are done, hit "ENTER" and your form will be recorded.
Saturday, June 4, 2011
Python wrapper for running GraphLab GaBP linear solver
I got from Daniel Zerbino, a postdoc in UCSC, who is working on smoothing genomic sequencing measurements with constraints, a Python script for converting data to GraphLab GaBP format. Here it is:
#!/usr/bin/env python ############################################# # Python wrapper for GaBP application in GraphLab # By Daniel Zerbino, based on Matlab code by Danny Bickson ############################################# import sys import os import tempfile import struct import numpy as np import subprocess ############################################# # Convenience binary writer/reader functions: ############################################# def writeDouble(F, x): F.write(struct.pack('<d', x)) def writeVector(F, vector): for X in vector: writeDouble(F, X) def writeInt(F, x): F.write(struct.pack('<i', x)) def readInt(F): return struct.unpack('<i', F.read(struct.calcsize('i')))[0] def readDouble(F): return struct.unpack('<d', F.read(struct.calcsize('d')))[0] def readVector(F, n): return [readDouble(F) for i in range(n)] ############################################# # Convenience MatLab like functions ############################################# def enumerate(M): for i in range(np.shape(M)[0]): for j in range(np.shape(M)[1]): # Beware of Matlab numbering!! yield i+1, j+1, M[i,j] def find(M): return filter(lambda X: X[2] != 0, enumerate(M)) ############################################# # script for exporting system of linear equations of the type Ax = y to # graphlab format. # Written by Danny Bickson, CMU # Input: fn - output file name # A - A mxn matrix (if A is square than m=n) # y - mx1 observation vector # x - nx1 known solution (optional, if not given will write a vector of zeros) # sigma - a vector m+nx1 of noise levels (optional, for non-square matrices only) # Conversion to Python by Daniel Zerbino, UCSC ############################################# def save_c_gl(fn, A, y, x=None, sigma_y=None, sigma_x=None, square=False): m, n = np.shape(A) if len(y) != m: sys.exit('y vector should be of len as the number of A rows (%i and %i resp.)' % (len(y), m)) if x is not None and len(x) != n: sys.exit('x vector length should be as the matrix A columns') if sigma_y is None and sigma_x is not None: sys.exit("sigma_y should be provided when entering sigma_x") if sigma_x is None and sigma_y is not None: sys.exit("sigma_x should be provided when entering sigma_y") if sigma_x is not None: if square: sys.exit('sigma noise level input is allowed only for non-square matrices') else: if len(sigma_x) != n: sys.exit('sigma_x length should be number of cols of A') if len(sigma_y) != m: sys.exit('sigma_y length should be number of rows of A') if square: # matrix is square, edges are non digonal entries vals = find((A - np.diag(np.diag(A)))) print "Saving a square matrix A" else: # matrix is not square, edges are non zero values vals = find(A) print "Saving a non-square matrix A" F = open(fn, 'wb') #write matrix size writeInt(F, m) writeInt(F, n) # write y (the observation), x (the solution, if known), diag(A) the # variance (if known, else default variance of 1) writeVector(F, y) if x is not None: writeVector(F, x) else: writeVector(F, (0 for x in range(n))) if square: writeVector(F, diag(A)) else: if sigma_y is not None: writeVector(F, sigma_y) writeVector(F, sigma_x) else: writeVector(F, (1 for x in range(m+n))) #write number of edges assert len(vals) > 0 writeInt(F, len(vals)) # pad with zeros for 64 bit offset writeInt(F, 0) if not square: offset = m else: offset = 0 #write all edges for val in vals: writeInt(F, val[0]) writeInt(F, val[1] + offset) writeDouble(F, val[2]) F.close() #verify written file header F = open(fn,'rb') x = readInt(F) assert x == m F.close() print 'Wrote succesfully into file: %s' % fn ######################################################## #script for reading the output of the GaBP GraphLab program into matlab # returns x = inv(A)*b as computed by GaBP # returns diag = diag(inv(A)) - an approximation to the main diagonal of # the inverse matrix of A. # Written by Danny Bickson, CMU # Conversion to Python by Daniel Zerbino, UCSC ######################################################## def load_c_gl(filename, columns): F = open(filename, 'rb') x = readVector(F, columns) diag = readVector(F, columns) F.close() os.remove(filename) return x, diag ######################################################## ## Wrapper Utility to be used from outside ######################################################## def runGaBP(convergence, A, y, sigma_y=None, x=None, sigma_x=None, square=False): file, input = tempfile.mkstemp(dir='.') save_c_gl(input, A, y, x=x, sigma_y=sigma_y, sigma_x=sigma_x, square=square) args = ['gabp', '--data', input, '--threshold', str(convergence), '--algorithm', '0', '--scheduler=round_robin', '--square'] if not square: args.append('false') else: args.append('true') print "Running " + " ".join(args) ret = subprocess.Popen(args, stdout=sys.stdout, stderr=subprocess.STDOUT).wait() if ret != 0: sys.exit("GaBP did not complete") os.remove(input) x2, diag = load_c_gl(input + ".out", len(x)) return x2, diag ######################################################### ## Unit test ######################################################### def main(): A = np.array([[0.2785, 0.9649],[0.5469, 0.1576],[0.9575, 0.9706]]) y = np.array([1.2434, 0.7045, 1.9281]) sigma_y= np.array([1e-10, 1e-10, 1e-10]) x = np.array([0, 0]) sigma_x = np.array([1, 1]) convergence = 1e-10 x2, diag = runGaBP(convergence, A, y, sigma_y=sigma_y, x=x, sigma_x=sigma_x) print 'A' print A print 'y' print y print 'Initial X' print x print 'Initial Error' print A.dot(x) - y print 'Final X' print x2 print 'Final Error' print A.dot(x2) - y print 'diag' print diag if __name__=='__main__': main()
Wednesday, June 1, 2011
On the performance of Graphlab for large scale linear solver
I just got some interesting performance numbers from Yun Teng, a master student in UCSB, who tried GraphLab GaBP implementation of a linear solver. Yun used steady state thermal problem:
This is a linear system of equations, where the matrix A has 1.2M rows x 1.2M columns and
In terms of running time, each iteration runs in about 0.2 seconds on 20 cores.
Which means we handle about 40M non-zero matrix entries a second.
This is a linear system of equations, where the matrix A has 1.2M rows x 1.2M columns and
8.4M non-zeros. Here are some performance numbers. For round-robin scheduler, running 70 iterations on a 32 core Opteron (2.4GHz) computer with 128GB of RAM with CentOS release 5.6.
We got the best speedup, around 12, using 24 cores:
For the chromatic scheduler, we got a slightly less good speedup of 8 using 20 cores.
In terms of running time, each iteration runs in about 0.2 seconds on 20 cores.
Which means we handle about 40M non-zero matrix entries a second.
Subscribe to:
Posts (Atom)