Monday, June 27, 2011

Large scale logistic regression on multicore

A few days ago, I wrote about our new ICML paper, which discusses how to speed up L1 regularized loss (either LASSO or logistic regression) on multicore. The paper is a joint work with Joseph Bradley, Aapo Kyrola and Carlos Guestrin.

Today, we have released a parallel Matlab mex code here. The code was written by Aapo Kyrola, CMU.

To give some motivation why a parallel code is needed, I will give an example. In their paper: Kogan, S., Levin, D., Routledge, B.R., Sagi, J.S., and Smith, N.A. Predicting risk from financial reports with regression. In Human Language Tech.-NAACL, 2009. Kogan et. al gives and example on how to predict stock behavior by analyzing diagrams of words that appear in the report. The computational problem is that the data matrix is rather large: it is a sparse matrix of size 30465 x 216842  with 0.0575% of non-zeros.

We tried to compare to other efficient l1 algorithms, but the best performance (after our own method, shotgun) was obtained by Boyd's interior point method. The following Graph shows the running time:

Shotgun, is about x65 times faster than the interior point method. (All tests ran on an AMD processor using up to 8 Opteron 8384 cores (2.69 GHz)).

To make things worse, on the full data set of bigrams, (matrix of size 30465 x 5845762 ) we did not find even a single algorithm for comparison that could cope with the data magnitude. Shotgun crunched this dataset in 2000 seconds.

The following graphs summarize the properties of shotgun vs. 4 categories of datasets:
compressed sensing (Sparco), single pixel camera, sparse compressed images, and large sparse datasets.

Whenever the tickers are above the diagonal line, Shotgun is performing faster. The algorithms for comparison used are: Shooting, L1_LS, FPC_AS, GPSR_BB, Hard_l0, SpaRSA. Except of the single pixel camera datasets, shotgun performs better on all datasets. The worser performance is explained by the theoretical P* bound, which limits the number of processors to be used efficiently in parallel. For the single pixel camera dataset the bound is 3, so using 8 cores in parallel does not gain much, since only 3 of them are effective and the others actually hurt performance.

As always, comparison between algorithms should be taken with a grain of salt, since the termination conditions are different for each algorithm. But overall, shotgun has a significant speedup over competing methods.

Some comments we got from Alex Smola, Yahoo! Research:
a) you probably want to look at the Beck & Teboulle 2008 paper
the issue is that FISTA (yes, they named their algorithm this way ...) has O(1/T^2) convergence rather than O(1/T).

in fact, FISTA could be implemented in parallel rather trivially since all it needs is a subgradient. in fact, i guess you could even use noisy subgradient estimates initially where the noise is less than the accuracy penalty (this would be a simple application of concentration inequalities to get it done - that's _not_ in their paper but imho quite obvious).

also, the actual thresholding stage parallelizes very nicely, so it shouldn't be to hard to build overall.

b) you should be able to get some additional mileage from whitening the A^\top A matrix, at least with a diagonal preconditioner. that way you end up rescaling updates in accordance to their mutual interference. this should be quite helpful in particular since A will also give you some idea of the subspace generating the gradient, so things are effectively scaling like (A^\top A)^2 rather than (A\top A) when the gradient and the interference term are evaluated.

c) a slight wrinkle in your analysis is that you only look at expected improvement. while this is perfectly good for the purpose of getting an intuition of convergence and capturing most of the behavior, it doesn't quite suffice to _prove_ convergence without additional gymnastics (not sure whether the Shalev-Schwartz proof has it).

d) equation (10) probably has a spurious 1/2 factor on the rhs, or at least i can't see how you got it. it works out in the end since (8) is probably where it came from ...

e) you probably want to look at a partitioning which has small interference since ultimately you don't really need to draw from an independent selection distribution. for instance, imagine that you had a chain. in this case if you drew every third element you'd be safe (think of the chromatic engine in graphlab).

while this is generally not possible, it has the flavor of a jacobi-like approach to it. i.e. randomize over decompositions that minimize interactions, provided that the overall distribution sill picks each coordinate uniformly. this would work well e.g. in a graph.

Thanks Alex!

No comments:

Post a Comment