Large Scale Machine Learning and Other Animals: logistic regression

Showing posts with label logistic regression. Show all posts

Thursday, September 8, 2011

More on shutgun (2)

One of the greatest things about writing a blog is in getting interesting feedback from the readers. (I mean if no one reads it - why bother??) Here is an email I got this morning from Steve Lianoglou, a graduate student in the Computational Systems Biology dept, Memorial Sloan-Kettering Cancer Center, Weill Medical College of Cornell University.

Hi,
..
I stumbled on the shotgun library from reading Dr. Bickson's (somehow)
recent blog post:

http://bickson.blogspot.com/2011/08/more-on-shutgun.html

I thought it could come in handy on some of the stuff I'm playing with
and wrote a wrapper for it so us R folks can use it, too ... I mean,
we can't lab the MATLAB tribe have all the fun, now, can we?

https://github.com/lianos/buckshot

Some installation and usage instructions are on the wiki:
https://github.com/lianos/buckshot/wiki

R has to be running in 64 bit for you to be able to build and install
the package successfully. It works on my OS X boxes, and our (linux)
compute server, so hopefully it can work for you.

It's a bit rough around the edges (ie. no documentation), but if
you're familiar with building models in R, you'll now how to use this:

R> library(buckshot)
R> data(arcene)
R> model <- buckshot(A.arcene, y.arcene, 'logistic', lambda=0.166)
R> preds <- predict(model, A.arcene)
R> sum(preds == y.arcene) / length(y.arcene)
[1] 1

Could get worse than 100% accuracy, I guess ...

In time, I hope to get it "easily installable" and push it out to
CRAN, but in the meantime I thought it would be of interest to the
people reading this list in its current form, and to the shotgun
authors (who I'm hoping are also reading this list), even if they
don't use R :-)

Thanks for putting shotgun out in the wild for us to tinker with!

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Thanks a lot Steve! We really appreciate your efforts. The shotgun code has been significantly improved over the last two weeks. We are looking for more users to beta test it on real data. Write me if you are trying our code!

Sunday, August 28, 2011

More on shutgun

I got inquiries from several companies who wanted to use shotgun, our large scale sparse logistic regression / lasso solver. I thought about writing a quick tutorial that could be useful.

PAPER: shotgun paper appears in this year ICML 2011. The paper explains the approach we took to allow running our algorithm on multicore machines. It analyzes the theory, and justifies the cases where parallel execution does not hurt accuracy.

CODE: the shotgun code is found here: http://select.cs.cmu.edu/code/

TARGET: the goal of this code, is to handle large scale problems, that from the one hand fit into a multicore machine, but from the other hand, other solvers such as GLMNET, Boy'd l1 interior point methods and liblinear fail to scale.

LICENSE: The code is licensed under Apache license.

INTERFACES: We have both a C code version, as well as Matlab interface for running the code from within Matlab. Due to Patrick Harrington (OneRiot.com) request, we added support for Matrix Market input format.
Additional R interface is found here, thanks to Steve Lianoglou, Cornell graduate student.

COST FUNCTION:
We use the following cost function formulation.
For Lasso:
argmin_x sum_i [(A_i*x - y_i)^2 + lambda * |x|_1]
For sparse logistic regression:
argmin_x sum_i [-log(1 + exp(-y_i * x* A_i) ) + lambda * |x|_1]

where |x|_1 is the first norm (sum of absolute value of the vector x).

Matlab Usage:

x = shotgun_logreg(A,y,lambda)
x = shotgun_lasso(A,y,lambda)

C Usage:

./mm_lasso [ A input matrix_file] [y input vector file] [x vector output file] [algorithm] [ threshold] [ K] [max_iter] [num_threads] [lammbda]
  Program inputs are:
Matrix and vector files are mandaroty inputs
Usage: ./mm_lasso
 -m matrix A in sparse matrix market format
  -v vector y in sparse matrix market format
  -o output file name (will contain solution vector x, default is x.mtx)
  -a algorithm (1=lasso, 2=logitic regresion, 3 = find min lambda for all zero solution)
  -t convergence threshold (default 1e-5)
  -k solution path length (for lasso)
  -i  max_iter (default 100)
  -n num_threads (default 2)
  -l lammbda - positive weight constant (default 1)
  -V verbose: 1=verbose, 0=quiet (default 0)

Monday, June 27, 2011

Large scale logistic regression on multicore

A few days ago, I wrote about our new ICML paper, which discusses how to speed up L1 regularized loss (either LASSO or logistic regression) on multicore. The paper is a joint work with Joseph Bradley, Aapo Kyrola and Carlos Guestrin.

Today, we have released a parallel Matlab mex code here. The code was written by Aapo Kyrola, CMU.

To give some motivation why a parallel code is needed, I will give an example. In their paper: Kogan, S., Levin, D., Routledge, B.R., Sagi, J.S., and Smith, N.A. Predicting risk from financial reports with regression. In Human Language Tech.-NAACL, 2009. Kogan et. al gives and example on how to predict stock behavior by analyzing diagrams of words that appear in the report. The computational problem is that the data matrix is rather large: it is a sparse matrix of size 30465 x 216842 with 0.0575% of non-zeros.

We tried to compare to other efficient l1 algorithms, but the best performance (after our own method, shotgun) was obtained by Boyd's interior point method. The following Graph shows the running time:

Shotgun, is about x65 times faster than the interior point method. (All tests ran on an AMD processor using up to 8 Opteron 8384 cores (2.69 GHz)).

To make things worse, on the full data set of bigrams, (matrix of size 30465 x 5845762 ) we did not find even a single algorithm for comparison that could cope with the data magnitude. Shotgun crunched this dataset in 2000 seconds.

The following graphs summarize the properties of shotgun vs. 4 categories of datasets:
compressed sensing (Sparco), single pixel camera, sparse compressed images, and large sparse datasets.

Whenever the tickers are above the diagonal line, Shotgun is performing faster. The algorithms for comparison used are: Shooting, L1_LS, FPC_AS, GPSR_BB, Hard_l0, SpaRSA. Except of the single pixel camera datasets, shotgun performs better on all datasets. The worser performance is explained by the theoretical P* bound, which limits the number of processors to be used efficiently in parallel. For the single pixel camera dataset the bound is 3, so using 8 cores in parallel does not gain much, since only 3 of them are effective and the others actually hurt performance.

As always, comparison between algorithms should be taken with a grain of salt, since the termination conditions are different for each algorithm. But overall, shotgun has a significant speedup over competing methods.

Some comments we got from Alex Smola, Yahoo! Research:
a) you probably want to look at the Beck & Teboulle 2008 paper
http://iew3.technion.ac.il/~becka/papers/71654.pdf
the issue is that FISTA (yes, they named their algorithm this way ...) has O(1/T^2) convergence rather than O(1/T).

in fact, FISTA could be implemented in parallel rather trivially since all it needs is a subgradient. in fact, i guess you could even use noisy subgradient estimates initially where the noise is less than the accuracy penalty (this would be a simple application of concentration inequalities to get it done - that's _not_ in their paper but imho quite obvious).

also, the actual thresholding stage parallelizes very nicely, so it shouldn't be to hard to build overall.

b) you should be able to get some additional mileage from whitening the A^\top A matrix, at least with a diagonal preconditioner. that way you end up rescaling updates in accordance to their mutual interference. this should be quite helpful in particular since A will also give you some idea of the subspace generating the gradient, so things are effectively scaling like (A^\top A)^2 rather than (A\top A) when the gradient and the interference term are evaluated.

c) a slight wrinkle in your analysis is that you only look at expected improvement. while this is perfectly good for the purpose of getting an intuition of convergence and capturing most of the behavior, it doesn't quite suffice to _prove_ convergence without additional gymnastics (not sure whether the Shalev-Schwartz proof has it).

d) equation (10) probably has a spurious 1/2 factor on the rhs, or at least i can't see how you got it. it works out in the end since (8) is probably where it came from ...

e) you probably want to look at a partitioning which has small interference since ultimately you don't really need to draw from an independent selection distribution. for instance, imagine that you had a chain. in this case if you drew every third element you'd be safe (think of the chromatic engine in graphlab).

while this is generally not possible, it has the flavor of a jacobi-like approach to it. i.e. randomize over decompositions that minimize interactions, provided that the overall distribution sill picks each coordinate uniformly. this would work well e.g. in a graph.

Thanks Alex!

Friday, May 20, 2011

Large scale logistic regression

L1 regularized logistic regression is one of the widely used applied machine learning algorithms. Recently I attended an interesting talk by Rajat Raina from Facebook, who is working on both ad placement and new friends recommendation. It seems that Facebook is using logistic regression for ranking which ads to display. I also talked to Abhinav Gputa, co-Founder of RocketFuel, a successful Bay area startup that optimizes ads placement. RocketFuel is also using logistic regression for ranking ads. I further hear that Google is using it as well in Gmail ad placement.

A related problem domain to L1 regularization is compressed sensing - a great resource to learn about compressed sensing is found in Igor Caroon's blog. In the ML community, this problem is called also LASSO.

So why is logistic regression widely used? Some of the reasons are that it rather simple to compute, since it represent a convex optimization problem, that has a global unique minimum. However, when the problem size becomes large, it becomes harder and harder to efficiently compute it. Our new paper Parallel Coordinate Descent for L1-Regularized Loss Minimization (to appear on this year ICML ) gives some insight about how to compute large scale L1 regularized loss efficiently on large scale problems.

As an executive summary, I summarize below our experience in computing large scale logistic regression below. For medium size problems (problem that fit in Matlab), Kim and Boyd interior point method outperforms other algorithms, both in accuracy and speed. It is further robust and converges well on many datasets. Unfortunately, interior point methods do not scale well for large data sets, since they involve the computation of the Hessian, which is quadratic in the number of variables.

I believe that there are two possible approaches for tackling the computational problem on large datasets (i.e. problems that do not fit into a memory of one machine). One of them is our new Shotgun algorithm which works by parallelising iterative coordinate descent (the shooting algorithm). Other useful approach is parallel stochastic gradient descent. The drawback is the L1 sparsity is not obtained using the gradient descent and thus small values have to be truncated to zero.

Large Scale Machine Learning and Other Animals

Thursday, September 8, 2011

More on shutgun (2)

Sunday, August 28, 2011

More on shutgun

Monday, June 27, 2011

Large scale logistic regression on multicore

Friday, May 20, 2011

Large scale logistic regression

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax