Large Scale Machine Learning and Other Animals: More on shutgun

Sunday, August 28, 2011

More on shutgun

I got inquiries from several companies who wanted to use shotgun, our large scale sparse logistic regression / lasso solver. I thought about writing a quick tutorial that could be useful.

PAPER: shotgun paper appears in this year ICML 2011. The paper explains the approach we took to allow running our algorithm on multicore machines. It analyzes the theory, and justifies the cases where parallel execution does not hurt accuracy.

CODE: the shotgun code is found here: http://select.cs.cmu.edu/code/

TARGET: the goal of this code, is to handle large scale problems, that from the one hand fit into a multicore machine, but from the other hand, other solvers such as GLMNET, Boy'd l1 interior point methods and liblinear fail to scale.

LICENSE: The code is licensed under Apache license.

INTERFACES: We have both a C code version, as well as Matlab interface for running the code from within Matlab. Due to Patrick Harrington (OneRiot.com) request, we added support for Matrix Market input format.
Additional R interface is found here, thanks to Steve Lianoglou, Cornell graduate student.

COST FUNCTION:
We use the following cost function formulation.
For Lasso:
argmin_x sum_i [(A_i*x - y_i)^2 + lambda * |x|_1]
For sparse logistic regression:
argmin_x sum_i [-log(1 + exp(-y_i * x* A_i) ) + lambda * |x|_1]

where |x|_1 is the first norm (sum of absolute value of the vector x).

Matlab Usage:

x = shotgun_logreg(A,y,lambda)
x = shotgun_lasso(A,y,lambda)

C Usage:

./mm_lasso [ A input matrix_file] [y input vector file] [x vector output file] [algorithm] [ threshold] [ K] [max_iter] [num_threads] [lammbda]
  Program inputs are:
Matrix and vector files are mandaroty inputs
Usage: ./mm_lasso
 -m matrix A in sparse matrix market format
  -v vector y in sparse matrix market format
  -o output file name (will contain solution vector x, default is x.mtx)
  -a algorithm (1=lasso, 2=logitic regresion, 3 = find min lambda for all zero solution)
  -t convergence threshold (default 1e-5)
  -k solution path length (for lasso)
  -i  max_iter (default 100)
  -n num_threads (default 2)
  -l lammbda - positive weight constant (default 1)
  -V verbose: 1=verbose, 0=quiet (default 0)

Large Scale Machine Learning and Other Animals

Sunday, August 28, 2011

More on shutgun

No comments:

Post a Comment

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax