Today I had the pleasure of meeting Shai Shalev-Shwartz, the author of Pegasos. I asked Shai if he can explain to me (namely for dummies.. ) why pegasos is working so well. So this is what I heard. Pegasos is a stochastic gradient descent method. Instead of computing the costly operation of the exact gradient, a random data point is selected, and an approximated gradient is computed, based solely on this data point. The solution method is advancing in random directions, however on

**expectation**, those random directions will lead to the exact global solution.

I asked Shai if he can provide me a simple matlab code that demonstrates the essence of Pegasos. And here is the code I got from him:

% w=pegasos(X,Y,lambda,nepochs)

% Solve the SVM optimization problem without kernels:

% w = argmin lambda w'*w + 1/m * sum(max(0,1-Y.*X*w))

% Input:

% X - matrix of instances (each row is an instance)

% Y - column vector of labels over {+1,-1}

% lambda - scalar

% nepochs - how many times to go over the training set

% Output:

% w - column vector of weights

% Written by Shai Shalev-Shwartz, HUJI

function w=pegasos(X,Y,lambda,nepochs)

[m,d] = size(X);

w = zeros(d,1);

t = 1;

for (i=1:nepochs) % iterations over the full data

for (tau=1:m) % pick a single data point

if (Y(tau)*X(tau,:)*w < 1) % distance of data point

% from separator is to small

% or data point is at the other side of the separator.

% take a step towards the gradient

w = (1-1/t)*w + 1/(lambda*t)*Y(tau)*X(tau,:)';

else

w = (1-1/t)*w;

end

t=t+1; % increment counter

end

end

You must agree with me that this is a very simple and elegant piece of code.

And here are my two stupid questions:

1) Why do we update the gradient with the magic number < 1?

2) Why do we update w even when gradient update is not needed?

Answers I got from Shai:

1) It's either the data point is close to the separator, or that it's far away from the separator but on the wrong side (i.e., if y*w*x is a large negative number).

2) The actual distance from a point x to the separator is |w*x| / (||w|| * ||x||). So, to increase this number we want that both |w*x| will be large and that ||w|| will be small. So, even if we don't update, we always want to shrink ||w||.

Actually I also apply Pegasos successfully (e.g. in my gossip based learning framework http://arxiv.org/abs/1109.1396). I just would like to add a short comment to answer 2. Concretely this shrinking mechanism is resulted by the fact the objective function of Pegasos contains regularization term which makes the algorithm more elgant/accurate.

ReplyDeleteSuppose you did not have any training instances. Then you would like w to go to zero.

ReplyDeleteAre you aware of any online/incremental SVM training tool/code? A few are mentioned here: http://www.quora.com/Support-Vector-Machines/What-are-some-good-tools-for-training-online-svm

ReplyDeleteBut. I am not sure, as which would work best for millions of data points.

For linear models, VW is great, you can read about it here:

Deletehttp://bickson.blogspot.co.il/2012/01/vowal-wabbit-tutorial.html

You should use hinge loss for SVM.