Sunday, April 15, 2012

More on large scale SVM

About a month ago I posted here on large scale SVM. The conclusion of my post was that linear SVM is solved problem, mainly due to Pegasos stochastic gradient descent algorithm.

Today I had the pleasure of meeting Shai Shalev-Shwartz, the author of Pegasos. I asked Shai if he can explain to me (namely for dummies.. ) why pegasos is working so well. So this is what I heard. Pegasos is a stochastic gradient descent method. Instead of computing the costly operation of the exact gradient, a random data point is selected, and an approximated gradient is computed, based solely on this data point. The solution method is advancing in random directions, however on expectation, those random directions will lead to the exact global solution.

I asked Shai if he can provide me a simple matlab code that demonstrates the essence of Pegasos. And here is the code I got from him:

% w=pegasos(X,Y,lambda,nepochs)
% Solve the SVM optimization problem without kernels:
%  w = argmin lambda w'*w + 1/m * sum(max(0,1-Y.*X*w))
% Input:
%  X - matrix of instances (each row is an instance)
%  Y - column vector of labels over {+1,-1}
%  lambda - scalar
%  nepochs - how many times to go over the training set
% Output: 
%  w - column vector of weights
%  Written by Shai Shalev-Shwartz, HUJI
function w=pegasos(X,Y,lambda,nepochs)

[m,d] = size(X);
w = zeros(d,1);
t = 1;
for (i=1:nepochs)      % iterations over the full data
    for (tau=1:m)      % pick a single data point
        if (Y(tau)*X(tau,:)*w < 1)   % distance of data point
                                     % from  separator is to small          
                                     % or data point is at the other side of the separator.
            % take a step towards the gradient
            w = (1-1/t)*w + 1/(lambda*t)*Y(tau)*X(tau,:)';
            w = (1-1/t)*w;
        t=t+1;         % increment counter

You must agree with me that this is a very simple and elegant piece of code.
And here are my two stupid questions:
1) Why do we update the gradient with the magic number < 1?
2) Why do we update w even when gradient update is not needed?
Answers I got from Shai:
1) It's either the data point is close to the separator, or that it's far away from the separator but on the wrong side (i.e., if y*w*x is a large negative number).
2) The actual distance from a point x to the separator is |w*x| / (||w|| * ||x||). So, to increase this number we want that both |w*x| will be large and that ||w|| will be small. So, even if we don't update, we always want to shrink ||w||.


  1. Actually I also apply Pegasos successfully (e.g. in my gossip based learning framework I just would like to add a short comment to answer 2. Concretely this shrinking mechanism is resulted by the fact the objective function of Pegasos contains regularization term which makes the algorithm more elgant/accurate.

  2. Suppose you did not have any training instances. Then you would like w to go to zero.

  3. Are you aware of any online/incremental SVM training tool/code? A few are mentioned here:

    But. I am not sure, as which would work best for millions of data points.

    1. For linear models, VW is great, you can read about it here:
      You should use hinge loss for SVM.