## Sunday, April 15, 2012

### More on large scale SVM

About a month ago I posted here on large scale SVM. The conclusion of my post was that linear SVM is solved problem, mainly due to Pegasos stochastic gradient descent algorithm.

Today I had the pleasure of meeting Shai Shalev-Shwartz, the author of Pegasos. I asked Shai if he can explain to me (namely for dummies.. ) why pegasos is working so well. So this is what I heard. Pegasos is a stochastic gradient descent method. Instead of computing the costly operation of the exact gradient, a random data point is selected, and an approximated gradient is computed, based solely on this data point. The solution method is advancing in random directions, however on expectation, those random directions will lead to the exact global solution.

I asked Shai if he can provide me a simple matlab code that demonstrates the essence of Pegasos. And here is the code I got from him:

% w=pegasos(X,Y,lambda,nepochs)
% Solve the SVM optimization problem without kernels:
%  w = argmin lambda w'*w + 1/m * sum(max(0,1-Y.*X*w))
% Input:
%  X - matrix of instances (each row is an instance)
%  Y - column vector of labels over {+1,-1}
%  lambda - scalar
%  nepochs - how many times to go over the training set
% Output:
%  w - column vector of weights
%  Written by Shai Shalev-Shwartz, HUJI
function w=pegasos(X,Y,lambda,nepochs)

[m,d] = size(X);
w = zeros(d,1);
t = 1;
for (i=1:nepochs)      % iterations over the full data
for (tau=1:m)      % pick a single data point
if (Y(tau)*X(tau,:)*w < 1)   % distance of data point
% from  separator is to small
% or data point is at the other side of the separator.
% take a step towards the gradient
w = (1-1/t)*w + 1/(lambda*t)*Y(tau)*X(tau,:)';
else
w = (1-1/t)*w;
end
t=t+1;         % increment counter
end
end

You must agree with me that this is a very simple and elegant piece of code.
And here are my two stupid questions:
1) Why do we update the gradient with the magic number < 1?
2) Why do we update w even when gradient update is not needed?
1) It's either the data point is close to the separator, or that it's far away from the separator but on the wrong side (i.e., if y*w*x is a large negative number).
2) The actual distance from a point x to the separator is |w*x| / (||w|| * ||x||). So, to increase this number we want that both |w*x| will be large and that ||w|| will be small. So, even if we don't update, we always want to shrink ||w||.