Monday, July 18, 2011

Hearst Machine Learning Challenge - Converting inputs to SVMLight format

After the excitement following our 5th place in KDD CUP 2011 is a little over, I started looking at other interesting problems. The hearst machine learning challenge has some interesting data. About 1M emails are given with 273 sparse features. The task is to classify some validation emails, and decide whether the user has opened the email and if he clicked on the link within the email. The problem is not so easy since the data is highly skewed - most users ignore ad emails
as spam, so the number of positive examples is rather low.

One of the classic ways of solving the classification problem is using SVM (support vector machine). SVMLight is a popular implementation of SVM solver.

Here is a short script I wrote for converting Hearst machine learning challenge data into SVMLight format (and also pegasos format).

%function for converting hearst data to svm light format
%Input: number - the file number. 1-5 Model files. 6 - validation.
%       doclick or doopen - one of them should be 1 and the other zero, depends on which target.
%Written by Danny Bickson, CMU, July 2011.
%This script converts hearst machine learning challenge data into SVMlight format
%namely:    ...
% for example
%-1 3:15.4 4:18 19:32
%

function []=convert2svm(number,doclick, doopen)

assert(number>=1 && number<=6);
row_offset = [0 400000 800000 1200000 1600000 0];
rows=[400000 400000 400000 400000 185421 9956];
cols=274;

assert(~(doopen && doclick));
assert(doclick || doopen);
terms273 = {'Sun', 'Mon','Tue', 'Wed', 'Thu', 'Fri', 'Sat'};
ids   = num2cell(1:length(terms273));
dict273  = reshape({terms273{:};ids{:}},2,[]);
dict273  = struct(dict273{:});

if (number == 6)
  fid=fopen('validation.csv','r');
  outid=fopen('validation.txt','w');
else
  fid=fopen(['Modeling_', num2str(number), '.csv'],'r');
  if (doclick)
    outid=fopen(['svm', num2str(number), '.txt'],'w');
  else
    outid=fopen(['2svm', num2str(number), '.txt'],'w');
  end
end

assert(outid~=-1);

title=textscan(fid, '%s', 273, 'delimiter', ','); % read title
title=title{1};
title{274} = 'date';% field no. 273 is mistakenly parsed into two fields in matlab because of a ","
% go over rows
tic
for j=1:rows(number)-1
  if (mod(j,500) == 0)
      disp(['row ', num2str(j)]);
tic
for j=1:rows(number)-1
  if (mod(j,500) == 0)
      disp(['row ', num2str(j)]);
      toc
  end
  a=textscan(fid, '%s', 274,'delimiter', ',');
  a=a{1};
  for i=1:cols
      if (i == 1|| i == 2) %handle target
        if ((doclick&&i==1) || (doopen&&i==2))
           if (number == 6)
             fprintf(outid,'%d ', -1); %target is unknown, write -1 as a placeholder
           else
             fprintf(outid,'%d ', (2*strcmp(a{i},'Y'))-1);
           end
        end
      elseif (~strcmp(a{i} ,''))%if feature is non zero
          val=a{i};
          if (i == 73) % translate field of the type A01, B03, J05, etc. quickly into a number
              val = val(1)*26+val(3);
          elseif (i==273)
              val = val(2:end); %remove quatation mark 
              val = dict273.(val);
          elseif (i==274) % translate date into a number
              val = datenum(a{274});
          else
              if (length(val) == 1)
                val = uint8(val);
              elseif (sum(isletter(val))==0) % string is all digits, translate to double
                val = str2double(val);
              else
                val = sum(uint8(val));%translate a string into a number, using sun of chars, can use more fancy methods here
              end
          end

          fprintf(outid, '%d:%f ', i-2, val); % remove two from field number since first two fields are targets
      end
  end
  fprintf(outid, '\n');
end

fclose(fid);
fclose(outid);
end
                                                                                                                                                  
The script can be actually run in parallel on multicore machine. The way to run it is to execute the following in a Linux shell (optimally if you have 11 cores):
for i in `seq 1 1 6`
do
matlab -r "convert2svm($i,1,0)" &
matlab -r "convert2svm($i,0,1)" &
done
The resulting files are svm1.txt -> svm5.txt (using first target - open email), files 2svm1.txt -> 2svm5.txt (using second target - click email) and the validation.txt file. Next you can merge the files using the command
cat svm1.txt > total.txt
for i in `seq 2 1 5`
do
cat svm$i.txt >> total.txt
done

9 comments:

  1. Hey Danny, thanks for sharing this.
    I'm not familar with matlab scripts.
    How do you deal with missing values.

    ReplyDelete
  2. Hi S.A.S,
    SVMLight gets as input only the observed values, with the format:
    -1 3:5.2 4:-5.6 8:12
    which means that the target was -1 (user did not click the email), and feature number 3 value was 5.2, feature number 4 values was -5.6 etc.
    Note that features 1,2,3,5,6,7 where missing so they are not given as input to SVMLight.

    ReplyDelete
  3. Hi,
    I have had some previous experiences running svm light on high dimensional categorical sparse data, but the results were not quite impressive. How did it turn out in your case? It will be great if you could share the results if you don mind sharing them.

    Thanks,
    Venki

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Not much results yet - did not have much time to work on it - I will update as I make more progress. One interesting thing I noticed about the two targets click vs. open is that the first is much easier to classify and the later is harder - but it may be that I need to fine tune the SVM parameters better.

    ReplyDelete
  6. Hey Danny,
    Pegasos only gives the model file, how can i predict using that? Can i use the svm_perf_classify? or that of svm_light_classify?

    ReplyDelete
  7. This is not pegasus - it is SVMLight (a different software..). You should use svm_classify. See documentation here: http://svmlight.joachims.org/

    Best,

    DB

    ReplyDelete
  8. Hey Danny
    i have to convert tweet text data to svm light format data how should i do it ?
    please help

    ReplyDelete
    Replies
    1. Take a look here: http://bickson.blogspot.co.il/2012/09/graphchi-parsers-toolkit.html

      Delete