as spam, so the number of positive examples is rather low.
One of the classic ways of solving the classification problem is using SVM (support vector machine). SVMLight is a popular implementation of SVM solver.
Here is a short script I wrote for converting Hearst machine learning challenge data into SVMLight format (and also pegasos format).
%function for converting hearst data to svm light format %Input: number - the file number. 1-5 Model files. 6 - validation. % doclick or doopen - one of them should be 1 and the other zero, depends on which target. %Written by Danny Bickson, CMU, July 2011. %This script converts hearst machine learning challenge data into SVMlight format %namely:The script can be actually run in parallel on multicore machine. The way to run it is to execute the following in a Linux shell (optimally if you have 11 cores):... % for example %-1 3:15.4 4:18 19:32 % function []=convert2svm(number,doclick, doopen) assert(number>=1 && number<=6); row_offset = [0 400000 800000 1200000 1600000 0]; rows=[400000 400000 400000 400000 185421 9956]; cols=274; assert(~(doopen && doclick)); assert(doclick || doopen); terms273 = {'Sun', 'Mon','Tue', 'Wed', 'Thu', 'Fri', 'Sat'}; ids = num2cell(1:length(terms273)); dict273 = reshape({terms273{:};ids{:}},2,[]); dict273 = struct(dict273{:}); if (number == 6) fid=fopen('validation.csv','r'); outid=fopen('validation.txt','w'); else fid=fopen(['Modeling_', num2str(number), '.csv'],'r'); if (doclick) outid=fopen(['svm', num2str(number), '.txt'],'w'); else outid=fopen(['2svm', num2str(number), '.txt'],'w'); end end assert(outid~=-1); title=textscan(fid, '%s', 273, 'delimiter', ','); % read title title=title{1}; title{274} = 'date';% field no. 273 is mistakenly parsed into two fields in matlab because of a "," % go over rows tic for j=1:rows(number)-1 if (mod(j,500) == 0) disp(['row ', num2str(j)]); tic for j=1:rows(number)-1 if (mod(j,500) == 0) disp(['row ', num2str(j)]); toc end a=textscan(fid, '%s', 274,'delimiter', ','); a=a{1}; for i=1:cols if (i == 1|| i == 2) %handle target if ((doclick&&i==1) || (doopen&&i==2)) if (number == 6) fprintf(outid,'%d ', -1); %target is unknown, write -1 as a placeholder else fprintf(outid,'%d ', (2*strcmp(a{i},'Y'))-1); end end elseif (~strcmp(a{i} ,''))%if feature is non zero val=a{i}; if (i == 73) % translate field of the type A01, B03, J05, etc. quickly into a number val = val(1)*26+val(3); elseif (i==273) val = val(2:end); %remove quatation mark val = dict273.(val); elseif (i==274) % translate date into a number val = datenum(a{274}); else if (length(val) == 1) val = uint8(val); elseif (sum(isletter(val))==0) % string is all digits, translate to double val = str2double(val); else val = sum(uint8(val));%translate a string into a number, using sun of chars, can use more fancy methods here end end fprintf(outid, '%d:%f ', i-2, val); % remove two from field number since first two fields are targets end end fprintf(outid, '\n'); end fclose(fid); fclose(outid); end
for i in `seq 1 1 6` do matlab -r "convert2svm($i,1,0)" & matlab -r "convert2svm($i,0,1)" & doneThe resulting files are svm1.txt -> svm5.txt (using first target - open email), files 2svm1.txt -> 2svm5.txt (using second target - click email) and the validation.txt file. Next you can merge the files using the command
cat svm1.txt > total.txt for i in `seq 2 1 5` do cat svm$i.txt >> total.txt done
Hey Danny, thanks for sharing this.
ReplyDeleteI'm not familar with matlab scripts.
How do you deal with missing values.
Hi S.A.S,
ReplyDeleteSVMLight gets as input only the observed values, with the format:
-1 3:5.2 4:-5.6 8:12
which means that the target was -1 (user did not click the email), and feature number 3 value was 5.2, feature number 4 values was -5.6 etc.
Note that features 1,2,3,5,6,7 where missing so they are not given as input to SVMLight.
Hi,
ReplyDeleteI have had some previous experiences running svm light on high dimensional categorical sparse data, but the results were not quite impressive. How did it turn out in your case? It will be great if you could share the results if you don mind sharing them.
Thanks,
Venki
This comment has been removed by the author.
ReplyDeleteNot much results yet - did not have much time to work on it - I will update as I make more progress. One interesting thing I noticed about the two targets click vs. open is that the first is much easier to classify and the later is harder - but it may be that I need to fine tune the SVM parameters better.
ReplyDeleteHey Danny,
ReplyDeletePegasos only gives the model file, how can i predict using that? Can i use the svm_perf_classify? or that of svm_light_classify?
This is not pegasus - it is SVMLight (a different software..). You should use svm_classify. See documentation here: http://svmlight.joachims.org/
ReplyDeleteBest,
DB
Hey Danny
ReplyDeletei have to convert tweet text data to svm light format data how should i do it ?
please help
Take a look here: http://bickson.blogspot.co.il/2012/09/graphchi-parsers-toolkit.html
Delete