Large Scale Machine Learning and Other Animals

Thursday, July 28, 2011

Identifying structure in large social and information networks

I got the following from Aapo Kyrola:

Regarding partitioning, I recommend reading Stanford's Prof Mahoney's tutorial
about real-world graphs: http://cs.stanford.edu/people/mmahoney/icml10-tut/
Here is a link to the video of the tutorial: http://videolectures.net/kdd2010_mahoney_gtgm/

Basically he argues that although, for example, social network graphs have
local structure (such as your friend network in facebook might be nicely clustered),
they lack any global structure. This makes at least current methods of graph partitioning useless on such graphs.

Monday, July 25, 2011

GraphLab vs. Spark

I had a very interesting meeting today with Matei Zaharia, a graduate student from Berkeley with a very impressive publication record. He is one of the the authors of Spark, a parallel cloud architecture targeted for iterative and interactive algorithms, where Hadoop does not perform well.

I was impressed with the demo of Spark, using 20 Amazon EC2 machines. Spark is implemented using Scala, which allows for convenient Java like programming interface.

What caught my attention, is that on top of Spark, one of the implemented algorithms was alternating least squares. A work to be reported in the coming Sigcomm:
M. Chowdhury, M. Zaharia, J. Ma, M.I. Jordan and I. Stoica, Managing Data Transfers in Computer Clusters with Orchestra.
Here are some performance number given by Matei:
4 cores (master + 0 workers): 272s
12 cores (master + 1 worker): 106s
20 cores (master + 2 workers): 72s
28 cores (master + 3 workers): 57s
36 cores (master + 4 workers): 50s

Next, I tested multicore GraphLab using the same data, and same factorized matrix width (D=60). One iteration in GraphLab of alternating least squares takes 106 seconds on 8 cores, using an AMD Opteron multicore machine. This is very close
to Spark results with 12 cores.

Overall conclusion (by Matei:) I think the main takeaway from this is that you should absolutely use a lot of cores on one machine if you have them, as communication is much faster. When you add nodes, the communication cost will lower the overall performance-per-CPU, but you will get lower response times too.

Thursday, July 21, 2011

Fighting with Amazon EC2 AMI

There is no doubt that Amazon EC2 is one of the most successful and useful cloud services. However, a few days ago I had the nerve breaking experience of trying to ec2-register an amazon AMI image. This task is needed when you want to save your work so you can easily load it next time you run. Amazon AMI tools is one of the worst designed and implemented tools I have ever encountered. You need a lot of patience when dealing with it. I wrote down some of the errors I encountered.
(I thought that by having a PhD and working for 15 years on Linux I will be immune to this kinds of errors, but I was absolutely wrong..) The reader should be warned,
that I did not collect error on the web, I simply encountered all of the below errors, until eventually I got so tired so I did not document everything from a certain point.

Basically, what you want to do is to run 3 commands. Usually it should not take more than a few minutes. However, if you manage to run those command in less than a few hours you are absolutely lucky.
Those are the command you like to run:

sudo -E /opt/aws/bin/ec2-bundle-vol -k [path to your x.509 private key] -c [path to your x.509 public key] -u [Amazon 12 digit user id] -d /mnt -p [bundle file name] -r x86_64
sudo -E /opt/aws/bin/ec2-upload-bundle -b [bundle file name] -m /mnt/[bundle file name].manifest.xml -s [Amazon AWS secret string] -a [Amazon AWS ID string]
sudo -E ec2-register -K [location of X.509 private key] -C [location of x.509 certificate] --name [bucket name]/[image name] --region us-east-1 [bucket name]/[image name].manifest.xml

Potential problems. If the process failed and you tried again you may get an error:

/opt/aws/amitools/ec2/lib/ec2/platform/linux/image.rb:154:in `mount_image': image already mounted (FatalError)
    from /opt/aws/amitools/ec2/lib/ec2/platform/linux/image.rb:81:in `make'
    from /opt/aws/amitools/ec2/lib/ec2/amitools/bundlevol.rb:151:in `bundle_vol'
    from /opt/aws/amitools/ec2/lib/ec2/amitools/bundlevol.rb:193:in `main'
    from /opt/aws/amitools/ec2/lib/ec2/amitools/tool_base.rb:201:in `run'
    from /opt/aws/amitools/ec2/lib/ec2/amitools/bundlevol.rb:201

solution: using the "mount" command find the mounted image and unmount it using the "sudo unmount XXX" command.

problem:

/opt/aws/bin/ec2-bundle-vol: line 3: EC2_HOME: Neither of EC2_AMITOOL_HOME or EC2_HOME environment variables are set

Solution:
Assuming AMITOOL is isntalled, try to use sudo -E. If this did not work, try to set (assuming working on bash shell):

export EC2_AMITOOL_HOME=/home/ubuntu/ami-tools/ec2-ami-tools-1.3-57676/

where you should point the path to where ami-tools are installed. If they are not installed
you need to install EC2 AMITOOLs. And then set the environment variable using "setenv" or "export" command.

Problem:

--user has invalid value 'AKIAJWASWE2DSWQFKILA': the user ID should consist of 12 digits (optionally hyphenated); this should not be your Access Key ID
Try 'ec2-bundle-vol --help'

solution:
You ou gave the wrong key, look for a numeric key id in amazon AWS website of the format 0000-0000-0000. This key is especially hard to find within all the menus.

Problem:

ERROR: the specified image file /mnt/graphlab.org.realase_v1234 already exists

Solution: remove the image file created using the command

sudo rm -fR /mnt/graphlab.org.release_v1234

Problem:

The specified bucket is not S3 v2 safe (see S3 documentation for details):

Solution: Looks like an EC2 bug - underscore and capital letters are allowed but result in this warning. If you try to ignore this warning at this point, you will get much worser errors later. Try to avoid this warning.

Problem:

mke2fs 1.41.12 (17-May-2010)
error writing /etc/mtab.tmp: No space left on device

Solution:
You tried so many times, you got out of disk space.. Need to clean up files or restart image and retry again.

Problem:

Neither a 'manifest' or 'block-device-mapping' have been specified; at least one is required. (-h for usage)

Solution:
You should have both used the -n flag to specify a bucket name, and then the path of the bucketname/imagename.manifest.xml . By the way bucket name is flexible - it does not have to be image name.

Problem:

1) Client.InvalidManifest: HTTP 403 (Forbidden) response for URL http://s3.amazonaws.com:80/graphlab.org.release_v1234/graphlab.org.release_v1234.mainfest.xml: check your S3 ACLs are correct.
2) Client.InvalidManifest: HTTP 404 (Not Found) response for URL http://s3.amazonaws.com:80/graphlab.org.realase_v1234/graphlab.org.realase_v1234.mainfest.xml: check your manifest path is correct and in the correct region.

Solution:
something in the process has gone wrong - either bucket name is wrong or upload failed.. Need to do everything correctly from the beginning.

Problem:

ERROR: Parameter problem: Expecting S3 URI with just the bucket name set instead of 'graphlab_org_release_v1234'

Solution:
Need to add s3:// when using the command: s3cmd mb

Problem:

ERROR: Error talking to S3: Curl.Error(51): SSL: certificate subject name '*.s3.amazonaws.com' does not match target host name 'graphlab.org.release.v1234.s3.amazonaws.com'.

Solution: no clue what I did - I started to get out of focus at this point. Probably started all over again.. :-(

Problem: Client.InvalidAMIName.Duplicate: AMI name graphlaborgreleasev1234 is already in use by AMI ami-98946ef1

Solution: this happens when you try to register a new AMI with a name you already gave to an older AMI need to rename.

Hopefully, after all this mess, you managed to ec2-register.. and got a printout of
the type:
IMAGE AMI-12120930
HALELUYA.
And I ask : why not simply add a UI option from AWS consule to register an image???

Final comment: I have a quick email exchange with James Hamilton, VP in Amazon and I sent him this link. I got back the following note: Sorry you had a bad experience with EC2.

I would like to take this opportunity to clarify that my overall experience with EC2 is very good. But still some interfaces could be improved.

Monday, July 18, 2011

Hearst Machine Learning Challenge - Converting inputs to SVMLight format

After the excitement following our 5th place in KDD CUP 2011 is a little over, I started looking at other interesting problems. The hearst machine learning challenge has some interesting data. About 1M emails are given with 273 sparse features. The task is to classify some validation emails, and decide whether the user has opened the email and if he clicked on the link within the email. The problem is not so easy since the data is highly skewed - most users ignore ad emails
as spam, so the number of positive examples is rather low.

One of the classic ways of solving the classification problem is using SVM (support vector machine). SVMLight is a popular implementation of SVM solver.

Here is a short script I wrote for converting Hearst machine learning challenge data into SVMLight format (and also pegasos format).

%function for converting hearst data to svm light format
%Input: number - the file number. 1-5 Model files. 6 - validation.
%       doclick or doopen - one of them should be 1 and the other zero, depends on which target.
%Written by Danny Bickson, CMU, July 2011.
%This script converts hearst machine learning challenge data into SVMlight format
%namely:    ...
% for example
%-1 3:15.4 4:18 19:32
%

function []=convert2svm(number,doclick, doopen)

assert(number>=1 && number<=6);
row_offset = [0 400000 800000 1200000 1600000 0];
rows=[400000 400000 400000 400000 185421 9956];
cols=274;

assert(~(doopen && doclick));
assert(doclick || doopen);
terms273 = {'Sun', 'Mon','Tue', 'Wed', 'Thu', 'Fri', 'Sat'};
ids   = num2cell(1:length(terms273));
dict273  = reshape({terms273{:};ids{:}},2,[]);
dict273  = struct(dict273{:});

if (number == 6)
  fid=fopen('validation.csv','r');
  outid=fopen('validation.txt','w');
else
  fid=fopen(['Modeling_', num2str(number), '.csv'],'r');
  if (doclick)
    outid=fopen(['svm', num2str(number), '.txt'],'w');
  else
    outid=fopen(['2svm', num2str(number), '.txt'],'w');
  end
end

assert(outid~=-1);

title=textscan(fid, '%s', 273, 'delimiter', ','); % read title
title=title{1};
title{274} = 'date';% field no. 273 is mistakenly parsed into two fields in matlab because of a ","
% go over rows
tic
for j=1:rows(number)-1
  if (mod(j,500) == 0)
      disp(['row ', num2str(j)]);
tic
for j=1:rows(number)-1
  if (mod(j,500) == 0)
      disp(['row ', num2str(j)]);
      toc
  end
  a=textscan(fid, '%s', 274,'delimiter', ',');
  a=a{1};
  for i=1:cols
      if (i == 1|| i == 2) %handle target
        if ((doclick&&i==1) || (doopen&&i==2))
           if (number == 6)
             fprintf(outid,'%d ', -1); %target is unknown, write -1 as a placeholder
           else
             fprintf(outid,'%d ', (2*strcmp(a{i},'Y'))-1);
           end
        end
      elseif (~strcmp(a{i} ,''))%if feature is non zero
          val=a{i};
          if (i == 73) % translate field of the type A01, B03, J05, etc. quickly into a number
              val = val(1)*26+val(3);
          elseif (i==273)
              val = val(2:end); %remove quatation mark 
              val = dict273.(val);
          elseif (i==274) % translate date into a number
              val = datenum(a{274});
          else
              if (length(val) == 1)
                val = uint8(val);
              elseif (sum(isletter(val))==0) % string is all digits, translate to double
                val = str2double(val);
              else
                val = sum(uint8(val));%translate a string into a number, using sun of chars, can use more fancy methods here
              end
          end

          fprintf(outid, '%d:%f ', i-2, val); % remove two from field number since first two fields are targets
      end
  end
  fprintf(outid, '\n');
end

fclose(fid);
fclose(outid);
end

The script can be actually run in parallel on multicore machine. The way to run it is to execute the following in a Linux shell (optimally if you have 11 cores):

for i in `seq 1 1 6`
do
matlab -r "convert2svm($i,1,0)" &
matlab -r "convert2svm($i,0,1)" &
done

The resulting files are svm1.txt -> svm5.txt (using first target - open email), files 2svm1.txt -> 2svm5.txt (using second target - click email) and the validation.txt file. Next you can merge the files using the command

cat svm1.txt > total.txt
for i in `seq 2 1 5`
do
cat svm$i.txt >> total.txt
done

GraphLab - Machine Learning in the Cloud

A few days ago we have released a detailed technical report about the performance of distributed GraphLab on Amazon EC2 with up to 64 nodes (512 cores total) : http://arxiv.org/abs/1107.0922

We compared GraphLab using three applications: matrix factorization, CoEM (a variant of personalized pagerank, a named entity recognition algorithm), and video co-segmentation.

As a reference we compared three platforms: Hadoop, MPI (message-passing-interface) and GraphLab. In a nutshell, GraphLab runs about 20x to 100x times faster than Hadoop, depending on the data and application. The main reason is that we perform all computation in memory and do not provide any fault tolerance. Compared to MPI, GraphLab has a similar performance. The drawback of MPI is that the code has to be rewritten for each application, while GraphLab provides building blocks for iterative computation.

The following graph shows the speedup of the 3 applications using 64 Amazon HPC machines:

The baseline for speedup calculation are 4 machines. For matrix factorization (line denoted as Netflix) we get a speedup of x16 on x64 machines. For video co-segmentation we get a speedup of x40 on x64 EC2 nodes.
When we increase factorized matrix width, the problem becomes computation heavy and we get
even a better speedup of x40 on 64 nodes.

Monday, June 27, 2011

Large scale logistic regression on multicore

A few days ago, I wrote about our new ICML paper, which discusses how to speed up L1 regularized loss (either LASSO or logistic regression) on multicore. The paper is a joint work with Joseph Bradley, Aapo Kyrola and Carlos Guestrin.

Today, we have released a parallel Matlab mex code here. The code was written by Aapo Kyrola, CMU.

To give some motivation why a parallel code is needed, I will give an example. In their paper: Kogan, S., Levin, D., Routledge, B.R., Sagi, J.S., and Smith, N.A. Predicting risk from financial reports with regression. In Human Language Tech.-NAACL, 2009. Kogan et. al gives and example on how to predict stock behavior by analyzing diagrams of words that appear in the report. The computational problem is that the data matrix is rather large: it is a sparse matrix of size 30465 x 216842 with 0.0575% of non-zeros.

We tried to compare to other efficient l1 algorithms, but the best performance (after our own method, shotgun) was obtained by Boyd's interior point method. The following Graph shows the running time:

Shotgun, is about x65 times faster than the interior point method. (All tests ran on an AMD processor using up to 8 Opteron 8384 cores (2.69 GHz)).

To make things worse, on the full data set of bigrams, (matrix of size 30465 x 5845762 ) we did not find even a single algorithm for comparison that could cope with the data magnitude. Shotgun crunched this dataset in 2000 seconds.

The following graphs summarize the properties of shotgun vs. 4 categories of datasets:
compressed sensing (Sparco), single pixel camera, sparse compressed images, and large sparse datasets.

Whenever the tickers are above the diagonal line, Shotgun is performing faster. The algorithms for comparison used are: Shooting, L1_LS, FPC_AS, GPSR_BB, Hard_l0, SpaRSA. Except of the single pixel camera datasets, shotgun performs better on all datasets. The worser performance is explained by the theoretical P* bound, which limits the number of processors to be used efficiently in parallel. For the single pixel camera dataset the bound is 3, so using 8 cores in parallel does not gain much, since only 3 of them are effective and the others actually hurt performance.

As always, comparison between algorithms should be taken with a grain of salt, since the termination conditions are different for each algorithm. But overall, shotgun has a significant speedup over competing methods.

Some comments we got from Alex Smola, Yahoo! Research:
a) you probably want to look at the Beck & Teboulle 2008 paper
http://iew3.technion.ac.il/~becka/papers/71654.pdf
the issue is that FISTA (yes, they named their algorithm this way ...) has O(1/T^2) convergence rather than O(1/T).

in fact, FISTA could be implemented in parallel rather trivially since all it needs is a subgradient. in fact, i guess you could even use noisy subgradient estimates initially where the noise is less than the accuracy penalty (this would be a simple application of concentration inequalities to get it done - that's _not_ in their paper but imho quite obvious).

also, the actual thresholding stage parallelizes very nicely, so it shouldn't be to hard to build overall.

b) you should be able to get some additional mileage from whitening the A^\top A matrix, at least with a diagonal preconditioner. that way you end up rescaling updates in accordance to their mutual interference. this should be quite helpful in particular since A will also give you some idea of the subspace generating the gradient, so things are effectively scaling like (A^\top A)^2 rather than (A\top A) when the gradient and the interference term are evaluated.

c) a slight wrinkle in your analysis is that you only look at expected improvement. while this is perfectly good for the purpose of getting an intuition of convergence and capturing most of the behavior, it doesn't quite suffice to _prove_ convergence without additional gymnastics (not sure whether the Shalev-Schwartz proof has it).

d) equation (10) probably has a spurious 1/2 factor on the rhs, or at least i can't see how you got it. it works out in the end since (8) is probably where it came from ...

e) you probably want to look at a partitioning which has small interference since ultimately you don't really need to draw from an independent selection distribution. for instance, imagine that you had a chain. in this case if you drew every third element you'd be safe (think of the chromatic engine in graphlab).

while this is generally not possible, it has the flavor of a jacobi-like approach to it. i.e. randomize over decompositions that minimize interactions, provided that the overall distribution sill picks each coordinate uniformly. this would work well e.g. in a graph.

Thanks Alex!

Friday, June 17, 2011

KDD Cup progress update - We are in the 5th place!

I am proud to report that yesterday JustinYan, a graduate student from the Institute of Automation, Chinese Academy of Sciences, a member of the LeBuSiShu team, suggested that I will join their team.
The reason is that LeBuSiShu is using GraphLab ALS solution as one of the components in their final result. Currently LeBuSiShu is the 5th group in the leaderboard (out of 100 groups!) with RMSE = 22.2820.

Anyway, of course I was honored to accept as well as quite excited! Although when writing the GraphLab collaborative filtering code of matrix factorization I had no competition in mind, it is always nice to know that someone finds your solution useful.

Besides of JustinYun, another nice team member in the LeBuSiShu group is: Yao Wu, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academic of Sciences.

A quick update 6/23: We are actually now at the 4th place! The Chinese group is an amazing team, they simply work around the clock. I myself am running some fancy MCMC methods for slightly improving prediction, but the Chinese team is responsible of 99.9% of the progress.