Large Scale Machine Learning and Other Animals: Amazon-EC2

Showing posts with label Amazon-EC2. Show all posts

Tuesday, April 3, 2012

Machine learning contest from Amazon

Here is what I got from Ken Montanez, from Amazon, via Mahout mailing list:

Amazon's Information Security Organization partnered with IEEE to put on a Machine Learning Competition for their upcoming MLSP 2012 International Workshop in Spain.
Amazon provided real industry data to give competitors the opportunity to work with real world data.The competition deadline is May 14, 2012.Link Website: http://mlsp2012.conwiz.dk, under the 'MLSP Competition' menu option.
Have fun!Ken-- Ken Montanez Software Engineering Manager Security Platform

It seems the goal of the competition is binary classification of user access to resources, to either allow or deny access. The data is explained here. Anyway for us it is always interesting to obtain real data!

Thursday, September 8, 2011

Amazon EC2 Supports Research!

I am very pleased to announce, that GraphLab large scale machine learning project is now supported by Amazon Elastic cloud (EC2), who allocated us computing time for using their cloud. This will allow us to extend compatibility with EC2 and further to scale for larger models.

I want to take this opportunity to thanks James Hammilton, VP and Distinguished Engineer in Amazon who pulled some strings, and introduced us to Kurt Messersmith, Senior Manager in Amazon Web Services who was kind enough to approve our grant request.

By the way, if you are teaching a course about EC2 or you want to apply for research grants you can apply here: http://aws.amazon.com/education/

Thursday, July 21, 2011

Fighting with Amazon EC2 AMI

There is no doubt that Amazon EC2 is one of the most successful and useful cloud services. However, a few days ago I had the nerve breaking experience of trying to ec2-register an amazon AMI image. This task is needed when you want to save your work so you can easily load it next time you run. Amazon AMI tools is one of the worst designed and implemented tools I have ever encountered. You need a lot of patience when dealing with it. I wrote down some of the errors I encountered.
(I thought that by having a PhD and working for 15 years on Linux I will be immune to this kinds of errors, but I was absolutely wrong..) The reader should be warned,
that I did not collect error on the web, I simply encountered all of the below errors, until eventually I got so tired so I did not document everything from a certain point.

Basically, what you want to do is to run 3 commands. Usually it should not take more than a few minutes. However, if you manage to run those command in less than a few hours you are absolutely lucky.
Those are the command you like to run:

sudo -E /opt/aws/bin/ec2-bundle-vol -k [path to your x.509 private key] -c [path to your x.509 public key] -u [Amazon 12 digit user id] -d /mnt -p [bundle file name] -r x86_64
sudo -E /opt/aws/bin/ec2-upload-bundle -b [bundle file name] -m /mnt/[bundle file name].manifest.xml -s [Amazon AWS secret string] -a [Amazon AWS ID string]
sudo -E ec2-register -K [location of X.509 private key] -C [location of x.509 certificate] --name [bucket name]/[image name] --region us-east-1 [bucket name]/[image name].manifest.xml

Potential problems. If the process failed and you tried again you may get an error:

/opt/aws/amitools/ec2/lib/ec2/platform/linux/image.rb:154:in `mount_image': image already mounted (FatalError)
    from /opt/aws/amitools/ec2/lib/ec2/platform/linux/image.rb:81:in `make'
    from /opt/aws/amitools/ec2/lib/ec2/amitools/bundlevol.rb:151:in `bundle_vol'
    from /opt/aws/amitools/ec2/lib/ec2/amitools/bundlevol.rb:193:in `main'
    from /opt/aws/amitools/ec2/lib/ec2/amitools/tool_base.rb:201:in `run'
    from /opt/aws/amitools/ec2/lib/ec2/amitools/bundlevol.rb:201

solution: using the "mount" command find the mounted image and unmount it using the "sudo unmount XXX" command.

problem:

/opt/aws/bin/ec2-bundle-vol: line 3: EC2_HOME: Neither of EC2_AMITOOL_HOME or EC2_HOME environment variables are set

Solution:
Assuming AMITOOL is isntalled, try to use sudo -E. If this did not work, try to set (assuming working on bash shell):

export EC2_AMITOOL_HOME=/home/ubuntu/ami-tools/ec2-ami-tools-1.3-57676/

where you should point the path to where ami-tools are installed. If they are not installed
you need to install EC2 AMITOOLs. And then set the environment variable using "setenv" or "export" command.

Problem:

--user has invalid value 'AKIAJWASWE2DSWQFKILA': the user ID should consist of 12 digits (optionally hyphenated); this should not be your Access Key ID
Try 'ec2-bundle-vol --help'

solution:
You ou gave the wrong key, look for a numeric key id in amazon AWS website of the format 0000-0000-0000. This key is especially hard to find within all the menus.

Problem:

ERROR: the specified image file /mnt/graphlab.org.realase_v1234 already exists

Solution: remove the image file created using the command

sudo rm -fR /mnt/graphlab.org.release_v1234

Problem:

The specified bucket is not S3 v2 safe (see S3 documentation for details):

Solution: Looks like an EC2 bug - underscore and capital letters are allowed but result in this warning. If you try to ignore this warning at this point, you will get much worser errors later. Try to avoid this warning.

Problem:

mke2fs 1.41.12 (17-May-2010)
error writing /etc/mtab.tmp: No space left on device

Solution:
You tried so many times, you got out of disk space.. Need to clean up files or restart image and retry again.

Problem:

Neither a 'manifest' or 'block-device-mapping' have been specified; at least one is required. (-h for usage)

Solution:
You should have both used the -n flag to specify a bucket name, and then the path of the bucketname/imagename.manifest.xml . By the way bucket name is flexible - it does not have to be image name.

Problem:

1) Client.InvalidManifest: HTTP 403 (Forbidden) response for URL http://s3.amazonaws.com:80/graphlab.org.release_v1234/graphlab.org.release_v1234.mainfest.xml: check your S3 ACLs are correct.
2) Client.InvalidManifest: HTTP 404 (Not Found) response for URL http://s3.amazonaws.com:80/graphlab.org.realase_v1234/graphlab.org.realase_v1234.mainfest.xml: check your manifest path is correct and in the correct region.

Solution:
something in the process has gone wrong - either bucket name is wrong or upload failed.. Need to do everything correctly from the beginning.

Problem:

ERROR: Parameter problem: Expecting S3 URI with just the bucket name set instead of 'graphlab_org_release_v1234'

Solution:
Need to add s3:// when using the command: s3cmd mb

Problem:

ERROR: Error talking to S3: Curl.Error(51): SSL: certificate subject name '*.s3.amazonaws.com' does not match target host name 'graphlab.org.release.v1234.s3.amazonaws.com'.

Solution: no clue what I did - I started to get out of focus at this point. Probably started all over again.. :-(

Problem: Client.InvalidAMIName.Duplicate: AMI name graphlaborgreleasev1234 is already in use by AMI ami-98946ef1

Solution: this happens when you try to register a new AMI with a name you already gave to an older AMI need to rename.

Hopefully, after all this mess, you managed to ec2-register.. and got a printout of
the type:
IMAGE AMI-12120930
HALELUYA.
And I ask : why not simply add a UI option from AWS consule to register an image???

Final comment: I have a quick email exchange with James Hamilton, VP in Amazon and I sent him this link. I got back the following note: Sorry you had a bad experience with EC2.

I would like to take this opportunity to clarify that my overall experience with EC2 is very good. But still some interfaces could be improved.

Monday, July 18, 2011

GraphLab - Machine Learning in the Cloud

A few days ago we have released a detailed technical report about the performance of distributed GraphLab on Amazon EC2 with up to 64 nodes (512 cores total) : http://arxiv.org/abs/1107.0922

We compared GraphLab using three applications: matrix factorization, CoEM (a variant of personalized pagerank, a named entity recognition algorithm), and video co-segmentation.

As a reference we compared three platforms: Hadoop, MPI (message-passing-interface) and GraphLab. In a nutshell, GraphLab runs about 20x to 100x times faster than Hadoop, depending on the data and application. The main reason is that we perform all computation in memory and do not provide any fault tolerance. Compared to MPI, GraphLab has a similar performance. The drawback of MPI is that the code has to be rewritten for each application, while GraphLab provides building blocks for iterative computation.

The following graph shows the speedup of the 3 applications using 64 Amazon HPC machines:

The baseline for speedup calculation are 4 machines. For matrix factorization (line denoted as Netflix) we get a speedup of x16 on x64 machines. For video co-segmentation we get a speedup of x40 on x64 EC2 nodes.
When we increase factorized matrix width, the problem becomes computation heavy and we get
even a better speedup of x40 on 64 nodes.

Friday, March 4, 2011

Tuning Hadoop configuration for high performance - Mahut on Amazon EC2

In this post I will share some of the insights I got when tuning Hadoop/Mahout on Amazon EC regular and high performance nodes. I was using two algorithms.
1) Mahout's Alternating least squares application (See MAHOUT-542) with Netflix data. (Sparse matrix with 100,000,000 non zeros). Test was done with up to 64 HPC nodes (512 cores).
2) CoEM algorithm - NLP algorithm (R. Jones, 2005) with data graph of around 200,000,000 edges.

Below are running time results for running one iteration of alternating least squares (implemented by Sebastian Schelter) on Netflix data. Runtime is in seconds.
X-axis are the participating machines - from 4 to 64 machines.

My conclusion from this experiment, is that 16 HPC nodes (256 cores) are enough for computing matrix factorization/CoEM of this scale. Beyond 16 nodes there is no benefit in further parallism.

Below I explain how I fine-tuned performance.
Preliminaries: I assume you followed the instruction on part 1 of this tutorial to setup Hadoop on EC2.

1) The hdfs-site.xml file

dfs.replication

- I set dfs replication to 1. Replication determines the number of copies the hdfs data is saved on. When working with a relative low number of nodes (several) higher replication delays performance.

hadoop.tmp.dir
hadoop.data.dir
dfs.name.dir

You should set all those directories to point to DIFFERENT paths which have ENOUGH DISK SPACE.
Default hadoop configuration points to either /tmp or /usr/local/hadoop-0.20.2/ and in Amazon
EC2 there is a 10Gb disk space limit for the root partition. To increase available storage,
on regular nodes I set the above fields to /mnt/tmp1, /mnt/tmp2/ and /mnt/tmp3
On HPC nodes, I first mounted /dev/sdb using the command:

mkdir -p /home/data
mount -t ext3 /dev/sdb/ /home/data/

And then created /home/data/tmp1 /home/data/tmp2 /home/data/tmp3 and pointed the above fields to there.

dfs.block.size

The default is 64MB. For CoEM set it to 4MB, so there will be enough mappers for all cores. For Netflix data I set it to 16MB. When the block size is too small, there are too manny mappers, resulting in loading the system, having many task failures, and some of the job trackers gets black-listed. Having too few mappers does not exploit well parallism. Unfortunately it seems that block size should be tuned separately for each algorithm.

2) The file core-site.xml should be configured as explained in the first part of this post.

3) The file mapred-site.xml

mapred.map.task

empirically setting them to the number of cores -1 seemed to work the best. (On HPC nodes, 15 cores). Note that this number is per machine.

mapred.reduce.task

Common practice says to set it to 0.95 * number of machines * (number of cores-1).
For me that did not work well, especially with 64 machines - reduce phase becomes terribly slow with very slow copying phase (in Kb instead of MB). Finally I set it to 64 for all experiments.

mapred.tasktracker.map.tasks.maximum, mapred.tasktracker.reduce.tasks.maximum

set them to the values above. Note that it seems that reduce tasks maximum is a global maximum and not a limit per single machines. So in this case 64 was a global limit of 64 reduce tasks.

mapred.task.timeout, mapred.tasktracker.expiry.interval

default is 600000 milliseconds which was too low for ALS. If the interval is too low, task will be killed prematurely. I set it to 7200000

mapred.task.tracker.expiry.interval

don't ask me what is the difference to previous field - probably a bug. Anyway I set it as well.

mapred.compress.map.output, mapred.output.compress

again I set those fields to true. It reduced
significantly the disk writes to about 1/3 the size.

mapred.child.java.opts

set it to -Xmx2500Mb , the default is 500, which results in out of memory errors, java heap errors and GC errors.

4) The file hadoop-env.sh
On HPC nodes, set

JAVA_HOME=/usr/lib/jvm/jre-openjdk

On regular nodes, set

JAVA_HOME=/usr/lib/jvm/java-6-openjdk

Heap size parameter controls the heap size. When it is too small you get
out of memory error and out of heap size erros.

HADOOP_HEAPSIZE=4000

5) Avoiding string parsing as much as possible
Java string parsing is rather slow. Avoid reading string input files as possible and write the data in binary format whenever possible. For the CoEM algorithm, avoiding string parsing resulted in x4 faster code, since the inputs files where read on each iteration.

Some tips I got from Julio Lopez, OpenCloud project @ CMU:
Block size and controlling the number of mappers. I believe someone already commented on this. In general, you want to have the block sizes relatively large in order to induce your job to perform sequential instead of random I/O. You can use the "InputFormat" to control how the work is split and how many tasks are created.

I've found that the first instincts users have is to match the number of mappers or reducers per node to the number of cores. For many Hadoop applications, this does not work. Properly setting these parameters is application dependent (module the available resources). In Hadoop these are framework-wide parameters. In my experience, how memory is allocated to tasks has a much larger impact on application performance. However, it is not clear how these memory parameters should be set, and there are all sorts of complex interactions among tasks.

For reference, in the cloud cluster, there are 8 cores per node, we allow 10 simultaneous tasks to execute per node and in general we see better throughput that way. As I mentioned earlier, most jobs experience contention for memory.

Interesting related projects/ papers:
1) http://www.cs.duke.edu/~shivnath/amr.html
2) Kai Ren, Julio López and Garth Gibson. Otus: Resource Attribution in Data-Intensive Clusters. MapReduce: The Second International Workshop on MapReduce and its Applications. San Jose, CA, June 2011. (bib, pdf)

Other useful tips:

When stopping and starting Hadoop you should be very careful since Hadoop generates a zillion of temp file, that if found on the next run makes a mess.

1) I always run from script

echo Y | hadoop namenode -format

Since if the file system was formatted the script will get stuck without getting the "Y" input.

2) Remove all /tmp/*.pid files, or else Hadoop will think some old processes are running.

3) Remove all files in the directories hadoop.tmp.dir, hadoop.data.dir, dfs.name.dir
especially VERSION files. Old VERSION files lead to namespaceID collisions.

4) Delete old logs from /usr/local/hadoop-0.20.2/logs/

Friday, February 25, 2011

Mahout on Amazon EC2 - part 5 - installing Hadoop/Mahout on high performance instance (CentOS/RedHat)

This post explains how to install Mahout ML framework on top of Amazon EC2 (CentOS/RedHat based machine).
The notes are based on older Mahout notes: https://cwiki.apache.org/MAHOUT/mahout-on-amazon-ec2.html which are unfortunately outdated.

Note: part 1 of this post, explains how to install the same installation on top of Ubuntu based machine.

Full procedure should take around 2-3 hours.. :-(

1) Start high performance instance from amazon aws console

Cent OS AMI ID ami-7ea24a17 (x86_64)  Edit AMI
Name:  Basic Cluster Instances HVM CentOS 5.4   
Description:  Minimal CentOS 5.4, 64-bit architecture, and HVM-based virtualization for use with Amazon EC2 Cluster Instances.

2) Login into the instance (right mouse click on running instance from AWS console)

3) Install some required stuff

sudo yum update
sudo yum upgrade
sudo apt-get install python-setuptools  
sudo easy_install "simplejson"

4) Install boto (unfortunately I was not able to install it using easy_install directly)

wget http://boto.googlecode.com/files/boto-1.8d.tar.gz
tar xvzf boto-1.8d.tar.gz
cd boto=1.8d
sudo easy_install .

5) Install maven2 (unfortunately I was not able to install it using yum)

wget http://www.trieuvan.com/apache/maven/binaries/apache-maven-2.2.1-bin.tar.gz
tar xvzf apache-maven-2.2.1-bin.tar.gz
cp -R apache-maven-2.2.1 /usr/local/
ln -s /usr/local/apache-maven-2.2.1/bin/mvn /usr/local/bin/

6) Download and install Hadoop

wget http://apache.cyberuse.com//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz   
tar vxzf hadoop-0.20.2.tar.gz  
sudo  mv hadoop-0.20.2 /usr/local/

add the following to $HADOOP_HOME/conf/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/jre-openjdk/  
# The maximum amount of heap to use, in MB. Default is 1000  
export HADOOP_HEAPSIZE=2000

add the following to $HADOOP_HOME/conf/core-site.xml and also $HADOOP_HOME/conf/mapred-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property> <property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Edit the file hdfs-site.xml

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/data/tmp/</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/data/tmp2/</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/data/tmp3/</value>
</property>
</configuration>

Note: directory /home/data does not exist, and you will have to create it
when starting the instance using the commands:

# mkdir -p /home/data  
# mount -t ext3 /dev/sdb/ /home/data/

The reason for this setup is that the root dir has only 10GB, while /dev/sdb/
has 800GB.

set up authorized keys for localhost login w/o passwords and format your name node

# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Checkout and build Mahout from trunk. Alternatively, you can upload a Mahout release tarball and install it as we did with the Hadoop tarball (Don't forget to update your .profile accordingly).

# svn co http://svn.apache.org/repos/asf/mahout/trunk mahout
# cd mahout
# mvn clean install
# cd ..
# sudo mv mahout /usr/local/mahout-0.4

4)Add the following to your .profile

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export HADOOP_HOME=/usr/local/hadoop-0.20.2
export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
export MAHOUT_HOME=/usr/local/mahout-0.4/
export MAHOUT_VERSION=0.4-SNAPSHOT
export MAVEN_OPTS=-Xmx1024m

Verify that the paths on .profile point to the exact version you downloaded

6) Run Hadoop, just to prove you can, and test Mahout by building the Reuters dataset on it. Finally, delete the files and shut it down.

# $HADOOP_HOME/bin/hadoop namenode -format
$HADOOP_HOME/bin/start-all.sh
jps     // you should see all 5 Hadoop processes (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker)
cd $MAHOUT_HOME
./examples/bin/build-reuters.sh
$HADOOP_HOME/bin/stop-all.sh
rm -rf /tmp/*   // delete the Hadoop files

Remove the single-host stuff you added to $HADOOP_HOME/conf/core-site.xml and $HADOOP_HOME/conf/mapred-site.xml in step #6b and verify you are happy with the other conf file settings. The Hadoop startup scripts will not make any changes to them. In particular, upping the Java heap size is required for many of the Mahout jobs.

// edit $HADOOP_HOME/conf/mapred-site.xml to include the following:
<property>
   <name>mapred.child.java.opts</name>
   <value>-Xmx2000m</value>
</property>

7) Allow for Hadoop to run even if you will work on a different EC2 machine:

echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config

8) Now bundle the image.
Using Amazon AWS console - select running instance, right mouse click and then bundle EBS image. Enter image name and description. Now the machine will reboot and the image will be created.

Thursday, February 24, 2011

The GraphLab large scale machine learning framework - part 1 - installation on Amazon EC2

GraphLab is an open source large scale parallel machine learning framework.

This post explains how to install GraphLab on Amazon EC2 and how to load GraphLab from a preinstalled EC2 image.

Loading Graphlab from one of our publicly available images:
1) Follow the directions on http://graphlab.org/download.html to check out the latest available AMI. Launch the selected image using Amazon AWS Console.

NOTE: Images are available in US-EAST region.
TIP: Don't forget to allow SSH (Tcp port 22) in the default security group,
( In the AWS console go to EC2 -> Network Security -> Security Groups and verify that in the default security group (or the security group you where using)).

2) After launching the AMI instance, it is always desirable to get the latest
GraphLab using the commands:

cd graphlabapi
hg pull
hg update
./configure
cd release
make -j4

3) It is also useful to run unit testing to verify the update went fine:

cd tests/
./runtests.sh

GraphLab installation instructions

NOTE: The below instruction should be used by advanced users, in case you did not find the required AMI in our public AMI images.

Installations instructions where moved to the download page.
Select the matching icon for your operating systems for detailed instructions.

Monday, February 21, 2011

Large scale matrix factorization using alternating least suqares: which is better - GraphLab or Mahout?

I am working in the last couple of weeks on comparing the performance of GraphLab vs. Mahout on Alternaring least squares using Netflix data. To remind, GraphLab is the parallel machine learning system we are building in CMU.

Initial results are encouraging. Mahout Alternating least squares implementation by Sebastian Schelter was tested on Amazon EC2, using two m2.2xlarge nodes (13x2 virtual cores).

For running 10 iterations, number of features=20, lambda=0.065, it takes 39272 seconds, while GraphLab implementation in C++ takes only 714 seconds (on a machine with 8 cores).

Running time may be taken with a grain of salt, since I was not using the exact same machine, but the magnitude of difference will certainly hold even if I would run GraphLab on EC2 (which I plan to do soon).

Regarding accuracy, Mahout ALS has a test RMSE accuracy of
0.9310 while GraphLab obtained slightly better accuracy of 0.9279.

Here is Mahout ALS final output: (of the RMSE computation)

ubuntu@ip-10-115-27-222:/mnt$ /usr/local/mahout-0.4/bin/
mahout evaluateALS --probes /user/ubuntu/myout/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/ | grep RMSE
11/02/17 12:31:42 WARN driver.MahoutDriver: No evaluateALS.props found on classpath, will use command-line arguments only
11/02/17 12:31:42 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --itemFeatures=/tmp/als/out/M/, --probes=/user/ubuntu/myout/probeSet/, --startPhase=0, --tempDir=temp, --userFeatures=/tmp/als/out/U/}
RMSE: 0.9310729597725026, MAE: 0.7298745910296568
11/02/17 12:31:55 INFO driver.MahoutDriver: Program took 12437 ms

Here is the GraphLab output:

bickson@biggerbro:~/newgraphlab/graphlabapi/debug/apps/pmf$ ./PMF netflix-r 10 0 --D=20 --max_iter=10 --lambda=0.065 --ncpus=8
setting run mode 0
INFO   :pmf.cpp(main:1121): PMF starting

loading data file netflix-r
Loading netflix-r train
Creating 99072112 edges...
................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................loading data file netflix-re
Loading netflix-re test
Creating 1408395 edges...
........setting regularization weight to 0.065
PTF_ALS for matrix (480189, 17770, 27):99072112.  D=20
pU=0.065, pV=0.065, pT=1, muT=1, D=20
nuAlpha=1, Walpha=1, mu=0, muT=1, nu=20, beta=1, W=1, WT=1 BURN_IN=10
complete. Obj=6.83664e+08, TEST RMSE=3.7946.
INFO   :asynchronous_engine.hpp(run:56): Worker 0 started.

...

INFO   :asynchronous_engine.hpp(run:56): Worker 7 started.

Entering last iter with 1
228.524) Iter ALS 1  Obj=2.60675e+08, TRAIN RMSE=2.2904 TEST RMSE=0.9948.
Entering last iter with 2
289.594) Iter ALS 2  Obj=6.48921e+07, TRAIN RMSE=1.1400 TEST RMSE=0.9573.
Entering last iter with 3
350.487) Iter ALS 3  Obj=4.75073e+07, TRAIN RMSE=0.9754 TEST RMSE=0.9444.
Entering last iter with 4
411.551) Iter ALS 4  Obj=4.09914e+07, TRAIN RMSE=0.9063 TEST RMSE=0.9381.
Entering last iter with 5
472.615) Iter ALS 5  Obj=3.79096e+07, TRAIN RMSE=0.8718 TEST RMSE=0.9348.
Entering last iter with 6
533.039) Iter ALS 6  Obj=3.61298e+07, TRAIN RMSE=0.8513 TEST RMSE=0.9324.
Entering last iter with 7
594.177) Iter ALS 7  Obj=3.50076e+07, TRAIN RMSE=0.8382 TEST RMSE=0.9305.
Entering last iter with 8
654.41) Iter ALS 8  Obj=3.42655e+07, TRAIN RMSE=0.8294 TEST RMSE=0.9290.
Entering last iter with 9
714.095) Iter ALS 9  Obj=3.37535e+07, TRAIN RMSE=0.8234 TEST RMSE=0.9279.
INFO   :asynchronous_engine.hpp(run:66): Worker 6 finished.

...

INFO   :asynchronous_engine.hpp(run:66): Worker 2 finished.

Sunday, February 20, 2011

Installing BLAS/Lapack/ITPP on Amazon EC2/Ubuntu Linux

BLAS/Lapack are efficient matrix math libraries. The following instructions explains how to install them for Amazon EC2 (Ubuntu maverick version, and Amazon Linux). It++ (itpp) is a popular c++ wrapper for blas/lapack.

DISLAIMER: The below instructions are for 64 bit machines. For 32 bit machines follow other instructions: http://bickson.blogspot.com/2011/06/graphlab-pmf-on-32-bit-linux.html

FOR LAZY READERS:
Just use Amazon EC2 public image ami-c21eedab (Ubuntu)

INSTALLATION VIA YUM/APT-GET
Try to install itpp using the following command:

sudo yum install libitpp-dev

sudo apt-get install libitpp-dev

TIP: You may also want to install libitpp7-dbg using the yum/apt-get command.
It is not mandatory, but it helps debugging when you link against libitpp_debug.so
(instead of libitpp.so).

If the above worked then we are done. If not, you will need to follow
instructions below. Thanks to Udi Weinsberg for this tip.

FOR ADVANCED USERS:

0) Start with an Ubuntu image like ami-641eed0d, or an Amazon AMI image like:

1) Install required packages. For Ubuntu:

sudo apt-get install --yes --force-yes automake autoconf libtool* gfortran

For Amazon Linux:

sudo yum install -y automake autoconf libtool* gcc-gfortran

2) Install lapack.
Here again their are two options:

The easy way is to simply (On Ububtu)

sudo apt-get install --yes --force-yes liblapack-dev

On Amazon Linux:

sudo yum install -y lapack-devel blas-devel

Thanks Akshay Bhat from Cornell for this tip!
If the liblapack setup was successful, go to step 3.

If the above command DOES NOT work for you (depends on your OS and setup) you will need to install lapack manually. The procedure is explained in steps a-c below.

a) Download and prepare the code

wget http://www.netlib.org/lapack/lapack.tgz
tar xvzf lapack.tgz
cd lapack-3.3.0  //if version number changes, change here to the right directory
mv make.inc.example make.inc

b) edit make.inc and add -m64 -fPIC flag to fortran compiler options:
# FORTRAN, OPTS, NOOPT, LOADER

c) compile

make blaslib
make

If everthing went OK, test will be run for a couple of minutes
and the files blas_LINUX.a and lapack_LINUX.a will be created at the main directory

3) setup LDFLAGS

export LDFLAGS="-L/usr/lib -lgfortran"

4) Download and install itpp from

wget http://sourceforge.net/projects/itpp/files/itpp/4.2.0/itpp-4.2.tar.gz
tar xvzf itpp-4.2.tar.gz
cd itpp-4.2
./autogen.sh

If you installed Lapack from yum/apt-get, you should use the following command:

./configure --without-fft --with-blas=/usr/lib64/libblas.so --with-lapack=/usr/lib64/liblapack.so --enable-debug CFLAGS=-fPIC CXXFLAGS=-fPIC CPPFLAGS=-fPIC

Where /usr/lib64/ is the place where lapack was installed.

If you installed lapack from source, use the following command

./configure --without-fft --with-blas=/home/ubuntu/lapack-3.3.0/blas_LINUX.a --with-lapack=/home/ubuntu/lapack-3.3.0/lapack_LINUX.a CFLAGS=-fPIC CXXFLAGS=-fPIC CPPFLAGS=-fPIC

make
sudo make install

Note: If you installed lapack from yum/apt-get, don't forget to add the -lblas -llapack linker flag when you compile against lapack/blas.

Verifying installation
To verify that installation went Ok, run the following commands:

itpp-config --cflags
itpp-config --libs

1) The command itpp-config should be available from shell.
2) The right installation path should appear as output.

Known issues you may encounter:
Problem:

*** Warning: Linking the shared library libitpp.la against the^M
*** static library /usr/lib64/libblas.a is not portable!^M
libtool: link: g++ -shared -nostdlib /usr/lib/gcc/x86_64-amazon-linux/4.4.4/../../../../lib64/crti.o /usr/lib/gcc/x86_64-amazon-linux/4.4.4/crtbeginS.o  -Wl,--whole-archive ../itpp/base/.libs/libbase.a ../itpp/stat/.libs/libstat.a ../itpp/comm/.libs/libcomm.a ../itpp/fixed/.libs/libfixed.a ../itpp/optim/.libs/liboptim.a ../itpp/protocol/.libs/libprotocol.a ../itpp/signal/.libs/libsignal.a ../itpp/srccode/.libs/libsrccode.a -Wl,--no-whole-archive  -L/usr/lib64/ /usr/lib64/liblapack.a /usr/lib64/libblas.a -lgfortranbegin -lgfortran -L/usr/lib/gcc/x86_64-amazon-linux/4.4.4 -L/usr/lib/gcc/x86_64-amazon-linux/4.4.4/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/usr/lib/gcc/x86_64-amazon-linux/4.4.4/../../.. -lstdc++ -lm -lc -lgcc_s /usr/lib/gcc/x86_64-amazon-linux/4.4.4/crtendS.o /usr/lib/gcc/x86_64-amazon-linux/4.4.4/../../../../lib64/crtn.o    -Wl,-soname -Wl,libitpp.so.7 -o .libs/libitpp.so.7.0.0^M
/usr/bin/ld: /usr/lib64/liblapack.a(dgees.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object; recompile with -fPIC^M
/usr/lib64/liblapack.a: could not read symbols: Bad value^M
collect2: ld returned 1 exit status^M
make[2]: *** [libitpp.la] Error 1^M

Solution: it seems that lapack was statically compiled without the -fPIC option and thus itpp refuses to link against it. Follow step 2a to install lapack manaually with the -fPIC option.

Problem:

make[1]: *** Waiting for unfinished jobs....
[ 83%] Building CXX object src/graphlab/CMakeFiles/
graphlab_pic.dir/distributed2/distributed_scheduler_list.o
/usr/local/lib/libitpp.so: undefined reference to `zgesv_'
/usr/local/lib/libitpp.so: undefined reference to `dorgqr_'
/usr/local/lib/libitpp.so: undefined reference to `dswap_'
/usr/local/lib/libitpp.so: undefined reference to `dgeqp3_'
/usr/local/lib/libitpp.so: undefined reference to `dpotrf_'
/usr/local/lib/libitpp.so: undefined reference to `dgemm_'
/usr/local/lib/libitpp.so: undefined reference to `zungqr_'
/usr/local/lib/libitpp.so: undefined reference to `zscal_'
/usr/local/lib/libitpp.so: undefined reference to `dscal_'
/usr/local/lib/libitpp.so: undefined reference to `dgesv_'
/usr/local/lib/libitpp.so: undefined reference to `dgetri_'
/usr/local/lib/libitpp.so: undefined reference to `zgemm_'
/usr/local/lib/libitpp.so: undefined reference to `zposv_'
/usr/local/lib/libitpp.so: undefined reference to `zgetri_'
/usr/local/lib/libitpp.so: undefined reference to `dgeev_'
/usr/local/lib/libitpp.so: undefined reference to `zgemv_'
/usr/local/lib/libitpp.so: undefined reference to `zgeqrf_'
/usr/local/lib/libitpp.so: undefined reference to `zgerc_'
/usr/local/lib/libitpp.so: undefined reference to `zswap_'
/usr/local/lib/libitpp.so: undefined reference to `zgeev_'
/usr/local/lib/libitpp.so: undefined reference to `daxpy_'
/usr/local/lib/libitpp.so: undefined reference to `dgetrf_'
/usr/local/lib/libitpp.so: undefined reference to `zgels_'
/usr/local/lib/libitpp.so: undefined reference to `zgetrf_'
/usr/local/lib/libitpp.so: undefined reference to `dgees_'
/usr/local/lib/libitpp.so: undefined reference to `dcopy_'
/usr/local/lib/libitpp.so: undefined reference to `dger_'
/usr/local/lib/libitpp.so: undefined reference to `dgels_'
/usr/local/lib/libitpp.so: undefined reference to `dgeqrf_'
/usr/local/lib/libitpp.so: undefined reference to `zpotrf_'
/usr/local/lib/libitpp.so: undefined reference to `zgees_'
/usr/local/lib/libitpp.so: undefined reference to `dgesvd_'
/usr/local/lib/libitpp.so: undefined reference to `zgeru_'
/usr/local/lib/libitpp.so: undefined reference to `dsyev_'
/usr/local/lib/libitpp.so: undefined reference to `zaxpy_'
/usr/local/lib/libitpp.so: undefined reference to `ddot_'
/usr/local/lib/libitpp.so: undefined reference to `zgesvd_'
/usr/local/lib/libitpp.so: undefined reference to `zgeqp3_'
/usr/local/lib/libitpp.so: undefined reference to `zcopy_'
/usr/local/lib/libitpp.so: undefined reference to `dgemv_'
/usr/local/lib/libitpp.so: undefined reference to `dposv_'
/usr/local/lib/libitpp.so: undefined reference to `zheev_'
collect2: ld returned 1 exit status
make[2]: *** [tests/anytests] Error 1
make[1]: *** [tests/CMakeFiles/anytests.dir/all] Error 2

Solution: itpp was compiled using dynamic libraries, but your application did not include the -lblas and -llapack link flags.

Problem:

*** Error: You must have "autoconf" installed to compile IT++ SVN sources
*** Error: You must have "automake" installed to compile IT++ SVN sources
*** Error: You must have "libtoolize" installed to compile IT++ SVN sources

Solution:
Need to install the packages autoconf, automake and libtoolize. See yum/apt-get documentation.

Problem:

/usr/bin/ld: /home/bickson/lapack-3.3.1/lapack_LINUX.a(dgees.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object; recompile with -fPIC
/home/bickson/lapack-3.3.1/lapack_LINUX.a: could not read symbols: Bad value
collect2: ld returned 1 exit status

Solution:
It seem you forgot to follow section 2b.

TIP: It is useful to enable also itpp_debug library which is very useful when debugging your code. This is done by adding the flag --enable-debug to the configure script.

Problem:

*** Error in ../../../itpp/base/algebra/ls_solve.cpp on line 271:
LAPACK library is needed to use ls_solve() function

Solution:
It seems that itpp is not installed properly -it did not link to lapack.

Wednesday, February 9, 2011

CMU Pegasus on Hadoop

Pegasus is a peta scale graph mining library.
This post explains how to install it on Amazon EC2.

1) Start with the Amazon EC2 image you created in http://bickson.blogspot.com/2011/01/how-to-install-mahout-on-amazon-ec2.html
2) Run Hadoop on a single node as explained in http://bickson.blogspot.com/2011/01/mahout-on-amazon-ec2-part-2-testing.html
3) Login into the EC2 machine
4) wget http://www.cs.cmu.edu/%7Epegasus/PEGASUSH-2.0.tar.gz
5) tar xvzf PEGASUSH-2.0.tar.gz
6) cd PEGASUS
7) export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/hadoop-0.20.2/bin/
8) sudo apt-get install gnuplot
9) ./pegasus.sh
PEGASUS> demo
put: Target pegasus/graphs/catstar/edge/catepillar_star.edge already exists
Graph catstar added.
rmr: cannot remove dd_node_deg: No such file or directory.
rmr: cannot remove dd_deg_count: No such file or directory.

-----===[PEGASUS: A Peta-Scale Graph Mining System]===-----

[PEGASUS] Computing degree distribution. Degree type = InOut

11/02/09 14:47:36 INFO mapred.FileInputFormat: Total input paths to process : 1
11/02/09 14:47:36 INFO mapred.JobClient: Running job: job_201102091432_0003
11/02/09 14:47:37 INFO mapred.JobClient: map 0% reduce 0%
11/02/09 14:47:45 INFO mapred.JobClient: map 18% reduce 0%
11/02/09 14:47:48 INFO mapred.JobClient: map 36% reduce 0%
11/02/09 14:47:51 INFO mapred.JobClient: map 54% reduce 0%
11/02/09 14:47:54 INFO mapred.JobClient: map 72% reduce 18%
11/02/09 14:47:57 INFO mapred.JobClient: map 90% reduce 18%
11/02/09 14:48:00 INFO mapred.JobClient: map 100% reduce 18%
11/02/09 14:48:03 INFO mapred.JobClient: map 100% reduce 24%
11/02/09 14:48:09 INFO mapred.JobClient: map 100% reduce 100%
11/02/09 14:48:11 INFO mapred.JobClient: Job complete: job_201102091432_0003
11/02/09 14:48:11 INFO mapred.JobClient: Counters: 18
11/02/09 14:48:11 INFO mapred.JobClient:   Job Counters
11/02/09 14:48:11 INFO mapred.JobClient:     Launched reduce tasks=1
11/02/09 14:48:11 INFO mapred.JobClient:     Launched map tasks=11
11/02/09 14:48:11 INFO mapred.JobClient:     Data-local map tasks=11

An image named catstar_deg_inout.eps will be created.

Tuesday, February 8, 2011

Hadoop on Amazon EC2 - Part 4 - Running on a cluster

1) Edit the file conf/hdfs-conf.xml
Set the number of replicas as the number of nodes you plan to use. In this example, 4.



 
  hadoop.tmp.dir
   /mnt/tmp/
  
  
   dfs.data.dir
   /mnt/tmp2/
   
 
   dfs.name.dir
   /mnt/tmp3/
   
  dfs.replication 
  4
  Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.

2) Edit the file conf/slaves and list the DNS names of all of the machines you are going to use. For example:

 ec2-67-202-45-10.compute-1.amazonaws.com

 ec2-67-202-45-11.compute-1.amazonaws.com

 ec2-67-202-45-12.compute-1.amazonaws.com

 ec2-67-202-45-13.compute-1.amazonaws.com 

3) Edit the file conf/master and enter the DNS name of the master node. For example

 ec2-67-202-45-10.compute-1.amazonaws.com

Note that the master node can appear also in the salves list.

4) Edit the file conf/core-site.xml to include the master name


  
    fs.default.name
    hdfs://ec2-67-202-45-10.compute-1.amazonaws.com:9000
  

  
    mapred.job.tracker
    ec2-67-202-45-10.compute-1.amazonaws.com:9001
  

  
  hadoop.tmp.dir
   /mnt/tmp/

5) Edit the file conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
  <property>
    <name>fs.default.name</name> 
    <value>hdfs://ec2-67-202-45-10.compute-1.amazonaws.com:9000</value>
  </property>

  <property>
    <name>mapred.job.tracker</name>
  <value>ec2-67-202-45-10.compute-1.amazonaws.com:9001</value>
  </property>
  <property>
  <name>hadoop.tmp.dir</name>
   <value>/mnt/tmp/</value>
  </property>

  <property>
  <name>mapred.map.tasks</name>
   <value>10</value> <!-- about the number of cores>
  </property>

   <property>
  <name>mapred.reduce.tasks</name>
   <value>10</value> <!-- about the number of cores>
  </property>

  <property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>12</value> <!-- slightly more than cores>  </property>

   <property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
   <value>12</value> <!-- slightly more than cores>
  </property>
   
</configuration>

6) Login into the master node. For each of the 3 slaves machines, copy the DSA key from the master node:

sh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-11.compute-1.amazonaws.com
ssh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-12.compute-1.amazonaws.com
ssh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-13.compute-1.amazonaws.com

7) To start Hadoop. On the master machine

/usr/local/hadoop-0.20.2/bin/hadoop namenode -format
/usr/local/hadoop-0.20.2/bin/start-dfs.sh
/usr/local/hadoop-0.20.2/bin/start-mapred.sh

8) To stop Hadoop

/usr/local/hadoop-0.20.2/bin/stop-mapred.sh
/usr/local/hadoop-0.20.2/bin/stop-dfs.sh

Tuesday, February 1, 2011

Mahout on Amazon EC2 - part 3 - Debugging

Connecting to management web interface of an Hadoop node

1) Login into AWS management consolute setup
Select the default security group, and add tcp ports 50010-50090 (with ip 0.0.0.0/0).

2) You can view hadoop node status (after starting Hadoop) by opening a web browser and
entering the following address:

http://ec2-50-16-155-136.compute-1.amazonaws.com:50070/

where ec2-XXX-XXXXXXXX is the nodename, and 50070 is the default port for namenode sever,
and 50030 is the default port of the job tracker. 50060 is the default port for the task tracker.

Common errors and their solutions:

* When starting hadoop, the following message is presented:
<32|0>bickson@biggerbro:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2$ ./bin/start-all.sh
localhost: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
localhost: @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
localhost: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
localhost: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
localhost: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
localhost: It is also possible that the RSA host key has just been changed.
localhost: The fingerprint for the RSA key sent by the remote host is
localhost: 06:95:7b:c8:0e:85:e7:ba:aa:b1:31:6e:fc:0e:ae:4d.
localhost: Please contact your system administrator.
localhost: Add correct host key in /mnt/bigbrofs/usr6/bickson/.ssh/known_hosts to get rid of this message.
localhost: Offending key in /mnt/bigbrofs/usr6/bickson/.ssh/known_hosts:1
localhost: RSA host key for localhost has changed and you have requested strict checking.
localhost: Host key verification failed.

Solution:

echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config

Note:

If the file ~/.ssh/config does not exist, change the command to:

echo "NoHostAuthenticationForLocalhost yes" >~/.ssh/config

The following exception is received:

org.apache.hadoop.ipc.RemoteException: java.io.IOException:
File /user/ubuntu/temp/markedPreferences/_temporary/_attempt_local_0001_r_000000_0/part-r-00000 could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock
(FSNamesystem.java:1271)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

Solution:
1) I saw this error when system is out of disk space. Increase number of nodes or instance type.
2) Another cause is that the datanode did not finish to boot. Wait at least 200 seconds after starting Hadoop before actually starting to run jobs.

* Job tracker fails to run with the following error:

2011-02-02 16:02:47,097 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting JobTracker
STARTUP_MSG:   host = ip-10-114-75-91/10.114.75.91
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/

hadoop/common/branches/branch-0.20 -r 911707; compiled by 

'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
2011-02-02 16:02:47,200 INFO org.apache.hadoop.mapred.JobTracker: 

Scheduler  configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, 

limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)
2011-02-02 16:02:47,220 FATAL org.apache.hadoop.mapred.JobTracker:

java.lang.RuntimeException: 

Not a host:port pair: local
 at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:136)
 at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:123)
 at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:1807)
 at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1579)
 at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)
 at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)
 at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)

Solution: edit the file /path/to/hadoop/conf/mapred-site.xml:

 
 
mapred.job.tracker
localhost:9001

* When connecting to EC2 host you get the following error:

ssh -i ./graphlabkey.pem -o "StrictHostKeyChecking no" 
ubuntu@ec2-50-16-101-232.compute-1.amazonaws.com "/home/ubuntu/ec2-metadata -i"
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0644 for './graphlabkey.pem' are too open.
It is recommended that your private key files are NOT accessible by others.
This private key will be ignored.

Solution:
chmod 400 graphlabkey.pem

* Exception : can not lock storage
************************************************************/

2011-02-03 23:47:48,623 INFO org.apache.hadoop.hdfs.server.common.Storage: 

Cannot lock storage /mnt. The directory is already locked.
2011-02-03 23:47:48,736 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: 

Cannot lock storage /mnt. The directory is already locked.
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:510)
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:363)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:112)
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
at org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:216)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)

Solution: search and remove the file in_use.lock

* Exception:

tasktracker running as process XXX. Stop it first.

Solution:
1) Hadoop is already running - kill it first using stop-all.sh (on a single machine) or stop-mapred.sh and stop-dfs.sh (on a cluster)
2) If you killed Hadoop and you are still getting this error - check under /tmp
if it contains files *.pid - if so remove them.

2010-08-06 12:12:06,900 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /Users/jchen/Data/Hadoop/dfs/data: namenode namespaceID = 773619367; datanode namespaceID = 2049079249
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:216)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)

Solution: Remove all files named VERSION from all tmp directories (need to search very well Hadoop has at least 3 working directories) and reformat the namenode file system.

Error:

bash-3.2$ ./bin/start-all.sh 
starting namenode, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-namenode-biggerbro.ml.cmu.edu.out
localhost: starting datanode, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-datanode-biggerbro.ml.cmu.edu.out
localhost: starting secondarynamenode, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-secondarynamenode-biggerbro.ml.cmu.edu.out
localhost: Exception in thread "main" java.net.BindException: Address already in use
localhost: 	at sun.nio.ch.Net.bind(Native Method)
localhost: 	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119)
localhost: 	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
localhost: 	at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
localhost: 	at org.apache.hadoop.http.HttpServer.start(HttpServer.java:425)
localhost: 	at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:165)
localhost: 	at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:115)
localhost: 	at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:469)
starting jobtracker, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-jobtracker-biggerbro.ml.cmu.edu.out

Solution: kill every process using ./bin/stop-all.sh, wait a few mins and retry. If this does not help you may need to change port numbers in config files.

hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/name is in an inconsistent state: storage directory does not exist or is not accessible.
hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:2011-09-05 04:33:10,572 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/name is in an inconsistent state: storage directory does not exist or is not accessible.

Solution: it seemed you did not format properly hdfs using the command
./bin/hadoop namenode -format

Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
	at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
	at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:247)
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
	at org.apache.hadoop.io.WritableName.getClass(WritableName.java:71)
	at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:1613)
	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1555)
	at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1428)
	at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
	at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
	at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)

Solution: verify that MAHOUT_HOME is properly defined.

hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:2011-09-05 05:31:35,693 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 9000, call delete(/tmp/hadoop/mapred/system, true) from 127.0.0.1:51103: error: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /tmp/hadoop/mapred/system. Name node is in safe mode.
hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /tmp/hadoop/mapred/system. Name node is in safe mode.
hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:2011-09-05 05:33:39,712 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 9000, call addBlock(/user/bickson/small_netflix_mahout_transpose/part-r-00000, DFSClient_-1810781150) from 127.0.0.1:48972: error: java.io.IOException: File /user/bickson/small_netflix_mahout_transpose/part-r-00000 could only be replicated to 0 nodes, instead of 1
hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:java.io.IOException: File /user/bickson/small_netflix_mahout_transpose/part-r-00000 could only be replicated to 0 nodes, instead of 1

Solution: This error may happen if you try to access hdfs file system before Hadoop finished loading up properly. Wait a few mins and try again.

Tuesday, January 25, 2011

Mahout on Amazon EC2 - part 2 - Running Hadoop on a single node

Following part 1 of this posting which explained how to install Mahout and Hadoop on Amazon EC2.

We start by testing logistic regression

1) Launch Amazon AMI image you constructed using the explanation in part 1 of this post.

2) Run Hadoop using

# $HADOOP_HOME/bin/hadoop namenode -format
# $HADOOP_HOME/bin/start-all.sh
# jps     // you should see all 5 Hadoop processes (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker)

3) Run logistic regression example

cd /usr/local/mahout-0.4/
./bin/mahout org.apache.mahout.classifier.sgd.TrainLogistic --passes 100 --rate 50 --lambda 0.001 --input examples/src/main/resources/donut.csv --features 21 --output donut.model --target color --categories 2 --predictors x y xx xy yy a b c --types n n

You should see the following output:

11/01/25 14:42:45 WARN driver.MahoutDriver: No org.apache.mahout.classifier.sgd.TrainLogistic.props found on classpath, will use command-line arguments only
21
color ~ 0.353*Intercept Term + 5.450*x + -1.671*y + -4.740*xx + 0.353*xy + 0.353*yy + 5.450*a + 2.765*b + -24.161*c
      Intercept Term 0.35319
                   a 5.45000
                   b 2.76534
                   c -24.16091
                   x 5.45000
                  xx -4.73958
                  xy 0.35319
                   y -1.67092
                  yy 0.35319

    2.765337737     0.000000000    -1.670917299     0.000000000     0.000000000     0.000000000     5.449999190     0.000000000   -24.160908591    -4.739579336     0.353190637     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000

11/01/25 14:42:46 INFO driver.MahoutDriver: Program took 1016 ms

Now we run alternating matrix factorization. Based on instructions by Sebastian Schelter (see https://issues.apache.org/jira/browse/MAHOUT-542).
A related GraphLab implementation is found here

0) Download the patch MAHOUT-542.patch from the above webpage.
Installl it using the command

cd /usr/local/mahout-0.4/src/
wget https://issues.apache.org/jira/secure/attachment/12469671/MAHOUT-542-5.patch
patch -p0 < MAHOUT-542-5.patch

1) Get the movie lens 1M movie dataset

cd /usr/local/mahout-0.4/
wget http://www.grouplens.org/system/files/million-ml-data.tar__0.gz
tar xvzf million-ml-data.tar__0.gz

2) Convert dataset to csv format

cat ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3 > ratings.csv
cd /usr/local/hadoop-0.20.2/
./bin/hadoop fs -copyFromLocal /path/to/ratings.csv ratings.csv
./bin/hadoop fs -ls

Should see something like

/user/ubuntu/ratings.csv

3) # create a 90% percent training set and a 10% probe set

/usr/local/mahout-0.4$ ./bin/mahout splitDataset  --input /user/ubuntu/ratings.csv --output /user/ubuntu/myout --trainingPercentage 0.9 --probePercentage 0.1

The output should look like:

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2/
HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
11/01/27 01:09:39 WARN driver.MahoutDriver: No splitDataset.props found on classpath, will use command-line arguments only
11/01/27 01:09:39 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --input=/user/ubuntu/ratings.csv, --output=/user/ubuntu/myout, --probePercentage=0.1, --startPhase=0, --tempDir=temp, --trainingPercentage=0.9}
11/01/27 01:09:40 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
11/01/27 01:09:40 INFO input.FileInputFormat: Total input paths to process : 1
11/01/27 01:09:40 INFO mapred.JobClient: Running job: job_local_0001
11/01/27 01:09:40 INFO input.FileInputFormat: Total input paths to process : 1
11/01/27 01:09:41 INFO mapred.MapTask: io.sort.mb = 100
11/01/27 01:09:41 INFO mapred.MapTask: data buffer = 79691776/99614720
11/01/27 01:09:41 INFO mapred.MapTask: record buffer = 262144/327680
11/01/27 01:09:42 INFO mapred.JobClient:  map 0% reduce 0%
11/01/27 01:09:42 INFO mapred.MapTask: Spilling map output: record full = true
11/01/27 01:09:42 INFO mapred.MapTask: bufstart = 0; bufend = 5970616; bufvoid = 99614720
11/01/27 01:09:42 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680
11/01/27 01:09:42 INFO util.NativeCodeLoader: Loaded the native-hadoop library

4)# run distributed ALS-WR to factorize the rating matrix based on the training set

bin/mahout parallelALS --input /user/ubuntu/myout/trainingSet/ --output /tmp/als/out --tempDir /tmp/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065
...
11/01/27 02:40:28 INFO mapred.JobClient:     Spilled Records=7398
11/01/27 02:40:28 INFO mapred.JobClient:     Map output bytes=691713
11/01/27 02:40:28 INFO mapred.JobClient:     Combine input records=0
11/01/27 02:40:28 INFO mapred.JobClient:     Map output records=3699
11/01/27 02:40:28 INFO mapred.JobClient:     Reduce input records=3699
11/01/27 02:40:28 INFO driver.MahoutDriver: Program took 1998612 ms

5)# measure the error of the predictions against the probe set

usr/local/mahout-0.4$ bin/mahout evaluateALS --probes /user/ubuntu/myout/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2/
HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
11/01/27 02:42:37 WARN driver.MahoutDriver: No evaluateALS.props found on classpath, will use command-line arguments only
11/01/27 02:42:37 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --itemFeatures=/tmp/als/out/M/, --probes=/user/ubuntu/myout/probeSet/, --startPhase=0, --tempDir=temp, --userFeatures=/tmp/als/out/U/}

...

Probe [99507], rating of user [4510] towards item [2560], [1.0] estimated [1.574626183998361]
Probe [99508], rating of user [4682] towards item [171], [4.0] estimated [4.073943928686575]
Probe [99509], rating of user [3333] towards item [1215], [5.0] estimated [4.098295242062813]
Probe [99510], rating of user [4682] towards item [173], [2.0] estimated [1.9625234269143972]
RMSE: 0.8546120366924382, MAE: 0.6798083002225481
11/01/27 02:42:50 INFO driver.MahoutDriver: Program took 13127 ms

Useful HDFS commands * View the current state of the file system

ubuntu@domU-12-31-39-00-18-51:/usr/local/hadoop-0.20.2$ ./bin/hadoop dfsadmin -report
Configured Capacity: 10568916992 (9.84 GB)
Present Capacity: 3698495488 (3.44 GB)
DFS Remaining: 40173568 (38.31 MB)
DFS Used: 3658321920 (3.41 GB)
DFS Used%: 98.91%
Under replicated blocks: 56
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Name: 127.0.0.1:50010
Decommission Status : Normal
Configured Capacity: 10568916992 (9.84 GB)
DFS Used: 3658321920 (3.41 GB)
Non DFS Used: 6870421504 (6.4 GB)
DFS Remaining: 40173568(38.31 MB)
DFS Used%: 34.61%
DFS Remaining%: 0.38%
Last contact: Tue Feb 01 21:10:15 UTC 2011

* Delete a directory

ubuntu@domU-12-31-39-00-18-51:/usr/local/hadoop-0.20.2$ ./bin/hadoop fs -rmr temp/markedPreferences
Deleted hdfs://localhost:9000/user/ubuntu/temp/markedPreferences

Monday, January 24, 2011

Mahout/Hadoop on Amazon EC2 - part 1 - Installation

This post explains how to install Mahout ML framework on top of Amazon EC2 (Ubuntu based machine).
The notes are based on older Mahout notes: https://cwiki.apache.org/MAHOUT/mahout-on-amazon-ec2.html which are unfortunately outdated

The next of the post (part 2) explains how to run two Mahout applications:
logistic regression and alternating least squares.

Note: part 5 of this post, explains how to make the same installation on top of
ec2 high computing node (CentOS/Redhat machine). Unfortunately, several steps
are different..

Part 6 of this post explains how to fine tune performance on large cluster.

Full procedure should take around 2-3 hours.. :-(

To confuse the users, Amazon has 5 types of IDs:
- Your email and password for getting into the AWS console
- Your AWS string name and private key string
- Your public/private key pair
- Your X.509 certificate (another private/public key pair)
- Your Amazon ID (12 digit number) which is very hard to find on their website
Make sure you have all your IDS ready, if you did not do it yet, generate the keys using AWS console.

1) select and launch instance ami-08f40561 from Amazon AWS console. Alternatively you can select any other Ubuntu based 64 bit image.
TIP: It is recommended using EBS backed image, since saving your work at the end will be made way easier.

2) verify java is installed correctly - some libs are missing in the ami

sudo apt-get install openjdk-6-jdk
sudo apt-get install openjdk-6-jre-headless
sudo apt-get install openjdk-6-jre-lib

3) In the root home directory evaluate:

# sudo apt-get update
# sudo apt-get upgrade
# sudo apt-get install python-setuptools
# sudo easy_install "simplejson==2.0.9"
# sudo easy_install "boto==1.8d"
# sudo apt-get install ant
# sudo apt-get install subversion
# sudo apt-get install maven2

4) for getting hadoop source

# wget http://apache.cyberuse.com//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz 
# tar vxzf hadoop-0.20.2.tar.gz
# sudo  mv hadoop-0.20.2 /usr/local/

A comment: I once managed to install 0.21.0, but after the EC2 node was killed and restarted
Mahout refused to work any more. So I reverted to Hadoop 0.20.2

add the following to $HADOOP_HOME/conf/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/
# The maximum amount of heap to use, in MB. Default is 1000
export HADOOP_HEAPSIZE=2000

add the following to $HADOOP_HOME/conf/core-site.xml and also $HADOOP_HOME/conf/mapred-site.xml

<pre class="xml" name="code"><configuration>     
<property>     
<name>fs.default.name</name>     
<value>hdfs://localhost:9000</value>   
</property>   <property>     
<name>mapred.job.tracker</name>      
<value>localhost:9001</value>    
</property>  
 <property>      
<name>dfs.replication</name>      
 <value>1</value>           
  </property>   
<property> 
 <name>hadoop.tmp.dir</name> 
<value>/mnt/tmp/</value>  
</property>  
</configuration></pre>

Edit the file hdfs-site.xml

<pre class="xml" name="code"><configuration>
 <property>  
  <name>hadoop.tmp.dir</name> 
  <value>/mnt/tmp/</value>   
 </property>   
<property>   
 <name>dfs.data.dir</name>
 <value>/mnt/tmp2/</value>
</property>  
<property> 
 <name>dfs.name.dir</name>
 <value>/mnt/tmp3/</value> 
</property> 
</configuration> 
</pre>

Note: pointing the directories to /mnt is done since on Amazon EC2 regular instances has 400GB free space there (vs. only 10GB free space on root partition). You may
need to change permissions of /mnt in so this file system will be writable by Hadoop.
So execute the following command:

sudo chmod 777 /mnt

Set up authorized keys for localhost login w/o passwords and format your name node

# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

5)Add the following to your .profile

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export HADOOP_HOME=/usr/local/hadoop-0.20.2
export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
export MAHOUT_HOME=/usr/local/mahout-0.4/
export MAHOUT_VERSION=0.4-SNAPSHOT
export MAVEN_OPTS=-Xmx1024m

6) Checkout and build Mahout from trunk. ify that the paths on .profile point to the exact version you downloaded

svn co http://svn.apache.org/repos/asf/mahout/trunk mahout
cd mahout
mvn clean install
cd ..
sudo mv mahout /usr/local/mahout-0.4

Note: I am getting a lot of questions about the mvn compilation.
a) On windows based machines, it seems that running a Linux VM makes some
of the tests fail. Try to compile with the flag -DskipTests
b) If compilation fails, you can try and download compiled jars from
http://mirror.its.uidaho.edu/pub/apache//mahout/0.4/ (the compiled jar are
in the files without "src" in the filename). Just open the tgz and place it
on /usr/local/mahout-0.4/ instead of the compilation step above.

7) Install other required stuff (optional: in the Amazon EC2 image I am using
those libraries are preinstalled).

sudo apt-get install wget alien ruby libopenssl-ruby1.8 rsync curl

8) Run Hadoop, just to prove you can, and test Mahout by building the Reuters dataset on it. Finally, delete the files and shut it down.

$HADOOP_HOME/bin/hadoop namenode -format
$HADOOP_HOME/bin/start-all.sh
jps     // you should see all 5 Hadoop processes (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker)
cd $MAHOUT_HOME
./examples/bin/build-reuters.sh
$HADOOP_HOME/bin/stop-all.sh
rm -rf /tmp/*   // delete the Hadoop files

// edit $HADOOP_HOME/conf/mapred-site.xml to include the following:
<property>
   <name>mapred.child.java.opts</name>
   <value>-Xmx2000m</value>
</property>

9) Allow for Hadoop to run even if you will work on a different EC2 machine:

echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config

If everything went well, you may want to bundle the output into an AMI image, so next time you will not need to install everything from scratch:
10) Install Amazon AMI tools
a) Edit the file /etc/apt/sources.list
and uncomment all the lines with multiverse (note: you need to call the editor as root!)
b) update the repositories

sudo apt-get update

c) Install ami and api tools

sudo apt-get install ec2-ami-tools ec2-api-tools

Thanks Kevin for this fix!

11) In order to save your work, you need to bundle and save the image.
Here there are two alternatives. If you started EBS backed image, you can simply use the Amazon AWS user interface, right mouse click on the running instance and select "save instance".
If the image is not EBS, you will need to do it manually:

- note you need to use the private key of the x.509 certificate and not the private key of the public private key pair!!!!!!!

[All the following commands should span one shell line..]

First you need to create a bucket named mahoutbucket using the Amazon AWS console
under S3 tab.

sudo ec2-bundle-vol -k /mnt/pk-<your private X.509 key>.pem -c /mnt/cert-<your public x.509 key>.pem -u <Your AWS ID (12 digit number)> -d /mnt -p mahout
sudo ec2-upload-bundle -b mahoutbucket -m /mnt/mahout.manifest.xml -a <Your AWS String> -s <Your AWS string password> 
sudo ec2-register -K /mnt/pk-<Your X.509 private key>.pem -C /mnt/cert-<Your X.509 public certificate>.pem --name mahoutbucket/  mahoutbucket/mahout.manifest.xml

If you are lucky -You will get a result of the type: IMAGE ami-XXXXXXX
where XXXXXXX is the generated image number.

More detailed explanations about this procedure, along with many potential pitfalls are found
in my blog post here.
Thanks to Kevin and Selwyn!

Large Scale Machine Learning and Other Animals