The notes are based on older Mahout notes: https://cwiki.apache.org/MAHOUT/mahout-on-amazon-ec2.html which are unfortunately outdated
The next of the post (part 2) explains how to run two Mahout applications:
logistic regression and alternating least squares.
Note: part 5 of this post, explains how to make the same installation on top of
ec2 high computing node (CentOS/Redhat machine). Unfortunately, several steps
Part 6 of this post explains how to fine tune performance on large cluster.
Full procedure should take around 2-3 hours.. :-(
To confuse the users, Amazon has 5 types of IDs:
- Your email and password for getting into the AWS console
- Your AWS string name and private key string
- Your public/private key pair
- Your X.509 certificate (another private/public key pair)
- Your Amazon ID (12 digit number) which is very hard to find on their website
Make sure you have all your IDS ready, if you did not do it yet, generate the keys using AWS console.
1) select and launch instance ami-08f40561 from Amazon AWS console. Alternatively you can select any other Ubuntu based 64 bit image.
TIP: It is recommended using EBS backed image, since saving your work at the end will be made way easier.
2) verify java is installed correctly - some libs are missing in the ami
sudo apt-get install openjdk-6-jdk sudo apt-get install openjdk-6-jre-headless sudo apt-get install openjdk-6-jre-lib
3) In the root home directory evaluate:
# sudo apt-get update # sudo apt-get upgrade # sudo apt-get install python-setuptools # sudo easy_install "simplejson==2.0.9" # sudo easy_install "boto==1.8d" # sudo apt-get install ant # sudo apt-get install subversion # sudo apt-get install maven2
4) for getting hadoop source
# wget http://apache.cyberuse.com//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz # tar vxzf hadoop-0.20.2.tar.gz # sudo mv hadoop-0.20.2 /usr/local/
A comment: I once managed to install 0.21.0, but after the EC2 node was killed and restarted
Mahout refused to work any more. So I reverted to Hadoop 0.20.2
add the following to $HADOOP_HOME/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ # The maximum amount of heap to use, in MB. Default is 1000 export HADOOP_HEAPSIZE=2000
add the following to $HADOOP_HOME/conf/core-site.xml and also $HADOOP_HOME/conf/mapred-site.xml
Edit the file hdfs-site.xml
<pre class="xml" name="code"><configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/mnt/tmp/</value> </property> </configuration></pre> <pre class="xml" name="code"><configuration> <property> <name>hadoop.tmp.dir</name> <value>/mnt/tmp/</value> </property> <property> <name>dfs.data.dir</name> <value>/mnt/tmp2/</value> </property> <property> <name>dfs.name.dir</name> <value>/mnt/tmp3/</value> </property> </configuration> </pre>
Note: pointing the directories to /mnt is done since on Amazon EC2 regular instances has 400GB free space there (vs. only 10GB free space on root partition). You may
need to change permissions of /mnt in so this file system will be writable by Hadoop.
So execute the following command:
sudo chmod 777 /mnt
Set up authorized keys for localhost login w/o passwords and format your name node
# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa # cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
5)Add the following to your .profile
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk export HADOOP_HOME=/usr/local/hadoop-0.20.2 export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf export MAHOUT_HOME=/usr/local/mahout-0.4/ export MAHOUT_VERSION=0.4-SNAPSHOT export MAVEN_OPTS=-Xmx1024m
svn co http://svn.apache.org/repos/asf/mahout/trunk mahout cd mahout mvn clean install cd .. sudo mv mahout /usr/local/mahout-0.4
Note: I am getting a lot of questions about the mvn compilation.
a) On windows based machines, it seems that running a Linux VM makes some
of the tests fail. Try to compile with the flag -DskipTests
b) If compilation fails, you can try and download compiled jars from
http://mirror.its.uidaho.edu/pub/apache//mahout/0.4/ (the compiled jar are
in the files without "src" in the filename). Just open the tgz and place it
on /usr/local/mahout-0.4/ instead of the compilation step above.
7) Install other required stuff (optional: in the Amazon EC2 image I am using
those libraries are preinstalled).
sudo apt-get install wget alien ruby libopenssl-ruby1.8 rsync curl
8) Run Hadoop, just to prove you can, and test Mahout by building the Reuters dataset on it. Finally, delete the files and shut it down.
$HADOOP_HOME/bin/hadoop namenode -format $HADOOP_HOME/bin/start-all.sh jps // you should see all 5 Hadoop processes (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker) cd $MAHOUT_HOME ./examples/bin/build-reuters.sh $HADOOP_HOME/bin/stop-all.sh rm -rf /tmp/* // delete the Hadoop files
// edit $HADOOP_HOME/conf/mapred-site.xml to include the following: <property> <name>mapred.child.java.opts</name> <value>-Xmx2000m</value> </property>
9) Allow for Hadoop to run even if you will work on a different EC2 machine:
echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config
If everything went well, you may want to bundle the output into an AMI image, so next time you will not need to install everything from scratch:
10) Install Amazon AMI tools
a) Edit the file /etc/apt/sources.list
and uncomment all the lines with multiverse (note: you need to call the editor as root!)
b) update the repositories
sudo apt-get updatec) Install ami and api tools
sudo apt-get install ec2-ami-tools ec2-api-toolsThanks Kevin for this fix!
11) In order to save your work, you need to bundle and save the image.
Here there are two alternatives. If you started EBS backed image, you can simply use the Amazon AWS user interface, right mouse click on the running instance and select "save instance".
If the image is not EBS, you will need to do it manually:
- note you need to use the private key of the x.509 certificate and not the private key of the public private key pair!!!!!!!
[All the following commands should span one shell line..]
First you need to create a bucket named mahoutbucket using the Amazon AWS console
under S3 tab.
sudo ec2-bundle-vol -k /mnt/pk-<your private X.509 key>.pem -c /mnt/cert-<your public x.509 key>.pem -u <Your AWS ID (12 digit number)> -d /mnt -p mahout sudo ec2-upload-bundle -b mahoutbucket -m /mnt/mahout.manifest.xml -a <Your AWS String> -s <Your AWS string password> sudo ec2-register -K /mnt/pk-<Your X.509 private key>.pem -C /mnt/cert-<Your X.509 public certificate>.pem --name mahoutbucket/ mahoutbucket/mahout.manifest.xmlIf you are lucky -You will get a result of the type: IMAGE ami-XXXXXXX
where XXXXXXX is the generated image number.
More detailed explanations about this procedure, along with many potential pitfalls are found
in my blog post here.
Thanks to Kevin and Selwyn!