The notes are based on older Mahout notes: https://cwiki.apache.org/MAHOUT/mahout-on-amazon-ec2.html which are unfortunately outdated
The next of the post (part 2) explains how to run two Mahout applications:
logistic regression and alternating least squares.
Note: part 5 of this post, explains how to make the same installation on top of
ec2 high computing node (CentOS/Redhat machine). Unfortunately, several steps
are different..
Part 6 of this post explains how to fine tune performance on large cluster.
Full procedure should take around 2-3 hours.. :-(
To confuse the users, Amazon has 5 types of IDs:
- Your email and password for getting into the AWS console
- Your AWS string name and private key string
- Your public/private key pair
- Your X.509 certificate (another private/public key pair)
- Your Amazon ID (12 digit number) which is very hard to find on their website
Make sure you have all your IDS ready, if you did not do it yet, generate the keys using AWS console.
1) select and launch instance ami-08f40561 from Amazon AWS console. Alternatively you can select any other Ubuntu based 64 bit image.
TIP: It is recommended using EBS backed image, since saving your work at the end will be made way easier.
2) verify java is installed correctly - some libs are missing in the ami
sudo apt-get install openjdk-6-jdk sudo apt-get install openjdk-6-jre-headless sudo apt-get install openjdk-6-jre-lib
3) In the root home directory evaluate:
# sudo apt-get update # sudo apt-get upgrade # sudo apt-get install python-setuptools # sudo easy_install "simplejson==2.0.9" # sudo easy_install "boto==1.8d" # sudo apt-get install ant # sudo apt-get install subversion # sudo apt-get install maven2
4) for getting hadoop source
# wget http://apache.cyberuse.com//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz # tar vxzf hadoop-0.20.2.tar.gz # sudo mv hadoop-0.20.2 /usr/local/
A comment: I once managed to install 0.21.0, but after the EC2 node was killed and restarted
Mahout refused to work any more. So I reverted to Hadoop 0.20.2
add the following to $HADOOP_HOME/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ # The maximum amount of heap to use, in MB. Default is 1000 export HADOOP_HEAPSIZE=2000
add the following to $HADOOP_HOME/conf/core-site.xml and also $HADOOP_HOME/conf/mapred-site.xml
Edit the file hdfs-site.xml<pre class="xml" name="code"><configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/mnt/tmp/</value> </property> </configuration></pre>
<pre class="xml" name="code"><configuration> <property> <name>hadoop.tmp.dir</name> <value>/mnt/tmp/</value> </property> <property> <name>dfs.data.dir</name> <value>/mnt/tmp2/</value> </property> <property> <name>dfs.name.dir</name> <value>/mnt/tmp3/</value> </property> </configuration> </pre>
Note: pointing the directories to /mnt is done since on Amazon EC2 regular instances has 400GB free space there (vs. only 10GB free space on root partition). You may
need to change permissions of /mnt in so this file system will be writable by Hadoop.
So execute the following command:
sudo chmod 777 /mnt
Set up authorized keys for localhost login w/o passwords and format your name node
# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa # cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
5)Add the following to your .profile
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk export HADOOP_HOME=/usr/local/hadoop-0.20.2 export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf export MAHOUT_HOME=/usr/local/mahout-0.4/ export MAHOUT_VERSION=0.4-SNAPSHOT export MAVEN_OPTS=-Xmx1024m
svn co http://svn.apache.org/repos/asf/mahout/trunk mahout cd mahout mvn clean install cd .. sudo mv mahout /usr/local/mahout-0.4
Note: I am getting a lot of questions about the mvn compilation.
a) On windows based machines, it seems that running a Linux VM makes some
of the tests fail. Try to compile with the flag -DskipTests
b) If compilation fails, you can try and download compiled jars from
http://mirror.its.uidaho.edu/pub/apache//mahout/0.4/ (the compiled jar are
in the files without "src" in the filename). Just open the tgz and place it
on /usr/local/mahout-0.4/ instead of the compilation step above.
7) Install other required stuff (optional: in the Amazon EC2 image I am using
those libraries are preinstalled).
sudo apt-get install wget alien ruby libopenssl-ruby1.8 rsync curl
8) Run Hadoop, just to prove you can, and test Mahout by building the Reuters dataset on it. Finally, delete the files and shut it down.
$HADOOP_HOME/bin/hadoop namenode -format $HADOOP_HOME/bin/start-all.sh jps // you should see all 5 Hadoop processes (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker) cd $MAHOUT_HOME ./examples/bin/build-reuters.sh $HADOOP_HOME/bin/stop-all.sh rm -rf /tmp/* // delete the Hadoop files
// edit $HADOOP_HOME/conf/mapred-site.xml to include the following: <property> <name>mapred.child.java.opts</name> <value>-Xmx2000m</value> </property>
9) Allow for Hadoop to run even if you will work on a different EC2 machine:
echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config
If everything went well, you may want to bundle the output into an AMI image, so next time you will not need to install everything from scratch:
10) Install Amazon AMI tools
a) Edit the file /etc/apt/sources.list
and uncomment all the lines with multiverse (note: you need to call the editor as root!)
b) update the repositories
sudo apt-get updatec) Install ami and api tools
sudo apt-get install ec2-ami-tools ec2-api-toolsThanks Kevin for this fix!
11) In order to save your work, you need to bundle and save the image.
Here there are two alternatives. If you started EBS backed image, you can simply use the Amazon AWS user interface, right mouse click on the running instance and select "save instance".
If the image is not EBS, you will need to do it manually:
- note you need to use the private key of the x.509 certificate and not the private key of the public private key pair!!!!!!!
[All the following commands should span one shell line..]
First you need to create a bucket named mahoutbucket using the Amazon AWS console
under S3 tab.
sudo ec2-bundle-vol -k /mnt/pk-<your private X.509 key>.pem -c /mnt/cert-<your public x.509 key>.pem -u <Your AWS ID (12 digit number)> -d /mnt -p mahout sudo ec2-upload-bundle -b mahoutbucket -m /mnt/mahout.manifest.xml -a <Your AWS String> -s <Your AWS string password> sudo ec2-register -K /mnt/pk-<Your X.509 private key>.pem -C /mnt/cert-<Your X.509 public certificate>.pem --name mahoutbucket/ mahoutbucket/mahout.manifest.xmlIf you are lucky -You will get a result of the type: IMAGE ami-XXXXXXX
where XXXXXXX is the generated image number.
More detailed explanations about this procedure, along with many potential pitfalls are found
in my blog post here.
Thanks to Kevin and Selwyn!
Wow....close but no cigar...your instructions do not work. You might have gotten everything to work but following verbatim...you where not very explicit. Please correct or remove post so as to avoid wasting the time for others.
ReplyDeletePlease be more specific. What step did not work for you?
ReplyDeleteHi Danny,
ReplyDeleteThanks for the guidelines - they've helped enormously in getting me going in the right direction.
I noticed there are a few things that could be clearer or slightly easier than given in your instructions (which may be what David above might be referring to):
1 - ec2-ami-tools and ec2-api-tools can be installed via apt (i.e. sudo apt-get install ec2-ami-tool ec2-api-tools), which means you can directly call ec2-bundle-vol, ec2-upload-bundle, and ec2-register, rather than the funky shell calls you have in step 10 of your instructions (these didn't work for me).
2 - there is a typo in line 4 of step 4) mahout is misspelled mahoot
Aside from that things work great. I'm looking forward to working through the rest of your series on Mahout and Hadoop.
Hi Selwyn!
ReplyDeleteThanks a lot for your feedback! I have fixed the post according to your suggestions.
I will be glad to get further feedback on the other posts. The reason I wrote those instructions is s that I am now in the process of comparing Mahout to our Parallel ML framework called GraphLab: http://www.graphlab.ml.cmu.edu/
GraphLab is significantly faster when implementing iterative algorithms. I will be also glad to learn what are you working on.
Best,
DB
Danny,
ReplyDeleteThanks for the guide, very helpful. Everything worked great for me up to step 8 minus minor typos (drop the 'e' in /user/ for the mahout home env var in step 4 and in 8 there should be an 's' on the end of 'ec2-ami-tool').
In step 8 it might help people out to let them know they'll need to have multiverse enabled to install the api tools. When I reached this step I used this documentation here: https://help.ubuntu.com/community/EC2StartersGuide
For step 9 I couldn't get those commands working so I switched over to the Amazon's documentation here which got me to the same end point: http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/creating-snapshot-s3-linux.html
Again, very helpful article, I hope they update the mahout documentation from it as I wasted some time trying to figure out why that didn't work. I look forward to working through part 2 tomorrow!
-Kevin
Hi Kevin!
ReplyDeleteIt is a pleasure to learn that the instructions where useful. I have fixed everything based on your comments. Would you mind sharing what are you working on?
I'm comparing how some in-house LDA stuff will perform in Hadoop versus custom distribution using JBoss clustering. Since it's highly iterative I don't think map-reduce is the "right" framework for it but it'll be interesting to see the trade-off as far as performance vs ease of setup and ability to handle larger data sets.
ReplyDeleteDanny,
ReplyDeleteThanks for posting this. When I fire up this AMI it only gives me the option to run it on a Large instance (or greater). Same with the AMI on the Mahout wiki page. Are there any public AMIs I can run on a small instance that will be suitable for mahout?
Also, what program are you using to run the commands in those screenshots? I'm on a windows machine and have been using PuTTy for my instances so far which is a bit clumsy.
Thanks in advance!
Mike
Hi Mike!
ReplyDeleteAny Linux will do. Note, that you can compile using my instructions on a large instance, and then simply zip and copy all the mahout's directory into another machine. Since it is written in Java, you can run it on a small instance machine as well. You can also compile on small instance, but depends on the OS there may be small variations on syntax.
About your second question -I would recommend install cygwin, and then running the cygwin program Xwin Server.
Good luck!
Thanks for the detailed post.
ReplyDeleteJust a complement: if you don't want/need to fine tune the hadoop node parameters, it probably much easier and faster to use Apache Whirr which helps you setup hadoop / hbase / zookeeper /cassandra clusters on both EC2 and rackspacecloud using a single ~10 lines configuration file.
http://incubator.apache.org/whirr/quick-start-guide.html
Danny,
ReplyDeleteI noticed you posted updates for exactly this but running the chmod 777 command on the /mnt folder solved a lot of my problems on the EC2.
I have not yet tried to compile mahout from the start using the -DSkipTests option though this might work. I did have the same test failure issue that you mention above, though that was on an ubuntu EC2 accessed via PuTTy on a Windows machine. I'll let you know how the -DSkipTests option goes. Thanks again for posting and updating this!
Mike
My pleasure!
ReplyDelete- Danny
Hi Danny:
ReplyDeleteThanks for the detailed instructions. I got stuck in step 6. I think it has to do with the security keys. I get the following error:
starting namenode, logging to /usr/local/hadoop-0.20.2/bin/../logs/hadoop-root-namenode-ip-10-2-34-166.out
localhost: Permission denied (publickey).
localhost: Permission denied (publickey).
starting jobtracker, logging to /usr/local/hadoop-0.20.2/bin/../logs/hadoop-root-jobtracker-ip-10-2-34-166.out
localhost: Permission denied (publickey).
Any suggestions?
Thanks in advance
\Carlos
Hi Carlos.
ReplyDeleteThis looks like a ssh problem. My step 6 instructs how to start Hadoop on a single machine. It seems that you are trying to start it on multiple machines. See my post: http://bickson.blogspot.com/2011/02/mahout-on-amazon-ec2-part-4-running-on.html Try to manually ssh from the machine you run the start_dfs.sh script into the namenode and jobtracker machines. If you fail to do it, repeat all step involving key creation and propagation.
Good luck! DB
Hi Danny,
ReplyDeleteI am new to the ubuntu terminal.
When you say "add the following to $HADOOP_HOME/conf/hadoop-env.sh", how exactly do you go about this?
Same question for:
"add the following to $HADOOP_HOME/conf/core-site.xml and also $HADOOP_HOME/conf/mapred-site.xml"
Thanks!
Corey
Hi Corey,
ReplyDeleteyou should use one of the available editors like vi or nano.
For exmaple: nano $HADOOP_HOME/conf/core-site.xml
then edit the file and save.
Thanks for the reply! Works fine.
ReplyDeletehello ..i got error on when i star all node using start-all.sh command ...and error is JAVA_HOME is not set . then i set through terminal export JAVA_HOME =/usr/lib/jvm/java-6-openjdk/
ReplyDeleteso plz guide me ...
thnx
Sagar
Hi
ReplyDeleteplease send me the full error. check your hadoop-env.sh to verify it points to the correct java. check your .profile settings as well. send me also the output of "ls -lrt /usr/lib/jvm/java-6-openjdk/ and also of 'java --version'
- Danny
yes sir i check my both hadoop-env.sh and .profile file where JAVA_HOME is set as above...
ReplyDeleteand my final error of running command
sagar@sagar:/usr/local/hadoop-0.20.2$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop-0.20.2/bin/../logs/hadoop-sagar-namenode-sagar.out
localhost: starting datanode, logging to /usr/local/hadoop-0.20.2/bin/../logs/hadoop-sagar-datanode-sagar.out
localhost: Error: JAVA_HOME is not set.
localhost: starting secondarynamenode, logging to /usr/local/hadoop-0.20.2/bin/../logs/hadoop-sagar-secondarynamenode-sagar.out
localhost: Error: JAVA_HOME is not set.
starting jobtracker, logging to /usr/local/hadoop-0.20.2/bin/../logs/hadoop-sagar-jobtracker-sagar.out
localhost: starting tasktracker, logging to /usr/local/hadoop-0.20.2/bin/../logs/hadoop-sagar-tasktracker-sagar.out
localhost: Error: JAVA_HOME is not set.
the output of ls -lrt /usr/lib/jvm/java-6-openjdk/
sagar@sagar:/usr/local/hadoop-0.20.2$ ls -lrt /usr/lib/jvm/java-6-openjdk/
total 20
lrwxrwxrwx 1 root root 41 2011-08-17 22:43 docs -> ../../../share/doc/openjdk-6-jre-headless
drwxr-xr-x 4 root root 4096 2011-08-17 22:43 man
drwxr-xr-x 5 root root 4096 2011-08-18 22:39 jre
lrwxrwxrwx 1 root root 22 2011-08-24 18:31 THIRD_PARTY_README -> jre/THIRD_PARTY_README
lrwxrwxrwx 1 root root 22 2011-08-24 18:31 ASSEMBLY_EXCEPTION -> jre/ASSEMBLY_EXCEPTION
drwxr-xr-x 2 root root 4096 2011-08-24 18:31 bin
drwxr-xr-x 2 root root 4096 2011-08-24 18:31 lib
drwxr-xr-x 3 root root 4096 2011-08-24 18:31 include
and output of java-version
sagar@sagar:~$ java -version
java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.2) (6b22-1.10.2-0ubuntu1~11.04.1)
OpenJDK Server VM (build 20.0-b11, mixed mode)
now tell me wht can i do..???
thnk u..
try to "source .profile" and send me the output of "echo $JAVA_HOME"
ReplyDeleteThis comment has been removed by the author.
ReplyDeletehello sir check my source.profile
ReplyDelete# ~/.profile: executed by the command interpreter for login shells.
# This file is not read by bash(1), if ~/.bash_profile or ~/.bash_login
# exists.
# see /usr/share/doc/bash/examples/startup-files for examples.
# the files are located in the bash-doc package.
# the default umask is set in /etc/profile; for setting the umask
# for ssh logins, install and configure the libpam-umask package.
#umask 022
# if running bash
if [ -n "$BASH_VERSION" ]; then
# include .bashrc if it exists
if [ -f "$HOME/.bashrc" ]; then
. "$HOME/.bashrc"
fi
fi
# set PATH so it includes user's private bin if it exists
if [ -d "$HOME/bin" ] ; then
PATH="$HOME/bin:$PATH"
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/
export HADOOP_HOME=/usr/local/hadoop-0.20.2
export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
export MAHOUT_HOME=/usr/local/mahout-0.4/
export MAHOUT_VERSION=0.4-SNAPSHOT
export MAVEN_OPTS=-Xmx1024m
fi
and out put of echo $JAVA_HOME is blank....
then i m set it using export command then got usr/lib/jvm/open-6-jdk/
Hi danny,
ReplyDeleteThanks for great posting on your blog for hadoop/mahout on Linux based Opertaing System.
I want to know that which is best operating system for hadoop/mahout?
and how to install that on Windows xp...
--
THANKS
VIGNESH PRAJAPATI
Answer: I advise using Linux if you consider running Hadoop on a cluster. Since Hadoop is written in Java you can potentially run it on Windows as well. One option is to install Sun's virtual box: http://www.virtualbox.org/ and have Linux run on your windows system.
Best,
DB
Hi Danny, great article, I followed it completely and successfully generated a custom ami. But When I try to launch the hadoop cluster using this, I keep getting this message over and over again:
ReplyDeletehadoop-0.20.203.0/src/contrib/ec2/bin/launch-hadoop-master: line 110: [: too many arguments
I would really appreciate your help.
Thanks,
Ehtsham
Hi!
ReplyDeleteIt seems that you installed a more recent version of Hadoop? My instructions where tested with hadoop-0.20.2
Can you retry with the version about and let me know if this works?
Best,
Danny
Hi Danny, I tried with the 0.20.2 version as well by installing it on the EC2 instance and on my computer as well and the problem persits :( I would like to know that if I am following the right procedure for launching the hadoop cluster. After completing your tutorial, I launch hadoop cluster by following the steps under the heading Getting started in the original Mahout article: "mahout on ec2" (on which you have based your tutorial") thanks, Ehtsham
ReplyDeleteHi!
ReplyDeleteFirst of all, I just noticed that the syntax highligher I was using is no longer supported in the new blogger template, so some <, > where missing in the hadoop conf file. I hope this did not confuse you it is now fixed.
After downloading hadoop and setting the conf files (section 4) try to execute section 8 and tell me if this works.
Hi Danny, I correctly entered the configurations in the configuration files in the $HADOOP_HOME/conf directory. I tried installing with hadoop-0.20.2 as well but the problem persists. Eventually i launched a public AMI which already has hadoop on it and then installed mahout and hive( i use both extensively) on it and created an image out of it and then launched cluster from that, and it worked :),
ReplyDeleteThanks for the article, I am following your other tutorials as well.
Ehtsham
Danny,
ReplyDeleteThanks a lot for sharing your experiences I followed these steps to setup a single instance and ran into the same issue Carlos described previously. When I run on the server:
sudo $HADOOP_HOME/bin/start-all.sh
I get:
namenode running as process 14028. Stop it first.
localhost: Permission denied (publickey).
localhost: Permission denied (publickey).
jobtracker running as process 12087. Stop it first.
localhost: Permission denied (publickey).
I'm thinking this occurs when hadoop talks to itself over localhost:port. Do you think I need to provide one of the keys for EC2?
Note: I'm able to run this with no problems (or password):
ssh localhost
Any ideas?
Thanks!!!!
It seems you did not kill properly the previous running instance of hadoop. You should kill it first using the stop-all.sh command.
DeleteTo generate key-based authentication that can be used so that ssh doesn’t require the use of a password every time it’s invoked, I typed the following
ReplyDelete$ ssh-keygen –t dsa –P ‘’ –f ~/.ssh/id_dsa
It gave me this
Too many arguments.
usage: ssh-keygen [options]
Options:
-A Generate non-existent host keys for all key types.
-a trials Number of trials for screening DH-GEX moduli.
-B Show bubblebabble digest of key file.
-b bits Number of bits in the key to create.
-C comment Provide new comment.
-c Change comment in private and public key files.
-D pkcs11 Download public key from pkcs11 token.
-e Export OpenSSH to foreign format key file.
-F hostname Find hostname in known hosts file.
-f filename Filename of the key file.
-G file Generate candidates for DH-GEX moduli.
-g Use generic DNS resource record format.
-H Hash names in known_hosts file.
-h Generate host certificate instead of a user certificate.
-I key_id Key identifier to include in certificate.
-i Import foreign format to OpenSSH key file.
-K checkpt Write checkpoints to this file.
-L Print the contents of a certificate.
-l Show fingerprint of key file.
-M memory Amount of memory (MB) to use for generating DH-GEX moduli.
-m key_fmt Conversion format for -e/-i (PEM|PKCS8|RFC4716).
-N phrase Provide new passphrase.
-n name,... User/host principal names to include in certificate
-O option Specify a certificate option.
-P phrase Provide old passphrase.
-p Change passphrase of private key file.
-q Quiet.
-R hostname Remove host from known_hosts file.
-r hostname Print DNS resource record.
-S start Start point (hex) for generating DH-GEX moduli.
-s ca_key Certify keys with CA key.
-T file Screen candidates for DH-GEX moduli.
-t type Specify type of key to create.
-V from:to Specify certificate validity interval.
-v Verbose.
-W gen Generator to use for generating DH-GEX moduli.
-y Read private key file and print public key.
-z serial Specify a serial number.
What should i do
You should copy and paste my command - namely used single qoute ' and not back qoute ` - see explanation here http://linuxreviews.org/beginner/Bash-Scripting-Introduction-HOWTO/en/x303.html
DeleteHi Danny,
ReplyDeleteI'm stuck with step 6.. cant get rid of the fatal errors that I've encountered. The mirror under Note b) doesnt seem to be working.
Please advise.
Thank you
Try http://www.jarvana.com/jarvana/browse/org/apache/mahout/mahout-distribution/0.4/
DeleteBut please note there is a newer version of mahout now so my instructions are a bit out of date..
Hi,
DeleteThank you for your reply. I will try the link and play around with the settings,
This comment has been removed by the author.
ReplyDelete