Monday, January 24, 2011

Mahout/Hadoop on Amazon EC2 - part 1 - Installation

This post explains how to install Mahout ML framework on top of Amazon EC2 (Ubuntu based machine).
The notes are based on older Mahout notes: https://cwiki.apache.org/MAHOUT/mahout-on-amazon-ec2.html which are unfortunately outdated

The next of the post (part 2) explains how to run two Mahout applications:
logistic regression and alternating least squares.

Note: part 5 of this post, explains how to make the same installation on top of
ec2 high computing node (CentOS/Redhat machine). Unfortunately, several steps
are different..

Part 6 of this post explains how to fine tune performance on large cluster.

Full procedure should take around 2-3   hours.. :-(

To confuse the users, Amazon has 5 types of IDs:
- Your email and password for getting into the AWS console
- Your AWS string name and private key string
- Your public/private key pair
- Your X.509 certificate (another private/public key pair)
- Your Amazon ID (12 digit number) which is very hard to find on their website
Make sure you have all your IDS ready, if you did not do it yet, generate the keys using AWS console.

1) select and launch instance ami-08f40561 from Amazon AWS console. Alternatively you can select any other Ubuntu based 64 bit image.
TIP: It is recommended using EBS backed image, since saving your work at the end will be made way easier.

2) verify java is installed correctly - some libs are missing in the ami
sudo apt-get install openjdk-6-jdk
sudo apt-get install openjdk-6-jre-headless
sudo apt-get install openjdk-6-jre-lib

3) In the root home directory evaluate:
# sudo apt-get update
# sudo apt-get upgrade
# sudo apt-get install python-setuptools
# sudo easy_install "simplejson==2.0.9"
# sudo easy_install "boto==1.8d"
# sudo apt-get install ant
# sudo apt-get install subversion
# sudo apt-get install maven2

4) for getting hadoop source
# wget http://apache.cyberuse.com//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz 
# tar vxzf hadoop-0.20.2.tar.gz
# sudo  mv hadoop-0.20.2 /usr/local/

A comment: I once managed to install 0.21.0, but after the EC2 node was killed and restarted
Mahout refused to work any more. So I reverted to Hadoop 0.20.2

add the following to $HADOOP_HOME/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/
# The maximum amount of heap to use, in MB. Default is 1000
export HADOOP_HEAPSIZE=2000

add the following to $HADOOP_HOME/conf/core-site.xml and also $HADOOP_HOME/conf/mapred-site.xml
<pre class="xml" name="code"><configuration>     
<property>     
<name>fs.default.name</name>     
<value>hdfs://localhost:9000</value>   
</property>   <property>     
<name>mapred.job.tracker</name>      
<value>localhost:9001</value>    
</property>  
 <property>      
<name>dfs.replication</name>      
 <value>1</value>           
  </property>   
<property> 
 <name>hadoop.tmp.dir</name> 
<value>/mnt/tmp/</value>  
</property>  
</configuration></pre>
  
Edit the file hdfs-site.xml
<pre class="xml" name="code"><configuration>
 <property>  
  <name>hadoop.tmp.dir</name> 
  <value>/mnt/tmp/</value>   
 </property>   
<property>   
 <name>dfs.data.dir</name>
 <value>/mnt/tmp2/</value>
</property>  
<property> 
 <name>dfs.name.dir</name>
 <value>/mnt/tmp3/</value> 
</property> 
</configuration> 
</pre>
 

Note: pointing the directories to /mnt is done since on Amazon EC2 regular instances has 400GB free space there (vs. only 10GB free space on root partition). You may
need to change permissions of /mnt in so this file system will be writable by Hadoop.
So execute the following command:
sudo chmod 777 /mnt


Set up authorized keys for localhost login w/o passwords and format your name node
# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys


5)Add the following to your .profile
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export HADOOP_HOME=/usr/local/hadoop-0.20.2
export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
export MAHOUT_HOME=/usr/local/mahout-0.4/
export MAHOUT_VERSION=0.4-SNAPSHOT
export MAVEN_OPTS=-Xmx1024m





  • 6) Checkout and build Mahout from trunk. ify that the paths on .profile point to the exact version you downloaded

    svn co http://svn.apache.org/repos/asf/mahout/trunk mahout
    cd mahout
    mvn clean install
    cd ..
    sudo mv mahout /usr/local/mahout-0.4
    

    Note: I am getting a lot of questions about the mvn compilation.
    a) On windows based machines, it seems that running a Linux VM makes some
    of the tests fail. Try to compile with the flag -DskipTests
    b) If compilation fails, you can try and download compiled jars from
    http://mirror.its.uidaho.edu/pub/apache//mahout/0.4/ (the compiled jar are
    in the files without "src" in the filename). Just open the tgz and place it
    on /usr/local/mahout-0.4/ instead of the compilation step above.


    7) Install other required stuff (optional: in the Amazon EC2 image I am using
    those libraries are preinstalled).
    sudo apt-get install wget alien ruby libopenssl-ruby1.8 rsync curl
    

    8) Run Hadoop, just to prove you can, and test Mahout by building the Reuters dataset on it. Finally, delete the files and shut it down.

    $HADOOP_HOME/bin/hadoop namenode -format
    $HADOOP_HOME/bin/start-all.sh
    jps     // you should see all 5 Hadoop processes (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker)
    cd $MAHOUT_HOME
    ./examples/bin/build-reuters.sh
    $HADOOP_HOME/bin/stop-all.sh
    rm -rf /tmp/*   // delete the Hadoop files







  • Remove the single-host stuff you added to $HADOOP_HOME/conf/core-site.xml and $HADOOP_HOME/conf/mapred-site.xml in step #6b and verify you are happy with the other conf file settings. The Hadoop startup scripts will not make any changes to them. In particular, upping the Java heap size is required for many of the Mahout jobs.
    // edit $HADOOP_HOME/conf/mapred-site.xml to include the following:
    <property>
       <name>mapred.child.java.opts</name>
       <value>-Xmx2000m</value>
    </property>


    9) Allow for Hadoop to run even if you will work on a different EC2 machine:
    echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config
    


    If everything went well, you may want to bundle the output into an AMI image, so next time you will not need to install everything from scratch:
    10) Install Amazon AMI tools
    a) Edit the file /etc/apt/sources.list
    and uncomment all the lines with multiverse (note: you need to call the editor as root!)
    b) update the repositories
    sudo apt-get update
    c) Install ami and api tools
    sudo apt-get install ec2-ami-tools ec2-api-tools
    
    Thanks Kevin for this fix!

    11) In order to save your work, you need to bundle and save the image.
    Here there are two alternatives. If you started EBS backed image, you can simply use the Amazon AWS user interface, right mouse click on the running instance and select "save instance".
    If the image is not EBS, you will need to do it manually:

    - note you need to use the private key of the x.509 certificate and not the private key of the public private key pair!!!!!!!

    [All the following commands should span one shell line..]

    First you need to create a bucket named mahoutbucket using the Amazon AWS console
    under S3 tab.

    sudo ec2-bundle-vol -k /mnt/pk-<your private X.509 key>.pem -c /mnt/cert-<your public x.509 key>.pem -u <Your AWS ID (12 digit number)> -d /mnt -p mahout
    sudo ec2-upload-bundle -b mahoutbucket -m /mnt/mahout.manifest.xml -a <Your AWS String> -s <Your AWS string password> 
    sudo ec2-register -K /mnt/pk-<Your X.509 private key>.pem -C /mnt/cert-<Your X.509 public certificate>.pem --name mahoutbucket/  mahoutbucket/mahout.manifest.xml
    
    If you are lucky -You will get a result of the type: IMAGE   ami-XXXXXXX
    where XXXXXXX is the generated image number.

    More detailed explanations about this procedure, along with many potential pitfalls are found
    in my blog post here.
    Thanks to Kevin and Selwyn!
  • 37 comments:

    1. Wow....close but no cigar...your instructions do not work. You might have gotten everything to work but following verbatim...you where not very explicit. Please correct or remove post so as to avoid wasting the time for others.

      ReplyDelete
    2. Please be more specific. What step did not work for you?

      ReplyDelete
    3. Hi Danny,

      Thanks for the guidelines - they've helped enormously in getting me going in the right direction.

      I noticed there are a few things that could be clearer or slightly easier than given in your instructions (which may be what David above might be referring to):
      1 - ec2-ami-tools and ec2-api-tools can be installed via apt (i.e. sudo apt-get install ec2-ami-tool ec2-api-tools), which means you can directly call ec2-bundle-vol, ec2-upload-bundle, and ec2-register, rather than the funky shell calls you have in step 10 of your instructions (these didn't work for me).

      2 - there is a typo in line 4 of step 4) mahout is misspelled mahoot

      Aside from that things work great. I'm looking forward to working through the rest of your series on Mahout and Hadoop.

      ReplyDelete
    4. Hi Selwyn!
      Thanks a lot for your feedback! I have fixed the post according to your suggestions.

      I will be glad to get further feedback on the other posts. The reason I wrote those instructions is s that I am now in the process of comparing Mahout to our Parallel ML framework called GraphLab: http://www.graphlab.ml.cmu.edu/

      GraphLab is significantly faster when implementing iterative algorithms. I will be also glad to learn what are you working on.

      Best,

      DB

      ReplyDelete
    5. Danny,

      Thanks for the guide, very helpful. Everything worked great for me up to step 8 minus minor typos (drop the 'e' in /user/ for the mahout home env var in step 4 and in 8 there should be an 's' on the end of 'ec2-ami-tool').

      In step 8 it might help people out to let them know they'll need to have multiverse enabled to install the api tools. When I reached this step I used this documentation here: https://help.ubuntu.com/community/EC2StartersGuide

      For step 9 I couldn't get those commands working so I switched over to the Amazon's documentation here which got me to the same end point: http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/creating-snapshot-s3-linux.html

      Again, very helpful article, I hope they update the mahout documentation from it as I wasted some time trying to figure out why that didn't work. I look forward to working through part 2 tomorrow!

      -Kevin

      ReplyDelete
    6. Hi Kevin!
      It is a pleasure to learn that the instructions where useful. I have fixed everything based on your comments. Would you mind sharing what are you working on?

      ReplyDelete
    7. I'm comparing how some in-house LDA stuff will perform in Hadoop versus custom distribution using JBoss clustering. Since it's highly iterative I don't think map-reduce is the "right" framework for it but it'll be interesting to see the trade-off as far as performance vs ease of setup and ability to handle larger data sets.

      ReplyDelete
    8. Danny,

      Thanks for posting this. When I fire up this AMI it only gives me the option to run it on a Large instance (or greater). Same with the AMI on the Mahout wiki page. Are there any public AMIs I can run on a small instance that will be suitable for mahout?

      Also, what program are you using to run the commands in those screenshots? I'm on a windows machine and have been using PuTTy for my instances so far which is a bit clumsy.

      Thanks in advance!

      Mike

      ReplyDelete
    9. Hi Mike!
      Any Linux will do. Note, that you can compile using my instructions on a large instance, and then simply zip and copy all the mahout's directory into another machine. Since it is written in Java, you can run it on a small instance machine as well. You can also compile on small instance, but depends on the OS there may be small variations on syntax.

      About your second question -I would recommend install cygwin, and then running the cygwin program Xwin Server.

      Good luck!

      ReplyDelete
    10. Thanks for the detailed post.

      Just a complement: if you don't want/need to fine tune the hadoop node parameters, it probably much easier and faster to use Apache Whirr which helps you setup hadoop / hbase / zookeeper /cassandra clusters on both EC2 and rackspacecloud using a single ~10 lines configuration file.

      http://incubator.apache.org/whirr/quick-start-guide.html

      ReplyDelete
    11. Danny,

      I noticed you posted updates for exactly this but running the chmod 777 command on the /mnt folder solved a lot of my problems on the EC2.

      I have not yet tried to compile mahout from the start using the -DSkipTests option though this might work. I did have the same test failure issue that you mention above, though that was on an ubuntu EC2 accessed via PuTTy on a Windows machine. I'll let you know how the -DSkipTests option goes. Thanks again for posting and updating this!

      Mike

      ReplyDelete
    12. Hi Danny:
      Thanks for the detailed instructions. I got stuck in step 6. I think it has to do with the security keys. I get the following error:

      starting namenode, logging to /usr/local/hadoop-0.20.2/bin/../logs/hadoop-root-namenode-ip-10-2-34-166.out
      localhost: Permission denied (publickey).
      localhost: Permission denied (publickey).
      starting jobtracker, logging to /usr/local/hadoop-0.20.2/bin/../logs/hadoop-root-jobtracker-ip-10-2-34-166.out
      localhost: Permission denied (publickey).

      Any suggestions?

      Thanks in advance

      \Carlos

      ReplyDelete
    13. Hi Carlos.
      This looks like a ssh problem. My step 6 instructs how to start Hadoop on a single machine. It seems that you are trying to start it on multiple machines. See my post: http://bickson.blogspot.com/2011/02/mahout-on-amazon-ec2-part-4-running-on.html Try to manually ssh from the machine you run the start_dfs.sh script into the namenode and jobtracker machines. If you fail to do it, repeat all step involving key creation and propagation.

      Good luck! DB

      ReplyDelete
    14. Hi Danny,

      I am new to the ubuntu terminal.

      When you say "add the following to $HADOOP_HOME/conf/hadoop-env.sh", how exactly do you go about this?

      Same question for:
      "add the following to $HADOOP_HOME/conf/core-site.xml and also $HADOOP_HOME/conf/mapred-site.xml"

      Thanks!

      Corey

      ReplyDelete
    15. Hi Corey,
      you should use one of the available editors like vi or nano.
      For exmaple: nano $HADOOP_HOME/conf/core-site.xml
      then edit the file and save.

      ReplyDelete
    16. Thanks for the reply! Works fine.

      ReplyDelete
    17. hello ..i got error on when i star all node using start-all.sh command ...and error is JAVA_HOME is not set . then i set through terminal export JAVA_HOME =/usr/lib/jvm/java-6-openjdk/

      so plz guide me ...

      thnx

      Sagar

      ReplyDelete
    18. Hi
      please send me the full error. check your hadoop-env.sh to verify it points to the correct java. check your .profile settings as well. send me also the output of "ls -lrt /usr/lib/jvm/java-6-openjdk/ and also of 'java --version'

      - Danny

      ReplyDelete
    19. yes sir i check my both hadoop-env.sh and .profile file where JAVA_HOME is set as above...

      and my final error of running command
      sagar@sagar:/usr/local/hadoop-0.20.2$ bin/start-all.sh
      starting namenode, logging to /usr/local/hadoop-0.20.2/bin/../logs/hadoop-sagar-namenode-sagar.out
      localhost: starting datanode, logging to /usr/local/hadoop-0.20.2/bin/../logs/hadoop-sagar-datanode-sagar.out
      localhost: Error: JAVA_HOME is not set.
      localhost: starting secondarynamenode, logging to /usr/local/hadoop-0.20.2/bin/../logs/hadoop-sagar-secondarynamenode-sagar.out
      localhost: Error: JAVA_HOME is not set.
      starting jobtracker, logging to /usr/local/hadoop-0.20.2/bin/../logs/hadoop-sagar-jobtracker-sagar.out
      localhost: starting tasktracker, logging to /usr/local/hadoop-0.20.2/bin/../logs/hadoop-sagar-tasktracker-sagar.out
      localhost: Error: JAVA_HOME is not set.


      the output of ls -lrt /usr/lib/jvm/java-6-openjdk/

      sagar@sagar:/usr/local/hadoop-0.20.2$ ls -lrt /usr/lib/jvm/java-6-openjdk/
      total 20
      lrwxrwxrwx 1 root root 41 2011-08-17 22:43 docs -> ../../../share/doc/openjdk-6-jre-headless
      drwxr-xr-x 4 root root 4096 2011-08-17 22:43 man
      drwxr-xr-x 5 root root 4096 2011-08-18 22:39 jre
      lrwxrwxrwx 1 root root 22 2011-08-24 18:31 THIRD_PARTY_README -> jre/THIRD_PARTY_README
      lrwxrwxrwx 1 root root 22 2011-08-24 18:31 ASSEMBLY_EXCEPTION -> jre/ASSEMBLY_EXCEPTION
      drwxr-xr-x 2 root root 4096 2011-08-24 18:31 bin
      drwxr-xr-x 2 root root 4096 2011-08-24 18:31 lib
      drwxr-xr-x 3 root root 4096 2011-08-24 18:31 include


      and output of java-version

      sagar@sagar:~$ java -version
      java version "1.6.0_22"
      OpenJDK Runtime Environment (IcedTea6 1.10.2) (6b22-1.10.2-0ubuntu1~11.04.1)
      OpenJDK Server VM (build 20.0-b11, mixed mode)


      now tell me wht can i do..???


      thnk u..

      ReplyDelete
    20. try to "source .profile" and send me the output of "echo $JAVA_HOME"

      ReplyDelete
    21. This comment has been removed by the author.

      ReplyDelete
    22. hello sir check my source.profile


      # ~/.profile: executed by the command interpreter for login shells.
      # This file is not read by bash(1), if ~/.bash_profile or ~/.bash_login
      # exists.
      # see /usr/share/doc/bash/examples/startup-files for examples.
      # the files are located in the bash-doc package.

      # the default umask is set in /etc/profile; for setting the umask
      # for ssh logins, install and configure the libpam-umask package.
      #umask 022

      # if running bash


      if [ -n "$BASH_VERSION" ]; then
      # include .bashrc if it exists
      if [ -f "$HOME/.bashrc" ]; then
      . "$HOME/.bashrc"
      fi
      fi

      # set PATH so it includes user's private bin if it exists
      if [ -d "$HOME/bin" ] ; then
      PATH="$HOME/bin:$PATH"
      export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/
      export HADOOP_HOME=/usr/local/hadoop-0.20.2
      export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
      export MAHOUT_HOME=/usr/local/mahout-0.4/
      export MAHOUT_VERSION=0.4-SNAPSHOT
      export MAVEN_OPTS=-Xmx1024m
      fi





      and out put of echo $JAVA_HOME is blank....

      then i m set it using export command then got usr/lib/jvm/open-6-jdk/

      ReplyDelete
    23. Hi danny,

      Thanks for great posting on your blog for hadoop/mahout on Linux based Opertaing System.
      I want to know that which is best operating system for hadoop/mahout?
      and how to install that on Windows xp...

      --
      THANKS

      VIGNESH PRAJAPATI

      Answer: I advise using Linux if you consider running Hadoop on a cluster. Since Hadoop is written in Java you can potentially run it on Windows as well. One option is to install Sun's virtual box: http://www.virtualbox.org/ and have Linux run on your windows system.

      Best,

      DB

      ReplyDelete
    24. Hi Danny, great article, I followed it completely and successfully generated a custom ami. But When I try to launch the hadoop cluster using this, I keep getting this message over and over again:
      hadoop-0.20.203.0/src/contrib/ec2/bin/launch-hadoop-master: line 110: [: too many arguments

      I would really appreciate your help.
      Thanks,
      Ehtsham

      ReplyDelete
    25. Hi!
      It seems that you installed a more recent version of Hadoop? My instructions where tested with hadoop-0.20.2
      Can you retry with the version about and let me know if this works?

      Best,

      Danny

      ReplyDelete
    26. Hi Danny, I tried with the 0.20.2 version as well by installing it on the EC2 instance and on my computer as well and the problem persits :( I would like to know that if I am following the right procedure for launching the hadoop cluster. After completing your tutorial, I launch hadoop cluster by following the steps under the heading Getting started in the original Mahout article: "mahout on ec2" (on which you have based your tutorial") thanks, Ehtsham

      ReplyDelete
    27. Hi!
      First of all, I just noticed that the syntax highligher I was using is no longer supported in the new blogger template, so some <, > where missing in the hadoop conf file. I hope this did not confuse you it is now fixed.
      After downloading hadoop and setting the conf files (section 4) try to execute section 8 and tell me if this works.

      ReplyDelete
    28. Hi Danny, I correctly entered the configurations in the configuration files in the $HADOOP_HOME/conf directory. I tried installing with hadoop-0.20.2 as well but the problem persists. Eventually i launched a public AMI which already has hadoop on it and then installed mahout and hive( i use both extensively) on it and created an image out of it and then launched cluster from that, and it worked :),
      Thanks for the article, I am following your other tutorials as well.
      Ehtsham

      ReplyDelete
    29. Danny,
      Thanks a lot for sharing your experiences I followed these steps to setup a single instance and ran into the same issue Carlos described previously. When I run on the server:

      sudo $HADOOP_HOME/bin/start-all.sh

      I get:

      namenode running as process 14028. Stop it first.
      localhost: Permission denied (publickey).
      localhost: Permission denied (publickey).
      jobtracker running as process 12087. Stop it first.
      localhost: Permission denied (publickey).

      I'm thinking this occurs when hadoop talks to itself over localhost:port. Do you think I need to provide one of the keys for EC2?

      Note: I'm able to run this with no problems (or password):
      ssh localhost

      Any ideas?

      Thanks!!!!

      ReplyDelete
      Replies
      1. It seems you did not kill properly the previous running instance of hadoop. You should kill it first using the stop-all.sh command.

        Delete
    30. To generate key-based authentication that can be used so that ssh doesn’t require the use of a password every time it’s invoked, I typed the following

      $ ssh-keygen –t dsa –P ‘’ –f ~/.ssh/id_dsa

      It gave me this
      Too many arguments.
      usage: ssh-keygen [options]
      Options:
      -A Generate non-existent host keys for all key types.
      -a trials Number of trials for screening DH-GEX moduli.
      -B Show bubblebabble digest of key file.
      -b bits Number of bits in the key to create.
      -C comment Provide new comment.
      -c Change comment in private and public key files.
      -D pkcs11 Download public key from pkcs11 token.
      -e Export OpenSSH to foreign format key file.
      -F hostname Find hostname in known hosts file.
      -f filename Filename of the key file.
      -G file Generate candidates for DH-GEX moduli.
      -g Use generic DNS resource record format.
      -H Hash names in known_hosts file.
      -h Generate host certificate instead of a user certificate.
      -I key_id Key identifier to include in certificate.
      -i Import foreign format to OpenSSH key file.
      -K checkpt Write checkpoints to this file.
      -L Print the contents of a certificate.
      -l Show fingerprint of key file.
      -M memory Amount of memory (MB) to use for generating DH-GEX moduli.
      -m key_fmt Conversion format for -e/-i (PEM|PKCS8|RFC4716).
      -N phrase Provide new passphrase.
      -n name,... User/host principal names to include in certificate
      -O option Specify a certificate option.
      -P phrase Provide old passphrase.
      -p Change passphrase of private key file.
      -q Quiet.
      -R hostname Remove host from known_hosts file.
      -r hostname Print DNS resource record.
      -S start Start point (hex) for generating DH-GEX moduli.
      -s ca_key Certify keys with CA key.
      -T file Screen candidates for DH-GEX moduli.
      -t type Specify type of key to create.
      -V from:to Specify certificate validity interval.
      -v Verbose.
      -W gen Generator to use for generating DH-GEX moduli.
      -y Read private key file and print public key.
      -z serial Specify a serial number.


      What should i do

      ReplyDelete
      Replies
      1. You should copy and paste my command - namely used single qoute ' and not back qoute ` - see explanation here http://linuxreviews.org/beginner/Bash-Scripting-Introduction-HOWTO/en/x303.html

        Delete
    31. Hi Danny,

      I'm stuck with step 6.. cant get rid of the fatal errors that I've encountered. The mirror under Note b) doesnt seem to be working.

      Please advise.

      Thank you

      ReplyDelete
      Replies
      1. Try http://www.jarvana.com/jarvana/browse/org/apache/mahout/mahout-distribution/0.4/
        But please note there is a newer version of mahout now so my instructions are a bit out of date..

        Delete
      2. Hi,

        Thank you for your reply. I will try the link and play around with the settings,

        Delete
    32. This comment has been removed by the author.

      ReplyDelete