Large Scale Machine Learning and Other Animals: Mahout on Amazon EC2 - part 5 - installing Hadoop/Mahout on high performance instance (CentOS/RedHat)

Friday, February 25, 2011

Mahout on Amazon EC2 - part 5 - installing Hadoop/Mahout on high performance instance (CentOS/RedHat)

This post explains how to install Mahout ML framework on top of Amazon EC2 (CentOS/RedHat based machine).
The notes are based on older Mahout notes: https://cwiki.apache.org/MAHOUT/mahout-on-amazon-ec2.html which are unfortunately outdated.

Note: part 1 of this post, explains how to install the same installation on top of Ubuntu based machine.

Full procedure should take around 2-3 hours.. :-(

1) Start high performance instance from amazon aws console

Cent OS AMI ID ami-7ea24a17 (x86_64)  Edit AMI
Name:  Basic Cluster Instances HVM CentOS 5.4   
Description:  Minimal CentOS 5.4, 64-bit architecture, and HVM-based virtualization for use with Amazon EC2 Cluster Instances.

2) Login into the instance (right mouse click on running instance from AWS console)

3) Install some required stuff

sudo yum update
sudo yum upgrade
sudo apt-get install python-setuptools  
sudo easy_install "simplejson"

4) Install boto (unfortunately I was not able to install it using easy_install directly)

wget http://boto.googlecode.com/files/boto-1.8d.tar.gz
tar xvzf boto-1.8d.tar.gz
cd boto=1.8d
sudo easy_install .

5) Install maven2 (unfortunately I was not able to install it using yum)

wget http://www.trieuvan.com/apache/maven/binaries/apache-maven-2.2.1-bin.tar.gz
tar xvzf apache-maven-2.2.1-bin.tar.gz
cp -R apache-maven-2.2.1 /usr/local/
ln -s /usr/local/apache-maven-2.2.1/bin/mvn /usr/local/bin/

6) Download and install Hadoop

wget http://apache.cyberuse.com//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz   
tar vxzf hadoop-0.20.2.tar.gz  
sudo  mv hadoop-0.20.2 /usr/local/

add the following to $HADOOP_HOME/conf/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/jre-openjdk/  
# The maximum amount of heap to use, in MB. Default is 1000  
export HADOOP_HEAPSIZE=2000

add the following to $HADOOP_HOME/conf/core-site.xml and also $HADOOP_HOME/conf/mapred-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property> <property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Edit the file hdfs-site.xml

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/data/tmp/</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/data/tmp2/</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/data/tmp3/</value>
</property>
</configuration>

Note: directory /home/data does not exist, and you will have to create it
when starting the instance using the commands:

# mkdir -p /home/data  
# mount -t ext3 /dev/sdb/ /home/data/

The reason for this setup is that the root dir has only 10GB, while /dev/sdb/
has 800GB.

set up authorized keys for localhost login w/o passwords and format your name node

# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Checkout and build Mahout from trunk. Alternatively, you can upload a Mahout release tarball and install it as we did with the Hadoop tarball (Don't forget to update your .profile accordingly).

# svn co http://svn.apache.org/repos/asf/mahout/trunk mahout
# cd mahout
# mvn clean install
# cd ..
# sudo mv mahout /usr/local/mahout-0.4

4)Add the following to your .profile

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export HADOOP_HOME=/usr/local/hadoop-0.20.2
export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
export MAHOUT_HOME=/usr/local/mahout-0.4/
export MAHOUT_VERSION=0.4-SNAPSHOT
export MAVEN_OPTS=-Xmx1024m

Verify that the paths on .profile point to the exact version you downloaded

6) Run Hadoop, just to prove you can, and test Mahout by building the Reuters dataset on it. Finally, delete the files and shut it down.

# $HADOOP_HOME/bin/hadoop namenode -format
$HADOOP_HOME/bin/start-all.sh
jps     // you should see all 5 Hadoop processes (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker)
cd $MAHOUT_HOME
./examples/bin/build-reuters.sh
$HADOOP_HOME/bin/stop-all.sh
rm -rf /tmp/*   // delete the Hadoop files

Remove the single-host stuff you added to $HADOOP_HOME/conf/core-site.xml and $HADOOP_HOME/conf/mapred-site.xml in step #6b and verify you are happy with the other conf file settings. The Hadoop startup scripts will not make any changes to them. In particular, upping the Java heap size is required for many of the Mahout jobs.

// edit $HADOOP_HOME/conf/mapred-site.xml to include the following:
<property>
   <name>mapred.child.java.opts</name>
   <value>-Xmx2000m</value>
</property>

7) Allow for Hadoop to run even if you will work on a different EC2 machine:

echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config

8) Now bundle the image.
Using Amazon AWS console - select running instance, right mouse click and then bundle EBS image. Enter image name and description. Now the machine will reboot and the image will be created.

9 comments:

Mat KelceyMarch 5, 2011 at 8:54 PM
Any reason you don't use Amazon's Elastic MapReduce to do the setup?
ReplyDelete
Replies
Danny BicksonMarch 5, 2011 at 9:10 PM
I am not that expert in Elastic Map Reduce and I would like to learn more, but in my experiments I needed to be able to tune the number of machines, the number of cores, the type of machines, amounts of memory etc., as well as other low level parameters like number of reducers and mappers etc. I don't think you can have access to in elastic map reduce.

- Danny
ReplyDelete
Replies
AnonymousDecember 9, 2011 at 9:31 PM
There are so many better tools available now to build up your Hadoop cluster with automagic configuration today than doing it all by scratch.

For instance, I use Cloudera's Manager Express 3.7 and it auto-configures the best settings after detecting clusters' sizes and hardware/software properties. CDH also gives me Mahout packages today to install and run in under two minutes. Its pretty much the best no-nonsense way to take presently.

There is also Apache Whirr, especially designed for these kind of setups.
ReplyDelete
Replies
Danny BicksonDecember 9, 2011 at 11:04 PM
Hi,
There are definitely good tools out there for auto configuration and I am not aware of all of them. In some case, however, you need to be able to modify settings from various reasons:
1) In terms of performance, number of mappers, reducers, dfs block size and replication can have a huge performance difference. In an optimized setting for one application may not fit other application.
2) Other settings, like for example heap size, may be too small for your application and in that case your application will simply not work. Another application are various timeouts that may lead to premature timeouts and rerun of jobs, resulting in very bad performance. See my performance tips here: http://bickson.blogspot.com/2011/03/tunning-hadoop-configuration-for-high.html

Best,
DB
ReplyDelete
Replies
AnonymousDecember 10, 2011 at 10:44 PM
Danny,

Right, but the tools I'd mentioned before are generic cluster install tools. They do not lock down configuration, but do provide a sensible set of defaults (more so than the native apache distribution). Cloudera Manager for instance, auto-decides your heap sizes and still lets you tweak each one individually via a UI (with validation). This is much better than hand-modifying your XML.
ReplyDelete
Replies
Danny BicksonDecember 10, 2011 at 10:47 PM
In that case it sounds like a very useful tool. Thanks for sharing!
ReplyDelete
Replies
VassilisDecember 14, 2011 at 2:36 AM
Very helpful post! thx!

By reading your post the following question came to me:

When installing Hadoop in a single pc, but this pc has multiple cores, does hadoop utilize all cores by default or you have to install many data nodes (e.g. each for each core)?
ReplyDelete
Replies
Danny BicksonDecember 14, 2011 at 4:11 AM
Hi Vassilis,
You can definitely use multiple cores on a single data node. Read my post here: http://bickson.blogspot.com/2011/03/tunning-hadoop-configuration-for-high.html
You will need to set the number of mappers and number of reducers to conform to the number of available cores.

Best,

DB
ReplyDelete
Replies
UnknownApril 22, 2014 at 7:13 AM
what is the command in centos 6 to set path in .profile.
is it vi .bashrc or some other..
ReplyDelete
Replies

Large Scale Machine Learning and Other Animals

Friday, February 25, 2011

Mahout on Amazon EC2 - part 5 - installing Hadoop/Mahout on high performance instance (CentOS/RedHat)

9 comments:

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax