Large Scale Machine Learning and Other Animals: Hadoop on Amazon EC2 - Part 4

Tuesday, February 8, 2011

Hadoop on Amazon EC2 - Part 4 - Running on a cluster

1) Edit the file conf/hdfs-conf.xml
Set the number of replicas as the number of nodes you plan to use. In this example, 4.



 
  hadoop.tmp.dir
   /mnt/tmp/
  
  
   dfs.data.dir
   /mnt/tmp2/
   
 
   dfs.name.dir
   /mnt/tmp3/
   
  dfs.replication 
  4
  Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.

2) Edit the file conf/slaves and list the DNS names of all of the machines you are going to use. For example:

 ec2-67-202-45-10.compute-1.amazonaws.com

 ec2-67-202-45-11.compute-1.amazonaws.com

 ec2-67-202-45-12.compute-1.amazonaws.com

 ec2-67-202-45-13.compute-1.amazonaws.com 

3) Edit the file conf/master and enter the DNS name of the master node. For example

 ec2-67-202-45-10.compute-1.amazonaws.com

Note that the master node can appear also in the salves list.

4) Edit the file conf/core-site.xml to include the master name


  
    fs.default.name
    hdfs://ec2-67-202-45-10.compute-1.amazonaws.com:9000
  

  
    mapred.job.tracker
    ec2-67-202-45-10.compute-1.amazonaws.com:9001
  

  
  hadoop.tmp.dir
   /mnt/tmp/

5) Edit the file conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
  <property>
    <name>fs.default.name</name> 
    <value>hdfs://ec2-67-202-45-10.compute-1.amazonaws.com:9000</value>
  </property>

  <property>
    <name>mapred.job.tracker</name>
  <value>ec2-67-202-45-10.compute-1.amazonaws.com:9001</value>
  </property>
  <property>
  <name>hadoop.tmp.dir</name>
   <value>/mnt/tmp/</value>
  </property>

  <property>
  <name>mapred.map.tasks</name>
   <value>10</value> <!-- about the number of cores>
  </property>

   <property>
  <name>mapred.reduce.tasks</name>
   <value>10</value> <!-- about the number of cores>
  </property>

  <property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>12</value> <!-- slightly more than cores>  </property>

   <property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
   <value>12</value> <!-- slightly more than cores>
  </property>
   
</configuration>

6) Login into the master node. For each of the 3 slaves machines, copy the DSA key from the master node:

sh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-11.compute-1.amazonaws.com
ssh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-12.compute-1.amazonaws.com
ssh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-13.compute-1.amazonaws.com

7) To start Hadoop. On the master machine

/usr/local/hadoop-0.20.2/bin/hadoop namenode -format
/usr/local/hadoop-0.20.2/bin/start-dfs.sh
/usr/local/hadoop-0.20.2/bin/start-mapred.sh

8) To stop Hadoop

/usr/local/hadoop-0.20.2/bin/stop-mapred.sh
/usr/local/hadoop-0.20.2/bin/stop-dfs.sh

Large Scale Machine Learning and Other Animals

Tuesday, February 8, 2011

Hadoop on Amazon EC2 - Part 4 - Running on a cluster

No comments:

Post a Comment

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax