Large Scale Machine Learning and Other Animals

Wednesday, February 9, 2011

Hadoop/Mahout - setting up a development environment

This post explains how to setup a development environment for Hadoop and Mahout.

Prerequisites: need to have Mahout and Hadoop sources. (See previous posts).

On a development machine
1) Download Helios version of Eclipse like eclipse-java-helios-SR1-linux-gtk-x86_64.tar.gz
and save it locally. Opem the zip file using:
tar xvzf *.gz

2) Install the Map-reduce eclipse plugin

cd eclipse/plugins/

 wget https://issues.apache.org/jira/secure/attachment/12460491/hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar

3) Follow the directions in
http://m2eclipse.sonatype.org/installing-m2eclipse.html
to install maven plugin in eclipse.

4) Eclipse-> File-> import maven project -> select mahout root dir -> finish
you will see a list of all subprojects. Press OK and wait for compilation to finish.
If everything went smoothly project should compile.

5)Select Map-reduce view -> Map-reduce location tab -> Edit Hadoop locations
In general tab, Add location name (just a name to identify this configuration) and the host and
port for the Map/reduce master (default port 50030 using the EC2 configuration described in previous posts) and DFS master (default port 50070) -> Finish

CMU Pegasus on Hadoop

Pegasus is a peta scale graph mining library.
This post explains how to install it on Amazon EC2.

1) Start with the Amazon EC2 image you created in http://bickson.blogspot.com/2011/01/how-to-install-mahout-on-amazon-ec2.html
2) Run Hadoop on a single node as explained in http://bickson.blogspot.com/2011/01/mahout-on-amazon-ec2-part-2-testing.html
3) Login into the EC2 machine
4) wget http://www.cs.cmu.edu/%7Epegasus/PEGASUSH-2.0.tar.gz
5) tar xvzf PEGASUSH-2.0.tar.gz
6) cd PEGASUS
7) export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/hadoop-0.20.2/bin/
8) sudo apt-get install gnuplot
9) ./pegasus.sh
PEGASUS> demo
put: Target pegasus/graphs/catstar/edge/catepillar_star.edge already exists
Graph catstar added.
rmr: cannot remove dd_node_deg: No such file or directory.
rmr: cannot remove dd_deg_count: No such file or directory.

-----===[PEGASUS: A Peta-Scale Graph Mining System]===-----

[PEGASUS] Computing degree distribution. Degree type = InOut

11/02/09 14:47:36 INFO mapred.FileInputFormat: Total input paths to process : 1
11/02/09 14:47:36 INFO mapred.JobClient: Running job: job_201102091432_0003
11/02/09 14:47:37 INFO mapred.JobClient: map 0% reduce 0%
11/02/09 14:47:45 INFO mapred.JobClient: map 18% reduce 0%
11/02/09 14:47:48 INFO mapred.JobClient: map 36% reduce 0%
11/02/09 14:47:51 INFO mapred.JobClient: map 54% reduce 0%
11/02/09 14:47:54 INFO mapred.JobClient: map 72% reduce 18%
11/02/09 14:47:57 INFO mapred.JobClient: map 90% reduce 18%
11/02/09 14:48:00 INFO mapred.JobClient: map 100% reduce 18%
11/02/09 14:48:03 INFO mapred.JobClient: map 100% reduce 24%
11/02/09 14:48:09 INFO mapred.JobClient: map 100% reduce 100%
11/02/09 14:48:11 INFO mapred.JobClient: Job complete: job_201102091432_0003
11/02/09 14:48:11 INFO mapred.JobClient: Counters: 18
11/02/09 14:48:11 INFO mapred.JobClient:   Job Counters
11/02/09 14:48:11 INFO mapred.JobClient:     Launched reduce tasks=1
11/02/09 14:48:11 INFO mapred.JobClient:     Launched map tasks=11
11/02/09 14:48:11 INFO mapred.JobClient:     Data-local map tasks=11

An image named catstar_deg_inout.eps will be created.

Tuesday, February 8, 2011

Hadoop on Amazon EC2 - Part 4 - Running on a cluster

1) Edit the file conf/hdfs-conf.xml
Set the number of replicas as the number of nodes you plan to use. In this example, 4.



 
  hadoop.tmp.dir
   /mnt/tmp/
  
  
   dfs.data.dir
   /mnt/tmp2/
   
 
   dfs.name.dir
   /mnt/tmp3/
   
  dfs.replication 
  4
  Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.

2) Edit the file conf/slaves and list the DNS names of all of the machines you are going to use. For example:

 ec2-67-202-45-10.compute-1.amazonaws.com

 ec2-67-202-45-11.compute-1.amazonaws.com

 ec2-67-202-45-12.compute-1.amazonaws.com

 ec2-67-202-45-13.compute-1.amazonaws.com 

3) Edit the file conf/master and enter the DNS name of the master node. For example

 ec2-67-202-45-10.compute-1.amazonaws.com

Note that the master node can appear also in the salves list.

4) Edit the file conf/core-site.xml to include the master name


  
    fs.default.name
    hdfs://ec2-67-202-45-10.compute-1.amazonaws.com:9000
  

  
    mapred.job.tracker
    ec2-67-202-45-10.compute-1.amazonaws.com:9001
  

  
  hadoop.tmp.dir
   /mnt/tmp/

5) Edit the file conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
  <property>
    <name>fs.default.name</name> 
    <value>hdfs://ec2-67-202-45-10.compute-1.amazonaws.com:9000</value>
  </property>

  <property>
    <name>mapred.job.tracker</name>
  <value>ec2-67-202-45-10.compute-1.amazonaws.com:9001</value>
  </property>
  <property>
  <name>hadoop.tmp.dir</name>
   <value>/mnt/tmp/</value>
  </property>

  <property>
  <name>mapred.map.tasks</name>
   <value>10</value> <!-- about the number of cores>
  </property>

   <property>
  <name>mapred.reduce.tasks</name>
   <value>10</value> <!-- about the number of cores>
  </property>

  <property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>12</value> <!-- slightly more than cores>  </property>

   <property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
   <value>12</value> <!-- slightly more than cores>
  </property>
   
</configuration>

6) Login into the master node. For each of the 3 slaves machines, copy the DSA key from the master node:

sh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-11.compute-1.amazonaws.com
ssh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-12.compute-1.amazonaws.com
ssh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-13.compute-1.amazonaws.com

7) To start Hadoop. On the master machine

/usr/local/hadoop-0.20.2/bin/hadoop namenode -format
/usr/local/hadoop-0.20.2/bin/start-dfs.sh
/usr/local/hadoop-0.20.2/bin/start-mapred.sh

8) To stop Hadoop

/usr/local/hadoop-0.20.2/bin/stop-mapred.sh
/usr/local/hadoop-0.20.2/bin/stop-dfs.sh

Friday, February 4, 2011

Mahout - SVD matrix factorization - reading output

Converting Mahout's SVD Distributed Matrix Factorization Solver Output Format into CSV format

Purpose

The code below, shows how to convert a matrix from Mahout's SVD output format

into a matrix format.

This code is based on code by Danny Leshem, ContextIn.

Command line arguments:

args[0] - path to svd output file
args[1] - path to output csv file

Compilation:
Copy the java code below into an java file named SVD2CSV.java
Add to the project path both Mahout and Hadoop jars.

#

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.util.Iterator;

import org.apache.mahout.math.SequentialAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;


public class SVD2CSV {
   
   
    public static int Cardinality;
   
    /**
     *
     * @param args[0] - input csv file
     * @param args[1] - cardinality (length of vector)
     * @param args[2] - output file for svd
     */
    public static void main(String[] args){
   
try {
    final Configuration conf = new Configuration();
    final FileSystem fs = FileSystem.get(conf);
    final SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(args[0]), conf);
    BufferedWriter br = new BufferedWriter(new FileWriter(args[1]));
    IntWritable key = new IntWritable();
    VectorWritable vec = new VectorWritable();
         
          while (reader.next(key, vec)) {
              //System.out.println("key " + key);
              SequentialAccessSparseVector vect = (SequentialAccessSparseVector)vec.get();
               System.out.println("key " + key + " value: " + vect);
              Iterator<Vector.Element> iter = vect.iterateNonZero();

               while(iter.hasNext()){
                  Vector.Element element = iter.next();
                  br.write(key + "," + element.index() + "," + vect.getQuick(element.index())+"\n");
            }
          }
     
          reader.close();
          br.close();

       
    } catch(Exception ex){
        ex.printStackTrace();
    }
    }
}

When parsing the output please look here about a discussion regarding validity of computed results.

Further reading: Yahoo! KDD Cup 2011 - large scale matrix factorization.

Mahout - SVD matrix factorization - formatting input matrix

Converting Input Format into Mahout's SVD Distributed Matrix Factorization Solver

Purpose

The code below, converts a matrix from csv format:

<from row>,<to col>,<value>\n

Into Mahout's SVD solver format.

For example,
The 3x3 matrix:
0 1.0 2.1
3.0 4.0 5.0
-5.0 6.2 0

Will be given as input in a csv file as:
1,0,3.0
2,0,-5.0
0,1,1.0
1,1,4.0
2,1,6.2
0,2,2.1
1,2,5.0

NOTE: I ASSUME THE MATRIX IS SORTED BY THE COLUMNS ORDER
This code is based on code by Danny Leshem, ContextIn.

Command line arguments:

args[0] - path to csv input file
args[1] - cardinality of the matrix (number of columns)
args[2] - path the resulting Mahout's SVD input file

Method:
The code below, goes over the csv file, and for each matrix column, creates a SequentialAccessSparseVector which contains all the non-zero row entries for this column.
Then it appends the column vector to file.

Compilation:
Copy the java code below into an java file named Convert2SVD.java
Add to your IDE project path both Mahout and Hadoop jars. Alternatively, a command line option for compilation is given below.

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.StringTokenizer;

import org.apache.mahout.math.SequentialAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.CompressionType;

/**
 * Code for converting CSV format to Mahout's SVD format
 * @author Danny Bickson, CMU
 * Note: I ASSUME THE CSV FILE IS SORTED BY THE COLUMN (NAMELY THE SECOND FIELD).
 *
 */

public class Convert2SVD {


        public static int Cardinality;

        /**
         * 
         * @param args[0] - input csv file
         * @param args[1] - cardinality (length of vector)
         * @param args[2] - output file for svd
         */
        public static void main(String[] args){

try {
        Cardinality = Integer.parseInt(args[1]);
        final Configuration conf = new Configuration();
        final FileSystem fs = FileSystem.get(conf);
        final SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, new Path(args[2]), IntWritable.class, VectorWritable.class, CompressionType.BLOCK);

          final IntWritable key = new IntWritable();
          final VectorWritable value = new VectorWritable();

   
           String thisLine;
        
           BufferedReader br = new BufferedReader(new FileReader(args[0]));
           Vector vector = null;
           int from = -1,to  =-1;
           int last_to = -1;
           float val = 0;
           int total = 0;
           int nnz = 0;
           int e = 0;
           int max_to =0;
           int max_from = 0;

           while ((thisLine = br.readLine()) != null) { // while loop begins here
            
                 StringTokenizer st = new StringTokenizer(thisLine, ",");
                 while(st.hasMoreTokens()) {
                     from = Integer.parseInt(st.nextToken())-1; //convert from 1 based to zero based
                     to = Integer.parseInt(st.nextToken())-1; //convert from 1 based to zero basd
                     val = Float.parseFloat(st.nextToken());
                     if (max_from < from) max_from = from;
                     if (max_to < to) max_to = to;
                     if (from < 0 || to < 0 || from > Cardinality || val == 0.0)
                         throw new NumberFormatException("wrong data" + from + " to: " + to + " val: " + val);
                 }
              
                 //we are working on an existing column, set non-zero rows in it
                 if (last_to != to && last_to != -1){
                     value.set(vector);
                     
                     writer.append(key, value); //write the older vector
                     e+= vector.getNumNondefaultElements();
                 }
                 //a new column is observed, open a new vector for it
                 if (last_to != to){
                     vector = new SequentialAccessSparseVector(Cardinality); 
                     key.set(to); // open a new vector
                     total++;
                 }

                 vector.set(from, val);
                 nnz++;

                 if (nnz % 1000000 == 0){
                   System.out.println("Col" + total + " nnz: " + nnz);
                 }
                 last_to = to;

          } // end while 

           value.set(vector);
           writer.append(key,value);//write last row
           e+= vector.getNumNondefaultElements();
           total++;
           
           writer.close();
           System.out.println("Wrote a total of " + total + " cols " + " nnz: " + nnz);
           if (e != nnz)
                System.err.println("Bug:missing edges! we only got" + e);
          
           System.out.println("Highest column: " + max_to + " highest row: " + max_from );
        } catch(Exception ex){
                ex.printStackTrace();
        }
    }
}

A second option to compile this file is create a Makefile, with the following in it:

all:
        javac -cp /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar *.java

Note that you will have the change location of the jars to point to where your jars are stored.

Example for running this conversion for netflix data:

java -cp .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-api-1.0.4.jar Convert2SVD ../../netflixe.csv 17770 netflixe.seq

Aug 23, 2011 1:16:06 PM org.apache.hadoop.util.NativeCodeLoader
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Aug 23, 2011 1:16:06 PM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Wrote a total of 241 rows, nnz: 1000000
Wrote a total of 381 rows, nnz: 2000000

Wrote a total of 571 rows, nnz: 3000000

Wrote a total of 789 rows, nnz: 4000000

Wrote a total of 1046 rows, nnz: 5000000

Wrote a total of 1216 rows, nnz: 6000000

Wrote a total of 1441 rows, nnz: 7000000

...

NOTE: You may want also to checkout GraphLab's collaborative filtering library: here. GraphLab has a 100% compatible SVD solver to Mahout, with performance gains up to x50 times faster. I have created Java code to convert Mahout sequence files into Graphlab's format and back. Email me and I will send you the code.

Tuesday, February 1, 2011

Mahout on Amazon EC2 - part 3 - Debugging

Connecting to management web interface of an Hadoop node

1) Login into AWS management consolute setup
Select the default security group, and add tcp ports 50010-50090 (with ip 0.0.0.0/0).

2) You can view hadoop node status (after starting Hadoop) by opening a web browser and
entering the following address:

http://ec2-50-16-155-136.compute-1.amazonaws.com:50070/

where ec2-XXX-XXXXXXXX is the nodename, and 50070 is the default port for namenode sever,
and 50030 is the default port of the job tracker. 50060 is the default port for the task tracker.

Common errors and their solutions:

* When starting hadoop, the following message is presented:
<32|0>bickson@biggerbro:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2$ ./bin/start-all.sh
localhost: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
localhost: @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
localhost: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
localhost: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
localhost: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
localhost: It is also possible that the RSA host key has just been changed.
localhost: The fingerprint for the RSA key sent by the remote host is
localhost: 06:95:7b:c8:0e:85:e7:ba:aa:b1:31:6e:fc:0e:ae:4d.
localhost: Please contact your system administrator.
localhost: Add correct host key in /mnt/bigbrofs/usr6/bickson/.ssh/known_hosts to get rid of this message.
localhost: Offending key in /mnt/bigbrofs/usr6/bickson/.ssh/known_hosts:1
localhost: RSA host key for localhost has changed and you have requested strict checking.
localhost: Host key verification failed.

Solution:

echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config

Note:

If the file ~/.ssh/config does not exist, change the command to:

echo "NoHostAuthenticationForLocalhost yes" >~/.ssh/config

The following exception is received:

org.apache.hadoop.ipc.RemoteException: java.io.IOException:
File /user/ubuntu/temp/markedPreferences/_temporary/_attempt_local_0001_r_000000_0/part-r-00000 could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock
(FSNamesystem.java:1271)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

Solution:
1) I saw this error when system is out of disk space. Increase number of nodes or instance type.
2) Another cause is that the datanode did not finish to boot. Wait at least 200 seconds after starting Hadoop before actually starting to run jobs.

* Job tracker fails to run with the following error:

2011-02-02 16:02:47,097 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting JobTracker
STARTUP_MSG:   host = ip-10-114-75-91/10.114.75.91
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/

hadoop/common/branches/branch-0.20 -r 911707; compiled by 

'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
2011-02-02 16:02:47,200 INFO org.apache.hadoop.mapred.JobTracker: 

Scheduler  configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, 

limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)
2011-02-02 16:02:47,220 FATAL org.apache.hadoop.mapred.JobTracker:

java.lang.RuntimeException: 

Not a host:port pair: local
 at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:136)
 at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:123)
 at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:1807)
 at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1579)
 at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)
 at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)
 at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)

Solution: edit the file /path/to/hadoop/conf/mapred-site.xml:

 
 
mapred.job.tracker
localhost:9001

* When connecting to EC2 host you get the following error:

ssh -i ./graphlabkey.pem -o "StrictHostKeyChecking no" 
ubuntu@ec2-50-16-101-232.compute-1.amazonaws.com "/home/ubuntu/ec2-metadata -i"
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0644 for './graphlabkey.pem' are too open.
It is recommended that your private key files are NOT accessible by others.
This private key will be ignored.

Solution:
chmod 400 graphlabkey.pem

* Exception : can not lock storage
************************************************************/

2011-02-03 23:47:48,623 INFO org.apache.hadoop.hdfs.server.common.Storage: 

Cannot lock storage /mnt. The directory is already locked.
2011-02-03 23:47:48,736 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: 

Cannot lock storage /mnt. The directory is already locked.
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:510)
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:363)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:112)
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
at org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:216)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)

Solution: search and remove the file in_use.lock

* Exception:

tasktracker running as process XXX. Stop it first.

Solution:
1) Hadoop is already running - kill it first using stop-all.sh (on a single machine) or stop-mapred.sh and stop-dfs.sh (on a cluster)
2) If you killed Hadoop and you are still getting this error - check under /tmp
if it contains files *.pid - if so remove them.

2010-08-06 12:12:06,900 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /Users/jchen/Data/Hadoop/dfs/data: namenode namespaceID = 773619367; datanode namespaceID = 2049079249
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:216)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)

Solution: Remove all files named VERSION from all tmp directories (need to search very well Hadoop has at least 3 working directories) and reformat the namenode file system.

Error:

bash-3.2$ ./bin/start-all.sh 
starting namenode, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-namenode-biggerbro.ml.cmu.edu.out
localhost: starting datanode, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-datanode-biggerbro.ml.cmu.edu.out
localhost: starting secondarynamenode, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-secondarynamenode-biggerbro.ml.cmu.edu.out
localhost: Exception in thread "main" java.net.BindException: Address already in use
localhost: 	at sun.nio.ch.Net.bind(Native Method)
localhost: 	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119)
localhost: 	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
localhost: 	at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
localhost: 	at org.apache.hadoop.http.HttpServer.start(HttpServer.java:425)
localhost: 	at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:165)
localhost: 	at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:115)
localhost: 	at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:469)
starting jobtracker, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-jobtracker-biggerbro.ml.cmu.edu.out

Solution: kill every process using ./bin/stop-all.sh, wait a few mins and retry. If this does not help you may need to change port numbers in config files.

hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/name is in an inconsistent state: storage directory does not exist or is not accessible.
hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:2011-09-05 04:33:10,572 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/name is in an inconsistent state: storage directory does not exist or is not accessible.

Solution: it seemed you did not format properly hdfs using the command
./bin/hadoop namenode -format

Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
	at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
	at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:247)
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
	at org.apache.hadoop.io.WritableName.getClass(WritableName.java:71)
	at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:1613)
	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1555)
	at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1428)
	at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
	at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
	at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)

Solution: verify that MAHOUT_HOME is properly defined.

hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:2011-09-05 05:31:35,693 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 9000, call delete(/tmp/hadoop/mapred/system, true) from 127.0.0.1:51103: error: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /tmp/hadoop/mapred/system. Name node is in safe mode.
hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /tmp/hadoop/mapred/system. Name node is in safe mode.
hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:2011-09-05 05:33:39,712 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 9000, call addBlock(/user/bickson/small_netflix_mahout_transpose/part-r-00000, DFSClient_-1810781150) from 127.0.0.1:48972: error: java.io.IOException: File /user/bickson/small_netflix_mahout_transpose/part-r-00000 could only be replicated to 0 nodes, instead of 1
hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:java.io.IOException: File /user/bickson/small_netflix_mahout_transpose/part-r-00000 could only be replicated to 0 nodes, instead of 1

Solution: This error may happen if you try to access hdfs file system before Hadoop finished loading up properly. Wait a few mins and try again.

Mahout on CMU OpenCloud

This post explains how to run Mahout on top of CMU OpenCloud.

1) log into the cloud login node

ssh -L 8888:proxy.opencloud:8888 login.cloud.pdl.cmu.local.

2) copy mahout directory tree into your home folder.

3) Run Mahout example

cd mahout-0.4/ 

export JAVA_HOME=/usr/lib/jvm/java-6-sun/

 ./examples/bin/build-reuters.sh

You should see:
sh -x ./examples/bin/build-reuters.sh
11/02/01 15:13:27 INFO driver.MahoutDriver: Program took 225915 ms
+ ./bin/mahout seqdirectory -i ./examples/bin/work/reuters-out/ -o ./examples/bin/work/reuters-out-seqdir -c UTF-8 -chunk 5
Running on hadoop, using HADOOP_HOME=/usr/local/sw/hadoop
HADOOP_CONF_DIR=/etc/hadoop/conf/global
11/02/01 15:13:38 INFO driver.MahoutDriver: Program took 10087 ms
+ ./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o ./examples/bin/work/reuters-out-seqdir-sparse
Running on hadoop, using HADOOP_HOME=/usr/local/sw/hadoop
HADOOP_CONF_DIR=/etc/hadoop/conf/global
11/02/01 15:13:40 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1
11/02/01 15:13:40 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
11/02/01 15:13:40 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
11/02/01 15:13:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/02/01 15:13:42 INFO input.FileInputFormat: Total input paths to process : 3
11/02/01 15:13:47 INFO mapred.JobClient: Running job: job_201101170028_1733
11/02/01 15:13:48 INFO mapred.JobClient: map 0% reduce 0%
11/02/01 15:17:49 INFO mapred.JobClient: map 33% reduce 0%
11/02/01 15:17:55 INFO mapred.JobClient: map 66% reduce 0%
11/02/01 15:18:01 INFO mapred.JobClient: map 100% reduce 0%
11/02/01 15:18:08 INFO mapred.JobClient: Job complete: job_201101170028_1733
11/02/01 15:18:08 INFO mapred.JobClient: Counters: 6
11/02/01 15:18:08 INFO mapred.JobClient:   Job Counters
11/02/01 15:18:08 INFO mapred.JobClient:     Rack-local map tasks=5
11/02/01 15:18:08 INFO mapred.JobClient:     Launched map tasks=5
11/02/01 15:18:08 INFO mapred.JobClient:   FileSystemCounters
11/02/01 15:18:08 INFO mapred.JobClient:     HDFS_BYTES_READ=13537042
11/02/01 15:18:08 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=11047110
11/02/01 15:18:08 INFO mapred.JobClient:   Map-Reduce Framework
11/02/01 15:18:08 INFO mapred.JobClient:     Map input records=16115
11/02/01 15:18:08 INFO mapred.JobClient:     Spilled Records=0
11/02/01 15:18:08 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/02/01 15:18:09 INFO input.FileInputFormat: Total input paths to process : 3
11/02/01 15:18:15 INFO mapred.JobClient: Running job: job_201101170028_1736
11/02/01 15:18:16 INFO mapred.JobClient: map 0% reduce 0%
...