We start by testing logistic regression
1) Launch Amazon AMI image you constructed using the explanation in part 1 of this post.
2) Run Hadoop using
# $HADOOP_HOME/bin/hadoop namenode -format # $HADOOP_HOME/bin/start-all.sh # jps // you should see all 5 Hadoop processes (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker)
3) Run logistic regression example
cd /usr/local/mahout-0.4/ ./bin/mahout org.apache.mahout.classifier.sgd.TrainLogistic --passes 100 --rate 50 --lambda 0.001 --input examples/src/main/resources/donut.csv --features 21 --output donut.model --target color --categories 2 --predictors x y xx xy yy a b c --types n n
You should see the following output:
11/01/25 14:42:45 WARN driver.MahoutDriver: No org.apache.mahout.classifier.sgd.TrainLogistic.props found on classpath, will use command-line arguments only 21 color ~ 0.353*Intercept Term + 5.450*x + -1.671*y + -4.740*xx + 0.353*xy + 0.353*yy + 5.450*a + 2.765*b + -24.161*c Intercept Term 0.35319 a 5.45000 b 2.76534 c -24.16091 x 5.45000 xx -4.73958 xy 0.35319 y -1.67092 yy 0.35319 2.765337737 0.000000000 -1.670917299 0.000000000 0.000000000 0.000000000 5.449999190 0.000000000 -24.160908591 -4.739579336 0.353190637 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 11/01/25 14:42:46 INFO driver.MahoutDriver: Program took 1016 ms
Now we run alternating matrix factorization. Based on instructions by Sebastian Schelter (see https://issues.apache.org/jira/browse/MAHOUT-542).
A related GraphLab implementation is found here
0) Download the patch MAHOUT-542.patch from the above webpage.
Installl it using the command
cd /usr/local/mahout-0.4/src/ wget https://issues.apache.org/jira/secure/attachment/12469671/MAHOUT-542-5.patch patch -p0 < MAHOUT-542-5.patch1) Get the movie lens 1M movie dataset
cd /usr/local/mahout-0.4/ wget http://www.grouplens.org/system/files/million-ml-data.tar__0.gz tar xvzf million-ml-data.tar__0.gz2) Convert dataset to csv format
cat ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3 > ratings.csv cd /usr/local/hadoop-0.20.2/ ./bin/hadoop fs -copyFromLocal /path/to/ratings.csv ratings.csv ./bin/hadoop fs -lsShould see something like
/user/ubuntu/ratings.csv3) # create a 90% percent training set and a 10% probe set
/usr/local/mahout-0.4$ ./bin/mahout splitDataset --input /user/ubuntu/ratings.csv --output /user/ubuntu/myout --trainingPercentage 0.9 --probePercentage 0.1The output should look like:
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2/ HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf 11/01/27 01:09:39 WARN driver.MahoutDriver: No splitDataset.props found on classpath, will use command-line arguments only 11/01/27 01:09:39 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --input=/user/ubuntu/ratings.csv, --output=/user/ubuntu/myout, --probePercentage=0.1, --startPhase=0, --tempDir=temp, --trainingPercentage=0.9} 11/01/27 01:09:40 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 11/01/27 01:09:40 INFO input.FileInputFormat: Total input paths to process : 1 11/01/27 01:09:40 INFO mapred.JobClient: Running job: job_local_0001 11/01/27 01:09:40 INFO input.FileInputFormat: Total input paths to process : 1 11/01/27 01:09:41 INFO mapred.MapTask: io.sort.mb = 100 11/01/27 01:09:41 INFO mapred.MapTask: data buffer = 79691776/99614720 11/01/27 01:09:41 INFO mapred.MapTask: record buffer = 262144/327680 11/01/27 01:09:42 INFO mapred.JobClient: map 0% reduce 0% 11/01/27 01:09:42 INFO mapred.MapTask: Spilling map output: record full = true 11/01/27 01:09:42 INFO mapred.MapTask: bufstart = 0; bufend = 5970616; bufvoid = 99614720 11/01/27 01:09:42 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680 11/01/27 01:09:42 INFO util.NativeCodeLoader: Loaded the native-hadoop library4)# run distributed ALS-WR to factorize the rating matrix based on the training set
bin/mahout parallelALS --input /user/ubuntu/myout/trainingSet/ --output /tmp/als/out --tempDir /tmp/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065 ... 11/01/27 02:40:28 INFO mapred.JobClient: Spilled Records=7398 11/01/27 02:40:28 INFO mapred.JobClient: Map output bytes=691713 11/01/27 02:40:28 INFO mapred.JobClient: Combine input records=0 11/01/27 02:40:28 INFO mapred.JobClient: Map output records=3699 11/01/27 02:40:28 INFO mapred.JobClient: Reduce input records=3699 11/01/27 02:40:28 INFO driver.MahoutDriver: Program took 1998612 ms5)# measure the error of the predictions against the probe set
usr/local/mahout-0.4$ bin/mahout evaluateALS --probes /user/ubuntu/myout/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/ Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2/ HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf 11/01/27 02:42:37 WARN driver.MahoutDriver: No evaluateALS.props found on classpath, will use command-line arguments only 11/01/27 02:42:37 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --itemFeatures=/tmp/als/out/M/, --probes=/user/ubuntu/myout/probeSet/, --startPhase=0, --tempDir=temp, --userFeatures=/tmp/als/out/U/} ... Probe [99507], rating of user [4510] towards item [2560], [1.0] estimated [1.574626183998361] Probe [99508], rating of user [4682] towards item [171], [4.0] estimated [4.073943928686575] Probe [99509], rating of user [3333] towards item [1215], [5.0] estimated [4.098295242062813] Probe [99510], rating of user [4682] towards item [173], [2.0] estimated [1.9625234269143972] RMSE: 0.8546120366924382, MAE: 0.6798083002225481 11/01/27 02:42:50 INFO driver.MahoutDriver: Program took 13127 msUseful HDFS commands * View the current state of the file system
ubuntu@domU-12-31-39-00-18-51:/usr/local/hadoop-0.20.2$ ./bin/hadoop dfsadmin -report Configured Capacity: 10568916992 (9.84 GB) Present Capacity: 3698495488 (3.44 GB) DFS Remaining: 40173568 (38.31 MB) DFS Used: 3658321920 (3.41 GB) DFS Used%: 98.91% Under replicated blocks: 56 Blocks with corrupt replicas: 0 Missing blocks: 0 ------------------------------------------------- Datanodes available: 1 (1 total, 0 dead) Name: 127.0.0.1:50010 Decommission Status : Normal Configured Capacity: 10568916992 (9.84 GB) DFS Used: 3658321920 (3.41 GB) Non DFS Used: 6870421504 (6.4 GB) DFS Remaining: 40173568(38.31 MB) DFS Used%: 34.61% DFS Remaining%: 0.38% Last contact: Tue Feb 01 21:10:15 UTC 2011* Delete a directory
ubuntu@domU-12-31-39-00-18-51:/usr/local/hadoop-0.20.2$ ./bin/hadoop fs -rmr temp/markedPreferences Deleted hdfs://localhost:9000/user/ubuntu/temp/markedPreferences