Large Scale Machine Learning and Other Animals: Mahout on Amazon EC2 - part 2

Tuesday, January 25, 2011

Mahout on Amazon EC2 - part 2 - Running Hadoop on a single node

Following part 1 of this posting which explained how to install Mahout and Hadoop on Amazon EC2.

We start by testing logistic regression

1) Launch Amazon AMI image you constructed using the explanation in part 1 of this post.

2) Run Hadoop using

# $HADOOP_HOME/bin/hadoop namenode -format
# $HADOOP_HOME/bin/start-all.sh
# jps     // you should see all 5 Hadoop processes (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker)

3) Run logistic regression example

cd /usr/local/mahout-0.4/
./bin/mahout org.apache.mahout.classifier.sgd.TrainLogistic --passes 100 --rate 50 --lambda 0.001 --input examples/src/main/resources/donut.csv --features 21 --output donut.model --target color --categories 2 --predictors x y xx xy yy a b c --types n n

You should see the following output:

11/01/25 14:42:45 WARN driver.MahoutDriver: No org.apache.mahout.classifier.sgd.TrainLogistic.props found on classpath, will use command-line arguments only
21
color ~ 0.353*Intercept Term + 5.450*x + -1.671*y + -4.740*xx + 0.353*xy + 0.353*yy + 5.450*a + 2.765*b + -24.161*c
      Intercept Term 0.35319
                   a 5.45000
                   b 2.76534
                   c -24.16091
                   x 5.45000
                  xx -4.73958
                  xy 0.35319
                   y -1.67092
                  yy 0.35319

    2.765337737     0.000000000    -1.670917299     0.000000000     0.000000000     0.000000000     5.449999190     0.000000000   -24.160908591    -4.739579336     0.353190637     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000

11/01/25 14:42:46 INFO driver.MahoutDriver: Program took 1016 ms

Now we run alternating matrix factorization. Based on instructions by Sebastian Schelter (see https://issues.apache.org/jira/browse/MAHOUT-542).
A related GraphLab implementation is found here

0) Download the patch MAHOUT-542.patch from the above webpage.
Installl it using the command

cd /usr/local/mahout-0.4/src/
wget https://issues.apache.org/jira/secure/attachment/12469671/MAHOUT-542-5.patch
patch -p0 < MAHOUT-542-5.patch

1) Get the movie lens 1M movie dataset

cd /usr/local/mahout-0.4/
wget http://www.grouplens.org/system/files/million-ml-data.tar__0.gz
tar xvzf million-ml-data.tar__0.gz

2) Convert dataset to csv format

cat ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3 > ratings.csv
cd /usr/local/hadoop-0.20.2/
./bin/hadoop fs -copyFromLocal /path/to/ratings.csv ratings.csv
./bin/hadoop fs -ls

Should see something like

/user/ubuntu/ratings.csv

3) # create a 90% percent training set and a 10% probe set

/usr/local/mahout-0.4$ ./bin/mahout splitDataset  --input /user/ubuntu/ratings.csv --output /user/ubuntu/myout --trainingPercentage 0.9 --probePercentage 0.1

The output should look like:

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2/
HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
11/01/27 01:09:39 WARN driver.MahoutDriver: No splitDataset.props found on classpath, will use command-line arguments only
11/01/27 01:09:39 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --input=/user/ubuntu/ratings.csv, --output=/user/ubuntu/myout, --probePercentage=0.1, --startPhase=0, --tempDir=temp, --trainingPercentage=0.9}
11/01/27 01:09:40 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
11/01/27 01:09:40 INFO input.FileInputFormat: Total input paths to process : 1
11/01/27 01:09:40 INFO mapred.JobClient: Running job: job_local_0001
11/01/27 01:09:40 INFO input.FileInputFormat: Total input paths to process : 1
11/01/27 01:09:41 INFO mapred.MapTask: io.sort.mb = 100
11/01/27 01:09:41 INFO mapred.MapTask: data buffer = 79691776/99614720
11/01/27 01:09:41 INFO mapred.MapTask: record buffer = 262144/327680
11/01/27 01:09:42 INFO mapred.JobClient:  map 0% reduce 0%
11/01/27 01:09:42 INFO mapred.MapTask: Spilling map output: record full = true
11/01/27 01:09:42 INFO mapred.MapTask: bufstart = 0; bufend = 5970616; bufvoid = 99614720
11/01/27 01:09:42 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680
11/01/27 01:09:42 INFO util.NativeCodeLoader: Loaded the native-hadoop library

4)# run distributed ALS-WR to factorize the rating matrix based on the training set

bin/mahout parallelALS --input /user/ubuntu/myout/trainingSet/ --output /tmp/als/out --tempDir /tmp/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065
...
11/01/27 02:40:28 INFO mapred.JobClient:     Spilled Records=7398
11/01/27 02:40:28 INFO mapred.JobClient:     Map output bytes=691713
11/01/27 02:40:28 INFO mapred.JobClient:     Combine input records=0
11/01/27 02:40:28 INFO mapred.JobClient:     Map output records=3699
11/01/27 02:40:28 INFO mapred.JobClient:     Reduce input records=3699
11/01/27 02:40:28 INFO driver.MahoutDriver: Program took 1998612 ms

5)# measure the error of the predictions against the probe set

usr/local/mahout-0.4$ bin/mahout evaluateALS --probes /user/ubuntu/myout/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2/
HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
11/01/27 02:42:37 WARN driver.MahoutDriver: No evaluateALS.props found on classpath, will use command-line arguments only
11/01/27 02:42:37 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --itemFeatures=/tmp/als/out/M/, --probes=/user/ubuntu/myout/probeSet/, --startPhase=0, --tempDir=temp, --userFeatures=/tmp/als/out/U/}

...

Probe [99507], rating of user [4510] towards item [2560], [1.0] estimated [1.574626183998361]
Probe [99508], rating of user [4682] towards item [171], [4.0] estimated [4.073943928686575]
Probe [99509], rating of user [3333] towards item [1215], [5.0] estimated [4.098295242062813]
Probe [99510], rating of user [4682] towards item [173], [2.0] estimated [1.9625234269143972]
RMSE: 0.8546120366924382, MAE: 0.6798083002225481
11/01/27 02:42:50 INFO driver.MahoutDriver: Program took 13127 ms

Useful HDFS commands * View the current state of the file system

ubuntu@domU-12-31-39-00-18-51:/usr/local/hadoop-0.20.2$ ./bin/hadoop dfsadmin -report
Configured Capacity: 10568916992 (9.84 GB)
Present Capacity: 3698495488 (3.44 GB)
DFS Remaining: 40173568 (38.31 MB)
DFS Used: 3658321920 (3.41 GB)
DFS Used%: 98.91%
Under replicated blocks: 56
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Name: 127.0.0.1:50010
Decommission Status : Normal
Configured Capacity: 10568916992 (9.84 GB)
DFS Used: 3658321920 (3.41 GB)
Non DFS Used: 6870421504 (6.4 GB)
DFS Remaining: 40173568(38.31 MB)
DFS Used%: 34.61%
DFS Remaining%: 0.38%
Last contact: Tue Feb 01 21:10:15 UTC 2011

* Delete a directory

ubuntu@domU-12-31-39-00-18-51:/usr/local/hadoop-0.20.2$ ./bin/hadoop fs -rmr temp/markedPreferences
Deleted hdfs://localhost:9000/user/ubuntu/temp/markedPreferences

10 comments:

AnonymousMarch 13, 2012 at 9:05 AM
Hi Danny,

I tried running logistic regression on mahout with a different data. But I receive an error: "Unexpected rate while processing". Any idea why that might be? I have used the same parameters used by you for rate, lambda and features. Others were modified based on my data.

Thanks.
ReplyDelete
Replies
Danny BicksonMarch 13, 2012 at 9:35 AM
Hi Anonymous!
There are several version of Mahout - my post relates to version 0.4, there are also several version of hadoop and a lot of possible reasons for errors.. I suggest you email mahout user email list with more information.
If you like to try out GraphLab logistic regression, I can help you much more actively. Take a look at http://graphlab.org/gabp.html

Best,

DB
ReplyDelete
Replies
usmanApril 30, 2012 at 11:15 AM
Hello.. Please Any one tells me that how can i install this mahout on ubuntu.. i tried to install it for 1 month but nothing happened. so please tell me how to install it.. Thanks
ReplyDelete
Replies
AnonymousJune 2, 2012 at 9:49 AM
usman,
follow hadoop's installation instructions and make sure you get the pseudo distributed setup done.
Then using mahout should be pretty straightforward.
Both projects' documentation should be enough.
If not, give specifics..
ReplyDelete
Replies
AnonymousSeptember 26, 2012 at 11:18 AM
Hello,
I'm trying to run a couple of tests on mahout, logistic regression. Meanwhile, I cannot get the results from mahout to match R. Any suggestions? Here are the codes to generate data on R and export to use on Mahout. You can also find the results from both products.

#######
## R ##
#######
aps<-runif(500,2,37) #uniform with 2 and 37
tiss<-runif(500,9,36)
sex<-rbinom(500,1,.51) #binomial with p(sucess)=.51
a2 <- as.data.frame(cbind(aps, tiss, sex))
a2.logr <- glm(sex ~ aps + tiss , data=a2, family=binomial("logit"))
a2.logr
write.csv(a2, "~/test2.csv")

Model Calculated by R
Call: glm(formula = sex ~ aps + tiss, family = binomial("logit"), data = a2)

Coefficients:
(Intercept) aps tiss
-0.449966 0.020108 0.006734

Degrees of Freedom: 499 Total (i.e. Null); 497 Residual
Null Deviance: 692.2
Residual Deviance: 686.9 AIC: 692.9

############
## MAHOUT ##
############

mahout org.apache.mahout.classifier.sgd.TrainLogistic --input test2.csv --output test_output.csv --target sex --categories 2 --predictors aps tiss --types numeric --features 20 --passes 20

sex ~ -0.000*Intercept Term + 0.011*aps + -0.004*tiss
Intercept Term -0.00025
aps 0.01142
tiss -0.00395
ReplyDelete
Replies
Danny BicksonSeptember 27, 2012 at 11:34 AM
Hi,
I am not sure about which algorithm is specifically implemented in Mahout/R. From previous experience, excluding bugs, there are several cost functions used in lasso and logistic regression in different papers and thus different implementations may give different results. For example in Lasso, when we compare multiple algorithms in our shotgun work, we encountered three main cost functions:
Cost function A: || Ax - y ||_2^2 + \lambda ||x||_1
Cost function B: || Ax - y ||_2^2 s.t. ||x||_0 = s
Cost function C: || Ax - y ||_2^2 s.t. ||x||_1 \le \tau

The first is the traditional lasso formulation, the second defines the problems by the number of expected zeros in the sparse answer vector, and the third uses a constraint instead of L1 penalty.
All cost functions are potentially equivalent, but some fine tuning is needed to get the same results on the different algorithms. I would start looking in this direction.
ReplyDelete
Replies
ShanJuly 29, 2013 at 11:17 PM
Danny,
can I directly load a csv file to HDFS and start processing it with Mahout for logistic regression is there any preprocessing of csv data is required. You have considered a example file and not a general dataset.
Kindly reply.

sushant
ReplyDelete
Replies
AKFebruary 25, 2014 at 3:29 AM
Hi Danny,
I want to use SVM in Mahout. So I have used patch in issue https://issues.apache.org/jira/browse/MAHOUT-232/Mahout-232-0.8.patch. But I can't run SVM on Mahout and Hadoop. I am getting class not found exception. Help me.
Thank You in advance.
ReplyDelete
Replies