We start by testing logistic regression
1) Launch Amazon AMI image you constructed using the explanation in part 1 of this post.
2) Run Hadoop using
# $HADOOP_HOME/bin/hadoop namenode -format # $HADOOP_HOME/bin/start-all.sh # jps // you should see all 5 Hadoop processes (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker)
3) Run logistic regression example
cd /usr/local/mahout-0.4/ ./bin/mahout org.apache.mahout.classifier.sgd.TrainLogistic --passes 100 --rate 50 --lambda 0.001 --input examples/src/main/resources/donut.csv --features 21 --output donut.model --target color --categories 2 --predictors x y xx xy yy a b c --types n n
You should see the following output:
11/01/25 14:42:45 WARN driver.MahoutDriver: No org.apache.mahout.classifier.sgd.TrainLogistic.props found on classpath, will use command-line arguments only 21 color ~ 0.353*Intercept Term + 5.450*x + -1.671*y + -4.740*xx + 0.353*xy + 0.353*yy + 5.450*a + 2.765*b + -24.161*c Intercept Term 0.35319 a 5.45000 b 2.76534 c -24.16091 x 5.45000 xx -4.73958 xy 0.35319 y -1.67092 yy 0.35319 2.765337737 0.000000000 -1.670917299 0.000000000 0.000000000 0.000000000 5.449999190 0.000000000 -24.160908591 -4.739579336 0.353190637 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 11/01/25 14:42:46 INFO driver.MahoutDriver: Program took 1016 ms
Now we run alternating matrix factorization. Based on instructions by Sebastian Schelter (see https://issues.apache.org/jira/browse/MAHOUT-542).
A related GraphLab implementation is found here
0) Download the patch MAHOUT-542.patch from the above webpage.
Installl it using the command
cd /usr/local/mahout-0.4/src/ wget https://issues.apache.org/jira/secure/attachment/12469671/MAHOUT-542-5.patch patch -p0 < MAHOUT-542-5.patch1) Get the movie lens 1M movie dataset
cd /usr/local/mahout-0.4/ wget http://www.grouplens.org/system/files/million-ml-data.tar__0.gz tar xvzf million-ml-data.tar__0.gz2) Convert dataset to csv format
cat ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3 > ratings.csv cd /usr/local/hadoop-0.20.2/ ./bin/hadoop fs -copyFromLocal /path/to/ratings.csv ratings.csv ./bin/hadoop fs -lsShould see something like
/user/ubuntu/ratings.csv3) # create a 90% percent training set and a 10% probe set
/usr/local/mahout-0.4$ ./bin/mahout splitDataset --input /user/ubuntu/ratings.csv --output /user/ubuntu/myout --trainingPercentage 0.9 --probePercentage 0.1The output should look like:
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2/ HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf 11/01/27 01:09:39 WARN driver.MahoutDriver: No splitDataset.props found on classpath, will use command-line arguments only 11/01/27 01:09:39 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --input=/user/ubuntu/ratings.csv, --output=/user/ubuntu/myout, --probePercentage=0.1, --startPhase=0, --tempDir=temp, --trainingPercentage=0.9} 11/01/27 01:09:40 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 11/01/27 01:09:40 INFO input.FileInputFormat: Total input paths to process : 1 11/01/27 01:09:40 INFO mapred.JobClient: Running job: job_local_0001 11/01/27 01:09:40 INFO input.FileInputFormat: Total input paths to process : 1 11/01/27 01:09:41 INFO mapred.MapTask: io.sort.mb = 100 11/01/27 01:09:41 INFO mapred.MapTask: data buffer = 79691776/99614720 11/01/27 01:09:41 INFO mapred.MapTask: record buffer = 262144/327680 11/01/27 01:09:42 INFO mapred.JobClient: map 0% reduce 0% 11/01/27 01:09:42 INFO mapred.MapTask: Spilling map output: record full = true 11/01/27 01:09:42 INFO mapred.MapTask: bufstart = 0; bufend = 5970616; bufvoid = 99614720 11/01/27 01:09:42 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680 11/01/27 01:09:42 INFO util.NativeCodeLoader: Loaded the native-hadoop library4)# run distributed ALS-WR to factorize the rating matrix based on the training set
bin/mahout parallelALS --input /user/ubuntu/myout/trainingSet/ --output /tmp/als/out --tempDir /tmp/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065 ... 11/01/27 02:40:28 INFO mapred.JobClient: Spilled Records=7398 11/01/27 02:40:28 INFO mapred.JobClient: Map output bytes=691713 11/01/27 02:40:28 INFO mapred.JobClient: Combine input records=0 11/01/27 02:40:28 INFO mapred.JobClient: Map output records=3699 11/01/27 02:40:28 INFO mapred.JobClient: Reduce input records=3699 11/01/27 02:40:28 INFO driver.MahoutDriver: Program took 1998612 ms5)# measure the error of the predictions against the probe set
usr/local/mahout-0.4$ bin/mahout evaluateALS --probes /user/ubuntu/myout/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/ Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2/ HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf 11/01/27 02:42:37 WARN driver.MahoutDriver: No evaluateALS.props found on classpath, will use command-line arguments only 11/01/27 02:42:37 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --itemFeatures=/tmp/als/out/M/, --probes=/user/ubuntu/myout/probeSet/, --startPhase=0, --tempDir=temp, --userFeatures=/tmp/als/out/U/} ... Probe [99507], rating of user [4510] towards item [2560], [1.0] estimated [1.574626183998361] Probe [99508], rating of user [4682] towards item [171], [4.0] estimated [4.073943928686575] Probe [99509], rating of user [3333] towards item [1215], [5.0] estimated [4.098295242062813] Probe [99510], rating of user [4682] towards item [173], [2.0] estimated [1.9625234269143972] RMSE: 0.8546120366924382, MAE: 0.6798083002225481 11/01/27 02:42:50 INFO driver.MahoutDriver: Program took 13127 msUseful HDFS commands * View the current state of the file system
ubuntu@domU-12-31-39-00-18-51:/usr/local/hadoop-0.20.2$ ./bin/hadoop dfsadmin -report Configured Capacity: 10568916992 (9.84 GB) Present Capacity: 3698495488 (3.44 GB) DFS Remaining: 40173568 (38.31 MB) DFS Used: 3658321920 (3.41 GB) DFS Used%: 98.91% Under replicated blocks: 56 Blocks with corrupt replicas: 0 Missing blocks: 0 ------------------------------------------------- Datanodes available: 1 (1 total, 0 dead) Name: 127.0.0.1:50010 Decommission Status : Normal Configured Capacity: 10568916992 (9.84 GB) DFS Used: 3658321920 (3.41 GB) Non DFS Used: 6870421504 (6.4 GB) DFS Remaining: 40173568(38.31 MB) DFS Used%: 34.61% DFS Remaining%: 0.38% Last contact: Tue Feb 01 21:10:15 UTC 2011* Delete a directory
ubuntu@domU-12-31-39-00-18-51:/usr/local/hadoop-0.20.2$ ./bin/hadoop fs -rmr temp/markedPreferences Deleted hdfs://localhost:9000/user/ubuntu/temp/markedPreferences
Hi Danny,
ReplyDeleteI tried running logistic regression on mahout with a different data. But I receive an error: "Unexpected rate while processing". Any idea why that might be? I have used the same parameters used by you for rate, lambda and features. Others were modified based on my data.
Thanks.
Hi Anonymous!
ReplyDeleteThere are several version of Mahout - my post relates to version 0.4, there are also several version of hadoop and a lot of possible reasons for errors.. I suggest you email mahout user email list with more information.
If you like to try out GraphLab logistic regression, I can help you much more actively. Take a look at http://graphlab.org/gabp.html
Best,
DB
This comment has been removed by the author.
DeleteHello.. Please Any one tells me that how can i install this mahout on ubuntu.. i tried to install it for 1 month but nothing happened. so please tell me how to install it.. Thanks
ReplyDeleteusman,
ReplyDeletefollow hadoop's installation instructions and make sure you get the pseudo distributed setup done.
Then using mahout should be pretty straightforward.
Both projects' documentation should be enough.
If not, give specifics..
Hello,
ReplyDeleteI'm trying to run a couple of tests on mahout, logistic regression. Meanwhile, I cannot get the results from mahout to match R. Any suggestions? Here are the codes to generate data on R and export to use on Mahout. You can also find the results from both products.
#######
## R ##
#######
aps<-runif(500,2,37) #uniform with 2 and 37
tiss<-runif(500,9,36)
sex<-rbinom(500,1,.51) #binomial with p(sucess)=.51
a2 <- as.data.frame(cbind(aps, tiss, sex))
a2.logr <- glm(sex ~ aps + tiss , data=a2, family=binomial("logit"))
a2.logr
write.csv(a2, "~/test2.csv")
Model Calculated by R
Call: glm(formula = sex ~ aps + tiss, family = binomial("logit"), data = a2)
Coefficients:
(Intercept) aps tiss
-0.449966 0.020108 0.006734
Degrees of Freedom: 499 Total (i.e. Null); 497 Residual
Null Deviance: 692.2
Residual Deviance: 686.9 AIC: 692.9
############
## MAHOUT ##
############
mahout org.apache.mahout.classifier.sgd.TrainLogistic --input test2.csv --output test_output.csv --target sex --categories 2 --predictors aps tiss --types numeric --features 20 --passes 20
sex ~ -0.000*Intercept Term + 0.011*aps + -0.004*tiss
Intercept Term -0.00025
aps 0.01142
tiss -0.00395
Hi,
ReplyDeleteI am not sure about which algorithm is specifically implemented in Mahout/R. From previous experience, excluding bugs, there are several cost functions used in lasso and logistic regression in different papers and thus different implementations may give different results. For example in Lasso, when we compare multiple algorithms in our shotgun work, we encountered three main cost functions:
Cost function A: || Ax - y ||_2^2 + \lambda ||x||_1
Cost function B: || Ax - y ||_2^2 s.t. ||x||_0 = s
Cost function C: || Ax - y ||_2^2 s.t. ||x||_1 \le \tau
The first is the traditional lasso formulation, the second defines the problems by the number of expected zeros in the sparse answer vector, and the third uses a constraint instead of L1 penalty.
All cost functions are potentially equivalent, but some fine tuning is needed to get the same results on the different algorithms. I would start looking in this direction.
Danny,
ReplyDeletecan I directly load a csv file to HDFS and start processing it with Mahout for logistic regression is there any preprocessing of csv data is required. You have considered a example file and not a general dataset.
Kindly reply.
sushant
Here is the dataset: http://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut.csv
DeleteYou can use it as an example.
Hi Danny,
ReplyDeleteI want to use SVM in Mahout. So I have used patch in issue https://issues.apache.org/jira/browse/MAHOUT-232/Mahout-232-0.8.patch. But I can't run SVM on Mahout and Hadoop. I am getting class not found exception. Help me.
Thank You in advance.