Showing posts with label K-means. Show all posts
Showing posts with label K-means. Show all posts

Friday, September 9, 2011

GraphLab Clustering library

Recently I have been working on implementing a clustering library on top of GraphLab.
Currently we have K-means, Fuzzy K-means and LDA (Latent Dirichlet Allocation) implemented. I took some time for comparing performance of GraphLab vs. Mahout on an Amazon EC2 machine.

Here is a graph which compares performance:


Some explanation about the experiment. I took a subset of Netflix data with 3,298,163 movie ratings, 95,526 users, and 3,561 movies. The goal is to cluster user with similar movie preferences together. Both GraphLab and Mahout run on Amazon m2.xlarge instance .
This machine has 2 cores. I have used the following settings: 250 clusters, 50 clusters and 20 clusters. The algorithm runs a single iteration and then dumps the output into a text file.
For Mahout, I used Mahout's K-Means implementation. GraphLab was run using a single node, while Mahout was run using either one or two nodes. Mahout is using 7 mappers and GraphLab 7 threads.

Overall, GraphLab runs between x15 to x40 faster on this dataset.

A second experiment I did is to compare Mahout's LDA performance to GraphLab's LDA.
Here is the Graph:
For this experiment, I used m1.xlarge instance. I tested Graphlab on 4 cores, Mahout and 4 cores and Mahout on 8 cores (2 nodes). I used the same Netflix data subset, this time with 10 clusters. Graph depicts running time of a single iteration.

Finally, here are performance results of GraphLab LDA with 1, 2, 3 and 4 cores (on m1.xlarge EC2 instance):

Running time in this case is for 5 iterations.

Saturday, September 3, 2011

Understanding Mahout K-Means clustering implementation

This post helps to understand Mahout's K-Means clustering implementation.
Preliminaries: you should read first the explanation in the link above.

Installation and setup
wget http://apache.spd.co.il//mahout/0.5/mahout-distribution-0.5.zip
unzip mahout-distribution-0.5.zip
cd mahout-distribution-0.5.zip 
setenv JAVA_HOME /path.to/java1.6.0/

Running the example
From the Mahout root folder:
./examples/bin/build_reuters.sh

Explanation 
The script build_reuters.sh downloads reuters data, which is composed of news items.
<46|0>bickson@biggerbro:~/usr7/mahout-distribution-0.5/examples/bin/mahout-work/reuters-out$ ls
reut2-000.sgm-0.txt    reut2-003.sgm-175.txt  reut2-006.sgm-24.txt   reut2-009.sgm-324.txt  reut2-012.sgm-39.txt   reut2-015.sgm-474.txt  reut2-018.sgm-549.txt
reut2-000.sgm-100.txt  reut2-003.sgm-176.txt  reut2-006.sgm-250.txt  reut2-009.sgm-325.txt  reut2-012.sgm-3.txt    reut2-015.sgm-475.txt  reut2-018.sgm-54.txt
....
A typical news item looks like:
26-FEB-1987 15:01:01.79

BAHIA COCOA REVIEW

Showers continued throughout the week in the Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, although normal humidity levels have not been restored, Comissaria Smith said in its weekly review.     The dry period means the temporao will be late this year.     Arrivals for the week ended February 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against 5.81 at the same stage last year. Again it seems th
....

The goal of the method, is to cluster similar news items together. This is done by first counting word occurrences using TF-IDF scheme. Each news item is a sparse row in a matrix. Next, rows are clustered together using the k-means algorithm. 

What happens behind the scenes?
1) mahout seqdirectory is called, to create sequence files containing file name as key, and file content as value.
INPUT DIR: mahout-work/reuters-out/
OUTPUT DIR: mahout-work/reuters-out-seqdir/

2) mahout seq2parse is called, to create sparse vectors out of the sequence files.
INPUT DIR: mahout-work/reuters-out-seqdir/
OUTPUT DIR: mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/

Inside the output dir, a file called part-t-00000 is created. This is a sequence file which includes int (row id) as key, and a sparse vector (SequentialAccessSparseVector) as value.

3) mahout kmeams is called, for clustering the sparse vectors into cluster.
INPUT DIR: mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/
INTERMEDIATE OUTPUT DIR: mahout-work/reuters-kmeans-clusters/
OUTPUT DIR:mahout-work/reuters-kmeans/

4) Finally clusterdump converts clusters into human readable format
INPUT DIR: mahout-work/reuters-kmeans/
OUPUT : a text file.

Debugging:
Below you can find some common problems and their solutions.

Problem:
~/usr7/mahout-distribution-0.5$ ./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:10:39 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:10:39 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout
Sep 4, 2011 2:10:39 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:10:39 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Sep 4, 2011 2:10:39 AM org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker <init>
WARNING: Problem opening checksum file: file:/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/part-randomSeed.  Ignoring exception: java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at java.io.DataInputStream.readFully(DataInputStream.java:152)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:134)
    at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
    at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init>(SequenceFileIterator.java:58)
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
    at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:87)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)

Answer: cluster path and input path point for the same folder. When starting run all files in cluster path are deleted, so input file is deleted as well. Change paths to point to different folders! Problem:
./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_clusters/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:15:11 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:15:12 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:15:12 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)

Answer: Input file named part-r-00000 is missing in the input folder. Sucessful run:
124|0>bickson@biggerbro:~/usr7/mahout-distribution-0.5$ ./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_clusters/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:19:48 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:19:48 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters
Sep 4, 2011 2:19:48 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Wrote 10 vectors to /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/part-randomSeed
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Input: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout Clusters In: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/part-randomSeed Out: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: K-Means Iteration 1
Sep 4, 2011 2:19:52 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
Sep 4, 2011 2:19:52 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Running job: job_local_0001
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: io.sort.mb = 100
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: data buffer = 79691776/99614720
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: record buffer = 262144/327680
Sep 4, 2011 2:19:53 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:54 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 0% reduce 0%
Sep 4, 2011 2:19:59 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Sep 4, 2011 2:20:00 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 80% reduce 0%
Sep 4, 2011 2:20:00 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
INFO: Starting flush of map output
Sep 4, 2011 2:20:02 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Sep 4, 2011 2:20:03 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 100% reduce 0%
Sep 4, 2011 2:20:05 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Problem:
no HADOOP_HOME set, running locally
Exception in thread "main" java.lang.ClassFormatError: org.apache.mahout.driver.MahoutDriver (unrecognized class file version)
   at java.lang.VMClassLoader.defineClass(libgcj.so.8rh)
   at java.lang.ClassLoader.defineClass(libgcj.so.8rh)
   at java.security.SecureClassLoader.defineClass(libgcj.so.8rh)
   at java.net.URLClassLoader.findClass(libgcj.so.8rh)
   at java.lang.ClassLoader.loadClass(libgcj.so.8rh)
   at java.lang.ClassLoader.loadClass(libgcj.so.8rh)
   at gnu.java.lang.MainThread.run(libgcj.so.8rh)
ANSWER: wrong java version used - you should is 1.6.0 or higher. Problem:
../../bin/mahout: line 201: /usr/share/java-1.6.0//bin/java: No such file or directory
../../bin/mahout: line 201: exec: /usr/share/java-1.6.0//bin/java: cannot execute: No such file or directory
Answer: JAVA_HOME is pointing to the wrong place. Inside this directory a subdirectory called bin should be present, with an executable named "java" in it. Problem:
export JAVA_HOME=/afs/cs.cmu.edu/local/java/amd64_f7/jdk1.6.0_16/
cd ~/usr7/mahout-distribution-0.5/ ; ./bin/mahout clusterdump --seqFileDir ~/usr7/small_netflix_mahout_clusters/ --pointsDir ~/usr7/small_netflix_mahout/ --output small_netflix_output.txt
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 4:22:29 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--dictionaryType=text, --endPhase=2147483647, --output=small_netflix_output.txt, --pointsDir=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --seqFileDir=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 4:22:29 AM org.apache.hadoop.util.NativeCodeLoader 
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Exception in thread "main" java.lang.ClassCastException: org.apache.mahout.math.VectorWritable cannot be cast to org.apache.mahout.clustering.WeightedVectorWritable
	at org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:171)
	at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:121)
	at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:86)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
make: *** [clusterdump] Error 1
Problem:
export JAVA_HOME=/afs/cs.cmu.edu/local/java/amd64_f7/jdk1.6.0_16/
cd ~/usr7/mahout-distribution-0.5/ ; ./bin/mahout kmeans -i ~/usr7/small_netflix_transpose_mahout/ -o ~/usr7/small_netflix_mahout_transpose_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_transpose_clusters/ -x 2 -ow -cl
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 4:57:44 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clustering=null, --clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_transpose_mahout/, --maxIter=2, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_output/, --overwrite=null, --startPhase=0, --tempDir=temp}
Sep 4, 2011 4:57:45 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_output
Sep 4, 2011 4:57:45 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_clusters
Exception in thread "main" java.io.FileNotFoundException: File /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_transpose_mahout does not exist.
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
	at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:69)
	at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
make: *** [kmeans_transpose] Error 1
Answer: input directory does not exist.

Problem: program clusterdump runs, with empty txt file as output.
Solution: You probably gave the intermediate cluster path of k-means instead of the output path dir. In this case, program runs and terminates without an error.

Thursday, June 9, 2011

What are the most widely deployed machine learning algorithms?

One of the interesting questions is: "what are the most useful machine learning algorithms?".
I did a little survey by looking at the Mahout user mailing list and counting occurrences of keywords. The results I got are shown in the plot above.


It seems that matrix factorization (SVD) is the most widely used algorithm, and then K-means. We have just implemented SVD as a part of the GraphLab Collaborative Filtering library. Anyone who wants to beta test it is welcome!