Large Scale Machine Learning and Other Animals: K-means

This post helps to understand Mahout's K-Means clustering implementation.
Preliminaries: you should read first the explanation in the link above.

Installation and setup

wget http://apache.spd.co.il//mahout/0.5/mahout-distribution-0.5.zip
unzip mahout-distribution-0.5.zip
cd mahout-distribution-0.5.zip 
setenv JAVA_HOME /path.to/java1.6.0/

Running the example
From the Mahout root folder:

./examples/bin/build_reuters.sh

Explanation
The script build_reuters.sh downloads reuters data, which is composed of news items.

<46|0>bickson@biggerbro:~/usr7/mahout-distribution-0.5/examples/bin/mahout-work/reuters-out$ ls
reut2-000.sgm-0.txt    reut2-003.sgm-175.txt  reut2-006.sgm-24.txt   reut2-009.sgm-324.txt  reut2-012.sgm-39.txt   reut2-015.sgm-474.txt  reut2-018.sgm-549.txt
reut2-000.sgm-100.txt  reut2-003.sgm-176.txt  reut2-006.sgm-250.txt  reut2-009.sgm-325.txt  reut2-012.sgm-3.txt    reut2-015.sgm-475.txt  reut2-018.sgm-54.txt
....

A typical news item looks like:
26-FEB-1987 15:01:01.79

BAHIA COCOA REVIEW

Showers continued throughout the week in the Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, although normal humidity levels have not been restored, Comissaria Smith said in its weekly review. The dry period means the temporao will be late this year. Arrivals for the week ended February 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against 5.81 at the same stage last year. Again it seems th....

The goal of the method, is to cluster similar news items together. This is done by first counting word occurrences using TF-IDF scheme. Each news item is a sparse row in a matrix. Next, rows are clustered together using the k-means algorithm.

What happens behind the scenes?
1) mahout seqdirectory is called, to create sequence files containing file name as key, and file content as value.
INPUT DIR: mahout-work/reuters-out/
OUTPUT DIR: mahout-work/reuters-out-seqdir/

2) mahout seq2parse is called, to create sparse vectors out of the sequence files.
INPUT DIR: mahout-work/reuters-out-seqdir/
OUTPUT DIR: mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/

Inside the output dir, a file called part-t-00000 is created. This is a sequence file which includes int (row id) as key, and a sparse vector (SequentialAccessSparseVector) as value.

3) mahout kmeams is called, for clustering the sparse vectors into cluster.
INPUT DIR: mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/
INTERMEDIATE OUTPUT DIR: mahout-work/reuters-kmeans-clusters/
OUTPUT DIR:mahout-work/reuters-kmeans/

4) Finally clusterdump converts clusters into human readable format
INPUT DIR: mahout-work/reuters-kmeans/
OUPUT : a text file.

Debugging:
Below you can find some common problems and their solutions.

Problem:

~/usr7/mahout-distribution-0.5$ ./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:10:39 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:10:39 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout
Sep 4, 2011 2:10:39 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:10:39 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Sep 4, 2011 2:10:39 AM org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker <init>
WARNING: Problem opening checksum file: file:/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/part-randomSeed.  Ignoring exception: java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at java.io.DataInputStream.readFully(DataInputStream.java:152)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:134)
    at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
    at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init>(SequenceFileIterator.java:58)
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
    at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:87)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)

Answer: cluster path and input path point for the same folder. When starting run all files in cluster path are deleted, so input file is deleted as well. Change paths to point to different folders! Problem:

./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_clusters/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:15:11 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:15:12 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:15:12 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)

Answer: Input file named part-r-00000 is missing in the input folder. Sucessful run:

124|0>bickson@biggerbro:~/usr7/mahout-distribution-0.5$ ./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_clusters/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:19:48 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:19:48 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters
Sep 4, 2011 2:19:48 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Wrote 10 vectors to /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/part-randomSeed
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Input: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout Clusters In: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/part-randomSeed Out: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: K-Means Iteration 1
Sep 4, 2011 2:19:52 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
Sep 4, 2011 2:19:52 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Running job: job_local_0001
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: io.sort.mb = 100
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: data buffer = 79691776/99614720
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: record buffer = 262144/327680
Sep 4, 2011 2:19:53 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:54 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 0% reduce 0%
Sep 4, 2011 2:19:59 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Sep 4, 2011 2:20:00 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 80% reduce 0%
Sep 4, 2011 2:20:00 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
INFO: Starting flush of map output
Sep 4, 2011 2:20:02 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Sep 4, 2011 2:20:03 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 100% reduce 0%
Sep 4, 2011 2:20:05 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO:

Problem:

no HADOOP_HOME set, running locally
Exception in thread "main" java.lang.ClassFormatError: org.apache.mahout.driver.MahoutDriver (unrecognized class file version)
   at java.lang.VMClassLoader.defineClass(libgcj.so.8rh)
   at java.lang.ClassLoader.defineClass(libgcj.so.8rh)
   at java.security.SecureClassLoader.defineClass(libgcj.so.8rh)
   at java.net.URLClassLoader.findClass(libgcj.so.8rh)
   at java.lang.ClassLoader.loadClass(libgcj.so.8rh)
   at java.lang.ClassLoader.loadClass(libgcj.so.8rh)
   at gnu.java.lang.MainThread.run(libgcj.so.8rh)

ANSWER: wrong java version used - you should is 1.6.0 or higher. Problem:

../../bin/mahout: line 201: /usr/share/java-1.6.0//bin/java: No such file or directory
../../bin/mahout: line 201: exec: /usr/share/java-1.6.0//bin/java: cannot execute: No such file or directory

Answer: JAVA_HOME is pointing to the wrong place. Inside this directory a subdirectory called bin should be present, with an executable named "java" in it. Problem:

export JAVA_HOME=/afs/cs.cmu.edu/local/java/amd64_f7/jdk1.6.0_16/
cd ~/usr7/mahout-distribution-0.5/ ; ./bin/mahout clusterdump --seqFileDir ~/usr7/small_netflix_mahout_clusters/ --pointsDir ~/usr7/small_netflix_mahout/ --output small_netflix_output.txt
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 4:22:29 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--dictionaryType=text, --endPhase=2147483647, --output=small_netflix_output.txt, --pointsDir=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --seqFileDir=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 4:22:29 AM org.apache.hadoop.util.NativeCodeLoader 
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Exception in thread "main" java.lang.ClassCastException: org.apache.mahout.math.VectorWritable cannot be cast to org.apache.mahout.clustering.WeightedVectorWritable
	at org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:171)
	at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:121)
	at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:86)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
make: *** [clusterdump] Error 1

Problem:

export JAVA_HOME=/afs/cs.cmu.edu/local/java/amd64_f7/jdk1.6.0_16/
cd ~/usr7/mahout-distribution-0.5/ ; ./bin/mahout kmeans -i ~/usr7/small_netflix_transpose_mahout/ -o ~/usr7/small_netflix_mahout_transpose_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_transpose_clusters/ -x 2 -ow -cl
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 4:57:44 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clustering=null, --clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_transpose_mahout/, --maxIter=2, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_output/, --overwrite=null, --startPhase=0, --tempDir=temp}
Sep 4, 2011 4:57:45 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_output
Sep 4, 2011 4:57:45 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_clusters
Exception in thread "main" java.io.FileNotFoundException: File /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_transpose_mahout does not exist.
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
	at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:69)
	at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
make: *** [kmeans_transpose] Error 1

Answer: input directory does not exist.

Problem: program clusterdump runs, with empty txt file as output.
Solution: You probably gave the intermediate cluster path of k-means instead of the output path dir. In this case, program runs and terminates without an error.

Large Scale Machine Learning and Other Animals

Friday, September 9, 2011

GraphLab Clustering library

Saturday, September 3, 2011

Understanding Mahout K-Means clustering implementation

Thursday, June 9, 2011

What are the most widely deployed machine learning algorithms?

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax