Saturday, September 3, 2011

Understanding Mahout K-Means clustering implementation

This post helps to understand Mahout's K-Means clustering implementation.
Preliminaries: you should read first the explanation in the link above.

Installation and setup
wget http://apache.spd.co.il//mahout/0.5/mahout-distribution-0.5.zip
unzip mahout-distribution-0.5.zip
cd mahout-distribution-0.5.zip 
setenv JAVA_HOME /path.to/java1.6.0/

Running the example
From the Mahout root folder:
./examples/bin/build_reuters.sh

Explanation 
The script build_reuters.sh downloads reuters data, which is composed of news items.
<46|0>bickson@biggerbro:~/usr7/mahout-distribution-0.5/examples/bin/mahout-work/reuters-out$ ls
reut2-000.sgm-0.txt    reut2-003.sgm-175.txt  reut2-006.sgm-24.txt   reut2-009.sgm-324.txt  reut2-012.sgm-39.txt   reut2-015.sgm-474.txt  reut2-018.sgm-549.txt
reut2-000.sgm-100.txt  reut2-003.sgm-176.txt  reut2-006.sgm-250.txt  reut2-009.sgm-325.txt  reut2-012.sgm-3.txt    reut2-015.sgm-475.txt  reut2-018.sgm-54.txt
....
A typical news item looks like:
26-FEB-1987 15:01:01.79

BAHIA COCOA REVIEW

Showers continued throughout the week in the Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, although normal humidity levels have not been restored, Comissaria Smith said in its weekly review.     The dry period means the temporao will be late this year.     Arrivals for the week ended February 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against 5.81 at the same stage last year. Again it seems th
....

The goal of the method, is to cluster similar news items together. This is done by first counting word occurrences using TF-IDF scheme. Each news item is a sparse row in a matrix. Next, rows are clustered together using the k-means algorithm. 

What happens behind the scenes?
1) mahout seqdirectory is called, to create sequence files containing file name as key, and file content as value.
INPUT DIR: mahout-work/reuters-out/
OUTPUT DIR: mahout-work/reuters-out-seqdir/

2) mahout seq2parse is called, to create sparse vectors out of the sequence files.
INPUT DIR: mahout-work/reuters-out-seqdir/
OUTPUT DIR: mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/

Inside the output dir, a file called part-t-00000 is created. This is a sequence file which includes int (row id) as key, and a sparse vector (SequentialAccessSparseVector) as value.

3) mahout kmeams is called, for clustering the sparse vectors into cluster.
INPUT DIR: mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/
INTERMEDIATE OUTPUT DIR: mahout-work/reuters-kmeans-clusters/
OUTPUT DIR:mahout-work/reuters-kmeans/

4) Finally clusterdump converts clusters into human readable format
INPUT DIR: mahout-work/reuters-kmeans/
OUPUT : a text file.

Debugging:
Below you can find some common problems and their solutions.

Problem:
~/usr7/mahout-distribution-0.5$ ./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:10:39 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:10:39 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout
Sep 4, 2011 2:10:39 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:10:39 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Sep 4, 2011 2:10:39 AM org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker <init>
WARNING: Problem opening checksum file: file:/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/part-randomSeed.  Ignoring exception: java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at java.io.DataInputStream.readFully(DataInputStream.java:152)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:134)
    at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
    at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init>(SequenceFileIterator.java:58)
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
    at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:87)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)

Answer: cluster path and input path point for the same folder. When starting run all files in cluster path are deleted, so input file is deleted as well. Change paths to point to different folders! Problem:
./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_clusters/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:15:11 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:15:12 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:15:12 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)

Answer: Input file named part-r-00000 is missing in the input folder. Sucessful run:
124|0>bickson@biggerbro:~/usr7/mahout-distribution-0.5$ ./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_clusters/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:19:48 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:19:48 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters
Sep 4, 2011 2:19:48 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Wrote 10 vectors to /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/part-randomSeed
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Input: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout Clusters In: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/part-randomSeed Out: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: K-Means Iteration 1
Sep 4, 2011 2:19:52 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
Sep 4, 2011 2:19:52 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Running job: job_local_0001
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: io.sort.mb = 100
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: data buffer = 79691776/99614720
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: record buffer = 262144/327680
Sep 4, 2011 2:19:53 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:54 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 0% reduce 0%
Sep 4, 2011 2:19:59 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Sep 4, 2011 2:20:00 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 80% reduce 0%
Sep 4, 2011 2:20:00 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
INFO: Starting flush of map output
Sep 4, 2011 2:20:02 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Sep 4, 2011 2:20:03 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 100% reduce 0%
Sep 4, 2011 2:20:05 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Problem:
no HADOOP_HOME set, running locally
Exception in thread "main" java.lang.ClassFormatError: org.apache.mahout.driver.MahoutDriver (unrecognized class file version)
   at java.lang.VMClassLoader.defineClass(libgcj.so.8rh)
   at java.lang.ClassLoader.defineClass(libgcj.so.8rh)
   at java.security.SecureClassLoader.defineClass(libgcj.so.8rh)
   at java.net.URLClassLoader.findClass(libgcj.so.8rh)
   at java.lang.ClassLoader.loadClass(libgcj.so.8rh)
   at java.lang.ClassLoader.loadClass(libgcj.so.8rh)
   at gnu.java.lang.MainThread.run(libgcj.so.8rh)
ANSWER: wrong java version used - you should is 1.6.0 or higher. Problem:
../../bin/mahout: line 201: /usr/share/java-1.6.0//bin/java: No such file or directory
../../bin/mahout: line 201: exec: /usr/share/java-1.6.0//bin/java: cannot execute: No such file or directory
Answer: JAVA_HOME is pointing to the wrong place. Inside this directory a subdirectory called bin should be present, with an executable named "java" in it. Problem:
export JAVA_HOME=/afs/cs.cmu.edu/local/java/amd64_f7/jdk1.6.0_16/
cd ~/usr7/mahout-distribution-0.5/ ; ./bin/mahout clusterdump --seqFileDir ~/usr7/small_netflix_mahout_clusters/ --pointsDir ~/usr7/small_netflix_mahout/ --output small_netflix_output.txt
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 4:22:29 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--dictionaryType=text, --endPhase=2147483647, --output=small_netflix_output.txt, --pointsDir=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --seqFileDir=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 4:22:29 AM org.apache.hadoop.util.NativeCodeLoader 
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Exception in thread "main" java.lang.ClassCastException: org.apache.mahout.math.VectorWritable cannot be cast to org.apache.mahout.clustering.WeightedVectorWritable
	at org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:171)
	at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:121)
	at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:86)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
make: *** [clusterdump] Error 1
Problem:
export JAVA_HOME=/afs/cs.cmu.edu/local/java/amd64_f7/jdk1.6.0_16/
cd ~/usr7/mahout-distribution-0.5/ ; ./bin/mahout kmeans -i ~/usr7/small_netflix_transpose_mahout/ -o ~/usr7/small_netflix_mahout_transpose_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_transpose_clusters/ -x 2 -ow -cl
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 4:57:44 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clustering=null, --clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_transpose_mahout/, --maxIter=2, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_output/, --overwrite=null, --startPhase=0, --tempDir=temp}
Sep 4, 2011 4:57:45 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_output
Sep 4, 2011 4:57:45 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_clusters
Exception in thread "main" java.io.FileNotFoundException: File /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_transpose_mahout does not exist.
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
	at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:69)
	at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
make: *** [kmeans_transpose] Error 1
Answer: input directory does not exist.

Problem: program clusterdump runs, with empty txt file as output.
Solution: You probably gave the intermediate cluster path of k-means instead of the output path dir. In this case, program runs and terminates without an error.

17 comments:

  1. What does the output from clusterdump mean?
    telephone => 7.3313775062561035
    What is that number in this case? I tried to track the flow of that word and the tf & tf-idf numbers made sense. But after that it's just confusing. What does the cluster signify?

    ReplyDelete
  2. As far as I know the output should be in the format
    CL-0 { n=116 c=[29.922, ...
    where 0 is the cluster number, n is the number of points assigned to it, and c is the cluster center locations. Then the radius of the cluster, and then a printout of each point and its assigned cluster. I have now clue what you got - better ask in the Mahout user list.

    Best,

    DB

    ReplyDelete
  3. Yes I did see that output as well. I should've mentioned that. What good is the c=[] cluster center locations? They are not x and y coordinates right? I'm not getting anything useful from the mahout-0.5 nor mahout-0.6-snapshot reuters examples. Out of the box they just run for a few minutes and then output marginally useful information. (0.6-snapshot gives NPE anyway which makes it even less useful)

    ReplyDelete
  4. If you like to try out Graphlab clustering library I can give you close support (see http://graphlab.org/clustering.html). Unfortunately my resources for supporting Mahout are rather limited.

    ReplyDelete
  5. hi
    I am getting only one cluster CL-0 if i give one csv document of any size is this correct can any help me

    Thanks in advance

    ReplyDelete
  6. I think this is correct. You need to give multiple input files, each one of them is translated to a different vector and the vectors are clustered.

    Best,

    DB

    ReplyDelete
  7. Hi,
    If you need better output,

    SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(file.getAbsolutePath()), conf);
    IntWritable key = new IntWritable();
    WeightedVectorWritable value = new WeightedVectorWritable();
    while (reader.next(key, value)) {
    NamedVector vector = (NamedVector) value.getVector();
    String vectorName = vector.getName();
    System.out.println(vectorName + " belongs to cluster "+key.toString());
    }

    this prints each doc and its respective cluster

    ReplyDelete
  8. hi,

    Thank you for the reply for both can i know where to paste the above code

    Thanks in advance

    ReplyDelete
  9. Is there a resolution to this problem?

    Exception in thread "main" java.lang.ClassCastException: org.apache.mahout.math.VectorWritable cannot be cast to org.apache.mahout.clustering.WeightedVectorWritable

    ReplyDelete
    Replies
    1. It seems your input is VectorWritable while to program expect WeightedVectorWritable.

      Delete
  10. Hi i am very new to machine learning.Currently i am doing "Blog clustering"(classifying 2 clusters such as sports blog,political blog) using K-means for my course project.
    Right now i collected "bag of words" and done TF-IDF for the "bag of words" after removing the stop words. Now i do not know how to proceed further. I really appreciate your help if you give some pointers to proceed for my project.

    ReplyDelete
    Replies
    1. Hi!
      I suggest contacting the Mahout user mailing list with your problem.

      Best,

      DB

      Delete
  11. Hi, seems I face on a new problem:

    hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
    Error: Could not find or load main class org.apache.mahout.driver.MahoutDriver
    MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
    hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
    Error: Could not find or load main class org.apache.mahout.driver.MahoutDriver
    hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
    Error: Could not find or load main class org.apache.mahout.driver.MahoutDriver

    I think I set the paths in bash correctly:

    export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
    export HADOOP_HOME=/usr/local/hadoop-0.20.2
    export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
    export MAHOUT_HOME=/usr/local/mahout-0.4
    export MAHOUT_VERSION=0.4-SNAPSHOT
    export MAVEN_OPTS=-Xmx1024m

    Any idea?

    ReplyDelete
    Replies
    1. Take a look here: http://stackoverflow.com/questions/11664341/what-does-this-error-when-im-trying-to-run-an-example-in-apache-mahout

      You can use the "env" command to verify HADOOP_HOME is pointing to the right place.

      Delete
  12. When i run Clusterdump i get the following error.
    Command is as follows:
    $MAHOUT_HOME/bin/mahout clusterdump \--input $MAHOUT_HOME/examples/output/clusters-10 \--pointsDir $MAHOUT_HOME/examples/output/clusteredPoints/ \--output $MAHOUT_HOME/examples/output/clusteranalyze.txt
    Error:
    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Warning: $HADOOP_HOME is deprecated.

    Running on hadoop, using /home/shrida/hadoop/bin/hadoop and HADOOP_CONF_DIR=/home/shrida/hadoop/conf
    MAHOUT-JOB: /home/shrida/mahout/examples/target/mahout-examples-0.7-job.jar
    Warning: $HADOOP_HOME is deprecated.

    13/05/05 20:37:42 INFO common.AbstractJob: Command line arguments: {--dictionaryType=[text], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[/home/shrida/mahout/examples/output/clusters-10], --output=[/home/shrida/mahout/examples/output/clusteranalyze.txt], --outputFormat=[TEXT], --pointsDir=[/home/shrida/mahout/examples/output/clusteredPoints/], --startPhase=[0], --tempDir=[temp]}
    13/05/05 20:37:42 INFO clustering.ClusterDumper: Wrote 0 clusters
    13/05/05 20:37:42 INFO driver.MahoutDriver: Program took 583 ms (Minutes: 0.009716666666666667)

    ReplyDelete
  13. really useful! thank you :) it helped me a lot.

    ReplyDelete