Preliminaries: you should read first the explanation in the link above.
Installation and setup
wget http://apache.spd.co.il//mahout/0.5/mahout- distribution-0.5.zip unzip mahout-distribution-0.5.zip cd mahout-distribution-0.5.zip setenv JAVA_HOME /path.to/java1.6.0/
Running the example
From the Mahout root folder:
./examples/bin/build_reuters.sh
Explanation
The script build_reuters.sh downloads reuters data, which is composed of news items.
<46|0>bickson@biggerbro:~/usr7/mahout-distribution-0.5/examples/bin/mahout-work/reuters-out$ ls reut2-000.sgm-0.txt reut2-003.sgm-175.txt reut2-006.sgm-24.txt reut2-009.sgm-324.txt reut2-012.sgm-39.txt reut2-015.sgm-474.txt reut2-018.sgm-549.txt reut2-000.sgm-100.txt reut2-003.sgm-176.txt reut2-006.sgm-250.txt reut2-009.sgm-325.txt reut2-012.sgm-3.txt reut2-015.sgm-475.txt reut2-018.sgm-54.txt ....A typical news item looks like:
26-FEB-1987 15:01:01.79
BAHIA COCOA REVIEW
Showers continued throughout the week in the Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, although normal humidity levels have not been restored, Comissaria Smith said in its weekly review. The dry period means the temporao will be late this year. Arrivals for the week ended February 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against 5.81 at the same stage last year. Again it seems th....
The goal of the method, is to cluster similar news items together. This is done by first counting word occurrences using TF-IDF scheme. Each news item is a sparse row in a matrix. Next, rows are clustered together using the k-means algorithm.
What happens behind the scenes?
1) mahout seqdirectory is called, to create sequence files containing file name as key, and file content as value.
INPUT DIR: mahout-work/reuters-out/
OUTPUT DIR: mahout-work/reuters-out-seqdir/
2) mahout seq2parse is called, to create sparse vectors out of the sequence files.
INPUT DIR: mahout-work/reuters-out-seqdir/
OUTPUT DIR: mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/
Inside the output dir, a file called part-t-00000 is created. This is a sequence file which includes int (row id) as key, and a sparse vector (SequentialAccessSparseVector) as value.
3) mahout kmeams is called, for clustering the sparse vectors into cluster.
INPUT DIR: mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/
INTERMEDIATE OUTPUT DIR: mahout-work/reuters-kmeans-clusters/
OUTPUT DIR:mahout-work/reuters-kmeans/
4) Finally clusterdump converts clusters into human readable format
INPUT DIR: mahout-work/reuters-kmeans/
OUPUT : a text file.
Debugging:
Below you can find some common problems and their solutions.
Problem:
~/usr7/mahout-distribution-0.5$ ./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout/ -x 10 no HADOOP_HOME set, running locally SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Sep 4, 2011 2:10:39 AM org.slf4j.impl.JCLLoggerAdapter info INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp} Sep 4, 2011 2:10:39 AM org.slf4j.impl.JCLLoggerAdapter info INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout Sep 4, 2011 2:10:39 AM org.apache.hadoop.util.NativeCodeLoader <clinit> WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Sep 4, 2011 2:10:39 AM org.apache.hadoop.io.compress.CodecPool getCompressor INFO: Got brand-new compressor Sep 4, 2011 2:10:39 AM org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker <init> WARNING: Problem opening checksum file: file:/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/part-randomSeed. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:134) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283) at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init>(SequenceFileIterator.java:58) at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61) at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:87) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)Answer: cluster path and input path point for the same folder. When starting run all files in cluster path are deleted, so input file is deleted as well. Change paths to point to different folders! Problem:
./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_clusters/ -x 10 no HADOOP_HOME set, running locally SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Sep 4, 2011 2:15:11 AM org.slf4j.impl.JCLLoggerAdapter info INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp} Sep 4, 2011 2:15:12 AM org.apache.hadoop.util.NativeCodeLoader <clinit> WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Sep 4, 2011 2:15:12 AM org.apache.hadoop.io.compress.CodecPool getCompressor INFO: Got brand-new compressor Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)Answer: Input file named part-r-00000 is missing in the input folder. Sucessful run:
124|0>bickson@biggerbro:~/usr7/mahout-distribution-0.5$ ./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_clusters/ -x 10 no HADOOP_HOME set, running locally SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Sep 4, 2011 2:19:48 AM org.slf4j.impl.JCLLoggerAdapter info INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp} Sep 4, 2011 2:19:48 AM org.slf4j.impl.JCLLoggerAdapter info INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters Sep 4, 2011 2:19:48 AM org.apache.hadoop.util.NativeCodeLoader <clinit> WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getCompressor INFO: Got brand-new compressor Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor INFO: Got brand-new decompressor Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor INFO: Got brand-new decompressor Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor INFO: Got brand-new decompressor Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor INFO: Got brand-new decompressor Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info INFO: Wrote 10 vectors to /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/part-randomSeed Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info INFO: Input: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout Clusters In: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/part-randomSeed Out: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info INFO: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {} Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info INFO: K-Means Iteration 1 Sep 4, 2011 2:19:52 AM org.apache.hadoop.metrics.jvm.JvmMetrics init INFO: Initializing JVM Metrics with processName=JobTracker, sessionId= Sep 4, 2011 2:19:52 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus INFO: Total input paths to process : 1 Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob INFO: Running job: job_local_0001 Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus INFO: Total input paths to process : 1 Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init> INFO: io.sort.mb = 100 Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init> INFO: data buffer = 79691776/99614720 Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init> INFO: record buffer = 262144/327680 Sep 4, 2011 2:19:53 AM org.apache.hadoop.io.compress.CodecPool getDecompressor INFO: Got brand-new decompressor Sep 4, 2011 2:19:54 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob INFO: map 0% reduce 0% Sep 4, 2011 2:19:59 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate INFO: Sep 4, 2011 2:20:00 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob INFO: map 80% reduce 0% Sep 4, 2011 2:20:00 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush INFO: Starting flush of map output Sep 4, 2011 2:20:02 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate INFO: Sep 4, 2011 2:20:03 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob INFO: map 100% reduce 0% Sep 4, 2011 2:20:05 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate INFO:Problem:
no HADOOP_HOME set, running locally Exception in thread "main" java.lang.ClassFormatError: org.apache.mahout.driver.MahoutDriver (unrecognized class file version) at java.lang.VMClassLoader.defineClass(libgcj.so.8rh) at java.lang.ClassLoader.defineClass(libgcj.so.8rh) at java.security.SecureClassLoader.defineClass(libgcj.so.8rh) at java.net.URLClassLoader.findClass(libgcj.so.8rh) at java.lang.ClassLoader.loadClass(libgcj.so.8rh) at java.lang.ClassLoader.loadClass(libgcj.so.8rh) at gnu.java.lang.MainThread.run(libgcj.so.8rh)ANSWER: wrong java version used - you should is 1.6.0 or higher. Problem:
../../bin/mahout: line 201: /usr/share/java-1.6.0//bin/java: No such file or directory ../../bin/mahout: line 201: exec: /usr/share/java-1.6.0//bin/java: cannot execute: No such file or directoryAnswer: JAVA_HOME is pointing to the wrong place. Inside this directory a subdirectory called bin should be present, with an executable named "java" in it. Problem:
export JAVA_HOME=/afs/cs.cmu.edu/local/java/amd64_f7/jdk1.6.0_16/ cd ~/usr7/mahout-distribution-0.5/ ; ./bin/mahout clusterdump --seqFileDir ~/usr7/small_netflix_mahout_clusters/ --pointsDir ~/usr7/small_netflix_mahout/ --output small_netflix_output.txt no HADOOP_HOME set, running locally SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Sep 4, 2011 4:22:29 AM org.slf4j.impl.JCLLoggerAdapter info INFO: Command line arguments: {--dictionaryType=text, --endPhase=2147483647, --output=small_netflix_output.txt, --pointsDir=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --seqFileDir=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --startPhase=0, --tempDir=temp} Sep 4, 2011 4:22:29 AM org.apache.hadoop.util.NativeCodeLoaderProblem:WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor INFO: Got brand-new decompressor Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor INFO: Got brand-new decompressor Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor INFO: Got brand-new decompressor Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor INFO: Got brand-new decompressor Exception in thread "main" java.lang.ClassCastException: org.apache.mahout.math.VectorWritable cannot be cast to org.apache.mahout.clustering.WeightedVectorWritable at org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:171) at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:121) at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:86) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) make: *** [clusterdump] Error 1
export JAVA_HOME=/afs/cs.cmu.edu/local/java/amd64_f7/jdk1.6.0_16/ cd ~/usr7/mahout-distribution-0.5/ ; ./bin/mahout kmeans -i ~/usr7/small_netflix_transpose_mahout/ -o ~/usr7/small_netflix_mahout_transpose_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_transpose_clusters/ -x 2 -ow -cl no HADOOP_HOME set, running locally SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Sep 4, 2011 4:57:44 AM org.slf4j.impl.JCLLoggerAdapter info INFO: Command line arguments: {--clustering=null, --clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_transpose_mahout/, --maxIter=2, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_output/, --overwrite=null, --startPhase=0, --tempDir=temp} Sep 4, 2011 4:57:45 AM org.slf4j.impl.JCLLoggerAdapter info INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_output Sep 4, 2011 4:57:45 AM org.slf4j.impl.JCLLoggerAdapter info INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_transpose_clusters Exception in thread "main" java.io.FileNotFoundException: File /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_transpose_mahout does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:69) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) make: *** [kmeans_transpose] Error 1Answer: input directory does not exist.
Problem: program clusterdump runs, with empty txt file as output.
Solution: You probably gave the intermediate cluster path of k-means instead of the output path dir. In this case, program runs and terminates without an error.
What does the output from clusterdump mean?
ReplyDeletetelephone => 7.3313775062561035
What is that number in this case? I tried to track the flow of that word and the tf & tf-idf numbers made sense. But after that it's just confusing. What does the cluster signify?
As far as I know the output should be in the format
ReplyDeleteCL-0 { n=116 c=[29.922, ...
where 0 is the cluster number, n is the number of points assigned to it, and c is the cluster center locations. Then the radius of the cluster, and then a printout of each point and its assigned cluster. I have now clue what you got - better ask in the Mahout user list.
Best,
DB
Yes I did see that output as well. I should've mentioned that. What good is the c=[] cluster center locations? They are not x and y coordinates right? I'm not getting anything useful from the mahout-0.5 nor mahout-0.6-snapshot reuters examples. Out of the box they just run for a few minutes and then output marginally useful information. (0.6-snapshot gives NPE anyway which makes it even less useful)
ReplyDeleteIf you like to try out Graphlab clustering library I can give you close support (see http://graphlab.org/clustering.html). Unfortunately my resources for supporting Mahout are rather limited.
ReplyDeletehi
ReplyDeleteI am getting only one cluster CL-0 if i give one csv document of any size is this correct can any help me
Thanks in advance
I think this is correct. You need to give multiple input files, each one of them is translated to a different vector and the vectors are clustered.
ReplyDeleteBest,
DB
Hi,
ReplyDeleteIf you need better output,
SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(file.getAbsolutePath()), conf);
IntWritable key = new IntWritable();
WeightedVectorWritable value = new WeightedVectorWritable();
while (reader.next(key, value)) {
NamedVector vector = (NamedVector) value.getVector();
String vectorName = vector.getName();
System.out.println(vectorName + " belongs to cluster "+key.toString());
}
this prints each doc and its respective cluster
hi,
ReplyDeleteThank you for the reply for both can i know where to paste the above code
Thanks in advance
Is there a resolution to this problem?
ReplyDeleteException in thread "main" java.lang.ClassCastException: org.apache.mahout.math.VectorWritable cannot be cast to org.apache.mahout.clustering.WeightedVectorWritable
It seems your input is VectorWritable while to program expect WeightedVectorWritable.
DeleteHi i am very new to machine learning.Currently i am doing "Blog clustering"(classifying 2 clusters such as sports blog,political blog) using K-means for my course project.
ReplyDeleteRight now i collected "bag of words" and done TF-IDF for the "bag of words" after removing the stop words. Now i do not know how to proceed further. I really appreciate your help if you give some pointers to proceed for my project.
Hi!
DeleteI suggest contacting the Mahout user mailing list with your problem.
Best,
DB
Hi, seems I face on a new problem:
ReplyDeletehadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
Error: Could not find or load main class org.apache.mahout.driver.MahoutDriver
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
Error: Could not find or load main class org.apache.mahout.driver.MahoutDriver
hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
Error: Could not find or load main class org.apache.mahout.driver.MahoutDriver
I think I set the paths in bash correctly:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop-0.20.2
export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
export MAHOUT_HOME=/usr/local/mahout-0.4
export MAHOUT_VERSION=0.4-SNAPSHOT
export MAVEN_OPTS=-Xmx1024m
Any idea?
Take a look here: http://stackoverflow.com/questions/11664341/what-does-this-error-when-im-trying-to-run-an-example-in-apache-mahout
DeleteYou can use the "env" command to verify HADOOP_HOME is pointing to the right place.
When i run Clusterdump i get the following error.
ReplyDeleteCommand is as follows:
$MAHOUT_HOME/bin/mahout clusterdump \--input $MAHOUT_HOME/examples/output/clusters-10 \--pointsDir $MAHOUT_HOME/examples/output/clusteredPoints/ \--output $MAHOUT_HOME/examples/output/clusteranalyze.txt
Error:
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Warning: $HADOOP_HOME is deprecated.
Running on hadoop, using /home/shrida/hadoop/bin/hadoop and HADOOP_CONF_DIR=/home/shrida/hadoop/conf
MAHOUT-JOB: /home/shrida/mahout/examples/target/mahout-examples-0.7-job.jar
Warning: $HADOOP_HOME is deprecated.
13/05/05 20:37:42 INFO common.AbstractJob: Command line arguments: {--dictionaryType=[text], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[/home/shrida/mahout/examples/output/clusters-10], --output=[/home/shrida/mahout/examples/output/clusteranalyze.txt], --outputFormat=[TEXT], --pointsDir=[/home/shrida/mahout/examples/output/clusteredPoints/], --startPhase=[0], --tempDir=[temp]}
13/05/05 20:37:42 INFO clustering.ClusterDumper: Wrote 0 clusters
13/05/05 20:37:42 INFO driver.MahoutDriver: Program took 583 ms (Minutes: 0.009716666666666667)
Clusteranalyze.txt is created but blank
Deletereally useful! thank you :) it helped me a lot.
ReplyDelete