Large Scale Machine Learning and Other Animals

Tuesday, September 20, 2011

Spotlight: Nickolas Vasiloglou

I thought about writing down about some of the exciting applied machine learning projects that are currently taking place. I start with Nickolas (Nick) Vasiloglou, PhD, a former graduate student of Alex Gray at Georgia Tech. Nick is currently an independent large scale machine learning consultant.

Here are some of the projects Nick is involved in, in his own words:

LexisNexis-ML is a machine learning toolbox combining the HPCC-LexisNexis hyperformance computing cluster and the PaperBoat/GraphLab library. HPCC is by a far a superior alternative to Hadoop. The system uses ECL, a declarative language that allows easier expression of data problems (see http://hpccsystems.com/ ). Inlining of C++ code can make it even more powerful when blending of sequential numerical algorithms with data manipulation is necessary. HPCC's heart is a C++ code generator that has the advantage of generating highly optimized binaries that outperform java Hadoop binaries.
PaperBoat a single thread machine learning library built on top of C++ Boost MPL (template metaprogramming). The library is built with several templated abstractions so that it can be integrated easily with other platforms. The integration can be either light or very deep. The library makes extensive use of multidimensional trees for improving scalability and speed. Here is the current list of implemented algorithms. All of them support both sparse and desne data:

All nearest neighbors, (range, k, nearest, furthest, metric, bregman divergence), Kdtrees, ball trees
Kmeans, (kmeans++, peleg's algorithm, online, batch, kd-tree, sparse trees)
Quic SVD
NMF
Kernel density estimation (kdtrees, balltrees)
Lasso
Regression, (stepwise, vif, nonnegative, constrained)
SVM (smo, a faster method using trees)
Decision trees
Orhogonal range search
Kernel PCA
NonParametric regression
Maximum Variance Unfolding
Mouragio is an asynchronous version of Paperboat where single threaded machine learning algorithms can exchange asynchronously data. Mouragio implements very efficiently a publish subscribe model that is ideal for asynchronous bootstraping (bagging) as well as for the racing algorithm (Moore & Maron 1997). Asynchronous iterations is an old idea from MIT optimization lab (Bertsekas and Tsinstikilis see link.
Mouragio is trying to utilize algorithms from the graph literature to automatically partition data and tasks so that the user doesn't have to deal with it. The mouragio daemon is trying to schedule tasks to the node where most of the required data and computational power are available. Mouragio is partially supported by LogicBlox.
DataLog-LogicBlox scientific engine. LogicBlox has developed a database platform based on Logic. The language used is an enhanced version of Datalog. By far Datalog is the most expressive and declarative language for manipulating data. At this point datalog translates logic into a run-time database engine transactions. The goal of this project is to translate datalog to other scientific platfroms such as GRAPHLAB and MOURAGIO. Datalog is very good at expressing graphs so it very easily can translate to GRAPHLAB Also since the algorithm are described as sequence independent rules, automatic parallelization is more easy to do (although not always 100%).

Sunday, September 18, 2011

it++ vs. Eigen

it++ and Eigen are both popular and powerful matrix linear algebra packages for C++.

We got a lot of complaints from our users about the relative difficulty in installing it++, as well for its limited GPL license. We have decided to try and swith to Eigen linear library instead. Eigen has no installation since the code is composed of header files. It is licensed under LGPL3+ license.

Today I have created a pluggable interface that allows swapping it++ and Eigen underneath our GraphLab code. I have run some tests to verify speed and accuracy of Eigen vs. it++.

And here are the results:

Framework and Algorithm	Running time (sec)	Training RMSE	Validation RMSE
it++ ls_solve_chol	16.8	0.7000	0.9704
it++ ls_solve	17.8	0.7000	0.9704
Eigen ldlt	18.3	0.6745	0.9495
Eigen llt	18.7	0.6745	0.9495
Eigen JacobiSVD	63.0	0.6745	0.9495

Experiment details: I have used GraphLab's alternating least squares, with a subset of Netlix data. Dataset is described here. I let the algorithm run for 10 iterations, in release mode, on our AMD Opteron 8 core machine.

Experiment conclusions: It seems that Eigen is more accurate than it++. It slightly runs slower than it++ but accuracy of both training and validation RMSE is better.

Tho those of you who are familiar with it++ and would like to try out Eigen I made some short
list of compatible function calls of both systems.

	it++	Eigen
double matrix	mat	MatrixXd
double vector	vec	VectorXd
Value assignment	a.set(i,j,val)	a(i,j)=val
Get row	a.get_row(i)	a.row(i)
Identity matrix	eye(size)	Indentity(size)
Matrix/vecotr of ones	ones(size)	Ones(size)
Matrix/vecotr of zeros	zeros(size)	Zero(size)
Least squares solution	x=ls_solve(A,b)	x=A.ldlt().solve(b)
transpose	transpose(a) or a.transpose()	a.transpose()
set diagonal	a=diag(v)	a.diagonal()=v
sum values	a.sumsum()	a.sum
L2 norm	a.norm(2)	a.squaredNorm()
inverse	inv(a)	a.inverse()
outer product	outer_product(a,b)	a*b.transpose()
Eigenvalue of symmetric mat	eig_sym	VectorXcd eigs = T.eigenvalues()
Subvector	v.mid(1,n)	a.head(1,n)
Sum squares	sum_sqr(v)	v.array().pow(2).sum()
trace	trace(a)	a.trace()
min value	min(a)	a.minCoeff()
max value	max(a)	a.maxCoeff()
Random uniform	randu(size)	VectorXi::Random(size)
Concat vectors	concat(a,b)	VectorXi ret(a.size()+b.size()); ret << a,b;
Sort vector	Sort sorter; sorter.sort(0, a.size()-1, a)	std::sort(a.data(), a.data()+a.size());
Sort index	Sort sorter; sorter.sort_index(0, a.size()-1, a)	N/A
Get columns	a.get_cols(cols_vec)	N/A
Random normal	randn(size)	N/A

Monday, September 12, 2011

Atomic read-modify-write operation on multicore

Recently, while working on our Shotgun large scale sparse logistic regression code, I learned some cool programming trick from Aapo Kyrola, who in turn learned it from Prof. Guy Blelloch from CMU.

The problem arises when you have an array, and you want multiple cores to add values to the same array position concurrently. This of course may result in undetermined behavior of the needed precautions are not taken.

A nice way to solve this problem is the following. Define the following
scary assembler procedure:

bool CAS(long *ptr, long oldv, long newv) {
      unsigned char ret;
      /* Note that sete sets a 'byte' not the word */
      __asm__ __volatile__ (
                    "  lock\n"
                    "  cmpxchgq %2,%1\n"
                    "  sete %0\n"
                    : "=q" (ret), "=m" (*ptr)
                    : "r" (newv), "m" (*ptr), "a" (oldv)
                    : "memory");
      return ret;
    }

The above procedure defines a read-modify-write lock on the array, and permits
only one thread at a time to write to the specific array value given in ptr.
The way to use this procedure is as follows:

void add(int idx, double fact) {
        volatile double prev;
        volatile double newval;
        volatile double oldval;
        do {
            prev = arr[idx];
            oldval = prev;
            newval = prev+fact;
        } while (!CAS(reinterpret_cast<long *>(&arr[idx]), *reinterpret_cast<volatile long *>(&prev), *reinterpret_cast<volatile long*>(&newval)));
    }

And here is some more detailed explanation from Guy Blelloch:
The CAS instruction is one of the machine instructions on the x86 processors (the first function is just calling the instruction, which has name cmpxchgq). Probably the best book that describes its various applications is Herlihy and Shavit's book titled "The Art of Multiprocessor Programming". The general use is for implementing atomic read-modify-write operations. The idea is to read the value, make some modification to it (e.g. increment it) and then write it back if the value has not changed in the meantime. The CAS(ptr, a, b)
function conditionally writes a value b into ptr if the current value equals a.

Friday, September 9, 2011

GraphLab Clustering library

Recently I have been working on implementing a clustering library on top of GraphLab.
Currently we have K-means, Fuzzy K-means and LDA (Latent Dirichlet Allocation) implemented. I took some time for comparing performance of GraphLab vs. Mahout on an Amazon EC2 machine.

Here is a graph which compares performance:

Some explanation about the experiment. I took a subset of Netflix data with 3,298,163 movie ratings, 95,526 users, and 3,561 movies. The goal is to cluster user with similar movie preferences together. Both GraphLab and Mahout run on Amazon m2.xlarge instance .
This machine has 2 cores. I have used the following settings: 250 clusters, 50 clusters and 20 clusters. The algorithm runs a single iteration and then dumps the output into a text file.
For Mahout, I used Mahout's K-Means implementation. GraphLab was run using a single node, while Mahout was run using either one or two nodes. Mahout is using 7 mappers and GraphLab 7 threads.

Overall, GraphLab runs between x15 to x40 faster on this dataset.

A second experiment I did is to compare Mahout's LDA performance to GraphLab's LDA.
Here is the Graph:

For this experiment, I used m1.xlarge instance. I tested Graphlab on 4 cores, Mahout and 4 cores and Mahout on 8 cores (2 nodes). I used the same Netflix data subset, this time with 10 clusters. Graph depicts running time of a single iteration.

Finally, here are performance results of GraphLab LDA with 1, 2, 3 and 4 cores (on m1.xlarge EC2 instance):

Running time in this case is for 5 iterations.

Thursday, September 8, 2011

More on shutgun (2)

One of the greatest things about writing a blog is in getting interesting feedback from the readers. (I mean if no one reads it - why bother??) Here is an email I got this morning from Steve Lianoglou, a graduate student in the Computational Systems Biology dept, Memorial Sloan-Kettering Cancer Center, Weill Medical College of Cornell University.

Hi,
..
I stumbled on the shotgun library from reading Dr. Bickson's (somehow)
recent blog post:

http://bickson.blogspot.com/2011/08/more-on-shutgun.html

I thought it could come in handy on some of the stuff I'm playing with
and wrote a wrapper for it so us R folks can use it, too ... I mean,
we can't lab the MATLAB tribe have all the fun, now, can we?

https://github.com/lianos/buckshot

Some installation and usage instructions are on the wiki:
https://github.com/lianos/buckshot/wiki

R has to be running in 64 bit for you to be able to build and install
the package successfully. It works on my OS X boxes, and our (linux)
compute server, so hopefully it can work for you.

It's a bit rough around the edges (ie. no documentation), but if
you're familiar with building models in R, you'll now how to use this:

R> library(buckshot)
R> data(arcene)
R> model <- buckshot(A.arcene, y.arcene, 'logistic', lambda=0.166)
R> preds <- predict(model, A.arcene)
R> sum(preds == y.arcene) / length(y.arcene)
[1] 1

Could get worse than 100% accuracy, I guess ...

In time, I hope to get it "easily installable" and push it out to
CRAN, but in the meantime I thought it would be of interest to the
people reading this list in its current form, and to the shotgun
authors (who I'm hoping are also reading this list), even if they
don't use R :-)

Thanks for putting shotgun out in the wild for us to tinker with!

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Thanks a lot Steve! We really appreciate your efforts. The shotgun code has been significantly improved over the last two weeks. We are looking for more users to beta test it on real data. Write me if you are trying our code!

Amazon EC2 Supports Research!

I am very pleased to announce, that GraphLab large scale machine learning project is now supported by Amazon Elastic cloud (EC2), who allocated us computing time for using their cloud. This will allow us to extend compatibility with EC2 and further to scale for larger models.

I want to take this opportunity to thanks James Hammilton, VP and Distinguished Engineer in Amazon who pulled some strings, and introduced us to Kurt Messersmith, Senior Manager in Amazon Web Services who was kind enough to approve our grant request.

By the way, if you are teaching a course about EC2 or you want to apply for research grants you can apply here: http://aws.amazon.com/education/

Saturday, September 3, 2011

Understanding Mahout K-Means clustering implementation

This post helps to understand Mahout's K-Means clustering implementation.
Preliminaries: you should read first the explanation in the link above.

Installation and setup

wget http://apache.spd.co.il//mahout/0.5/mahout-distribution-0.5.zip
unzip mahout-distribution-0.5.zip
cd mahout-distribution-0.5.zip 
setenv JAVA_HOME /path.to/java1.6.0/

Running the example
From the Mahout root folder:

./examples/bin/build_reuters.sh

Explanation
The script build_reuters.sh downloads reuters data, which is composed of news items.

<46|0>bickson@biggerbro:~/usr7/mahout-distribution-0.5/examples/bin/mahout-work/reuters-out$ ls
reut2-000.sgm-0.txt    reut2-003.sgm-175.txt  reut2-006.sgm-24.txt   reut2-009.sgm-324.txt  reut2-012.sgm-39.txt   reut2-015.sgm-474.txt  reut2-018.sgm-549.txt
reut2-000.sgm-100.txt  reut2-003.sgm-176.txt  reut2-006.sgm-250.txt  reut2-009.sgm-325.txt  reut2-012.sgm-3.txt    reut2-015.sgm-475.txt  reut2-018.sgm-54.txt
....

A typical news item looks like:
26-FEB-1987 15:01:01.79

BAHIA COCOA REVIEW

Showers continued throughout the week in the Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, although normal humidity levels have not been restored, Comissaria Smith said in its weekly review. The dry period means the temporao will be late this year. Arrivals for the week ended February 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against 5.81 at the same stage last year. Again it seems th....

The goal of the method, is to cluster similar news items together. This is done by first counting word occurrences using TF-IDF scheme. Each news item is a sparse row in a matrix. Next, rows are clustered together using the k-means algorithm.

What happens behind the scenes?
1) mahout seqdirectory is called, to create sequence files containing file name as key, and file content as value.
INPUT DIR: mahout-work/reuters-out/
OUTPUT DIR: mahout-work/reuters-out-seqdir/

2) mahout seq2parse is called, to create sparse vectors out of the sequence files.
INPUT DIR: mahout-work/reuters-out-seqdir/
OUTPUT DIR: mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/

Inside the output dir, a file called part-t-00000 is created. This is a sequence file which includes int (row id) as key, and a sparse vector (SequentialAccessSparseVector) as value.

3) mahout kmeams is called, for clustering the sparse vectors into cluster.
INPUT DIR: mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/
INTERMEDIATE OUTPUT DIR: mahout-work/reuters-kmeans-clusters/
OUTPUT DIR:mahout-work/reuters-kmeans/

4) Finally clusterdump converts clusters into human readable format
INPUT DIR: mahout-work/reuters-kmeans/
OUPUT : a text file.

Debugging:
Below you can find some common problems and their solutions.

Problem:

~/usr7/mahout-distribution-0.5$ ./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:10:39 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:10:39 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout
Sep 4, 2011 2:10:39 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:10:39 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Sep 4, 2011 2:10:39 AM org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker <init>
WARNING: Problem opening checksum file: file:/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/part-randomSeed.  Ignoring exception: java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at java.io.DataInputStream.readFully(DataInputStream.java:152)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:134)
    at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
    at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init>(SequenceFileIterator.java:58)
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
    at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:87)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)

Answer: cluster path and input path point for the same folder. When starting run all files in cluster path are deleted, so input file is deleted as well. Change paths to point to different folders! Problem:

./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_clusters/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:15:11 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:15:12 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:15:12 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)

Answer: Input file named part-r-00000 is missing in the input folder. Sucessful run:

124|0>bickson@biggerbro:~/usr7/mahout-distribution-0.5$ ./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_clusters/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:19:48 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:19:48 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters
Sep 4, 2011 2:19:48 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Wrote 10 vectors to /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/part-randomSeed
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Input: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout Clusters In: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/part-randomSeed Out: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: K-Means Iteration 1
Sep 4, 2011 2:19:52 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
Sep 4, 2011 2:19:52 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Running job: job_local_0001
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: io.sort.mb = 100
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: data buffer = 79691776/99614720
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: record buffer = 262144/327680
Sep 4, 2011 2:19:53 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:54 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 0% reduce 0%
Sep 4, 2011 2:19:59 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Sep 4, 2011 2:20:00 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 80% reduce 0%
Sep 4, 2011 2:20:00 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
INFO: Starting flush of map output
Sep 4, 2011 2:20:02 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Sep 4, 2011 2:20:03 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 100% reduce 0%
Sep 4, 2011 2:20:05 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO:

Problem:

no HADOOP_HOME set, running locally
Exception in thread "main" java.lang.ClassFormatError: org.apache.mahout.driver.MahoutDriver (unrecognized class file version)
   at java.lang.VMClassLoader.defineClass(libgcj.so.8rh)
   at java.lang.ClassLoader.defineClass(libgcj.so.8rh)
   at java.security.SecureClassLoader.defineClass(libgcj.so.8rh)
   at java.net.URLClassLoader.findClass(libgcj.so.8rh)
   at java.lang.ClassLoader.loadClass(libgcj.so.8rh)
   at java.lang.ClassLoader.loadClass(libgcj.so.8rh)
   at gnu.java.lang.MainThread.run(libgcj.so.8rh)

ANSWER: wrong java version used - you should is 1.6.0 or higher. Problem:

../../bin/mahout: line 201: /usr/share/java-1.6.0//bin/java: No such file or directory
../../bin/mahout: line 201: exec: /usr/share/java-1.6.0//bin/java: cannot execute: No such file or directory

Answer: JAVA_HOME is pointing to the wrong place. Inside this directory a subdirectory called bin should be present, with an executable named "java" in it. Problem:

export JAVA_HOME=/afs/cs.cmu.edu/local/java/amd64_f7/jdk1.6.0_16/
cd ~/usr7/mahout-distribution-0.5/ ; ./bin/mahout clusterdump --seqFileDir ~/usr7/small_netflix_mahout_clusters/ --pointsDir ~/usr7/small_netflix_mahout/ --output small_netflix_output.txt
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 4:22:29 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--dictionaryType=text, --endPhase=2147483647, --output=small_netflix_output.txt, --pointsDir=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --seqFileDir=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 4:22:29 AM org.apache.hadoop.util.NativeCodeLoader 
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Exception in thread "main" java.lang.ClassCastException: org.apache.mahout.math.VectorWritable cannot be cast to org.apache.mahout.clustering.WeightedVectorWritable
	at org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:171)
	at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:121)
	at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:86)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
make: *** [clusterdump] Error 1