Large Scale Machine Learning and Other Animals: September 2011

Thursday, September 29, 2011

Spotlight: Pandora Internet Radio

A while ago I met Eric Bieschke and Tao Ye at GeekSessions event in SF.
I will really impressed by Eric's talk presenting Pandora Internet Radio, and I am sure everyone will agree with me it is one of the coolest companies, with great large scale machine learning
applications. Here is a quick interview I held with Tao:

Q: Can you give a short description of Pandora, to those few who don't know about this company?
A: Pandora is the leader in internet radio in the United States, offering a personalized experience for each of our listeners. We have pioneered a new form of radio that uses intrinsic qualities of music to initially create stations and then adapts playlists in real-time based on the individual feedback of each listener.

The Music Genome Project and our playlist generating algorithms form the technology foundation that enables us to deliver personalized radio to our listeners. These proprietary technologies power our ability to predict listener music preferences and play music content suited to the tastes of
each individual listener. The extensive musicological database of the Music Genome Project has been meticulously built by a team of professional musicians and musicologists analyzing up to 480 attributes, or genes, for every song in our vast collection, to capture the fundamental musical
properties of each recording. When a listener enters a single song, artist, composer or genre to start a station a process we call seeding our complex mathematical algorithms combine the genes cataloged by the Music Genome Project with individual and collective feedback to suggest songs and buildpersonalized playlists.

Q: What is the magnitude of datasets you are working on?
A: As of July 2011, we had over 100 million registered users, and more
than 37 million Active monthly users. Since the launch of Pandora in 2005, our listeners
have created 1.9 billion stations and have given more than 11 billion thumbs.
Containing over 900,000 songs from over 90,000 artists, we believe the
Music Genome Project is the most comprehensive analysis of music in the
world.

Q: Are there unique properties of your data relative to other datasets
like yahoo KDD cup?
A: Compared to KDD cup 2011, our feedback dataset has binary data only
(thumb up or thumb downs) instead of numeric ratings. In addition all the feedbacks are
in context -- for a music/comedic seed. Since users can start stations
from a song, an artists or a genre, there are close to 1 million possible
"contexts" for recommendations to live in. This has both computation
implications (scale makes running complex algorithms harder) and
recommendation implications (in some cases makes the problem easier).

Our genome data has not only track/album/artist/genre meta data, but also
'gene' analysis for each track done by human music/comedic analysts. There are up to
450 gene values per track, capturing a track's musical (or comedic)
attributes from melody, harmony and instrumentation to rhythm, vocals and
lyrics.

Q: How does your current recommendation engine works? (maybe in general,
you probably do not want to reveal all secret recipes here)
A: We combine crowd feedback data and genome analysis data to provide
recommendations within context of station seed to each user. Our algorithm
recommends songs based on metrics such as thumbs ratio, genome nearest
neighbor and song novelty. It also additionally customize stations in real
time per user, based on instant user feedback.

Q: What are some future challenges you would like to solve? Specifically,
are you looking at online /real-time recommendations.
A: We're constantly improving the playlist algorithm. Many challenges lie
ahead.
* Pandora already provides online/real time personalized playlist. We
compute the building blocks to assist in making those choices offline, but
every song on Pandora was chosen specifically for that listener at that
moment. It is still a challenge to build a more refined set of real time
Metrics and infer listener preference, especially with limited user input
(many listeners don't thumb at all!).
* Past competitions emphasize on prediction accuracy optimization, however
at Pandora we value music variety greatly, hence understanding the
tradeoff between prediction accuracy and music variety/diversity and
striking the right balance is very important.
* We work on context relevant recommendations, from creating the best 4th
of July stations to ensuring new artist/song stations are good. These are
our cold start problems.
* Greater combination of different recommendation algorithms, including
content based, expert based and varies crowd based recommendation.

About Tao and Eric:

Tao Ye is a member of the Pandora playlist engineering team, currently
working on Pandora's playlist measurement and genome optimization. Most
recently, she spent 5 years as a research scientist at Sprint's IP and
wireless networking group, working on network monitoring and measurement
of large scale IP backbone. Prior to joining Sprint, she held lead
engineer and engineer roles working on Java systems at Consilient and Sun
Microsystems. She received a Master's degree from UC Berkeley
in Computer Science and duo Bachelor's degrees from State University of
New York at Stony Brook in Computer Science and Engineering Chemistry. She is expecting her Ph.D.
Degree from University of Melbourne on 12/2011.

Eric Bieschke runs playlist engineering for Pandora. As Pandora¹s second
employee he built initial prototypes for Pandora¹s playlist algorithms and
with his team has grown them to service more than 100M users who¹ve
thumbed 10 billion songs while listening to billions of hours of music. He
is currently working on optimally combining content based recommendations,
collective intelligence, and human machine cooperation in order to provide
the best experience for listeners.

Tuesday, September 27, 2011

K-Core/K-Shell Network Decomposition

A few days ago I met with Prof. Scott Kirkpatrick, who asked me for some help implementing the K-cores decomposition algorithm in Graphlab.

K-Core decomposition is a very simple algorithm, originating from statistical mechanics for investigating graph properties. I found a nice paper (k-core decomposition: a tool for the visualization of large scale networks, by
José Ignacio Alvarez-Hamelin, Luca Dall'Asta, Alain Barrat, Alessandro Vespignani, NIPS 18, 2006, describing the algorithm. The algorithm proceeds as follows:
In the first round, all nodes with one or less edges are removed from the graph, and all their edges are deleted. In the second round, all nodes with two edges or less are removed from the graph and their edges deleted. Similarly, at the i-th iteration all nodes with less than i neighbors are removed from the graph. Note that removal of a node with k or fewer neighbors is done recursively. If removal (with the k links) exposes a new site with now less than k ngbrs it is removed in the current iteration as well.

The above image taken from the NIPS paper above, shows 3 iterations of the K-core algorithm. In the first iteration the blue nodes
are removed. In the second iteration the yellow nodes are removed. Finally we are left with the red nodes which are the core of this sample network.

The output of the K-cores algorithm is a single number for each node: its core assignment. The algorithm has been used by Prof. Kirkpatrick for investigating Internet topology and reported here.

Below you can find an image from the above NIPS paper which depicts application of the algorithm into France Internet domain:

Anyway, now the K-core algorithm is supported as part of the Graphlab clustering library. You are welcome to try it out on your own network!

Wednesday, September 21, 2011

On the importance of different configurations of linear algebra packages

I got the following question from Kayhan, a Upenn graduate student:

I am one of the readers of your weblog. I have a question about one of your posts in your weblog about comparison of of two linear algebra libraries: 'it++ vs eigen" ; I guess you are the expert person who can answer my question.
I have an algorithm that involves matrix-matrix and matrix-vector matrix multiplication iteratively and involves all kinds of dense and sparse matrices and vectors. I have implemented my algorithm using gmm with atlas flag active but it seems that it is still slower than MATLAB. More specifically, it seems that gmm uses one thread comparing to MATLAB that uses multiple threads when it is compiled with MCC.
I was wondering if any of those libraries you have introduced in your post (it++, eigen) are capable to of multi-threading and how does it compared with MATLAB linear algebra engine.

Regards,
Kayhan

It is always nice to get feedback from my readers! Especially the ones who call
me an expert (although without "I guess" - next time please!! :-)
There is definitely a room for improving blas/lapack performance. Need to dig into the details of the library you are using.

Eigen has some nice benchmarks here:

As you can see, Atlas has relatively inferior performance vs. Eigen and
Intel MKL.(Higher MFLOPS is better).

Here is an example setup for Intel MKL I got from Joel Welling (PSC):

I'm guessing that the current configuration produces too many threads,
or puts those threads in the wrong places. See for example the section
'Choosing the number of threads with MKL' on
http://www.psc.edu/general/software/packages/mkl/ . It might also be
worth linking against the non-threaded version of MKL, which I think
would involve doing:

-L${MKL_PATH} -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread

instead of:

-L${MKL_PATH} -lmkl_intel_lp64  -lmkl_intel_thread -lmkl_core \
-L/opt/intel/Compiler/11.1/072/lib/intel64 -liomp5 -fopenmp

From my experience, there is a huge difference in performance between different lapack configurations on the same machine. For example, on BlackLight supercomputer
I got the following timing results for Alternating least squares on Netflix data.
Here is a graph comparison different implementations. I used 16 BlackLight cores. Alternating least squares is run 10 iterations to factorize a matrix of 100,000,000 nnz. The width of the factor matrices was set to D=30.

As you can see, wrong configuration resulted in x24 more running time! (In this Graph - lower is better!) Overall, if you are using an Intel platform I highly recommend using MKL.

Why don't you try out GraphLab? It is designed for iterative algorithms on sparse data. In case you use it is much easier to deploy efficiently the multiple cores.

Giraph Machine Learning Project - Setting up

Giraph machine learning project, is a relatively new large scale machine learning project at incubation stage under Apache. It is the only open source implementation I am aware of Google's Pregel (BSP = Bulk Synchronous Parallel) framework.

I got the following instructions, from my colleague and friend Aapo Kyrola:

1. INSTALL HADOOP: Must be version 0.20.203 or later.
- This is simple, just download and extract.

2. Set HADOOP_HOME variable to point to the hadoop directory.

3. Set Hadoop configuration (under HADOOP_HOME/config) according
to what is explained here.

* NOTE: set the hdfs directory appropriately: core-site.xml, property hadoop.tmp.dir

3.5 Start Hadoop:
bin/start-all.sh

4. Install zookeeper
- just download and extract

5. Configure conf/zoo.cfg properly. (Just copy the sample config and change to sensible parameters).
- set clientPort=22181

6. Start up zookeeper:
bin/zkServer.sh start

7. Install and build Giraph as explained in the end of this website:
http://incubator.apache.org/giraph/

8. In HADOOP_HOME, run PageRank:

bin/hadoop jar ../../GraphLab/giraph/giraph/trunk/target/giraph-0.70-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark -e 100 -s 5 -V 10000 -w 1 -v

If everything went OK you will get:

11/09/19 18:23:20 INFO mapred.JobClient:   Giraph Timers
11/09/19 18:23:20 INFO mapred.JobClient:     Total (milliseconds)=260128
11/09/19 18:23:20 INFO mapred.JobClient:     Superstep 3 (milliseconds)=54578
11/09/19 18:23:20 INFO mapred.JobClient:     Setup (milliseconds)=2771
11/09/19 18:23:20 INFO mapred.JobClient:     Shutdown (milliseconds)=92
11/09/19 18:23:20 INFO mapred.JobClient:     Vertex input superstep (milliseconds)=2386
11/09/19 18:23:20 INFO mapred.JobClient:     Superstep 0 (milliseconds)=8059
11/09/19 18:23:20 INFO mapred.JobClient:     Superstep 4 (milliseconds)=70263
11/09/19 18:23:20 INFO mapred.JobClient:     Superstep 5 (milliseconds)=1879
11/09/19 18:23:20 INFO mapred.JobClient:     Superstep 2 (milliseconds)=66531
11/09/19 18:23:20 INFO mapred.JobClient:     Superstep 1 (milliseconds)=53564
11/09/19 18:23:20 INFO mapred.JobClient:   Giraph Stats
11/09/19 18:23:20 INFO mapred.JobClient:     Aggregate edges=1000000
11/09/19 18:23:20 INFO mapred.JobClient:     Superstep=6
11/09/19 18:23:20 INFO mapred.JobClient:     Current workers=1
11/09/19 18:23:20 INFO mapred.JobClient:     Current master task partition=0
11/09/19 18:23:20 INFO mapred.JobClient:     Sent messages=0
11/09/19 18:23:20 INFO mapred.JobClient:     Aggregate finished vertices=10000
11/09/19 18:23:20 INFO mapred.JobClient:     Aggregate vertices=10000

Anyway Aapo has a great Nordic sense of humor. This is what he sent me later:
For your convenience, I have pasted the documentation of Giraph to this email.

-- Begin --
-- End --

Additionally, a quick start document is available here:
https://github.com/aching/Giraph/wiki/Quick-Start-Guide

Tuesday, September 20, 2011

Spotlight: Nickolas Vasiloglou

I thought about writing down about some of the exciting applied machine learning projects that are currently taking place. I start with Nickolas (Nick) Vasiloglou, PhD, a former graduate student of Alex Gray at Georgia Tech. Nick is currently an independent large scale machine learning consultant.

Here are some of the projects Nick is involved in, in his own words:

LexisNexis-ML is a machine learning toolbox combining the HPCC-LexisNexis hyperformance computing cluster and the PaperBoat/GraphLab library. HPCC is by a far a superior alternative to Hadoop. The system uses ECL, a declarative language that allows easier expression of data problems (see http://hpccsystems.com/ ). Inlining of C++ code can make it even more powerful when blending of sequential numerical algorithms with data manipulation is necessary. HPCC's heart is a C++ code generator that has the advantage of generating highly optimized binaries that outperform java Hadoop binaries.
PaperBoat a single thread machine learning library built on top of C++ Boost MPL (template metaprogramming). The library is built with several templated abstractions so that it can be integrated easily with other platforms. The integration can be either light or very deep. The library makes extensive use of multidimensional trees for improving scalability and speed. Here is the current list of implemented algorithms. All of them support both sparse and desne data:

All nearest neighbors, (range, k, nearest, furthest, metric, bregman divergence), Kdtrees, ball trees
Kmeans, (kmeans++, peleg's algorithm, online, batch, kd-tree, sparse trees)
Quic SVD
NMF
Kernel density estimation (kdtrees, balltrees)
Lasso
Regression, (stepwise, vif, nonnegative, constrained)
SVM (smo, a faster method using trees)
Decision trees
Orhogonal range search
Kernel PCA
NonParametric regression
Maximum Variance Unfolding
Mouragio is an asynchronous version of Paperboat where single threaded machine learning algorithms can exchange asynchronously data. Mouragio implements very efficiently a publish subscribe model that is ideal for asynchronous bootstraping (bagging) as well as for the racing algorithm (Moore & Maron 1997). Asynchronous iterations is an old idea from MIT optimization lab (Bertsekas and Tsinstikilis see link.
Mouragio is trying to utilize algorithms from the graph literature to automatically partition data and tasks so that the user doesn't have to deal with it. The mouragio daemon is trying to schedule tasks to the node where most of the required data and computational power are available. Mouragio is partially supported by LogicBlox.
DataLog-LogicBlox scientific engine. LogicBlox has developed a database platform based on Logic. The language used is an enhanced version of Datalog. By far Datalog is the most expressive and declarative language for manipulating data. At this point datalog translates logic into a run-time database engine transactions. The goal of this project is to translate datalog to other scientific platfroms such as GRAPHLAB and MOURAGIO. Datalog is very good at expressing graphs so it very easily can translate to GRAPHLAB Also since the algorithm are described as sequence independent rules, automatic parallelization is more easy to do (although not always 100%).

Sunday, September 18, 2011

it++ vs. Eigen

it++ and Eigen are both popular and powerful matrix linear algebra packages for C++.

We got a lot of complaints from our users about the relative difficulty in installing it++, as well for its limited GPL license. We have decided to try and swith to Eigen linear library instead. Eigen has no installation since the code is composed of header files. It is licensed under LGPL3+ license.

Today I have created a pluggable interface that allows swapping it++ and Eigen underneath our GraphLab code. I have run some tests to verify speed and accuracy of Eigen vs. it++.

And here are the results:

Framework and Algorithm	Running time (sec)	Training RMSE	Validation RMSE
it++ ls_solve_chol	16.8	0.7000	0.9704
it++ ls_solve	17.8	0.7000	0.9704
Eigen ldlt	18.3	0.6745	0.9495
Eigen llt	18.7	0.6745	0.9495
Eigen JacobiSVD	63.0	0.6745	0.9495

Experiment details: I have used GraphLab's alternating least squares, with a subset of Netlix data. Dataset is described here. I let the algorithm run for 10 iterations, in release mode, on our AMD Opteron 8 core machine.

Experiment conclusions: It seems that Eigen is more accurate than it++. It slightly runs slower than it++ but accuracy of both training and validation RMSE is better.

Tho those of you who are familiar with it++ and would like to try out Eigen I made some short
list of compatible function calls of both systems.

	it++	Eigen
double matrix	mat	MatrixXd
double vector	vec	VectorXd
Value assignment	a.set(i,j,val)	a(i,j)=val
Get row	a.get_row(i)	a.row(i)
Identity matrix	eye(size)	Indentity(size)
Matrix/vecotr of ones	ones(size)	Ones(size)
Matrix/vecotr of zeros	zeros(size)	Zero(size)
Least squares solution	x=ls_solve(A,b)	x=A.ldlt().solve(b)
transpose	transpose(a) or a.transpose()	a.transpose()
set diagonal	a=diag(v)	a.diagonal()=v
sum values	a.sumsum()	a.sum
L2 norm	a.norm(2)	a.squaredNorm()
inverse	inv(a)	a.inverse()
outer product	outer_product(a,b)	a*b.transpose()
Eigenvalue of symmetric mat	eig_sym	VectorXcd eigs = T.eigenvalues()
Subvector	v.mid(1,n)	a.head(1,n)
Sum squares	sum_sqr(v)	v.array().pow(2).sum()
trace	trace(a)	a.trace()
min value	min(a)	a.minCoeff()
max value	max(a)	a.maxCoeff()
Random uniform	randu(size)	VectorXi::Random(size)
Concat vectors	concat(a,b)	VectorXi ret(a.size()+b.size()); ret << a,b;
Sort vector	Sort sorter; sorter.sort(0, a.size()-1, a)	std::sort(a.data(), a.data()+a.size());
Sort index	Sort sorter; sorter.sort_index(0, a.size()-1, a)	N/A
Get columns	a.get_cols(cols_vec)	N/A
Random normal	randn(size)	N/A

Monday, September 12, 2011

Atomic read-modify-write operation on multicore

Recently, while working on our Shotgun large scale sparse logistic regression code, I learned some cool programming trick from Aapo Kyrola, who in turn learned it from Prof. Guy Blelloch from CMU.

The problem arises when you have an array, and you want multiple cores to add values to the same array position concurrently. This of course may result in undetermined behavior of the needed precautions are not taken.

A nice way to solve this problem is the following. Define the following
scary assembler procedure:

bool CAS(long *ptr, long oldv, long newv) {
      unsigned char ret;
      /* Note that sete sets a 'byte' not the word */
      __asm__ __volatile__ (
                    "  lock\n"
                    "  cmpxchgq %2,%1\n"
                    "  sete %0\n"
                    : "=q" (ret), "=m" (*ptr)
                    : "r" (newv), "m" (*ptr), "a" (oldv)
                    : "memory");
      return ret;
    }

The above procedure defines a read-modify-write lock on the array, and permits
only one thread at a time to write to the specific array value given in ptr.
The way to use this procedure is as follows:

void add(int idx, double fact) {
        volatile double prev;
        volatile double newval;
        volatile double oldval;
        do {
            prev = arr[idx];
            oldval = prev;
            newval = prev+fact;
        } while (!CAS(reinterpret_cast<long *>(&arr[idx]), *reinterpret_cast<volatile long *>(&prev), *reinterpret_cast<volatile long*>(&newval)));
    }

And here is some more detailed explanation from Guy Blelloch:
The CAS instruction is one of the machine instructions on the x86 processors (the first function is just calling the instruction, which has name cmpxchgq). Probably the best book that describes its various applications is Herlihy and Shavit's book titled "The Art of Multiprocessor Programming". The general use is for implementing atomic read-modify-write operations. The idea is to read the value, make some modification to it (e.g. increment it) and then write it back if the value has not changed in the meantime. The CAS(ptr, a, b)
function conditionally writes a value b into ptr if the current value equals a.

Friday, September 9, 2011

GraphLab Clustering library

Recently I have been working on implementing a clustering library on top of GraphLab.
Currently we have K-means, Fuzzy K-means and LDA (Latent Dirichlet Allocation) implemented. I took some time for comparing performance of GraphLab vs. Mahout on an Amazon EC2 machine.

Here is a graph which compares performance:

Some explanation about the experiment. I took a subset of Netflix data with 3,298,163 movie ratings, 95,526 users, and 3,561 movies. The goal is to cluster user with similar movie preferences together. Both GraphLab and Mahout run on Amazon m2.xlarge instance .
This machine has 2 cores. I have used the following settings: 250 clusters, 50 clusters and 20 clusters. The algorithm runs a single iteration and then dumps the output into a text file.
For Mahout, I used Mahout's K-Means implementation. GraphLab was run using a single node, while Mahout was run using either one or two nodes. Mahout is using 7 mappers and GraphLab 7 threads.

Overall, GraphLab runs between x15 to x40 faster on this dataset.

A second experiment I did is to compare Mahout's LDA performance to GraphLab's LDA.
Here is the Graph:

For this experiment, I used m1.xlarge instance. I tested Graphlab on 4 cores, Mahout and 4 cores and Mahout on 8 cores (2 nodes). I used the same Netflix data subset, this time with 10 clusters. Graph depicts running time of a single iteration.

Finally, here are performance results of GraphLab LDA with 1, 2, 3 and 4 cores (on m1.xlarge EC2 instance):

Running time in this case is for 5 iterations.

Thursday, September 8, 2011

More on shutgun (2)

One of the greatest things about writing a blog is in getting interesting feedback from the readers. (I mean if no one reads it - why bother??) Here is an email I got this morning from Steve Lianoglou, a graduate student in the Computational Systems Biology dept, Memorial Sloan-Kettering Cancer Center, Weill Medical College of Cornell University.

Hi,
..
I stumbled on the shotgun library from reading Dr. Bickson's (somehow)
recent blog post:

http://bickson.blogspot.com/2011/08/more-on-shutgun.html

I thought it could come in handy on some of the stuff I'm playing with
and wrote a wrapper for it so us R folks can use it, too ... I mean,
we can't lab the MATLAB tribe have all the fun, now, can we?

https://github.com/lianos/buckshot

Some installation and usage instructions are on the wiki:
https://github.com/lianos/buckshot/wiki

R has to be running in 64 bit for you to be able to build and install
the package successfully. It works on my OS X boxes, and our (linux)
compute server, so hopefully it can work for you.

It's a bit rough around the edges (ie. no documentation), but if
you're familiar with building models in R, you'll now how to use this:

R> library(buckshot)
R> data(arcene)
R> model <- buckshot(A.arcene, y.arcene, 'logistic', lambda=0.166)
R> preds <- predict(model, A.arcene)
R> sum(preds == y.arcene) / length(y.arcene)
[1] 1

Could get worse than 100% accuracy, I guess ...

In time, I hope to get it "easily installable" and push it out to
CRAN, but in the meantime I thought it would be of interest to the
people reading this list in its current form, and to the shotgun
authors (who I'm hoping are also reading this list), even if they
don't use R :-)

Thanks for putting shotgun out in the wild for us to tinker with!

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Thanks a lot Steve! We really appreciate your efforts. The shotgun code has been significantly improved over the last two weeks. We are looking for more users to beta test it on real data. Write me if you are trying our code!

Amazon EC2 Supports Research!

I am very pleased to announce, that GraphLab large scale machine learning project is now supported by Amazon Elastic cloud (EC2), who allocated us computing time for using their cloud. This will allow us to extend compatibility with EC2 and further to scale for larger models.

I want to take this opportunity to thanks James Hammilton, VP and Distinguished Engineer in Amazon who pulled some strings, and introduced us to Kurt Messersmith, Senior Manager in Amazon Web Services who was kind enough to approve our grant request.

By the way, if you are teaching a course about EC2 or you want to apply for research grants you can apply here: http://aws.amazon.com/education/

Saturday, September 3, 2011

Understanding Mahout K-Means clustering implementation

This post helps to understand Mahout's K-Means clustering implementation.
Preliminaries: you should read first the explanation in the link above.

Installation and setup

wget http://apache.spd.co.il//mahout/0.5/mahout-distribution-0.5.zip
unzip mahout-distribution-0.5.zip
cd mahout-distribution-0.5.zip 
setenv JAVA_HOME /path.to/java1.6.0/

Running the example
From the Mahout root folder:

./examples/bin/build_reuters.sh

Explanation
The script build_reuters.sh downloads reuters data, which is composed of news items.

<46|0>bickson@biggerbro:~/usr7/mahout-distribution-0.5/examples/bin/mahout-work/reuters-out$ ls
reut2-000.sgm-0.txt    reut2-003.sgm-175.txt  reut2-006.sgm-24.txt   reut2-009.sgm-324.txt  reut2-012.sgm-39.txt   reut2-015.sgm-474.txt  reut2-018.sgm-549.txt
reut2-000.sgm-100.txt  reut2-003.sgm-176.txt  reut2-006.sgm-250.txt  reut2-009.sgm-325.txt  reut2-012.sgm-3.txt    reut2-015.sgm-475.txt  reut2-018.sgm-54.txt
....

A typical news item looks like:
26-FEB-1987 15:01:01.79

BAHIA COCOA REVIEW

Showers continued throughout the week in the Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, although normal humidity levels have not been restored, Comissaria Smith said in its weekly review. The dry period means the temporao will be late this year. Arrivals for the week ended February 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against 5.81 at the same stage last year. Again it seems th....

The goal of the method, is to cluster similar news items together. This is done by first counting word occurrences using TF-IDF scheme. Each news item is a sparse row in a matrix. Next, rows are clustered together using the k-means algorithm.

What happens behind the scenes?
1) mahout seqdirectory is called, to create sequence files containing file name as key, and file content as value.
INPUT DIR: mahout-work/reuters-out/
OUTPUT DIR: mahout-work/reuters-out-seqdir/

2) mahout seq2parse is called, to create sparse vectors out of the sequence files.
INPUT DIR: mahout-work/reuters-out-seqdir/
OUTPUT DIR: mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/

Inside the output dir, a file called part-t-00000 is created. This is a sequence file which includes int (row id) as key, and a sparse vector (SequentialAccessSparseVector) as value.

3) mahout kmeams is called, for clustering the sparse vectors into cluster.
INPUT DIR: mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/
INTERMEDIATE OUTPUT DIR: mahout-work/reuters-kmeans-clusters/
OUTPUT DIR:mahout-work/reuters-kmeans/

4) Finally clusterdump converts clusters into human readable format
INPUT DIR: mahout-work/reuters-kmeans/
OUPUT : a text file.

Debugging:
Below you can find some common problems and their solutions.

Problem:

~/usr7/mahout-distribution-0.5$ ./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:10:39 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:10:39 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout
Sep 4, 2011 2:10:39 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:10:39 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Sep 4, 2011 2:10:39 AM org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker <init>
WARNING: Problem opening checksum file: file:/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/part-randomSeed.  Ignoring exception: java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at java.io.DataInputStream.readFully(DataInputStream.java:152)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:134)
    at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
    at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init>(SequenceFileIterator.java:58)
    at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
    at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:87)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)

Answer: cluster path and input path point for the same folder. When starting run all files in cluster path are deleted, so input file is deleted as well. Change paths to point to different folders! Problem:

./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_clusters/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:15:11 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:15:12 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:15:12 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)

Answer: Input file named part-r-00000 is missing in the input folder. Sucessful run:

124|0>bickson@biggerbro:~/usr7/mahout-distribution-0.5$ ./bin/mahout kmeans -i ~/usr7/small_netflix_mahout/ -o ~/usr7/small_netflix_mahout_output/ --numClusters 10 -c ~/usr7/small_netflix_mahout_clusters/ -x 10
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 2:19:48 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --maxIter=10, --method=mapreduce, --numClusters=10, --output=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 2:19:48 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters
Sep 4, 2011 2:19:48 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:48 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Wrote 10 vectors to /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/part-randomSeed
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Input: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout Clusters In: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/part-randomSeed Out: /mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_output Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
Sep 4, 2011 2:19:52 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: K-Means Iteration 1
Sep 4, 2011 2:19:52 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
Sep 4, 2011 2:19:52 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Running job: job_local_0001
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: io.sort.mb = 100
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: data buffer = 79691776/99614720
Sep 4, 2011 2:19:53 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: record buffer = 262144/327680
Sep 4, 2011 2:19:53 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 2:19:54 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 0% reduce 0%
Sep 4, 2011 2:19:59 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Sep 4, 2011 2:20:00 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 80% reduce 0%
Sep 4, 2011 2:20:00 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
INFO: Starting flush of map output
Sep 4, 2011 2:20:02 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Sep 4, 2011 2:20:03 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 100% reduce 0%
Sep 4, 2011 2:20:05 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO:

Problem:

no HADOOP_HOME set, running locally
Exception in thread "main" java.lang.ClassFormatError: org.apache.mahout.driver.MahoutDriver (unrecognized class file version)
   at java.lang.VMClassLoader.defineClass(libgcj.so.8rh)
   at java.lang.ClassLoader.defineClass(libgcj.so.8rh)
   at java.security.SecureClassLoader.defineClass(libgcj.so.8rh)
   at java.net.URLClassLoader.findClass(libgcj.so.8rh)
   at java.lang.ClassLoader.loadClass(libgcj.so.8rh)
   at java.lang.ClassLoader.loadClass(libgcj.so.8rh)
   at gnu.java.lang.MainThread.run(libgcj.so.8rh)

ANSWER: wrong java version used - you should is 1.6.0 or higher. Problem:

../../bin/mahout: line 201: /usr/share/java-1.6.0//bin/java: No such file or directory
../../bin/mahout: line 201: exec: /usr/share/java-1.6.0//bin/java: cannot execute: No such file or directory

Answer: JAVA_HOME is pointing to the wrong place. Inside this directory a subdirectory called bin should be present, with an executable named "java" in it. Problem:

export JAVA_HOME=/afs/cs.cmu.edu/local/java/amd64_f7/jdk1.6.0_16/
cd ~/usr7/mahout-distribution-0.5/ ; ./bin/mahout clusterdump --seqFileDir ~/usr7/small_netflix_mahout_clusters/ --pointsDir ~/usr7/small_netflix_mahout/ --output small_netflix_output.txt
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/mahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/bigbrofs/usr7/bickson/mahout-distribution-0.5/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Sep 4, 2011 4:22:29 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--dictionaryType=text, --endPhase=2147483647, --output=small_netflix_output.txt, --pointsDir=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout/, --seqFileDir=/mnt/bigbrofs/usr6/bickson/usr7/small_netflix_mahout_clusters/, --startPhase=0, --tempDir=temp}
Sep 4, 2011 4:22:29 AM org.apache.hadoop.util.NativeCodeLoader 
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Sep 4, 2011 4:22:29 AM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Exception in thread "main" java.lang.ClassCastException: org.apache.mahout.math.VectorWritable cannot be cast to org.apache.mahout.clustering.WeightedVectorWritable
	at org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:171)
	at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:121)
	at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:86)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
make: *** [clusterdump] Error 1