Large Scale Machine Learning and Other Animals: March 2012

Friday, March 30, 2012

Interesting twitter dataset and other big datasets

I got this from Aapo Kyrola, how got it from Guy Blleloch:

An interesting paper which explores the twitter social network is:

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue B. Moon. What is twitter, a social network or a news media? In WWW, pages 591–600, 2010.

The twitter graph is available fro download from here. The format is very simple:
user follower\n

The graph gas 41M nodes, and 1.4 billion edges. What is nice about it, is that you can view the profile of each node id using the twitter web API. For example, for user 12 you can do:
http://api.twitter.com/1/users/show.xml?user_id=12
Some statistics about the graph are found here.

If you like to use it in Graphlab v2, you need to do the following:
1) assuming the graph file name is user_follower.txt, sort the graph using:
sort -u -n -k 1,1 -k 2,2 -T . user_follower.txt > user_follower.sorted
2) Add the following matrix market format header to the file:
%%MatrixMarket matrix coordinate real general
61578414 61578414 1468365182

I am using k-cores algorithm to reveal this graph structure. I will add some results soon.

And here is a library of webgraphs and other big graphs I got from

Kanat Tangwongsan.

Big Data Grants

Scott Kirkpatrick from the Hebrew University of Jerusalem sent me the following. It seems that Obama's administration have allocated 200M $ NSF grant for big data analysis:

The Core Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA) solicitation aims to advance the core scientific and technological means of managing, analyzing, visualizing, and extracting useful information from large, diverse, distributed and heterogeneous data sets so as to: accelerate the progress of scientific discovery and innovation; lead to new fields of inquiry that would not otherwise be possible; encourage the development of new data analytic tools and algorithms; facilitate scalable, accessible, and sustainable data infrastructure; increase understanding of human and social processes and interactions; and promote economic growth and improved health and quality of life.

You can read more here.

Mike Draugelis, a strategic planning manager from Lockheed Martin, send me another related DARPA grant:

The XDATA program seeks to develop computational techniques and software tools for analyzing large volumes of data, both semi-structured (e.g., tabular, relational, categorical, meta-data) and unstructured (e.g., text documents, message traffic). Central challenges to be addressed include a) developing scalable algorithms for processing imperfect data in distributed data stores, and b) creating effective human-computer interaction tools for facilitating rapidly customizable visual reasoning for diverse missions.

The program envisions open source software toolkits that enable flexible software
development supporting users processing large volumes of data in timelines commensurate with mission workflows of targeted defense applications.

The full details are here.

Thursday, March 29, 2012

Farewell Yahoo! Labs

It is very sad to see Yahoo! labs disintegrating. In the area of machine learning Yahoo! had AAA machine learning researchers. It seems that all the head hunters are now celebrating...

From the other hand, I can't say I am worried for those guys. I am sure they have offers from a zillion other companies.

Tuesday, March 27, 2012

Dan Brickley - previously our man in Amsterdam

I just had a quick chat with Dan, working on big data analytics especially in the media context (radio, TV shows recommendations). Dan is a researcher at the NoTube EU project ending soon.

Dan is now heading a new initiative for a project proposal to the Knight Foundation. The basic idea is getting resources to connect not-so-technical domain experts
(e.g. journalists, analysts) with the new big data platforms that are
coming along. You can read some more about his proposal here.

Anyway we would love to help such an important project that bring big data analytics for larger audience! Contact Dan if you like to help promote those ideas.

Saturday, March 24, 2012

Online SVD/PCA resources

Last month I was vising Toyota Technological Institure in Chicago, where I was generously hosted by Tamir Hazan and Joseph Keshet. I heard some interesting stuff about large scale SVM from Joseph Keseht which I reported here Additionally I met with Raman Arora who is working on online SVD. I asked Raman to summarize the state-of-the-art research on online SVD and here is what I got from him:

Online PCA - the only work (that I am aware of) which comes with a guarantee is the following paper:

Warmuth, Manfred K. and Kuzmin, Dima. Randomized PCA algorithms with regret bounds that are logarithmic in the dimension. In Advances in Neural Information Processing Systems (NIPS), 2006.

Other references on which we build our work include following classical papers:

E. Oja and J. Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. Journal of Mathematical Analysis and Applications. Volume 106. Pages 69-84. 1985.

Terence D. Sanger. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks. Volume 12. Pages 459473. 1989

and these more recent papers:

Nicol N. Schraudolph, Simon Günter and S. V. N. Vishwanathan. Fast iterative kernel PCA. Advances in Neural Information Processing Systems. 2007.

Kwang In Kim, Matthias O. Franz, and Bernhard Schölkopf. Iterative Kernel Principal Component
Analysis for Image Modeling. IEEE transactions on pattern analysis and machine intelligence. Volume 27, number 9. Pages 1351-1366. 2005.

Dan Yang, Zongming Ma and Andreas Buja, A sparse SVD method for high-dimensional data, 2011.

Whitten, Tibshirani and Hastie, A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis, 2009.

Lee, Shen, Huang and Marron, Biclustering via sparse singular value decomposition, 2010.

There is another recent paper by Dan Yang on near optimality of sparse SVD algorithm that she proposed. She described some of her results during a talk she gave at U-Chicago but I couldn't find a copy of her paper online.

I am looking forward to read Raman's paper on online SVD once it is ready.

Friday, March 23, 2012

Large scale machine learning benchmark

About a week ago Joey and I had some interesting talk with Nicholas Kolegraff. Nicholas suggested a very useful initiative - to compile an EC2 Linux distribution with several machine learning frameworks installed, so one can very easily asses their performance on a set of benchmark problems. Nicholas has a preliminary demo, when one can login into the Linux configured system via his website and run multiple benchmarks using precompiled scripts. Using his system, it is be easier for people to evaluate different ML frameworks without the inevitable learning and setup curve of using each system independently. Since we really liked his idea, we invited Nicholas to give a demo at our coming GraphLab bay area workshop.

Currently, Nicholas includes the following systems in his distribution: GraphLab, Mahout, MadLIB and Vowpal Wabbit. He further asked us to help him compile some interesting datasets to test the different systems on.

Not long ago, during our Intel Labs-GraphLab collaboration, the Intel guys have also asked for help in creating some kind of baseline benchmark suite for comparing different machine learning methods and implementations. Additionally, Edin Muharemagic an architect @ HPCC systems have asked for help in assembling several large datasets for comparing their new SVD solver, as part of their new LexisNexis ML library.

Now it seems that many people find the same basic idea useful then we should probably make an effort in this direction. For a start, I am looking in providing some benchmarks for large scale sparse SVD problems. So why not crowd source this effort with the help of my readers?

As a first try, I contacted Joshua Vogelstein from Johns Hopkins who kindly offered to donate neural data. Now I am looking for some additional researchers who are willing to donate some large scale data for some real SVD / spectral clustering tasks in different fields. Matrices should be in the size between 100M non-zeros to 10 billion non-zeros. Please email me if you are intesting in helping create a standard benchmarking suite!

[An update]
Now I got a note from Andrew Aolney, from the University of Memphis that he contributes a wikipedia term occurrence matrix to the benchmark collection. I also got some wikipedia term occurrences matrices from: Jamie Callan, Brian Murphy, and Partha Talukdar, CMU. Thanks everyone!

The datasets are added to our GraphLab datasets page here.

GraphLab in Budapest

This week I was visiting the Hungarian National Academy of Sciences. I was invited as part of our newly formed collaboration with the Lawa FP7 EU research project. Lawa stands for "Longitude Analytics for Web Scale Data". As part of this collaboration, GraphLab will be used for running algorithms on web scale data collected and analyzed by the Lawa project. My host was András Benczúr's who hosted us superbly. Scott Kirkpatrick from the Hebrew University of Jerusalem is heading this collaboration and create the link between the projects.

I gave a talk about GraphLab and I got a crowd of about 30 researchers with a lot of interesting questions and comments. One pleasant surprise was that several people from Gravity showed up in my talk. Gravity was one of the leading teams in the Netflix prize. Specifically I had a nice discussion with Levente Török from the Gravity team. Levente has installed GraphLab not long ago and started to play with it.

An interesting research performed as part of the Lawa project is done one joining range queries by Peter Triantafillou from University of Patras, Greece. Join range queries are very useful when querying large datasets. Some cool bay area startup called quantiFind have some really impressive demo of which kinds of data you can gather using join range queries. Here is a video I got from Ari Tuchman their co-founder:

While staying in Budapest, and "stealing" wireless internet from some coffee shops I was still busy arranging our first GraphLab workshop. Piero P. Bonissonem chief scientist in General Electric Global Research has emailed me and kindly agree to participate in our program committee. Gilad Shainer from Mellanox, is the chair of the High Performance Computing Advisory Council, a non-profit organization promoting HPC computing with over 300 worldwide companies and university participating, has kindly agreed to help us organize the event.

I got the following encouraging note from John Marc Agnosta, a reseacher at Toyota InfoTechnology Center USA:

I just came across your blog post on the first Graphlab workshop planned for this summer, and I'd like to join the list of participating individuals & companies.

Let me introduce myself. I have been a member of Intel Research, where I gotten to know some of the other folks mentioned in the program committee. After leaving Intel Research last year, I've just recently joined a small Bay Area lab that is a Toyota joint venture. I am their machine learning lead. We are planning some efforts this coming year where the GraphLab could be employed.

I've heard about GraphLab at UAI and in discussions with other research folks, and I'm excited to see it progress.

I want to deeply thank Piero, Gilad and John, and all the others who are actively helping to promote and organize our workshop - without your great help it was simply not possible to make it happen.

Sunday, March 18, 2012

How to prepare your problem to be used in GraphLab / GraphLab parsers

In many cases your data is a collection of strings and you like to convert it to numerical form so it can be used in any machine learning software.

In GraphLab v2, I have added parsers library that can help you accomplish this task hopefully easier.
Let's start with an example. Suppose I have a collection of documents and in each document I have
a bag of words that appear. The input to GraphLab parser is:

1::the boy with big hat was here
2::no one would have believe in the last years of the nineteenth century

where 1 and 2 are the file numeric ids, we use '::' as a separator, and the rest of the line contains keywords that appear in that document.

Assuming you have this format, it is very easy to convert it to be used in GraphLab. You simply use
the texttokenparser application.
Preliminaries: you will need to install GraphLab v2 (explanation under installation section here).

And here is an example run of the parser:

./texttokenparser --dir=./ --outdir=./ --data=document_corpus.txt --gzip=false --debug=true

WARNING: texttokenparser.cpp(main:209): Eigen detected. (This is actually good news!)

INFO: texttokenparser.cpp(main:211): GraphLab parsers library code by Danny Bickson, CMU

Send comments and bug reports to danny.bickson@gmail.com

Currently implemented parsers are: Call data records, document tokens

Schedule all vertices

INFO: sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.

INFO: io.hpp(gzip_in_file:698): Opening input file: ./document_corpus.txt

INFO: io.hpp(gzip_out_file:729): Opening output file ./document_corpus.txt.out

Read line: 1 From: 1 To: 1 string: the

Read line: 1 From: 1 To: 2 string: boy

Read line: 4 From: 1 To: 3 string: with

INFO: texttokenparser.cpp(operator():159): Parsed line: 50000 map size is: 30219

INFO: texttokenparser.cpp(operator():159): Parsed line: 100000 map size is: 39510

INFO: texttokenparser.cpp(operator():159): Parsed line: 150000 map size is: 45200

INFO: texttokenparser.cpp(operator():159): Parsed line: 200000 map size is: 50310

INFO: texttokenparser.cpp(operator():164): Finished parsing total of 230114 lines in file document_corpus.txt

total map size: 52655

Finished in 17.0022

Total number of edges: 0

INFO: io.hpp(save_map_to_file:813): Save map to file: ./.map map size: 52655

INFO: io.hpp(save_map_to_file:813): Save map to file: ./.reverse.map map size: 52655

The output of the parser :
1) Text file containing consecutive integers in sparse matrix market format. In other words, each string is assigned an id, and a sparse matrix is formed where the rows are the document numbers and the non-zero columns are the strings.
NOTE: currently you will need to manually create the two header lines as explained here. The header lines specify the number of rows, columns and non-zero entires in the matrix. In the future I will automate this process.
2) A mapping from each text keyword to its matching integer
3) A mapping from each integer to its matching string.

Advanced options:
1) It is possible to parse in parallel (on a multicore machine) multiple files and still have the ids assigned correctly. Use the --filter= command line argument to select all files starting with a certain prefix. Do not use the --data= command line argument in that case.
2) Support for gzip input format. Using --gzip=true command line option.
3) Save the mapping into readable text file using the --save_in_text=true command line argument.
4) Incrementally add more documents to an existing map by using the --load=true command line flag.
5) Limit the number of parsed lines using --lines=XX command line flag (useful for debugging!)
6) Enable verbose mode using --debug=true command line flag.

Monday, March 12, 2012

Principal components wanted

Here is quite a funny joke. A startup company called principal components wanted

links to my blog. Now what is funny about it? The title.. It says: today's tools are just too complicated and painful to be used by non experts. And then they link to my explanations on how to debug hadoop (using the painful keyword).. :-) My first reaction is that they are completely right:
1) I am an expert
2) Hadoop is too complicated..
If those tools where too simple to use no one would have read my blog. The second funny thing, is that if they would have linked to any of my GraphLab instructions, I would have been more upset.. But I algo agree that GraphLab is still too complicated and advanced relative to the way we want it to be. Anyway I wish those guy a lot of luck, with the explosion of ML related startups it is not going to be an easy breakthrough.

GraphLab svd v. 2 tutorial

This blog post is outdated. The new SVD instructions have moved here:
http://docs.graphlab.org/collaborative_filtering.html

Sunday, March 11, 2012

GraphLab v2 @ Big Learning Workshop

It took some time, but finally Carlos' Guestrin GraphLab version 2 talk, given at the Big Learning Workshop is online. Enjoy!

Open Connectome Project

A couple of days ago I sent out an initial announcement about our planned GraphLab workshop and immediately I started getting a lot of interesting feedback from my blog readers.

Joshua Vogelstein, a researcher at the Dept. of Applied Mathematics & Statistics, Johns Hopkins University just sent me a note a would like very much to participate in our workshop. Joshua is a part of the Open Connectome project, a very interesting project in the area of neuroscience. The project mission is to allow open access for neuro data for researchers worldwide. Here is some examples for the data they are hosting:

I have no clue what the above picture means (although I must admit they look pretty cool)!!.. so I asked Joshua to describe in a little more detail the problems he is working on. This is what I got from him:

We have two very different kinds of data:

1) EM Connectomes - each dataset is a volumetric image of part of some animal brain, ranging in size from 1GB and 10TB. you can look at the data in 2D here
the first project is 10TB. we also designed a RESTful interface to facilitate anybody downloading and processing the data. the instructions for using it are here.
another thing that we have, but haven't yet provided the documentation for, is an annotation database. the idea is that anybody should be able to download some volume, annotate it, and upload it back to the server. we collect and store all the annotations, and can combine them to obtain meta-annotations. i expect that we'll release details for the annotation database in a week or so.

2) MR Connectomes - these are essentially multimodal images of human brains, including both time-varying and non-time-varying. the "multi" part of multimodal means that for each subject we have a number of different kinds of images. our plan for what to do with this stuff is here. Currently, we are organizing the data and pre-processing it. the output of the preprocessed data with be for each subject (there are a few thousand of them), we will have an O(10,000) vertex and O(100,000) edge graph. our vertices are attributed. in particular, each vertex has a 3d position as well as a whole time-series associated with it. we will implement a kind of spectral clustering on each graph (see this manuscript for the theoretical results of our algorithm).

Another interesting aspect in Joshua's work, is that he is part of the Institute for Data Intensive Engineering and Science which has a 5PB "Data-Scope".

Joshua is interested in exploring GraphLab at the first step for spectral clustering of brain image graphs. I promised to help him utilize our SVD and K-mean solvers and try them out on some of his data. I am looking forward to meeting Joshua at our workshop. I also think it is going to be very interesting if he could give a quick talk describing some of the challenges he is facing and what is needed out of GraphLab to help him solve them.

Friday, March 9, 2012

Some interesting projects

That where recently brought to my attention.

I got this from Oren Dobzinski:
Cassovary - new graph processing library from Twitter. Pankaj Gupta from Twitter is involved in this project. Since he is also a program committee member of our GraphLab workshop it will be very interesting to learn about this new system.

Ted Willke from Intel Labs brought to my attention the following project:
Galois - executes serial c++ code in parallel on multicore machines.

Mika Illouz from Eigendog recommended on the following projects:
ciel: yet-another-distributed-computing-abstraction
julia a new programming language, aims to be R/Matlab for clusters.

Definitely a lot of activity in this domain.

Wednesday, March 7, 2012

Large scale SVM (support vector machine)

Not long ago I had the pleasure of visiting Toyota Technical Institute in Chicago.
I had some interesting meeting with Joseph Keshet. We discussed what is the best way to
deploy large scale kernel SVM.

According to Joseph, linear SVM is a pretty much solved problem. State of the art solutions
consists Pegasos and SVMLight see Joachim 2006.

Recently, Joseph have worked on large scale SVMs in two fronts: GPU and MPI.
The GPU kernelized (not linear) can be found here. This paper appeared in KDD 2011.
Here is an image depicting nice speedup they got:

The second SVM activity by Joseph is how to quickly approximate the kernel matrix using Taylor serias expansion. This work is presented in their arxiv paper. Evaluating the kernel matrix has a major overhead in SVM implementation especially because it is dense. Previously, I have implemented a large scale SVM solver on up to 1024 cores on IBM BlueGene supercomputer. About 90% of the time was spent on evaluating the kernel matrix. I wish I had some fancy techniques as Joseph proposes for quickly approximating the kernel matrix...

Large Scale Machine Learning and Other Animals