Large Scale Machine Learning and Other Animals

Friday, March 30, 2012

Interesting twitter dataset and other big datasets

I got this from Aapo Kyrola, how got it from Guy Blleloch:

An interesting paper which explores the twitter social network is:

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue B. Moon. What is twitter, a social network or a news media? In WWW, pages 591–600, 2010.

The twitter graph is available fro download from here. The format is very simple:
user follower\n

The graph gas 41M nodes, and 1.4 billion edges. What is nice about it, is that you can view the profile of each node id using the twitter web API. For example, for user 12 you can do:
http://api.twitter.com/1/users/show.xml?user_id=12
Some statistics about the graph are found here.

If you like to use it in Graphlab v2, you need to do the following:
1) assuming the graph file name is user_follower.txt, sort the graph using:
sort -u -n -k 1,1 -k 2,2 -T . user_follower.txt > user_follower.sorted
2) Add the following matrix market format header to the file:
%%MatrixMarket matrix coordinate real general
61578414 61578414 1468365182

I am using k-cores algorithm to reveal this graph structure. I will add some results soon.

And here is a library of webgraphs and other big graphs I got from

Kanat Tangwongsan.

Big Data Grants

Scott Kirkpatrick from the Hebrew University of Jerusalem sent me the following. It seems that Obama's administration have allocated 200M $ NSF grant for big data analysis:

The Core Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA) solicitation aims to advance the core scientific and technological means of managing, analyzing, visualizing, and extracting useful information from large, diverse, distributed and heterogeneous data sets so as to: accelerate the progress of scientific discovery and innovation; lead to new fields of inquiry that would not otherwise be possible; encourage the development of new data analytic tools and algorithms; facilitate scalable, accessible, and sustainable data infrastructure; increase understanding of human and social processes and interactions; and promote economic growth and improved health and quality of life.

You can read more here.

Mike Draugelis, a strategic planning manager from Lockheed Martin, send me another related DARPA grant:

The XDATA program seeks to develop computational techniques and software tools for analyzing large volumes of data, both semi-structured (e.g., tabular, relational, categorical, meta-data) and unstructured (e.g., text documents, message traffic). Central challenges to be addressed include a) developing scalable algorithms for processing imperfect data in distributed data stores, and b) creating effective human-computer interaction tools for facilitating rapidly customizable visual reasoning for diverse missions.

The program envisions open source software toolkits that enable flexible software
development supporting users processing large volumes of data in timelines commensurate with mission workflows of targeted defense applications.

The full details are here.

Thursday, March 29, 2012

Farewell Yahoo! Labs

It is very sad to see Yahoo! labs disintegrating. In the area of machine learning Yahoo! had AAA machine learning researchers. It seems that all the head hunters are now celebrating...

From the other hand, I can't say I am worried for those guys. I am sure they have offers from a zillion other companies.

Tuesday, March 27, 2012

Dan Brickley - previously our man in Amsterdam

I just had a quick chat with Dan, working on big data analytics especially in the media context (radio, TV shows recommendations). Dan is a researcher at the NoTube EU project ending soon.

Dan is now heading a new initiative for a project proposal to the Knight Foundation. The basic idea is getting resources to connect not-so-technical domain experts
(e.g. journalists, analysts) with the new big data platforms that are
coming along. You can read some more about his proposal here.

Anyway we would love to help such an important project that bring big data analytics for larger audience! Contact Dan if you like to help promote those ideas.

Saturday, March 24, 2012

Online SVD/PCA resources

Last month I was vising Toyota Technological Institure in Chicago, where I was generously hosted by Tamir Hazan and Joseph Keshet. I heard some interesting stuff about large scale SVM from Joseph Keseht which I reported here Additionally I met with Raman Arora who is working on online SVD. I asked Raman to summarize the state-of-the-art research on online SVD and here is what I got from him:

Online PCA - the only work (that I am aware of) which comes with a guarantee is the following paper:

Warmuth, Manfred K. and Kuzmin, Dima. Randomized PCA algorithms with regret bounds that are logarithmic in the dimension. In Advances in Neural Information Processing Systems (NIPS), 2006.

Other references on which we build our work include following classical papers:

E. Oja and J. Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. Journal of Mathematical Analysis and Applications. Volume 106. Pages 69-84. 1985.

Terence D. Sanger. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks. Volume 12. Pages 459473. 1989

and these more recent papers:

Nicol N. Schraudolph, Simon Günter and S. V. N. Vishwanathan. Fast iterative kernel PCA. Advances in Neural Information Processing Systems. 2007.

Kwang In Kim, Matthias O. Franz, and Bernhard Schölkopf. Iterative Kernel Principal Component
Analysis for Image Modeling. IEEE transactions on pattern analysis and machine intelligence. Volume 27, number 9. Pages 1351-1366. 2005.

Dan Yang, Zongming Ma and Andreas Buja, A sparse SVD method for high-dimensional data, 2011.

Whitten, Tibshirani and Hastie, A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis, 2009.

Lee, Shen, Huang and Marron, Biclustering via sparse singular value decomposition, 2010.

There is another recent paper by Dan Yang on near optimality of sparse SVD algorithm that she proposed. She described some of her results during a talk she gave at U-Chicago but I couldn't find a copy of her paper online.

I am looking forward to read Raman's paper on online SVD once it is ready.

Friday, March 23, 2012

Large scale machine learning benchmark

About a week ago Joey and I had some interesting talk with Nicholas Kolegraff. Nicholas suggested a very useful initiative - to compile an EC2 Linux distribution with several machine learning frameworks installed, so one can very easily asses their performance on a set of benchmark problems. Nicholas has a preliminary demo, when one can login into the Linux configured system via his website and run multiple benchmarks using precompiled scripts. Using his system, it is be easier for people to evaluate different ML frameworks without the inevitable learning and setup curve of using each system independently. Since we really liked his idea, we invited Nicholas to give a demo at our coming GraphLab bay area workshop.

Currently, Nicholas includes the following systems in his distribution: GraphLab, Mahout, MadLIB and Vowpal Wabbit. He further asked us to help him compile some interesting datasets to test the different systems on.

Not long ago, during our Intel Labs-GraphLab collaboration, the Intel guys have also asked for help in creating some kind of baseline benchmark suite for comparing different machine learning methods and implementations. Additionally, Edin Muharemagic an architect @ HPCC systems have asked for help in assembling several large datasets for comparing their new SVD solver, as part of their new LexisNexis ML library.

Now it seems that many people find the same basic idea useful then we should probably make an effort in this direction. For a start, I am looking in providing some benchmarks for large scale sparse SVD problems. So why not crowd source this effort with the help of my readers?

As a first try, I contacted Joshua Vogelstein from Johns Hopkins who kindly offered to donate neural data. Now I am looking for some additional researchers who are willing to donate some large scale data for some real SVD / spectral clustering tasks in different fields. Matrices should be in the size between 100M non-zeros to 10 billion non-zeros. Please email me if you are intesting in helping create a standard benchmarking suite!

[An update]
Now I got a note from Andrew Aolney, from the University of Memphis that he contributes a wikipedia term occurrence matrix to the benchmark collection. I also got some wikipedia term occurrences matrices from: Jamie Callan, Brian Murphy, and Partha Talukdar, CMU. Thanks everyone!

The datasets are added to our GraphLab datasets page here.

GraphLab in Budapest

This week I was visiting the Hungarian National Academy of Sciences. I was invited as part of our newly formed collaboration with the Lawa FP7 EU research project. Lawa stands for "Longitude Analytics for Web Scale Data". As part of this collaboration, GraphLab will be used for running algorithms on web scale data collected and analyzed by the Lawa project. My host was András Benczúr's who hosted us superbly. Scott Kirkpatrick from the Hebrew University of Jerusalem is heading this collaboration and create the link between the projects.

I gave a talk about GraphLab and I got a crowd of about 30 researchers with a lot of interesting questions and comments. One pleasant surprise was that several people from Gravity showed up in my talk. Gravity was one of the leading teams in the Netflix prize. Specifically I had a nice discussion with Levente Török from the Gravity team. Levente has installed GraphLab not long ago and started to play with it.

An interesting research performed as part of the Lawa project is done one joining range queries by Peter Triantafillou from University of Patras, Greece. Join range queries are very useful when querying large datasets. Some cool bay area startup called quantiFind have some really impressive demo of which kinds of data you can gather using join range queries. Here is a video I got from Ari Tuchman their co-founder:

While staying in Budapest, and "stealing" wireless internet from some coffee shops I was still busy arranging our first GraphLab workshop. Piero P. Bonissonem chief scientist in General Electric Global Research has emailed me and kindly agree to participate in our program committee. Gilad Shainer from Mellanox, is the chair of the High Performance Computing Advisory Council, a non-profit organization promoting HPC computing with over 300 worldwide companies and university participating, has kindly agreed to help us organize the event.

I got the following encouraging note from John Marc Agnosta, a reseacher at Toyota InfoTechnology Center USA:

I just came across your blog post on the first Graphlab workshop planned for this summer, and I'd like to join the list of participating individuals & companies.

Let me introduce myself. I have been a member of Intel Research, where I gotten to know some of the other folks mentioned in the program committee. After leaving Intel Research last year, I've just recently joined a small Bay Area lab that is a Toyota joint venture. I am their machine learning lead. We are planning some efforts this coming year where the GraphLab could be employed.

I've heard about GraphLab at UAI and in discussions with other research folks, and I'm excited to see it progress.

I want to deeply thank Piero, Gilad and John, and all the others who are actively helping to promote and organize our workshop - without your great help it was simply not possible to make it happen.