Large Scale Machine Learning and Other Animals: March 2014

Monday, March 31, 2014

Last day for enjoying early bird discount for our 3rd GraphLab Conference!

We have just started to organize our 3rd user conference on Monday July 21, 2014 at the Nikko Hotel, SF. This is a very preliminary notice to attract companies and universities who like to be involved. We are planning a mega event this year with around 800-900 data scientists attending, with the topic of graph analytics and large scale machine learning.

The conference is a non-profit event held by GraphLab.org to promote applications of large scale graph analytics in industry. We invite talks from all major state-of-the-art systems for graph processing, graph databases and large scale data analytics and machine learning. We are looking for sponsors who would like to contribute to the event organization.

Preliminary talks:

Reynold Xin, co-Founder of Databricks will present Machine Learning and Graph Computation on Spark
Wes McKinney, Founder & CEO of DataPad – TBA
Prof. Carlos Guestrin, Founder & CEO of GraphLab will present GraphLab
Prof. Vahab Mirrokni from Google Google Research New York team – ASYMP: Fault-tolerant Graph Mining via ASYnchronous Message Passing
Prof. Joe Hellerstein, Founder & CEO of Trifacta – TBA
Tao Ye, Senior Scientist, Pandora Internet Radio – TBA
Josh Wills, Director of Data Science at Cloudera – TBA
Milind Bhandarkar, Chief Scientist at Pivotal – The Zoo Expands: Labrador 💛 Elephant thanks to Hamster
Dr. Markus Weimer,Microsot Research - REEF: Towards a Big Data stdlib
Karthik Ramachandran and Erick Tryzelaar, Lab41: Dendrite large scale graph analytics

Preliminary demos:

	Dr. Ari Tuchman: Beyond Sentiment and Buzz: Extracting the Answers that Matter Though Predictive Correlations from Unstructured Chatter
	TBA
	TBA
	Paul Hoffman: Large Scale Machine Learning on Sparse Graphs
	Dr. Jans Aasman, CEO, Franz Inc. Drag and Drop Graph Query Generator
	Dr. Zhisong Fu, Mike Personick, and Bryan Thompson: Ultra fast graph mining on GPUs.
	TBA
	TBA
	Tristan Zajonc and Anand Patil: Sense
	Dr. David Talby: Beyond ML basics: Localized, evolving, hybrid & automated modeling at scale
	Simon Chan: An Open Source Machine Learning Server for Developers
	TBA
	Adam Fuchs, CTO Sqrrl: How To Build Secure, Massively Scalable Graphs with Sqrrl
	Dr. Steven Hillion, Alpine Data Labs: Fast classification algorithms on Hadoop
	Jacob Nelson: Grappa graph engine
	Prof. Joshua Bloom, wiso.io: Machine-learning Driven Automated Insight Workflows
	Dr. Matthias Broecheler, Titan – Scalable Graph Computing in Real-time and Offline
	TBA
	Prof. Eric Xing: Petuum – a new distributed machine learning framework

	Corey Lanum, General Manager of North America, Cambridge Intelligence: How to make useful interactive graph visualizations
	Dr. Ira Cohen, HP Software: Scaling the data scientist
	Brendan Madden, Tom Sawyer Software: TBA
	Dr. Jason Riedy, Georgia Tech: STING: High-Performance Analysis for Streaming Graph Data
	Dr. Hassan Chafi, Oracle: Graph Analytics Research at Oracle Labs
	Dr. Achim Rettinger, EPPICS: Cross-lingual Cross-modal Analytics of Dynamic Graphs
	TBA
	Corinna Bahr, Continuum.io: Agile Data Exploration & Visualization with Blaze and Bokeh
Graphistry	Leo Meyerovich, Graphistry: Scaling Visualization with Design and GPUs
Domino Data Labs	Nick Elprin: Domino Data Labs
	Dr. Fernando Perez, Berkeley: IPython: from interactive computing to computational narratives
	Dr. Linas Baltrunas and Dr. Dionysos Logothetis:, Telefonica Research:Grafos.ml: Tools for large scale ML and graph analysis
	Ms. Raquel Pau, Sparsity Technologies: Tweeticer, Social Network Analysis with graphs using Sparksee.
	SriSatish Ambati, co-founder and CEO: TBA
	Jonathan Dinu, CTO Zipfian Academy: TBA
	Demian Bellumio, COO Senzari: MusicGraph
	Sutanay Choudhury, Pacific Northwest National Lab: M&Ms4Graphs: Multi-scale, Multi-dimensional Graph Analytics Tools for Cyber-Security
	Michael Zeller, CEO Zementis: Accelerate predictive analytics with massively parallel scoring
	Sébastien Heymann CEO and Jean Villedieu Co-founder, Linkurious: How can graph visualization help understand graphs faster?
	Richard Socher, Stanford: etcML project
	MongoDB: TBA
	Amit Moran, Crosswise: TBA

Tuesday, March 25, 2014

Graphs are everywhere - and now food graph!

There isn't a single day where I hear about a new system, or an academic project who is utilizing graphs for getting additional insights out of the data. Today I heard about an interesting study of taste from Prof. Alon Ben-Ari, Director Medical Informatics Fellowship Program, University of Washington - VA Medical Center:

Ahn, Yong-Yeol, Sebastian E. Ahnert, James P. Bagrow, and Albert-László Barabási. "Flavor network and the principles of food pairing." Scientific reports 1 (2011).

This paper analyses connections between recipe components using graphs. For each pair of ingredients that appear in a recipe together a graph edge is created.

Monday, March 17, 2014

Pivotal backs up GraphLab as part of its HD offering

Fresh news just announced:

Pivotal HD 2.0 expands analytic use cases with integration and support of GraphLab, MADlib, and popular languages and formats such as R, Python, Java, and Parquet to create a powerful and easy to use analytical platform for data scientists and analysts in Hadoop.

...
Also new within Pivotal HD is the world's first enterprise integration of GraphLab, an advanced set of algorithms for graph analytics that enables data scientists and analysts to leverage popular algorithms for insight, i.e. page rank, collaborative filtering and computer vision.

Anyone who wants to learn more about Pivotal and GraphLab integration should attend our 3rd GraphLab conference where Milind Bhandarkar, Chief Scientist at Pivotal – will give a talk titled: "The Zoo Expands: Labrador ♥ Elephant thanks to Hamster"

Friday, March 14, 2014

Spotlight: SiSense

Ben Lorica our man in O'Reilly Media sent me a link to this interesting Israeli company. It seems they are doing in memory and out of core computation on a single multicore machine to scale to large datasets, product some statistics which are turned into web reports. Here is their demo video:

According to their website they have some customers like ebay and NASA.

A very impressive performance is demonstrated here: 10TB of parsed data on 10 seconds on a 10,000$ server.

Related blog posts: HP Software's Titan system , Alpine Data Labs.

Thursday, March 13, 2014

Spotlight: 0xdata

Just learned about 0xdata open source project (pronounced hex-data). It has a library called H2O for predictive analytics that can work either standalone or on top of Hadoop map reduce. H2O has interfaces to Scala and Java, and also R.

So far H2O supports generalized linear models, decision trees and K-means clustering. According to their website they have Netflix and Trulia as customers.

Anyone who is interested in learning more about 0xdata is welcome to attend our 3rd GraphLab Conference where 0xdata will give a demo of their H2O library.

Wednesday, March 12, 2014

Spotlight: Ravel Law - introducing graph analytics to law research

Pranav Singh reached out to me, as he is a data scientist working with Ravel Law for analyzing law related datasets. It seems like an interesting vertical of applying big data analytics to court decisions.

Recently Ravel Law started to incorporate graph data into their analysis. While some of their research is proprietary, they where kindly willing to share some published results. Pranav sent me a paper by Fowler which is named "Network Analysis and the Law". It shows that using basic pagerank algorithm (hub / authorities) you can get very deep insights into supreme court decision and their importance. The algorithm is rather basic but the applications for the Law vertical are rather new, at least to me.

Sunday, March 9, 2014

University of Waterloo evaluates GraphLab vs. other BSP systems

Just got a link to a blog post which details experiments done at the University of Waterloo by a master student Prashant Raghav. The experiments compare Giraph, GPS, GraphLab and GraphChi, comparing both memory footprint and runtime. If you like to know which system performs better you should read the bog post.

Friday, March 7, 2014

SpotLight: Demain Bellumio - MusicGraph

I recently connected with Demian Bellumio COO of Senzari, a Miami/San Francisco based startup working on mining music data using graphs. One of their interesting projects is MusicGraph. MusicGraph is an ambitious project which collects all available information about music, including lyrics, signal processing information about the tracks, performers, social media metrics, broadcast radio plays, user playlists, etc.. The outcome is a huge graph (1B edges, 600M vertices, 7B different properties) and an open graph API people can use to traverse and query the graph. Through the API developers can easily add musical graph search to their services, as well as generate personalized playlist recommendations and even also access a large number low level features and data, like acoustic/lyrical features and social stats on artists and songs.

A related project is Wahwah Networks, a an embeddable web bar which lets websites offer personalized Internet radio to their audience for free, increasing their time on site and generating new ad revenue for the publishers.

MusicGraph is using Titan as their graph database infrastructure.

If you are interested in learning more about MusicGraph, you should attend their demo at our 3rd GraphLab conference.
Further reading: a Gigaom blog post about MusicGraph

SplotLight: Gilad Lotan - Betaworks

Just connected this week with Gilad Lotan, Chief Scientist at Betaworks. Gilad has CS background and previously worked in MS for social data analysis especially Twitter and Facebook data. Gilad is now in BetaWorks which is a VC and incubator in NY. One of their most known companies so far is chartbeat a well known company that takes website monitoring from a boring tasks into a game like experience.

Betaworks have now around 11 startups companies with emphasis on data analytics and visual design out the output. Gilad has a blog with many case studies about how visualize large behavioral graphs.

Here are some slick examples:

Some tips from Gilad: when visualizing large graphs the main tricks are doing the right subsampling since you can not draw more than 20K nodes. Gilad is mainly using NetworkX and Gephi for visualization.

Gilad is very supportive of graph analysis and says you have much better signal when using graphs especially with social networks.

Saturday, March 1, 2014

Domino Data Labs: run data analytics in the cloud

I just heard about Domino from my collaborator Yao Wu. He is using Domino for running data analytics scripts in the cloud. Domino supports R, Matlab and Python languages, you can upload your scripts and run them in the cloud. Domino maintains run history and allows collaboration between different team members working on the same project.

Domino Tour from Domino on Vimeo.

Interested in learning more? Domino Founder Nick Elprin will present a demo of Domino at our 3rd GraphLab Conference.

Virginia Tech CloudCV project contributes ADMM code to GraphLab

Some additional GraphLab open source code contributions announced today. Dhruv Bhatra's Virgina Tech Lab contributed today the recently made popular algorithm by Boyd: alternating directions methods of multipliers (ADMM). The algorithms are now part of the graphical models toolkit.

"We implemented ADMM and Bethe-ADMM for MAP inference in MRFs.

The algorithms are reported in the following papers:

Alternating Directions Dual Decomposition.
André F. T. Martins, Mário A. T. Figueiredo, Pedro M. Q. Aguiar, Noah A. Smith, Eric P. Xing.
arXiv:1212.6550.
http://arxiv.org/abs/1212.6550

Bethe-ADMM for Tree Decomposition based Parallel MAP inference
Q. Fu, H. Wang, and A. Banerjee
Conference on Uncertainty in Artificial Intelligence (UAI), 2013.
http://www-users.cs.umn.edu/~banerjee/papers/13/Bethe_ADMM.pdf "

Mahout or Oryx? Hadoop based analytics front is heating up

Got this from my Colleague Eric Wolfe: Gigaom blog post which backs up Oryx, an open source by Sean Owen, a previous Mahout contributor who crossed the lines and now creating a new system.

Apache Mahout, the traditional avenue for building machine learning models in Hadoop, “has reached the end of its road,” Owen said. It’s stuck in a batch-only first-generation MapReduce era, and it requires a lot of work on users’ parts to get a working system in place. “

A heated discussion was recorded a couple of months ago. For example, one of the main Mahout contributors, Sebastian Schelter does not stay idle:
..., I also cannot understand why Cloudera and you need to start a new open source project that in many ways mirrors what mahout offers. Why not contribute the algorithm implementations (the computation layer) to mahout and built the serving layer as a project on top of that? I don't see what would have prevented this, I would think it would have been warmly welcomed by this community.

It is not that this new project creates competition from which users will benefit, its exactly the opposite. To me it feels like an intentional abandonment of mahout. Instead of giving users a single project where we could have united efforts, users now have to choose between two things that in general do the same things with each of them missing some functionality. In my eyes, users lose here.

...

Its a very bad day for mahout today.

One of the reasons beyond this controversy is that Mahout is backed up by MapR who is backed up by EMC. From the other hand Oryx is backed up by Cloudera. Both MapR and Cloudera have competing Hadoop versions.

Additional interesting note at the Gigaom article about Spark:

Owen is spending a lot of time contributing to the Apache Spark project because he plans to rewrite Oryx to make Spark the primary processing framework instead of MapReduce. “There’s actually a lot of reasons to be interested in Spark from a machine learning point of view,” he said. “… I’d much rather put my energies there.”

He’s not alone. As we have explained, Spark is becoming a popular choice for next-generation big data applications and companies such as Cloudera and Hortonworks are embracing it as a big part of Hadoop’s future.

Collaborative filtering on top of Giraph: grafos.ml

Just learned from my collaborator Mohit Singh from Intel Labs about a new collaborative filtering package called grafos.ml from researchers in Telefonika Spain. The same group who created the CLiMF algorithm. Grafos runs on top of Giraph, and implements a few matrix factorization methods, ranking algorithm and some graph analytics algorithms. Grafos has a Python interface.

Large Scale Machine Learning and Other Animals