Large Scale Machine Learning and Other Animals: Data Science Summit

Tuesday, August 18, 2015

Data Science Summit - Talk Videos Released

As you may know I have been heavily involved in the organization of the Data Science Summit, a 1000 data scientists and machine learning researchers event this July in SF. We have recently released the event video talks. In this blog post I will summarize some of the highlights of our event.

The first talk you should watch in case you missed it is Prof. Carlos Guestrin keynote, which summarizes whats new in Dato:

An interesting talk from Prof. Mike Franklin from Berkeley AMPLab, about what's new in Berkely AMPLab:

Interesting to learn that Mesos, Tachion and Spark have graduated to startaps. What will be the next startup out of AMP Lab? Mike mentions Velox, their predictive service which competes with prediction.io (among others). KeyStoneML is a library of machine learning pipelines. MLMatrix is a library for matrix linear algebra operations. SampleClean is a project for involving humans in the data cleaning process.

A related talk by Prof. Seif Haridi (SICS) about Flink, a system geared towards stream processing:

Unlike Spark which implements streaming with small batches, Flink is written to support continuous stream handling. Quoted very good performance of Flink vs. Storm.

Another interesting talk by Prof. Alex Smola attracted big audience. Alex have recently formed a startup around his parameter server work. Unfortunately we did not get permission to release his video yet. I am working on that.

Wes McKiney, the creator of Python pandas used our conference to announce Cloudera's new Ibis project, which is a way to parallelize Python code on top of a Hadoop cluster at scale.

A related lecture by Peter Wang, CEO of Continuum about dusk - a different attempt to parallelize Python code. He also explores in detail their visualization library Bokeh.

Prof. Chris Re have covered his DeepDive framework. Recently he opened another exciting new startup around providing ML tools for a larger audience. For example PaleoDeepDive allows mining complex information out of pdf papers (including NLP, mining tables, geographical coordinates etc.)

Prof. Jeff Heer from Trifacta and University of Washington presented his recent research on how to improve visualization in a joint research project with Tableau. Multiple layouts and options are explored and a recommendation engine filters the results to present the most attractive and informative to the user.

Prof. Dhruv Batra from Virginia Tech described their visual question answering project, a cool project which answers free text questions on images:

In the startup session, an interesting talk by Stephen Merity from CommonCrawl:

Where Stephen describes the different cool things people do with their collected web data. For example, Stanford's Glove project which provides another word2vec implementation. Analyzing the web for the price of a Sandwich, an interesting work from Yelp for collecting US phone numbers out of the web.

One of the most bizarre applications (in a good sense!) is from compology.us - a US company who is monitoring trash bins using sensors and using GraphLab deep learning to detect the level of trash and optimize the pickup routes.

The last talk I wanted to highlight is the audience favorite: a talk by Amanda Cassari from Concur which shows how to run GraphLab Create on top of Spark: