Wednesday, April 16, 2014

New trends in sharing data science work

I got the following venturebeat article from my colleague Carlos Guestrin.

It seems there is an interesting trend of allowing data scientists to share their work: Imagine if a company’s three highly valued data scientists can happily work together without duplicating each other’s efforts and can easily call up the ingredients and results of each other’s previous work. 

That day has come. As the data scientist arms race continues, data scientists might want to join forces. Crazy idea, right? Two San Francisco startups — Domino Data Lab and Sense — have emerged recently with software to let data scientists collaborate on multiple projects. In a way, it’s like code storehouse GitHub for the data science world. A Montreal startup named has been talking about the same themes, but it brings a more social twist. Another startup, Mode Analytics, is building software for data analysts to ask questions of data without duplicating previous efforts. And at least one more mature software vendor, Alpine Data Labs, has been adding features to help many colleagues in a company apply algorithms to code on one central hub. 

If you are interested in learning about trends in sharing data science work you should attend our annual GraphLab conference - we have demos from Domino Data Labs, Sense, & Alpine Data Labs!

An update (April 17) - I have now connected Derek Steer, founder of Mode Analytics and they will also give a demo at our GraphLab event. 

Tuesday, April 15, 2014

Big data analytics front is heating up!

Following my previous blog post about Mahout vs. Oryx. Recent news is that Intel had invested a significant investment in Cloudera and the rumor is that it is going to abandon their Hadoop release.
From the other hand, Mahout  is switching to work on top of Spark. Mahout is backed up by MapR who is backed up by EMC.

GraphChi-DB - new experimental graph database released!

I got the following from my collaborator Aapo Kyrola:

I have just released the source code of GraphChi-DB to GitHub!

The repository is here:

GraphChi-DB is a research project that enhances GraphChi with database functionality:
- Fast queries
- Data columns for edges and vertices
- Fast insertions of new edges and vertices.

Compared to existing single-computer graph databases, it scales to much bigger graphs and - unlike other graph databases -provides the familiar GraphChi programming model (it also provides a rudimentary edge-centric programming model more similar to GraphLab). You can read the publication (below) for a performance evaluation.

It is written in Scala (with some Java). Scala is great language for a database because it has an interactive console (REPL), so you can query and interact with the database directly. GraphChi-DB does not support any query language, but instead it is accessed via the Scala API.

Note: the code is experimental, probably very buggy and has an awful API. Do not use it for anything important! Do not run your Bitcoin exchange with it!
GraphChi-Db also requires some expertise in Scala to be really usable. I would recommend it only to researchers and students at this point. For commercial level graph databases, look for Neo4j or Titan. To get started, look at the example applications (explained in the readme of the project).

The design and evalution of GraphChi-DB can be found from preprint:

Saturday, April 12, 2014

Interesting Graph Applications in Retail

I learned from Amit Steinberg about two additional interesting applications for graph analysis in retail.

Attribution modeling for online marketing
Interesting multichanel attribution work by UPENN:  Analyzing the Customer Journey:Attribution Modeling for OnlineMarketing Exposures in a Multi- Channel Setting

In a nutshell, they build a Markov chain that models the user exposure to different marketing channels and model who the different components help for convergence. 

Botnet detection
The second problem many retailers are facing is to try and filter out botnet behavior vs. real users behavior. One interesting paper in this domain is which uses graphs is:  BotGraph: Large Scale Spamming Botnet Detection. The second paper is  An analysis of social network based sybil deferences.

Monday, March 31, 2014

Last day for enjoying early bird discount for our 3rd GraphLab Conference!

GraphLab Conference 2014
We have just started to organize our 3rd user conference on Monday July 21, 2014 at the Nikko Hotel, SF. This is a very preliminary notice to attract companies and universities who like to be involved. We are planning a mega event this year with around 800-900 data scientists attending, with the topic of graph analytics and large scale machine learning.
The conference is a non-profit event held by to promote applications of large scale graph analytics in industry. We invite talks from all major state-of-the-art systems for graph processing, graph databases and large scale data analytics and machine learning. We are looking for sponsors who would like to contribute to the event organization.
Preliminary talks:
Preliminary demos:
Dr. Ari Tuchman: Beyond Sentiment and Buzz: Extracting the Answers that Matter Though Predictive Correlations from Unstructured Chatter
Paul Hoffman: Large Scale Machine Learning on Sparse Graphs
Dr. Jans Aasman, CEO, Franz Inc. Drag and Drop Graph Query Generator
Dr. Zhisong FuMike Personick, and Bryan Thompson: Ultra fast graph mining on GPUs.
Tristan Zajonc and Anand PatilSense
Dr. David Talby: Beyond ML basics: Localized, evolving, hybrid & automated modeling at scale
Simon Chan: An Open Source Machine Learning Server for Developers
Adam Fuchs, CTO Sqrrl: How To Build Secure, Massively Scalable Graphs with Sqrrl
Dr. Steven Hillion, Alpine Data Labs:
Fast classification algorithms on Hadoop
Jacob Nelson: Grappa graph engine
Prof. Joshua Bloom, Machine-learning Driven Automated Insight Workflows
Dr. Matthias BroechelerTitan – Scalable Graph Computing in Real-time and Offline
Prof. Eric Xing: Petuum – a new distributed machine learning framework
Corey Lanum, General Manager of North America, Cambridge Intelligence: How to make useful interactive graph visualizations
Dr. Ira Cohen, HP Software: Scaling the data scientist
Brendan Madden, Tom Sawyer Software: TBA
Dr. Jason Riedy, Georgia Tech: STING: High-Performance Analysis for Streaming Graph Data
Dr. Hassan Chafi, Oracle: Graph Analytics Research at Oracle Labs
Dr. Achim Rettinger, EPPICS: Cross-lingual Cross-modal Analytics of Dynamic Graphs
Corinna Bahr, Agile Data Exploration & Visualization with Blaze and Bokeh
GraphistryLeo Meyerovich, Graphistry: Scaling Visualization with Design and GPUs
Domino Data LabsNick Elprin: Domino Data Labs
Dr. Fernando Perez, Berkeley: IPython: from interactive computing to computational narratives
Dr. Linas Baltrunas and Dr. Dionysos Logothetis:, Telefonica Tools for large scale ML and graph analysis
Ms. Raquel PauSparsity Technologies: Tweeticer, Social Network Analysis with graphs using Sparksee.
SriSatish Ambati, co-founder and CEO: TBA
Jonathan Dinu, CTO Zipfian Academy: TBA
Demian Bellumio, COO Senzari: MusicGraph
Sutanay Choudhury, Pacific Northwest National Lab: M&Ms4Graphs: Multi-scale, Multi-dimensional Graph Analytics Tools for Cyber-Security
Michael Zeller, CEO Zementis: Accelerate predictive analytics with massively parallel scoring
Sébastien Heymann CEO and Jean Villedieu Co-founder, Linkurious: How can graph visualization help understand graphs faster?
Richard Socher, Stanford: etcML project
MongoDB: TBA
Amit MoranCrosswise: TBA

Tuesday, March 25, 2014

Graphs are everywhere - and now food graph!

There isn't a single day where I hear about a new system, or an academic project who is utilizing graphs for getting additional insights out of the data. Today I heard about an interesting study of taste from Prof. Alon Ben-Ari, Director Medical Informatics Fellowship Program, University of Washington - VA Medical Center:

Ahn, Yong-Yeol, Sebastian E. Ahnert, James P. Bagrow, and Albert-László Barabási. "Flavor network and the principles of food pairing." Scientific reports 1 (2011). 

This paper analyses connections between recipe components using graphs. For each pair of ingredients that appear in a recipe together a graph edge is created.