Wednesday, March 25, 2015

A new time series anomaly detection dataset from Yahoo!


I got this from my colleague Micky Fire: Yahoo! just released a freshly new time series dataset for anomaly detection.

Tuesday, March 24, 2015

Data Science Summit - why should you care?


The data science summit is a non-profit event is organized by Intel, Comcast, Pandora, Dato, Cloudera and O’Reilly Media. The Summit brings together researchers and data scientists from academia as well as industry to discuss state of the art data science, applied machine learning and predictive applications. The conference agenda has been co-created with Dr. Ben Lorica, Chief Scientist of O’Reilly Media who serves as the content manager of the O’Reilly Strata Conferences.

We are expecting 1000 data scientists to attend on Monday July 20 in SF, as this year we were able to group together an amazing group of data science leaders. We got speakers from three major data science domains:

  • Infrastructure
  • Data engineering
  • Machine learning and predictive applications 
From the infrastructure viewpoint, Prof. Mike Franklin (UC Berkeley) is the Director of Berkeley AMPLab and a co-founder in DataBricks, the cloud service hosting Spark. Dr. Misha Bilenko is a senior researcher at Microsoft, working on Microsoft Azure ML, a machine learning cloud service. Ron Kasabian is VP Big Data at Intel who will cover Intel effort in the data science domain. Prof. Alex Smola is the creator of the Parameter Server which is an efficient distributed infrastructure for ML applications deployed in Google and other companies. 

We call data engineering the data cleaning and transformation that needs to happen before we can apply the machine learning methods. Wes McKinney, is the creator of the popular pandas Python data science package, who recently sold his startup to Cloudera. Pandas has a lot of slicing and dicing operations which help with quick data science.  Prof. Jeff Heer (UW), is the creator of d3.js - the popular visualization software, and also a co-founder of Trifacta a data engineering startup. Trifacta allows you to visually specify complex data transformations that will be later executed on a cluster.  Dr. John Mount is the author of the popular book "Practical Data Science with R".

D3.js visualization software
 From the machine learning aspect, Prof. Carlos Guestrin (UW), is the founder and CEO of Dato, our popular big data analytics framework. Prof. Mike Jordan (Berkeley) Mike Jordan is famous for his work on neural networks, graphical models (specifically variational inference) and Bayesian non-parametric statistics. In recent years he's been working on statistical methods in Big Data. His recent Reddit AMA appearance (in which he bashed deep learning) generated a lot of chatter. Prof. Christopher Re (Stanford) has many applied works in this domain, one of the recent ones is DeepDive, a system which utilizes domain specific knowledge and users feedback to improve modeling and predictions. Prof. Robert (Rob) Tibshirani (Stanford) is famous for his sparse L1 regression work (Lasso).

In terms of predictive applications, Dr. Tao Ye is a senior scientist at Pandora Internet Radio working on their recommendation engine. Dr. Jan Neumann is manager of recommendations at Comcast. Esteban Alvarez from VARANIDEA will share their work about health care analytics in Nigeria.

We also plan to give the stage to a few younger startups that are working on ground braking research. Dr. Leo Meyerovicz from Graphistry will discuss GPU aided visualization for graphs that were too big to visualize before.

There is still an opportunity to get involved! Send me a note if you like to speak or sponsors the event.

Please note that early bird pricing ends April 1st.

Saturday, March 21, 2015

Graphistry: large scale data visualization

I connected with Leo Meyerovich, for a quick overview of Graphistry.

Who is behind Graphistry?
Graphistry spun out of UC Berkeley’s Parallel Computing lab last year. It stems from my Ph.D. on the first parallel web browser (Mozilla etc. are building new browsers around those ideas) and from Matt Torok (my RA), who built Superconductor, a GPU scripting language for big interactive data visualizations. 

What does Graphistry do?
Graphistry scales and streamlines visual analysis of big graphs.  Think answering questions about people (intelligence, sales, marketing), about things (data centers, sensors), and combinations of them (e.g., financial transactions). For example, we used it to crack a 70K+ node botnet a couple days ago. Our tool immediately revealed the accounts involved, their different roles, especially key accounts, and, after 30min of interactive analysis & googling, the credit card & passport theft operation it funneled to. Most tools can only sensibly show hundreds of nodes,  and a couple open source ones handle tens of thousands, but we’re already pushing 100X more than that.

How do you use GPU?
We're taking the last 20 years of infoviz research out of papers and into accessible tools by (a) powering them with big yet economical clusters of GPUs and (b) prioritizing interaction design. The GPU side is cool. For example, our unusual backend has JavaScript orchestrating our GPU cluster via node-opencl. Likewise, we take advantage of recent breakthroughs — including our own — in optimizing irregular graph algorithms on GPUs for multiple magnitudes more data & speed.  With all this power, we're deploying atypically smart visualizations that take advantage of computationally-intensive machine learning and physics algorithms. Likewise, we're adding interactive analysis tools on top that, till now, were impossible. I can write so much here!

What is your business model?
We currently work closely with customers on big problems (contact me if this sounds relevant). We’re actively working towards self-serve analyst tools for a couple industries, and want to share our APIs with internal dev teams and analytics providers to build tools for their more unique problems.

What is your target audience?
We currently like problems in IT (e.g., making sense of activity in big networks or many endpoints) and various security problems. We're starting to expand into problems in finance (e.g., risk, fraud) and sales/marketing (social & business networks).

Can you share some demo links?
I can’t yet share the interactive versions, but here’s a screenshot:


Are you looking for funding?
As you can probably attest, startup life is intense. We’re more interested in collaborating on good problems right now. 

Are you hiring?  
Graphistry is currently 5 Berkeley engineers — a mix of language designers, compiler builders, GPU hackers, and web devs — and that’s it! We'd especially love to talk to any frontend and data viz engineers about designing big interactive visualizations & tools that were previously impossible. Consider yourself invited to our new Oakland office for an amazing show-and-tell.

Anyone interested in watching some live Graphistry demos is welcome to join our Data Science Summit, July 20 in SF.

Wednesday, March 18, 2015

Text by the Bay

I got from Alexy Krabrob a note about interesting text analytics conference he is organizing: Text by the Bay. April 24-25 at the Bay Area.

My readers are welcome to use discount code: TEXTDENNY which give 250$ off, until 3/31.

Lecture videos will be made available online.



Friday, March 13, 2015

Data Science Summit - Registration is Open!



The data science summit is a non-profit event is organized by Intel, Comcast, Pandora, Dato, Cloudera and O’Reilly Media. The Summit brings together researchers and data scientists from academia as well as industry to discuss state of the art data science, applied machine learning and predictive applications. The conference agenda has been co-created with Dr. Ben Lorica, Chief Scientist of O’Reilly Media who serves as the content manager of the O’Reilly Strata Conferences. 

Confirmed speakers (preliminary list!)

Prof. Alex Smola - Google & Carnegie Mellon University
Prof. Jeff Heer - D3.js, Trifacta & University of Washington
Prof. Carlos Guestrin - Dato & University of Washington
Prof. Chris Re - Stanford University
Prof. Jure Leskovec - Pinterest & Stanford University
Prof. Mike Franklin - AMPLab UC Berkeley
Prof. Mike Jordan - UC Berkely
Dr. Andreas Muller - Scikit-learn & NYU
Wes McKinney - pandas & Cloudera
Dr. Misha Bilenko - Microsoft Azure ML
Dr. Tao Ye - Pandora
Dr. Jan Neumann - Comcast


150$ early bird discount until April 1st.

Additionally, use the discount code DannysBlog to get additional 50$ off!