It is very exciting that after many years of hard work, we have finally released our machine learning framework as open source! The announcement made yesterday at NIPS by Prof. Carlos Guestrin:
And here is our github link: https://github.com/apple/turicreate
My friend Joseph (Yossi) Keshet have recently released work for fooling deep learning systems. His work got a lot of attendion including MIT Technology Review and the New Scientist. Nice work!!
Here is my personal connection. Strangely last time I wed a couple I was wearing their t-shirt.
Unrelated, I just learned from my colleague Brian that Cloudera just acquired Fast Forward Labs, which is the company behind Hilary Mason. I visited Hilary in her offices a couple of years ago and learned they had an interesting consulting models of sharing periodical tech reports for educating data scientists to become more proficient. Congrats Hilary!
A very interesting podcast by Sam Charrington who is interviewing Scott Stephenson from DeepGram. DeepGram is using deep learning activations for creating indexes that allows to search text in voice recordings.
DeepGram have released Kur which is a high level abstraction of deep learning framework to allow quickly defining network layouts. But still, writing the target persona is researchers with deep learning knowledge.
A related Israeli startup is AudioBurst. They claim to use AI for indexing but it is not clear what they actually do. Another Israeli startup is Verbit. They seem to transcribe audio with humans going over the preliminary result.
Another interesting paper: Accelerating Innovation Through Analogy Mining, just received the best paper award at KDD 2017. The paper is by Dafna Shahaf who studied with me at CMU and her student Tom Hope.
Misha Bilenko, formerly from M$, released an source for gradient boosting. It seems to compete with XGBoost with the claim that it supports categorical variables as well. (In GraphLab Create we had an extended XGBoost with categorical variable support).
The 3rd Data Science Summit Europe is coming! This year I am not involved in the organization she it should probably be a better event :-) Save the date - May 29, 2017 in Jerusalem. The date was picked just after O'Reilly Strata London 2017 thus the conference will attract many speakers and guests from abroad.
The keynote speaker is my friend Dr. Ben Lorica, chief scientist of O'Reilly Media and the content organizer for O'Reilly Strata and O'Reilly AI conferences.
Unrelated, I heard today about Grail who raised 100M$ for cancer detection in blood tests. Grail raised money from Amazon, Google, and Microsoft (Bill Gates). Looking at their career page they are also looking for deep learning researchers.
Another interesting company is Zebra Medical Research which shares medical data with researchers in return for a fraction of future revenues.
I found this interesting blog post by Rachel Thomas. My favorite quote:
Using TensorFlow makes me feel like I’m not smart enough to use TensorFlow; whereas using Keras makes me feel like neural networks are easier than I realized. This is because TensorFlow’s API is verbose and confusing, and because Keras has the most thoughtfully designed, expressive API I’ve ever experienced. I was too embarrassed to publicly criticize TensorFlow after my first few frustrating interactions with it. It felt so clunky and unnatural, but surely this was my failing. However, Keras and Theano confirm my suspicions that tensors and neural networks don’t have to be so painful.
I recently stumbled upon pipeline.io - an open source production environment to serve TensorFlow deep learning models. By looking into Giuhub activity plots I see the Chris Fregly is the main force behind it. Pipeline.io is trying to solve the major headache around scoring and maintaining ML models in production.
Here us their general architecture diagram:
Here is a talk by Chris:
Alternative related systems are seldon.io, prediction.io (sold to SalesForce), sense.io (sold to Cloudera), Domino Data Labs and probably some others I forgot :-)
BTW Chris will be giving a talk at AI by the bay conference (March 6-8 in San Francisco). The conference looks pretty interesting.
And here is a note I got from Chris following my initial blog post:
I asked Chris which streaming applications he has in mind and this is what I got:
We've got a number of streaming-related Github issues (features) in the works:
here are the some relevant projects that are in the works:
- working with the Subscriber-Growth Team @ Netflix to replace their existing multi-armed bandit, Spark-Streaming-based data pipeline to select the best model to increase signups. we're using Kafka + Kafka Streams + Spark + Cassandra (they love Cassandra!) + Jupyter/Zeppelin Notebooks in both Python/Scala.
- working with the Platform Team @ Twilio to quickly detect application logs that potentially violate Privacy Policies. this is already an issue outside the US, but quickly becoming an issue here in the US. we're using Kafka + custom Kafka Input Readers for Tensorflow + Tensorflow to train the models (batch) and score every log line (real-time).
- working with a super-large Oil & Gas company out of Houston/Oslo (stupid NDA's) to continuously train, deploy, and compare scikit-learn and Spark ML models on live data in parallel - all from a Jupyter notebook.
- working with PagerDuty to predict potential outages based on their new "Event" stream which includes code deploys, configuration changes, etc. we're using Kafka + the new Spark 2.0 Structure Streaming.
What are the main benefits of piepline.io vs. other systems?
- the overall goal, as you can probably figure out, is to give data scientists the "freedom and responsibility" (hello, Netflix Culture Deck!) to iterate quickly without depending on production engineers or an ops group.
- this is a life style that i really embraced while at Netflix. with proper tooling, anyone (devs, data scientists, etc) should be able to deploy, scale, and rollback their own code or model artifacts.
- we're providing the platform for this ML/AI-focused freedom and responsibility!
- you pointed out a few of our key competitors/cooperators like seldon.io. i have a list of about 20 more that i keep an eye on each and every day. i'm in close talks with all of them.
- we're looking to partner with guys like Domino Data Labs who have a weak deployment story.
- and we're constantly sharing experience and code with seldon.io and hydrosphere.io and others.
- we're super performance-focused, as well. we have a couple efforts going on including PMML optimization, native code generation, etc.
- also super-focused on metrics and monitoring - including production-deployment dashboards targeted to data scientists.
- i feel like our main competitors are actually the cloud providers. they're the ones that keep me awake. one of our underlying themes is to reverse engineer Google and AWS's Cloud ML APIs.
Last week I attended an interesting lecture by Ran Gilad Bachrach from MSR. Ran presented CryptoNets who was reported in ICML 2016. CryptoNets allows to score trained deep learning models on encrypted data. They use homomorphic encryption a well known mechanism which allows computing encrypted products and sums. So the main trick is to limit the neural net operations to include only sums and products. To overcome this problem CryptoNet is using the square function as the only non-linear operation supported (vs. sigmoids, ReLU etc.)
On the up side, CryptoNets reports 99% accuracy on MNIST data which is the toy example everyone is using for deep learning. On the downside, you can not train a network but just score on new test data. Scoring is quite slow - around 5 minutes, although you can batch up to a few thousands scoring operations together at the same batch. Due to increasing complexity of the represented numbers the technique is also limited to a certain number of network layers.
I believe that in the coming few years additional research effort will be invested for trying to tackle the training of neural networks on private data without revealing the data contents.
Anyone who is interested in reading about other primitives who may be used for performing similar computation is welcome to take a look at my paper: D. Bickson, D. Dolev, G. Bezman and B. Pinkas
Secure Multi-party Peer-to-Peer Numerical Computation. Proceedings of the 8th IEEE Peer-to-Peer Computing (P2P'08), Sept. 2008, Aachen, Germany - where we use both homomorphic encryption but also Shamir Secret Sharing to compute a similar distributed computation (in terms of sums and products).