Wednesday, July 29, 2015
Scala training in SF
My Israeli Colleague Tomer Gabel is giving two full days Scala training in SF - Aug 11. My blog readers are welcome to use discount code: BOLD200 for getting 200$ off.
A new graph partitioning algorithm at CIKM
We got the following email from Fabio, a gradient student at Rome University:
I'm Fabio Petroni, a Ph.D. student in Engineering in Computer Science at Sapienza University of Rome.
Together with other researcher, we recently developed HDRF, a novel stream-based graph partitioning algorithm that provides important performance improvements with respect to all existing solutions (we are aware about) in partitioning quality.
In particular, HDRF provides the smallest average replication factor with close to optimal load balance. These two characteristics put together allow HDRF to significantly reduce the time needed to perform computation on graphs and makes it the best choice for partitioning graph data.
A paper describing the HDRF algorithm will be presented in the next CIKM conference (http://www.cikm-2015.org) and is available at the following address (this is the final submitted version): http://www.dis.uniroma1.it/~ midlab/articoli/PQDKI15CIKM. pdf
We will work with Fabio for including a version of his algorithm for our latest code base GraphLab Create.
Tuesday, July 28, 2015
Some exciting developments at Dato
You may have missed our latest Dato blog post, so I wanted to shed light on two of the coolest released features:
It's particularly exciting to mention that GraphLab Create's integration with Numpy will effectively scale scikit-learn. Now with GraphLab Create and Dato Predictive Services, you can deploy existing scikit-learn models at scale as a RESTful predictive service by changing only a few lines of code. Very cool.
Dato Distributed now with distributed machine learning
# jobs distribution environments # s = gl.deploy.spark_cluster.load(‘hdfs://…’) # h = gl.deploy.hadoop_cluster.load(‘hdfs://…’) e = gl.deploy.ec2_cluster.load(‘s3://…’) # set distribution environment to my AWS cluster gl.set_distributed_execution_environment(e)
Dato Distributed enables GraphLab Create users to execute parallel computation of Python code tasks on EC2, Spark or Hadoop clusters. The above shows how GraphLab Create can switch between these environments by changing one-line of code. In GraphLab Create 1.5.1, Dato Distributed on Hadoop now seamlessly supports distributed execution of machine learning models including logistic regression, linear regression, SVM classifier, label propagation and PageRank. Distributed machine learning on EC2 and Spark are in the works.
Thursday, July 16, 2015
Seven Python Tools for Data Scientists
Nice blog post from Dynelle Abeyta from Galvanize about popular tools for data science in Python.
Sunday, July 5, 2015
Apache Zeppelin - Yet another Ipython notebook
I got this from my colleague Guy Rapoport:
apache.org/
Apache Zeppelin project website (incubating under Apache)
https://zeppelin.incubator.
Comparison to IPython Notebook
Subscribe to:
Posts (Atom)