Monday, April 20, 2015

Time Series Support in Spark

I connected with Sandy Ryza, a Senior Data Scientist @ Cloudera has recently been working on a library for time series analysis with Spark.
Here is a quick Q&A about his project.

1) When did you start the project?

Pretty recently.  It looks like my first Github push was in the middle of February.
2) What is the goal of your project?

The goal is to provide a set of tools and abstractions that make it easy to deal with large scale time series data.  More concretely, a lot of this means basically mimicking the functionality of tried and true single-node libraries.  Especially in domains like finance, Pandas and Matlab dominate time series analysis.  The project seeks to provide an alternative that's well suited to a distributed setting.
3) Who are the contributors?

So far it's been a one-person project of myself, though I've received interest from a bunch of different people, and I'm hoping to see some external contributions soon.
4) What is the challenge to support time series data. Which mechanisms are currently missing in spark?

I think the main challenges are around finding the right abstractions for storage and analysis.  Single-node libraries can store a whole collection of time series in a single array and slice row or column-wise depending on the operation.  In a distributed setting, we need to think and make choices about how data is partitioned.  Good abstractions guide users towards patterns of access that the library can support in a performant way.
5) Are you supporting a streaming model or a bulk model or both

The initial focus is on bulk analysis.  We're targeting uses like backtesting of trading strategies or taking all the data at the end of a day and fitting risk models to it.  I think we'll definitely see uses that demand a streaming approach in the future, so I'm trying to design the interfaces with that in mind.

6) Which aggregation mechanisms are currently supported.

As the library sits on top of Spark, all of Spark's distributed operations are close at hand.  This makes it really easy to do things like train millions of autoregressive models on millions of time series in parallel and then pull those with the highest serial correlation to local memory for further analysis.  Or group and merge time series based on user-defined criteria.
7) Which algorithms are currently supported

So far I've been focusing on the standard battery of models and statistical tests that are used to model univariate time series.  They're the kinds of tools that are more likely to show up in a stats or econometrics course than a machine learning course: autoregressive models, GARCH models, and the like.
8) Which algorithms you plan to add in the near future?

There's quite a ways to go to support all the time series functionality that shows up in single-node libraries: seasoned ARIMA models, Kalman filters, sophisticated imputation strategies.  The other interesting bits come on the transformation / ETL side: joining and aligning time series sampled at different frequencies, bi-temporality 

9) What is the project license

Apache Version 2.0, which I think is the case for all Cloudera open source projects.
10) What are the programming interfaces supported. Is it Scala? Java?

The library is written in Scala, but supports Java as well.  A few people have expressed interest in Python bindings, so I'd like to support those when I find the time.

No comments:

Post a Comment