Simpler workflow tools enable the rapid deployment of models

Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you’re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized and analyzed. Many companies have data engineers (adept at using workflow tools like Azkaban and Oozie), who manage¹ pipelines for data scientists and analysts.

A workflow tool for data analysts: Chronos from airbnb
A raw bash scheduler written in Scala, Chronos is flexible, fault-tolerant², and distributed (it’s built on top of Mesos). What’s most interesting is that it makes the creation and maintenance of complex workflows more accessible: at least within airbnb, it’s heavily used by analysts.

Job orchestration and scheduling tools contain features that data scientists would appreciate. They make it easy for users to express dependencies (start a job upon the completion of another job), and retries (particularly in cloud computing settings, jobs can fail for a variety of reasons). Chronos comes with a web UI designed to let business analysts³ define, execute, and monitor workflows: a zoomable DAG highlights failed jobs and displays stats that can be used to identify bottlenecks. Chronos lets you include asynchronous jobs – a nice feature for data science pipelines that involve long-running calculations. It also lets you easily define repeating jobs over a finite time interval, something that comes in handy for short-lived⁴ experiments (e.g. A/B tests or multi-armed bandits).

The unreasonable effectiveness of data: model selection & deployment
By enabling airbnb analysts to take prototype workflows and easily deploy them to production, Chronos taps into a need that other⁵ tools are beginning to address. Startup Alpine Data Labs provides a GUI tool that lets business analysts define and manage, multi-step analytic workflows.

The landmark paper by Banko and Brill hinted that with massive amounts of data, the choice of models become less important. Thus tools that let you easily deploy analytic models at scale, become just as important as specific algorithms. A noteworthy project out of UW-Madison – Hazy – seeks to simplify the deployment and maintenance of analytic models.

“The next breakthrough in data analysis may not be in individual algorithms, but in the ability to rapidly combine, deploy, and maintain existing algorithms.”
Hazy: Making it Easier to Build and Maintain Big-data Analytics

Related posts:

(1) Data scientists may build prototypes, but repeatable pipelines tend to be the domain of data engineers.
(2) As with other workflow tools, Chronos includes alerts (for job deletes and failure after specified # of retries).
(3) Chronos jobs are defined via a web GUI, other tools require the creation/maintenance of “configuration” files. Chronos also comes with a simple REST API.
(4) Chronos uses ISO8601 which makes it easy to define repeating intervals and configure jobs that repeat over a time period, after which they get deleted.
(5) Other companies include Trifacta, Ufora, and BI tools Datameer and Platfora.

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata Rx Health Data Conference: September 25-27 | Boston, MA
Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England

Simpler workflow tools enable the rapid deployment of models

The importance of data science tools that let organizations easily combine, deploy, and maintain algorithms

Get the O’Reilly Data Newsletter