Data scientists tackle the analytic lifecycle

What happens after data scientists build analytic models? Model deployment, monitoring, and maintenance are topics that haven’t received as much attention in the past, but I’ve been hearing more about these subjects from data scientists and software developers. I remember the days when it took weeks before models I built got deployed in production. Long delays haven’t entirely disappeared, but I’m encouraged by the discussion and tools that are starting to emerge.

analytic lifecycle

The problem can often be traced to the interaction between data scientists and production engineering teams: if there’s a wall separating these teams, then delays are inevitable. In contrast having data scientists work more closely with production teams makes rapid iteration possible. Companies like Linkedin, Google, and Twitter work to make sure data scientists know how to interface with their production environment. In many forward thinking companies, data scientists and production teams work closely on analytic projects. Even a high-level understanding of production environments help data scientists develop models that are feasible to deploy and maintain.

Model Deployment
Models generally have to be recoded before deployment (e.g., data scientists may favor Python, but production environments may require Java). PMML, an XML standard for representing analytic models, has made things easier. Companies who have access to in-database analytics¹, may opt to use their database engines to encode and deploy models.

I’ve written about open source tools kiji and Augustus, that consume PMML, let users encode models, and take care of model scoring in real-time. In particular the kiji project has tools for integrating model development (kiji-express) and deployment (kiji-scoring). Built on top of Cascading, Pattern is a new framework for building and scoring models on Hadoop (it can also consume PMML).

Quite often models are trained in batch² jobs, but the actual scoring is usually easy to do in “real-time” (making it possible for tools like kiji to serve as real-time recommendation engines).

Model Monitoring and Maintenance
When evaluating models, it’s essential to measure the right business metrics (modelers tend to favor and obsess over quantitative/statistical measures). With the right metrics and dashboards in place, practices that are routine in IT Ops need to become more common in the analytic space. Already some companies monitor model performance closely – putting in place alerts and processes that let them quickly “fix, retrain, or replace” models that start tanking.

Prototypes built using historical data can fare poorly when deployed in production, so nothing beats real-world testing. Ideally the production environment allows for the deployment of multiple (competing) models³, in which case tools that let you test and compare multiple models are indispensable (via simple A/B tests or even multi-arm bandits).

At the recent SAS Global forum I came across the SAS Model Manager – a tool that attempts to address the analytic lifecycle. Among other things it lets you store and track versions of models. Proper versioning helps data scientists share their work, but it also can come in handy in other ways. For example, there’s a lot of metadata that you can attach to individual models (data schema, data lineage, parameters, algorithm(s), code/executable, etc), all of which are important for troubleshooting⁴ when things go wrong⁵.

Workflow Manager to tie it all together
Workflow tools provide a good framework for tying together various parts of the analytic lifecycle (SAS Model Manager is used in conjunction with SAS Workflow Studio). They make it easier to reproduce complex analytic projects and for team members to collaborate. Chronos already lets business analysts piece together complex data processing pipelines, while analytic tools like the SPSS Modeler and Alpine Data labs do the same for machine-learning and statistical models.

With companies wanting to unlock the value of big data, there is growing interest in tools for managing the entire analytic lifecycle. I’ll close by once again citing one of my favorite quotes⁶ on this topic:

The next breakthrough in data analysis may not be in individual algorithms, but in the ability to rapidly combine, deploy, and maintain existing algorithms.
Hazy: Making it Easier to Build and Maintain Big-data Analytics

Data Science tools: Are you “all in” or do you “mix and match”?

(1) Many commercial vendors offer in-database analytics. The open source library MADlib is another option.
(2) In certain situations online learning might be a requirement. In which case you have to guard against “spam” (garbage in, garbage out).
(3) A “model” could be a combination or ensemble of algorithms, that reference different features and libraries. It would be nice to have an environment where you can test different combinations of algorithms, features, and libraries.
(4) Metadata is important for other things besides troubleshooting: it comes in handy for auditing purposes, or when you’re considering “reusing” an older model.
(5) A common problem is a schema change may affect whether or not an important feature is getting picked up by a model.
(6) Courtesy of Chris Re and his students

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata Rx Health Data Conference: September 25-27 | Boston, MA
Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England

Data scientists tackle the analytic lifecycle

A new crop of data science tools for deploying, monitoring, and maintaining models

Get the O’Reilly Data Newsletter