A new crop of data science tools for deploying, monitoring, and maintaining models
What happens after data scientists build analytic models? Model deployment, monitoring, and maintenance are topics that haven’t received as much attention in the past, but I’ve been hearing more about these subjects from data scientists and software developers. I remember the days when it took weeks before models I built got deployed in production. Long delays haven’t entirely disappeared, but I’m encouraged by the discussion and tools that are starting to emerge.
The problem can often be traced to the interaction between data scientists and production engineering teams: if there’s a wall separating these teams, then delays are inevitable. In contrast having data scientists work more closely with production teams makes rapid iteration possible. Companies like Linkedin, Google, and Twitter work to make sure data scientists know how to interface with their production environment. In many forward thinking companies, data scientists and production teams work closely on analytic projects. Even a high-level understanding of production environments help data scientists develop models that are feasible to deploy and maintain.
Models generally have to be recoded before deployment (e.g., data scientists may favor Python, but production environments may require Java). PMML, an XML standard for representing analytic models, has made things easier. Companies who have access to in-database analytics1, may opt to use their database engines to encode and deploy models.
The importance of data science tools that let organizations easily combine, deploy, and maintain algorithms
Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you’re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized and analyzed. Many companies have data engineers (adept at using workflow tools like Azkaban and Oozie), who manage1 pipelines for data scientists and analysts.
A workflow tool for data analysts: Chronos from airbnb
A raw bash scheduler written in Scala, Chronos is flexible, fault-tolerant2, and distributed (it’s built on top of Mesos). What’s most interesting is that it makes the creation and maintenance of complex workflows more accessible: at least within airbnb, it’s heavily used by analysts.
Job orchestration and scheduling tools contain features that data scientists would appreciate. They make it easy for users to express dependencies (start a job upon the completion of another job), and retries (particularly in cloud computing settings, jobs can fail for a variety of reasons). Chronos comes with a web UI designed to let business analysts3 define, execute, and monitor workflows: a zoomable DAG highlights failed jobs and displays stats that can be used to identify bottlenecks. Chronos lets you include asynchronous jobs – a nice feature for data science pipelines that involve long-running calculations. It also lets you easily define repeating jobs over a finite time interval, something that comes in handy for short-lived4 experiments (e.g. A/B tests or multi-armed bandits).
Bookigee's Kristen McLean says agile techniques from the software world also apply to publishing.
Bookigee founder Kristen McLean explains how lightweight development, flexible teams and other agile methods can help publishers with content development and workflows.
Goodbye clunky CMS. Hello low-cost agility.
The Bangor Daily News addressed its digital workflow issues with a creative new system built on Google Docs and WordPress. William Davis, the newspaper's online editor and the system's architect, explains how it works and why they did it.
Laura Dawson has made her slides available from the recent TOC Webcast, "Essential Tools of an XML Workflow." A complete recording of the event will be posted here soon. View SlideShare presentation or Upload your own. (tags: xml swxml)…
Tools of Change for Publishing, in conjunction with StartWithXML, will host "Essential Tools of an XML Workflow," a free webcast with presenter Laura Dawson, on Thursday, Dec. 11 at 1 p.m. eastern (10 a.m. pacific). Webcast Overview This webcast is for those publishers who have made the decision to pursue digital channels for their content. What tools are out…
Below you'll find the full recording from the recent TOC Webcast, "What Publishers Need to Know about Digitization," with Liza Daly….