Evaluating machine learning systems: Kaggle's not enough

There is a tremendous amount of commercial attention on machine learning (ML) methods and applications. This includes product and content recommender systems, predictive models for churn and lead scoring, systems to assist in medical diagnosis, social network sentiment analysis, and on and on. ML often carries the burden of extracting value from big data.

But getting good results from machine learning still requires much art, persistence, and even luck. An engineer can’t yet treat ML as just another well-bahaved part of the technology stack. There are many underlying reasons for this, but for the moment I want to focus on how we measure or evaluate ML systems.

Reflecting their academic roots, machine learning methods have traditionally been evaluated in terms of narrow quantitative metrics: precision, recall, RMS error, and so on. The data-science-as-competitive-sport site Kaggle has adopted these metrics for many of its competitions. They are objective and reassuringly concrete.

But scaled, production systems have very different requirements than proof-of-concept academic implementations or prize-winning models. Adopting these metrics from the research world incentivizes one-off, specialized, brittle solutions rather than the reliable, reusable, composable subsystems that form the foundation of good software engineering.

So I’d like to propose some different evaluation criteria for ML systems, with the hope that we raise our collective expectations of what they should provide and, eventually, build them differently.

Encapsulation & abstraction. An ML system should behave well as a component in a large software system. It should provide an elegant programming interface, use standard data formats, and hide as much complexity as possible from developer.
Safety & conservatism. An ML system shouldn’t place the burden of avoiding overfitting on the user. It should be willing and able to communicate uncertainty about its results, including the possibility of “shrugging its shoulders” when the data is insufficient.
Simple and transparent controls. An ML system should expose its configuration and parameters in a clear, transparent way. The user should not have to perform heuristic searches through the parameter space, and there should be no art or mystery involved. The system should require as little tuning as possible from the user, with sensible defaults that handle the common cases.
Handling messy, real-world data. Real data has duplicates and missing values; is full of noise, errors, and surprises; and is composed of mixed of numerical, categorical, text, geospatial data, etc. ML systems should handle datasets as they are found in the wild, rather than forcing the user to perform significant cleanup and heuristic “feature engineering”.

ML methods and systems that are evaluated and perform well along these lines will help tame the complexity in smart software systems. As a result, more developers will be able to use them successfully and the resulting systems will be more resilient. ML will eventually transition from its current role as the the high-maintenance prima donna of the data stack to a workhorse component.

I would love to see more effort devoted to improvements on these fronts, even if that means less emphasis on capturing incremental accuracy improvements on specific problems. These criteria represent a significant change in focus, though, and they may disturb some ML experts because they embrace a more black-box approach in which the internals of the system are less accessible. I would argue that it is exactly this sort of encapsulation – wisely performed and tastefully engineered – that is needed for machine learning to have the wide impact that, say, relational databases have achieved in the past decades.

Machine Learning Related Resources:

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata Rx Health Data Conference: September 25-27 | Boston, MA
Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England

Evaluating machine learning systems: Kaggle’s not enough

We should raise our collective expectations of what they should provide

Evaluating machine learning systems: Kaggle’s not enough

We should raise our collective expectations of what they should provide

Get the O’Reilly Data Newsletter