What it takes to build great machine learning products

Rich machine learning products come from skilled and knowledgeable teams.

Machine learning (ML) is all the rage, riding tight on the coattails of the “big data” wave. Like most technology hype, the enthusiasm far exceeds the realization of actual products. Arguably, not since Google’s tremendous innovations in the late ’90s/early 2000s has algorithmic technology led to a product that has permeated the popular culture. That’s not to say there haven’t been great ML wins since, but none have as been as impactful or had computational algorithms at their core. Netflix may use recommendation technology, but Netflix is still Netflix without it. There would be no Google if Page, Brin, et al., hadn’t exploited the graph structure of the web and anchor text to improve search.

So why is this? It’s not for lack of trying. How many startups have aimed to bring natural language processing (NLP) technology to the masses, only to fade into oblivion after people actually try their products? The challenge in building great products with ML lies not in just understanding basic ML theory, but in understanding the domain and problem sufficiently to operationalize intuitions into model design. Interesting problems don’t have simple off-the-shelf ML solutions. Progress in important ML application areas, like NLP, come from insights specific to these problems, rather than generic ML machinery. Often, specific insights into a problem and careful model design make the difference between a system that doesn’t work at all and one that people will actually use.

The goal of this essay is not to discourage people from building amazing products with ML at their cores, but to be clear about where I think the difficulty lies.

Progress in machine learning

Machine learning has come a long way over the last decade. Before I started grad school, training a large-margin classifier (e.g., SVM) was done via John Platt’s batch SMO algorithm. In that case, training time scaled poorly with the amount of training data. Writing the algorithm itself required understanding quadratic programming and was riddled with heuristics for selecting active constraints and black-art parameter tuning. Now, we know how to train a nearly performance-equivalent large-margin classifier in linear time using a (relatively) simple online algorithm (PDF). Similar strides have been made in (probabilistic) graphical models: Markov-chain Monte Carlo (MCMC) and variational methods have facilitated inference for arbitrarily complex graphical models [1]. Anecdotally, take at look at papers over the last eight years in the proceedings of the Association for Computational Linguistics (ACL), the premiere natural language processing publication. A top paper from 2011 has orders of magnitude more technical ML sophistication than one from 2003.

On the education front, we’ve come a long way as well. As an undergrad at Stanford in the early-to-mid 2000s, I took Andrew Ng’s ML course and Daphne Koller’s probabilistic graphical model course. Both of these classes were among the best I took at Stanford and were only available to about 100 students a year. Koller’s course in particular was not only the best course I took at Stanford, but the one that taught me the most about teaching. Now, anyone can take these courses online.

As an applied ML person — specifically, natural language processing — much of this progress has made aspects of research significantly easier. However, the core decisions I make are not which abstract ML algorithm, loss-function, or objective to use, but what features and structure are relevant to solving my problem. This skill only comes with practice. So, while it’s great that a much wider audience will have an understanding of basic ML, it’s not the most difficult part of building intelligent systems.

Interesting problems are never off the shelf

The interesting problems that you’d actually want to solve are far messier than the abstractions used to describe standard ML problems. Take machine translation (MT), for example. Naively, MT looks like a statistical classification problem: You get an input foreign sentence and have to predict a target English sentence. Unfortunately, because the space of possible English is combinatorially large, you can’t treat
MT as a black-box classification problem. Instead, like most interesting ML applications, MT problems have a lot of structure and part of the job of a good researcher is decomposing the problem into smaller pieces that can be learned or encoded deterministically. My claim is that progress in complex problems like MT comes mostly from how we decompose and structure the solution space, rather than ML techniques used to learn within this space.

Machine translation has improved by leaps and bounds throughout the last decade. I think this progress has largely, but not entirely, come from keen insights into the specific problem, rather than generic ML improvements. Modern statistical MT originates from an amazing paper, “The mathematics of statistical machine translation” (PDF), which introduced the noisy-channel architecture on which future MT systems would be based. At a very simplistic level, this is how the model works [2]: For each foreign word, there are potential English translations (including the null word for foreign words that have no English equivalent). Think of this as a probabilistic dictionary. These candidate translation words are then re-ordered to create a plausible English translation. There are many intricacies being glossed over: how to efficiently consider candidate English sentences and their permutations, what model is used to learn the systematic ways in which reordering occurs between languages, and the details about how to score the plausibility of the English candidate (the language model).

The core improvement in MT came from changing this model. So, rather than learning translation probabilities of individual words, to instead learn models of how to translate foreign phrases to English phrases. For instance, the German word “abends” translates roughly to the English prepositional phrase “in the evening.” Before phrase-based translation (PDF), a word-based model would only get to translate to a single English word, making it unlikely to arrive at the correct English translation [3]. Phrase-based translation generally results in more accurate translations with fluid, idiomatic English output. Of course, adding phrased-based emissions introduces several additional complexities, including how to how to estimate phrase-emissions given that we never observe phrase segmentation; no one tells us that "in the evening" is a phrase that should match up to some foreign phrase. What’s surprising here is that there aren’t general ML improvements that are making this difference, but problem-specific model design. People can and have implemented more sophisticated ML techniques for various pieces of an MT system. And these do yield improvements, but typically far smaller than good problem-specific research insights.

Franz Och, one of the authors of the original Phrase-based papers, went on to Google and became the principle person behind the search company’s translation efforts. While the intellectual underpinnings of Google’s system go back to Och’s days as a research scientist at the Information Sciences Institute (and earlier as a graduate student), much of the gains beyond the insights underlying phrase-based translation (and minimum-error rate training, another of Och’s innovations) came from a massive software engineering effort to scale these ideas to the web. That effort itself yielded impressive research into large-scale language models and other areas of NLP. It’s important to note that Och, in addition to being a world-class researcher, is also, by all accounts, an incredibly impressive hacker and builder. It’s this rare combination of skill that can bring ideas all the way from a research project to where Google Translate is today.

Defining the problem

But I think there’s an even bigger barrier beyond ingenious model design and engineering skills. In the case of machine translation and speech recognition, the problem being solved is straightforward to understand and well-specified. Many of the NLP technologies that I think will revolutionize consumer products over the next decade are much vaguer. How, exactly, can we take the excellent research in structured topic models, discourse processing, or sentiment analysis and make a mass-appeal consumer product?

Consider summarization. We all know that in some way, we’ll want products that summarize and structure content. However, for computational and research reasons, you need to restrict the scope of this problem to something for which you can build a model, an algorithm, and ultimately evaluate. For instance, in the summarization literature, the problem of multi-document summarization is typically formulated as selecting a subset of sentences from the document collection and ordering them. Is this the right problem to be solving? Is the best way to summarize a piece of text a handful of full-length sentences? Even if a summarization is accurate, does the Franken-sentence structure yield summaries that feel inorganic to users?

Or, consider sentiment analysis. Do people really just want a coarse-grained thumbs-up or thumbs-down on a product or event? Or do they want a richer picture of sentiments toward individual aspects of an item (e.g., loved the food, hated the decor)? Do people care about determining sentiment attitudes of individual reviewers/utterances, or producing an accurate assessment of aggregate sentiment?

Typically, these decisions are made by a product person and are passed off to researchers and engineers to implement. The problem with this approach is that ML-core products are intimately constrained by what is technically and algorithmically feasible. In my experience, having a technical understanding of the range of related ML problems can inspire product ideas that might not occur to someone without this understanding. To draw a loose analogy, it’s like architecture. So much of the construction of a bridge is constrained by material resources and physics that it doesn’t make sense to have people without that technical background design a bridge.

The goal of all this is to say that if you want to build a rich ML product, you need to have a rich product/design/research/engineering team. All the way from the nitty gritty of how ML theory works to building systems to domain knowledge to higher-level product thinking to technical interaction and graphic design; preferably people who are world-class in one of these areas but also good in several. Small talented teams with all of these skills are better equipped to navigate the joint uncertainty with respect to product vision as well as model design. Large companies that have research and product people in entirely different buildings are ill-equipped to tackle these kinds of problems. The ML products of the future will come from startups with small founding teams that have this full context and can all fit in the proverbial garage.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

[1]: Although MCMC is a much older statistical technique, its broad use in large-scale machine learning applications is relatively recent.

[2]: The model is generative, so what’s being described here is from the point-of-view of inference; the model’s generative story works in reverse.

[3]: IBM model 3 introduced the concept of fertility to allow a given word to generate multiple independent target translation words. While this could generate the required translation, the probability of the model doing so is relatively low.


tags: , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Aria, great article! I took Andrew Ng’s online ML class last year (and Norvig and Thrun’s AI class too), and I’m currently going through Daphne’s PGM online class (and the NLP one too) and wholeheartedly agree with you on the instrospection gained thanks to these classes.

    I would like to point out that one challenge that you haven’t mentioned is volume of data and number of features/dimensions. While you can get by, sometimes, in NLP, by building better language models, as the number of features grow and the volume to be processed increases, traditional technologies can quickly fall short of expectations.

    I don’t know if you have seen it, but the open source HPCC Systems platform has a good set of fully distributed and extensible Machine Learning libraries, which scale linearly with the number of computing nodes in the system. And the convenience of having an extremely abstract and high level user interface through ECL encourages iterative speculation (“what if…” type of agile development).

    Please take a look (http://hpccsystems.com/ml) and let me know what you think.



  • Barry Railton

    I took Andrew Ng’s class last semester and Thrun and Norvig’s class too.

    I see the potential for a lot of growth in this area just because of the large number of unsolved problems that exist and the revenue potential of solving them.

  • stephen

    “The goal of all this is to say that if you want to build a rich ML product, you need to have a rich product/design/research/engineering team”

    nah, dude this isnt even true. Some folks like myself and some others in data markets/finance tinker on their investment models and make slick products without “best in class teams”.

    Also the best ML products are not from startups but rather IBM and Dwave systems/google who are really trying to push “physical” AI and ML routines forward in the lab.

    also no graph based semi supervised learning shout outs? keep up the good work at prismatic.

  • Ozgur

    Yay! I am one of these luckiest people having been taking both Andrew NG’s and Daphne Koller’s online courses. What I wonder is why you didn’t mention the Deep Learning, isn’t it the state of the art supervised/unsupervised Machine Learning?

    What do you think of this?

  • not since Google … in the late ’90s/early 2000s has algorithmic technology led to a product that has permeated the popular culture.

    Very nice insight.

    Interesting problems are never off the shelf

    This may be true in research. But given your statement above, how can we say regarding internet businesses (or ML applications in in-house software) whether X or Y kind of approach is the right one? There are no stand-out examples.

    My take on Google’s use of not-very-deep mathematics to great commercial effect is that the focus wasn’t on how cool is this algorithm. The algo took input from a wide variety of humans and calculated something simple and understandable as the output. (NB: Netflix only implemented two ML techniques after the Netflix prize, neither one fancy.)

    In my view, something like a “recommend me news” app shouldn’t focus on trying to “calculate your OPTIMAL newsfeed!!!” but instead, provide options and filters in a feedback process with the user. AI can also be used to make tiny improvements to the user experience (the kind of stuff that users would mutter, “ugh! I just told it that, why doesn’t it default to guessing something smarter?”).

    What ML should not try to do, in my opinion, is calculate a master answer to the user’s question based on some static info (e.g., “list of movie ratings”). I do think solving static problems is good in basic research.

  • I want to second that it’s an excellent article with a very valid point. In my experience, not even solid knowledge of ML theory and the application domain is enough, because the developer must also be able to integrate the machine in the reality where the product will be used. For example, the computations may need to be distributed, the data may be sensitive and must be encrypted, the user interface may have to be adapted to possibly inaccurate output, and so forth. I think it’s safe to say that the field “Machine Learning meets reality” will receive a lot of attention in the next few years.

  • sip

    Despite of all this science, resulting quality of Google translation is still just miserable.