Why is building custom recommender systems hard? Does it have to be?


Photo Courtesy of Carlos Guestrin

By Carlos Guestrin

Today, it’s shocking (and honestly exciting) how much of my daily experience is determined by a recommender system.  These systems drive amazing experiences everywhere, telling me where to eat, what to listen to, what to watch, what to read, and even who I should be friends with.  Furthermore, information overload is making recommender systems indispensable, since I can’t find what I want on the web simply using keyword search tools.  Recommenders are behind the success of industry leaders like Netflix, Google, Pandora, eHarmony, Facebook, and Amazon.  It’s no surprise companies want to integrate recommender systems with their own online experiences.  However, as I talk to team after team of smart industry engineers, it has become clear that building and managing these systems is usually a bit out of reach, especially given all the other demands on the team’s time.

Since starting the GraphLab open-source project over five years ago, we’ve learned a ton from our community of software engineers, data scientists, statisticians, and product managers who are using GraphLab to build recommender systems.  Over the years, we have heard a consistent set of pain points with building and managing these data products including:

  • learning the machine learning skills needed to build large-scale recommendation systems is challenging, especially since the methods are trapped in thick books and research papers disconnected from practice
  • building, optimizing and deploying these systems requires mastering and connecting a long list of disparate and brittle tools
  • operationalizing these systems is difficult, especially since the tweaking of the solution to optimize doesn’t stop at deployment time.

Learning new skills and tools is hard and time-consuming.  Building and managing recommender systems today requires specialized expertise in analytics, applied machine learning, software engineering, and systems operations.  This makes it challenging regardless of your background or skillset.  Data scientists typically have strong machine learning and statistics expertise, but are unfamiliar with operational issues like monitoring data quality, model-, or system-performance, as well as writing production-level code.  On the other hand, software engineers may be familiar with the operational aspects, but are new to the basics and idiosyncrasies of machine learning.  There are others still that transitioned from a field like Physics or Biology to data science and have had to learn all of this stuff on the job.  It can be a bit scary to get started when faced with a stack of books, never ending blog posts, tutorials, and academic papers just to understand the basics.  Even seasoned data scientists we work with are deep in certain areas and complete beginners in others.  Most of us learn to build by playing with example code. I think we need more examples that more people can understand, tweak, and apply to their problems.  At GraphLab, we use iPython notebooks to share and teach each other – we plan to share our example notebooks on our website.  These notebooks combine two things I love: a wiki and Python, to provide an interactive description of the analyses we have done.  And, anyone can download the notebook and adapt the code to their own problem.

There are too many disconnected and specialized tools.  From data stores to ETL.  Prototyping in R and then rewriting in Java.  Visualization in javascript.  It sometimes feels like without bash scripts gluing these different systems together, the world would fail!  There’s been a growing movement in the data science community to consolidate pipelines into a single usable language like Python. GraphLab, amongst others, is part of the Python movement and will soon offer our Python interface for GraphLab widely.  But for today, we can’t build and deploy applications like recommender systems using one set of end-to-end tools, in one language.  As a community and ecosystem, I think we can do better.

We’re all strapped for time.  Many data science teams we’ve talked to spend a lot of time on the easy, but time consuming problems.  Meanwhile, the harder proprietary stuff that makes their recommenders special gets delayed by all this grunt work.  Most are faced with an unappealing choice of deciding between spending a lot of time and resources to DIY or integrating a black-box solution which is by definition closed and not customizable.  Why can’t we get started faster with off-the-shelf solutions that are open and customizable to solve the easy “80% problems”?

Managing recommender systems in production is annoying.  We hear this story a lot.  “We have a recommender system that was written from scratch by some guy.  He’s since left the company, but his system is running on a Hadoop cluster.  It’s outdated, but no one wants to touch it because our company’s revenue depends on it!”  A lot of stuff is hacked together which initially serves a purpose but later turns into serious technical debt.  It’s amazing how many companies bet money on old, brittle code.  There’s further no platform to operationalize data products and spit out things like data quality, train/test, and model health reports.  Once that person who built your system is gone from your company, chances are you’ve lost a lot of the knowledge of what you should be looking out for to make sure all your models are up to snuff.  We need a lifecycle approach to building and maintaining data products.

Learn to build your own recommender system easily at our Strata tutorial.  If you want to learn more about building a recommender system from scratch, consider attending my upcoming tutorial, Large-scale Machine Learning Cookbook using GraphLab, at Strata Santa Clara. In this three-hour hands-on workshop I will teach fundamental machine learning techniques and demonstrate how to use them to build a production-ready recommender system in Python using the latest version of GraphLab. You can download and learn more about GraphLab technologies at GraphLab.com.

Strata Blog Data Logo 148x178O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata in Santa Clara: February 11-13 | Santa Clara, CA

tags: , , , ,