One of the highlights of the 2012 Strata California conference was the Oxford-style debate on the proposition “In data science, domain expertise is more important than machine learning skill.” If you weren’t there, Mike Driscoll’s summary is an excellent overview (full video of the debate is available here). To make the story short, the “cons” won; the audience was won over to the side that machine learning is more important. That’s not surprising, given that we’ve all experienced the unreasonable effectiveness of data. From the audience, Claudia Perlich pointed out that she won data mining competitions on breast cancer, movie reviews, and customer behavior without any prior knowledge. And Pete Warden (@petewarden) made the point that, when faced with the problem of finding “good” pictures on Facebook, he ran a data mining contest at Kaggle.
A good impromptu debate necessarily raises as many questions as it answers. Here’s the question that I was left with. The debate focused on whether domain expertise was necessary to ask the right questions, but a recent Guardian article,”The End of Theory,” asked a different but related question: Do we need theory (read: domain expertise) to understand the results, the output of our data analysis? The debate focused on a priori questions, but maybe the real value of domain expertise is a posteriori: after-the-fact reflection on the results and whether they make sense. Asking the right question is certainly important, but so is knowing whether you’ve gotten the right answer and knowing what that answer means. Neither problem is trivial, and in the real world, they’re often closely coupled. Often, the only way to know you’ve put garbage in is that you’ve gotten garbage out.
By the same token, data analysis frequently produces results that make too much sense. It yields data that merely reflects the biases of the organization doing the work. Bad sampling techniques, overfitting, cherry picking datasets, overly aggressive data cleaning, and other errors in data handling can all lead to results that are either too expected or unexpected. “Stupid Data Miner Tricks” is a hilarious send-up of the problems of data mining: It shows how to “predict” the value of the S&P index over a 10-year period based on butter production in Bangladesh, cheese production in the U.S., and the world sheep population.
Cherry picking and overfitting have particularly bad “smells” that are often fairly obvious: The Democrats never lose a Presidential election in a year when the Yankees win the world series, for example. (Hmmm. The 2000 election was rather fishy.) Any reasonably experienced data scientist should be able to stay out of trouble, but what if you treat your data with care and it still spits out an unexpected result? Or an expected result that’s too good to be true? After the data crunching has been done, it’s the subject expert’s job to ensure that your results are good, meaningful, and well-understood.
Let’s say you’re an audio equipment seller analyzing a lot of purchase data and you find out that people buy more orange juice just before replacing their home audio system. It’s an unlikely, absurd (and completely made up) result, but stranger things have happened. I’d probably go and build an audio gear marketing campaign targeting bulk purchasers of orange juice. Sales would probably go up; data is “unreasonably effective,” even if you don’t know why. This is precisely where things get interesting, and precisely where I think subject matter expertise becomes important: after the fact. Data breeds data, and it’s naive to think that marketing audio gear to OJ addicts wouldn’t breed more datasets and more analysis. It’s naive to think the OJ data wouldn’t be used in combination with other datasets to produce second-, third-, and fourth-order results. That’s when the unreasonable effectiveness of data isn’t enough; that’s when it’s important to understand the results in ways that go beyond what data analysis alone can currently give us. We may have a useful result that we don’t understand, but is it meaningful to combine that result with other results that we may (or may not) understand?
Let’s look at a more realistic scenario. Pete Warden’s Kaggle-based algorithm for finding quality pictures works well, despite giving the surprising result that pictures with “Michigan” in the caption are significantly better than average. (As are pictures from Peru, and pictures taken of tombs.) Why Michigan? Your guess is as good as mine. For Warden’s application, building photo albums on the fly for his company Jetpac, that’s fine. But if you’re building a more complex system that plans vacations for photographers, you’d better know more than that. Why are the photographs good? Is Michigan a destination for birders? Is it a destination for people who like tombs? Is it a destination with artifacts from ancient civilizations? Or would you be better off recommending a trip to Peru?
Another realistic scenario: Target recently used purchase histories to target pregnant women with ads for baby-related products, with surprising success. I won’t rehash that story. From that starting point, you can go a lot further. Pregnancies frequently lead to new car purchases. New car purchases lead to new insurance premiums, and I expect data will show that women with babies are safer drivers. At each step, you’re compounding data with more data. It would certainly be nice to know you understood what was happening at each step of the way before offering a teenage driver a low insurance premium just because she thought a large black handbag (that happened to be appropriate for storing diapers) looked cool.
There’s a limit to the value you can derive from correct but inexplicable results. (Whatever else one may say about the Target case, it looks like they made sure they understood the results.) It takes a subject matter expert to make the leap from correct results to understood results. In an email, Pete Warden said:
“My biggest worry is that we’re making important decisions based on black-box algorithms that may have hidden and problematic biases. If we’re deciding who to give a mortgage based on machine learning, and the system consistently turns down black people, how do we even notice it, let alone fix it, unless we understand what the rules are? A real-world case is trading systems. If you have a mass of tangled and inexplicable logic driving trades, how do you assign blame when something like the Flash Crash happens?
“For decades, we’ve had computer systems we don’t understand making decisions for us, but at least when something went wrong we could go in afterward and figure out what the causes were. More and more, we’re going to be left shrugging our shoulders when someone asks us for an explanation.”
That’s why you need subject matter experts to understand your results, rather than simply accepting them at face value. It’s easy to imagine that subject matter expertise requires hiring a PhD in some arcane discipline. For many applications, though, it’s much more effective to develop your own expertise. In an email exchange, DJ Patil (@dpatil) said that people often become subject experts just by playing with the data. As an undergrad, he had to analyze a dataset about sardine populations off the coast of California. Trying to understand some anomalies led him to ask questions about coastal currents, why biologists only count sardines at certain stages in their life cycle, and more. Patil said:
“… this is what makes an awesome data scientist. They use data to have a conversation. This way they learn and bring other data elements together, create tests, challenge hypothesis, and iterate.”
By asking questions of the data, and using those questions to ask more questions, Patil became an expert in an esoteric branch of marine biology, and in the process greatly increased the value of his results.
When subject expertise really isn’t available, it’s possible to create a workaround through clever application design. One of my takeaways from Patil’s “Data Jujitsu” talk was the clever way LinkedIn “crowdsourced” subject matter expertise to their membership. Rather than sending job recommendations directly to a member, they’d send them to a friend, and ask the friend to pass along any they thought appropriate. This trick doesn’t solve problems with hidden biases, and it doesn’t give LinkedIn insight into why any given recommendation is appropriate, but it does an effective job of filtering inappropriate recommendations.
Whether you hire subject experts, grow your own, or outsource the problem through the application, data only becomes “unreasonably effective” through the conversation that takes place after the numbers have been crunched. At his Strata keynote, Avinash Kaushik (@avinash) revisited Donald Rumsfeld’s statement about known knowns, known unknowns, and unknown unknowns, and argued that the “unknown unknowns” are where the most interesting and important results lie. That’s the territory we’re entering here: data-driven results we would never have expected. We can only take our inexplicable results at face value if we’re just going to use them and put them away. Nobody uses data that way. To push through to the next, even more interesting result, we need to understand what our results mean; our second- and third-order results will only be useful when we understand the foundations on which they’re based. And that’s the real value of a subject matter expert: not just asking the right questions, but understanding the results and finding the story that the data wants to tell. Results are good, but we can’t forget that data is ultimately about insight, and insight is inextricably tied to the stories we build from the data. And those stories are going to be ever more essential as we use data to build increasingly complex systems.
- The Data Science Debate: Domain expertise vs Machine Learning (full video session from Strata CA 12)
- Building data science teams
- Picture Perfect: Bo Yang on winning the Photo Quality Prediction competition
- DJ Patil on “Data Jujitsu” (video interview)
- The data analysis path is built on curiosity, followed by action