One of best presentations at IBM’s recent Blogger Day was given by David Ferrucci, the leader of the Watson team, the group that developed the supercomputer that recently appeared as a contestant on Jeopardy.
To many people, the Turing test is the gold standard of artificial intelligence. Put briefly, the idea is that if you can’t tell whether you’re interacting with a computer or a human, a computer has passed the test.
But it’s easy to forget how subtle this criterion is. Turing proposes changing the question from “Can machines think?” to the operational criterion, “Can we distinguish between a human and a machine?” But it’s not a trivial question: it’s not “Can a computer answer difficult questions correctly?” but rather, “Can a computer behave in ways that are indistinguishable from human behavior?” In other words, getting the “right” answer has nothing to do with the test. In fact, if you were trying to tell whether you were “talking to” a computer or a human, and got only correct answers, you would have every right to be deeply suspicious.
Alan Turing was thinking explicitly of this: in his 1950 paper, he proposes question/answer pairs like this:
Q: Please write me a sonnet on the subject of the Forth Bridge.
A: Count me out on this one. I never could write poetry.
Q: Add 34,957 to 70,764.
A: (Pause about 30 seconds and then give as answer) 105,621.
We’d never think of asking a computer the first question, though I’m sure there are sonnet-writing projects going on somewhere. And the hypothetical answer is equally surprising: it’s neither a sonnet (good or bad), nor a core dump, but a deflection. It’s human behavior, not accurate thought, that Turing is after. This is equally apparent with the second question: while it’s computational, just giving an answer (which even a computer from the early ’50s could do immediately) isn’t the point. It’s the delay that simulates human behavior.
Dave Ferrucci, IBM scientist and Watson project director
While Watson presumably doesn’t have delays programmed in, and appears only in a situation where deflecting a question (sorry, it’s Jeopardy, deflecting an answer) isn’t allowed, it’s much closer to this kind of behavior than any serious attempt at AI that I’ve seen. It’s an attempt to compete at a high level in a particular game. The game structures the interaction, eliminating some problems (like deflections) but adding others: “misleading or ambiguous answers are par for the course” (to borrow from NPR’s “What Do You Know”). Watson has to parse ambiguous sentences, decouple multiple clues embedded in one phrase, to come up with a question. Time is a factor — and more than time, confidence that the answer is correct. After all, it would be easy for a computer to buzz first on every question, electronics does timing really well, but buzzing first whether or not you know the answer would be a losing strategy for a computer, as well as for a human. In fact, Watson would handle the first of Turing’s questions perfectly: if it isn’t confident of an answer, it doesn’t buzz, just as a human Jeopardy player.
Equally important, Watson is not always right. While the film clip on IBM’s site shows some spectacular wrong answers (and wrong answers that don’t really duplicate human behavior), it’s an important step forward. As Ferrucci said when I spoke to him, the ability to be wrong is part of the problem. Watson’s goal is to emulate human behavior on a high level, not to be a search engine or some sort of automated answering machine.
Some fascinating statements are at the end of Turing’s paper. He predicts computers with a gigabyte of storage by 2000 (roughly correct, assuming that Turing was talking about what we now call RAM), and thought that we’d be able to achieve thinking machines in that same time frame. We aren’t there yet, but Watson shows that we might not be that far off.
But there’s a more important question than what it means for a machine to think, and that’s whether machines can help us to ask questions about huge amounts of ambiguous data. I was at a talk a couple of weeks ago where Tony Tyson talked about the Large Synoptic Survey Telescope project, which will deliver dozens of terabytes of data per night. He said that in the past, we’d use humans to take a first look at the data and decide what was interesting. Crowdsourcing analysis of astronomical images isn’t new, but the number of images coming from the LSST is even too large for a project like <a href="GalaxyZoo. With this much data, using humans is out of the question. LSST researchers will have to use computational techniques to figure out what’s interesting.
“What is interesting in 30TB?” is an ambiguous, poorly defined question involving large amounts of data — not that different from Watson. What’s an “anomaly”? You really don’t know until you see it. Just as you can’t parse a tricky Jeopardy answer until you see it. And while finding data anomalies is a much different problem from parsing misleading natural language statements, both projects are headed in the same direction: they are asking for human behavior in an ambiguous situation. (Remember, Tyson’s algorithms are replacing humans in a job humans have done well for years). While Watson is a masterpiece of natural language processing, it’s important to remember that it’s just a learning tool that will help us to solve more interesting problems. The LSST and problems of that scale are the real prize, and Watson is the next step.
Photo credit: Courtesy of International Business Machines Corporation. Unauthorized use not permitted.