Wolfram Alpha a Google Killer? Not… Supposed… To… Be

I’m getting tired of reading about whether Alpha is a Google-killer. I’ve seen Stephen Wolfram’s presentations a couple of times; he’s quite careful to say that it isn’t. There’s a fundamental difference that many people out there are just missing. Google is a search engine. Alpha looks like a search engine, but it isn’t; it’s all about curated data, and the analysis of that data.

What's the difference? Look at one simple query: "earth circumference". Alpha gives you one result, translated into a couple of units, along with information about the exact data source. Google gives you “about 1,190,000” results. Some of the answer the question “what is the earth circumference”; some of them answer other questions, like “how did Fermi propose computing the Earth’s circumference”; some are cute, maybe even useful, to a particular audience (I’m sure there are elementary school papers and science curriculum assignments buried in there); and some are probably just plain bogus (I bet you could find pages from the Flat Earth Society in those 1.2 million).

Asking which result is “right” misses the point. Google is a search engine; it did exactly what it’s supposed to do. It isn’t making any any assumptions about what you’re looking for, and will give you everything the cat dragged in. If you’re an elementary school teacher or a flat-earther, you can find the result you want somewhere in the big, messy pile. If you want accurate data from a known and reliable source, and you want to use that data in other computations, you don’t want Google’s answer; you want Alpha’s. (BTW, the Earth’s circumference is .1024 of the distance to the Moon.)

When is this important? Imagine we were asking a more politically charged question, like the correlation between childhood vaccinations and autism, or the number of civilians killed in the six-day war. Google will (and should) give you a wide range of answers, from every part of the spectrum. It’s up to you to figure out whether the data actually came from. Alpha doesn’t yet have data about autism or six-day war casualties, and even when it does, no one should blindly assume that all data that’s “curated” is valid; but Wolfram does its homework, and when data like this is available, it will provide the source. Without knowing the source, you can’t even ask the question.

Collecting and curating all the world’s data is an insanely ambitious project, but that’s only the start. The bigger problem is creating a common taxonomy that makes data useful. It was trivial to ask Alpha the ratio of the Earth’s circumference to the Moon’s, because the data is stored in a way that makes it easily accessible for computation. You can ask Google for web pages that the same data, and but before you can use the data, you’ll have to do a lot of “screen scraping” that’s much more difficult than getting the data in the first place. Again, this isn’t to say that Google or Wolfram is right or wrong–it’s just that they’re answering different questions. I’m working with a couple of authors who’ve done some brilliant work with R that collects online foreclosure data and analyzes it. Most of the code, and certainly the most difficult code, is screen-scraping and data-scrubbing, not statistics or analysis. Search results, returned as a web page, and data that’s compute-ready aren’t the same thing.

Why would we care about making the world’s data accessible to computation? At O’Reilly’s FOO camps, we’ve been talking a lot about “citizen science“–for example, Cornell’s many birding projects. But citizen science is usually about creating the data–counting the birds in your back yard, and so on. That’s great, but the analysis is still done by professionals. Putting a computation engine together with a curated, structured data source takes citizen science a step further. With all the panic about Swine Flu, I’ve been thinking about data from the 1918 flu epidemic. With time sequence, location-specific data (how many people are sick at any given time in any given city), it would be fun to study how the flu spread. This particular data isn’t yet available in Alpha (Stephen Wolfram, take note!), but when the data becomes available, creating an animation that shows the geographical distribution of flu cases over time should be easy; you could watch the flu move from city to city (or not). If Alpha’s not up to the task, it can be done simply enough with a Mathematica/Alpha bridge. I’m not an epidemiologist, and I won’t pretend that this animation would reveal anything fundamental, but I also believe that the world is full of under-analyzed data. Citizen data analysis? This is a New Kind of Science indeed.