Google Admits "Data is the Intel Inside"

That least-understood principle from my original Web 2.0 manifesto, “Data is the Intel Inside,” is finally coming out of the closet. A post on the Google Operating System Blog entitled Google is Really About Large Amounts of Data notes that in an interview at the Web 2.0 Summit in October, Marissa Mayer, Google’s VP of Search Products and User Experience, “confessed that having access to large amounts of data is in many instances more important than creating great algorithms.”

Right now Google is really good with keywords, and that’s a limitation we think the search engine should be able to overcome with time. People should be able to ask questions, and we should understand their meaning, or they should be able to talk about things at a conceptual level. We see a lot of concept-based questions — not about what words will appear on the page but more like “what is this about?” A lot of people will turn to things like the semantic Web as a possible answer to that. But what we’re seeing actually is that with a lot of data, you ultimately see things that seem intelligent even though they’re done through brute force.

When you type in “GM” into Google, we know it’s “General Motors.” If you type in “GM foods” we answer with “genetically modified foods.” Because we’re processing so much data, we have a lot of context around things like acronyms. Suddenly, the search engine seems smart like it achieved that semantic understanding, but it hasn’t really.

(Sounds like she’s very much in my camp on the Web 2.0 vs. semantic web debate.)

In particular, Marissa admitted that the reason for offering free 411 service was to get phoneme data for speech recognition algorithms. You heard it first on Radar. What’s also interesting, though, was her note on why they want better speech recognition algorithms right now: to improve video search. There’s an interesting principle here, namely that the obvious applications for a technology (e.g. transcription or speech recognition interfaces) aren’t necessarily the ones that will have the biggest impact. This is a great reason why companies like Google are increasing their data collection of all kinds (and their basic research into algorithms for using that data). As the applications become apparent, the data will be valuable in new ways, and the company with the most data wins.