That least-understood principle from my original Web 2.0 manifesto, “Data is the Intel Inside,” is finally coming out of the closet. A post on the Google Operating System Blog entitled Google is Really About Large Amounts of Data notes that in an interview at the Web 2.0 Summit in October, Marissa Mayer, Google’s VP of Search Products and User Experience, “confessed that having access to large amounts of data is in many instances more important than creating great algorithms.”

Right now Google is really good with keywords, and that’s a limitation we think the search engine should be able to overcome with time. People should be able to ask questions, and we should understand their meaning, or they should be able to talk about things at a conceptual level. We see a lot of concept-based questions — not about what words will appear on the page but more like “what is this about?” A lot of people will turn to things like the semantic Web as a possible answer to that. But what we’re seeing actually is that with a lot of data, you ultimately see things that seem intelligent even though they’re done through brute force.

When you type in “GM” into Google, we know it’s “General Motors.” If you type in “GM foods” we answer with “genetically modified foods.” Because we’re processing so much data, we have a lot of context around things like acronyms. Suddenly, the search engine seems smart like it achieved that semantic understanding, but it hasn’t really.

(Sounds like she’s very much in my camp on the Web 2.0 vs. semantic web debate.)

In particular, Marissa admitted that the reason for offering free 411 service was to get phoneme data for speech recognition algorithms. You heard it first on Radar. What’s also interesting, though, was her note on why they want better speech recognition algorithms right now: to improve video search. There’s an interesting principle here, namely that the obvious applications for a technology (e.g. transcription or speech recognition interfaces) aren’t necessarily the ones that will have the biggest impact. This is a great reason why companies like Google are increasing their data collection of all kinds (and their basic research into algorithms for using that data). As the applications become apparent, the data will be valuable in new ways, and the company with the most data wins.

tags:
• There are two fields within linguistics that seem to be relevant – corpus linguistics and formulaic language.

Corpus linguistics takes mass amounts of data and parses it into chunks allowing people to study how language is actually used, as opposed to how people think it is used. From there, it kind of devolves into lexicography, semantics, semiotics, and probably some other branches as well. For what it’s worth, Google released their corpus a few years ago.

Formulaic language looks a lot at words and phrases that are often used together, and the idea is that we use lots of phrases over and over again. Some are general, but many are dependent on context. “Root” means one thing to a sysadmin and something else to a horticulturalist. If you say “root,” each one of them will know exactly what you are talking about because they have a preconceived idea of “root”. I’m digressing though.

Search engines can collect this data two ways. By spidering the web they get an idea of how language is used on websites. If they categorized the sites into blogs, government, business, etc., they would give further context to how language is used. They can also collect data from what people enter into search boxes. SEO types would probably love to know that information.

Search engines could probably turn it into a revenue stream on a subscription basis – something like a combination of Google Zeitgeist and Statistical Abstract of the US. OTOH, releasing the data could also skew potential future results, if people use the data to increase their rankings.

• In their more academic presentations on search and AI, Google has been saying for years that simple algorithms trained on large corpuses trump sophisticated algorithms trained on smaller corpuses. Coming at it from this perspective, the idea that data is fundamental has never been news.

I always took your remark, “Data is the new Intel Inside”, as the insight that unique data was increasingly becoming the core value proposition for a number of kinds of business. I’m not sure Google was every guilty of not admitting this.

• It looks like the old battle of data vs. process is coming back.
As Bud Gibson put it in his comment:

Google has been saying for years that simple algorithms trained on large corpuses trump sophisticated algorithms trained on smaller corpuses

And what about sophisticated algorithms on large corpuses? Why is everyone trying to get into an “either-or” mode?

Sure more data is better than less data. Learning systems have more material to learn from. Feedback systems have more feedback to provide. Etc. Etc.

The same battle rages year ago when the software engineers were fighting against the database specialists who were winning initially, then were somewhat defeated when SOA and its associated services became predominent (at the expense of data) and then are smiling again because data re-gains its cool factor.

The truth is that it is not an either-or. The winners will be those who can apply complex algorithms over large chunks of data.

• Romuald —

I think you over-simplify. Of course, complex algorithms over lots of data may be better than brute algorithms over the same data. But the key point I’ve been making is that NOT EVERYONE WILL HAVE ACCESS TO THE DATA. If someone amasses a sufficiently large amount of data via network effects, and no one else has access to that data, they may be able to do things that are impossible for others, or impossibly expensive. Being smart and coming up with really clever new algorithms may not be enough to break into the market as more and more applications are built on top of huge proprietary databases that would be very expensive to recreate.

Sure, OpenStreetMap is trying to rebuild the street databases built by Navteq and TeleAtlas as open source data, but it’s a big enough problem that Nokia ended up paying $8.1 billion for Navteq, and TomTom paid$1.5 billion for TeleAtlas…

There are going to be lots of these cases, far more than I think people imagine.

• Tim

an interesting post. Data as the ‘Intel Inside’ is key to much of what we currently call the ‘Semantic Web’, as I tried to show in a piece for this month’s SemanticReport – http://www.semanticreport.com/index.php?option=com_content&task=view&id=62&Itemid=1&ed=3.

It’s also behind the current drive for effective licensing regimes to ensure Open and reusable pools of Data – http://blogs.talis.com/nodalities/2007/12/talis_and_creative_commons_lau.php

Paul

• Tim I’m curious why you think data that isn’t accessible to everyone will prevail, or be of such great importance in the future?
Looking at the web I draw the conclusion that openness will conquer any attempt to restrict access. Please clarify why you think companies will be able to restrict access to data without losing customers to competitors.

• So, a company like Google has massive amounts of data and has access to more. Going through the data (likely aided by algorithms), some patterns/trends/whatever emerge and suggest possible applications.

I think for my part it’s like the old saying, “can’t see the forest for the trees.” While I know there is a lot of data out there, it’s impossible for me to get a grasp on everything that’s out there. In *nix everything is a file. It sounds like everything is data – people, roads, voices, buildings, money, statistics, language, etc.

• As a guess, it may not be that the data is restricted (not open), but restricted (not well known). If Google finds a great source for data, it’s not like they are going to tell MS where to find it, even if it is publicly available. They will keep the information to themselves until they figure out how to exploit it. When they announce the project, it probably wouldn’t take long to figure out the source of the data. Then the other companies have to figure out if it is worth launching a competing project.

It might be data no one else has because no one else thought to look for it.

It might also be that having a variety of data allows the companies to mix and match to create new products, like having phone book data and street map data might allow a consumer to find the nearest pizza place by using their mobile phone. With even more data, they might be able to get the menu and reviews. A targeted ad might see that someone is looking for a restaurant in a certain area and suggest a Chinese place a block away, or maybe offer coupons. The data probably originates from different sources, but the company that’s able to collect the data and make it useful for consumers is likely to make some money.

As a practical matter, if maybe one or two companies collected restaurant menus, it would be in Google’s interest to buy them or make some sort of exclusive deal, denying the data to other companies. I think that’s the Navteq – TeleAtlas – OpenStreetMap idea. The data can be collected some other way, but it would pretty much be starting from scratch.

• Thanks Michael H, that’s clarifying but I wonder whether proprietary data is a sustainable advantage? Isn’t the speed at which the net recreates processed data accelerating? (thereby making data ownership less of an advantage)

• That statement by Marissa in incorrect.

Google has in fact modified their algos to include acronyms and stemming in the keyword-based search results.

This can be verified by the BOLDING of various forms of a given keyword query in the organic SERPs.

That suggestion had been insisted on tenaciously over the past two years on a high profle Google Engineer’s blog.
http://www.mattcutts.com/blog/video-anatomy-of-a-search-snippet/

Sadly, one gets the feeling that Marissa may not be up to date on the technicalites.

• Search Engines Web — it isn’t clear at all to me from what you say just what part of Marissa’s statement quoted above is incorrect. Your comment seems to bear no relationship to what she said. How does including acronyms and stemming change the fact that they use context to recognize GM for general motors vs. genetically modified?

• Rikard —

There are lots of examples of how exclusive data is used for proprietary advantage already in the age of the internet. There are several ways to get data advantage:

1. Spend a lot of money collecting the data. This is what navteq and teleatlas did with streets and addresses.

2. Get the government to give you a monopoly (if only for a while) before anyone realizes how valuable the data is. This is what Verisign (Network Solutions) did with the DNS registry. They extracted a lot of money before competition was introduced into domain name registrars. West did this with certain types of legal and court data.

3. Get your data built into enough services and devices (at a reasonable enough price) that no one except activists feel the need to switch. Think Gracenote’s CDDB (originally a user-generated db, btw), now built into every CD burner and music app via web services, so ubiquitous that everyone takes for granted that song names and artists and album titles can be automatically looked up. Yes, musicbrainz is trying to recreate this as an open database, but most people don’t care.

4. Aggregate enough data that scale alone becomes a barrier to entry. There are only a few companies that are big enough to manage data at web scale these days. If you’re not there already, you’ll have a hard time catching up. I saw this when on the board of nutch, which was briefly introduced with the idea of building an open source search engine. But it could never be more than a demonstration project because to build an index of the appropriate size required huge server farms.

5. Even if you’re open, being the biggest and the best has gravitational effects. Wikipedia is as open as can be, but no one has as yet managed to create a rival free online encyclopedia that matters. Network effects kick if you’re first, and good enough, and then a virtuous cycle commences.

I could go on and on. But the point is that understanding what data is going to be valuable and why, what data is reproducible at low cost via user contribution, is the calculus of competitive advantage in the Web 2.0 era.

• …But what we’re seeing actually is that with a lot of data, you ultimately see things that seem intelligent even though they’re done through brute force.

Because we’re processing so much data, we have a lot of context around things like acronyms. Suddenly, the search engine seems smart like it achieved that semantic understanding, but it hasn’t really.

There is an implication there that strongly suggests she is not really that savvy about the technicalities (what is brute force?????).

Popular acronyms were recently manually programmed into the SERPs, it was not a result of ‘processing so much data’.

Additionally, ‘CONTEXT’ understanding really is questionable; her assessment is just too casual and ambiguous.

While the effects of personalization are noticeable, one is really hard pressed to detect any PROGRAMED, INTELLIGENT contextual understanding in any of the organics. It all appears to be a manual case by case endeavor.

It is VERY possible that a demanding, thorough interview with the head of Search Quality or Chief Engineers/Developers DIRECTLY involved in programming algos would offer a more lucid technical insight. If that feat is at all remotely possible, please allow questions beforehand to be submitted by readers.

• I really like that Marissa defines Google’s skill with keywords as a ‘limitation.’ I infer that it’s not the keywords that are the limitation, it’s the inability to dynamically generate semantic responses.

Here’s how I interpret what she’s saying: as long as nobody’s got a true semantic web figured out, we’re going to win via the appearance of a semantic web, made possible through vast amounts of data.

Could it be that a genuinely semantic approach would trump data quantity?

• Kaila – please follow the link in the main story on web 2.0 vs. semantic web and read some of the stuff I’ve written about the different approaches.

It isn’t a matter of “figuring out” the semantic web. I think it’s a matter of “discovering” the semantic web. Pagerank is a kind of semantic web discovery, as is facebook’s “social graph.” Web 2.0 is about statistically extracting implicit semantics from a large mass of data, rather than expecting people to make the effort to be careful and make the data all nice for semantics.

Language is messy. Shakespeare: “the hot blood leaps over the cold decree.” The semantic web is web 2.0 run by grammarians. Web 2.0 is the semantic web achieved via slang.

• Tim, thanks for your reply. I realized I hadn’t really understood what “Intel Inside” meant but now things are a bit more clear.
A follow-up question. Will ownership of data become less important or more important? What if you consider that most of the text information on the web will fit on one \$1000 pc within say fifteen years?

• Rikard —

I do think that ubiquitous personal storage will help (but more important will be the return of P2P architectures to share that storage — if we don’t want to see the internet re-centralized.)

I don’t think that measuring current data volumes gives us any clue, though. Ubiquitous sensors are going to be generating data of all kinds that we would never have thought to keep or store before it was possible. What will happen when companies have access, for instance, to your entire daily location track (which they will surely do as location-based advertising becomes common)? And of course, there’s going to be more and more media content. But I think sensor data is going to be the killer…

However, I am sure that in 15 years, we’ll be well beyond this stage of the industry, or at least seeing the next horizon. The age of hardware as king of proprietary advantage, the age of software as king of proprietary advantage — each had their day. Why should the age of network-effect data be any different?

• RE: “You heard it first on Radar”:

I hope you’re not serious. Everyone should already that Goolge — and tons of other people, from Wall Street to Law Enforcement — are interested, developing and acquiring all manner of technology related to pattern recognition, where it be speech, OCR, facial recognition, click streams, other streams and relationships, credit profiles, other profiles, etc, etc. This is what it’s all about, after all. It is what ever increasing computer power, bandwidth, software, etc. are all driving towards. Skim a Ray Kurzweil book or just look at trends yourself and realize that they are on an exponential path rather than a linear one. That should get you where you need be, thinking wise. Cheers, chrisco

• PS: My comment was just a friendly poke… In one of those moods right now. Hehe. Love your blog. Cheers again. PS: Powerful forms of Artificial Intelligence will be here soon enough… problems solved… and new problems created. And so it goes.

• chrisco —

I’ve been talking about this for ten years. It isn’t news to me. But you’d be surprised how long it takes for the concept of “data is the greatest source of competitive advantage” to take over from “software apis, algorithms, and interfaces are the greatest source of competitive advantage.”

A lot of Web 2.0 startups don’t get this, even today. And certainly big software companies built in the PC era don’t get it. If they did, they’d be much less worried about open source software, and would be leveraging their software assets to acquire data assets (or acquiring data assets like NavTeq rather than leaving it to Nokia…)

• Davideo

Neophytes, as I sit and listen to Pink Floyd’s Dark Side Of The Moon, I recall the advent of ICs, stereo, cassette tapes, CDs, MP3s, DRM, BetaMax, VHS, DVDs, HDs Mainframes, IBM, PCs, MS, Linux, Usenet, Compuserve, WWW, AOL, Yahoo, Google, et al…..

Innovation and paradigm shifts eclipse the previously unshakable foundations of technology over and over throughout time, Google too shall fall, weighted down by it’s own unrestrained mass. Now, what is next, that is the real speculation….