Google's Rich Snippets and the Semantic Web

There’s a long-time debate between those who advocate for semantic markup, and those who believe that machine learning will eventually get us to the holy grail of a Semantic Web, one in which computer programs actually understand the meaning of what they see and read. Google has of course been the great proof point of the power of machine learning algorithms.

Earlier this week, Google made a nod to the other side of the debate, introducing a feature that they call “Rich Snippets.” Basically, if you mark up pages with certain microformats ( and soon, with RDFa), Google will take this data into account, and will provide enhanced snippets in the search results. Supported microformats in the first release include those for people and for reviews.

So, for example, consider the snippet for the Yelp review page on the Slanted Door restaurant in San Francisco:


The snippet is enhanced to show the number of reviews and the average star rating, with a snippet actually taken from one of the reviews. By contrast, the Citysearch results for the same restaurant are much less compelling:


(Yelp is one of Google’s partners in the rollout of Rich Snippets; Google hopes that others will follow their lead in using enhanced markup, enabling this feature.)

Rich snippets could be a turning point for the Semantic Web, since, for the first time, they create a powerful economic motivation for semantic markup. Google has told us that rich snippets significantly enhance click-through rates. That means that anyone who has been doing SEO is now going to have to add microformats and RDFa to their toolkit.

Historically, the biggest block to the Semantic Web has been the lack of a killer app that would drive widespread adoption. There was always a bit of a chicken-and-egg problem, in which users would need to do a lot of work to mark up the data for the benefit of others before getting much of a payoff themselves. But as Dan Bricklin remarked so insightfully in his 2000 paper on Napster, The Cornucopia of the Commons, the most powerful online dynamics are released not by appeals to volunteerism, but by self-interest:

What we see here is that increasing the value of the database by adding more information is a natural by-product of using the tool for your own benefit. No altruistic sharing motives need be present…

(Aside: @akumar, this is the answer to your question on Twitter about why in writing up this announcement we didn’t make more of Yahoo!’s prior support for microformats in searchmonkey. You guys did pioneering work, but Google has the market power to actually get people to pay attention.)

What I also find interesting about the announcement is the blurring line between machine learning and semantic markup.

Machine learning isn’t just brute force analysis of unstructured data. In fact, while Google is famous as a machine-learning company, their initial breakthrough with pagerank was based on the realization that there was hidden metadata in the link structure of the web that could be used to improve search results. It was precisely their departure from previous brute force methods that gave them some of their initial success. Since then, they have been diligent in developing countless other algorithms based on regular features of the data, and in particular regular associations between data sets that routinely appear together – implied metadata, so to speak.

So, for example, people are associated with addresses, with dates, with companies, with other people, with documents, with pictures and videos. Those associations may be made explicitly, via tags or true structured markup, but given a large enough data set, they can be extracted automatically. Jeff Jonas calls this process “context accumulation.” It’s the way that our own brains operate: over time, we make associations between parallel data streams, each of which informs us about the other. Semantic labeling (via language) is only one of many of those data streams. We may see someone and not remember their name; we may remember the name but not the face that goes with it. We might connect the two given the additional information that we met at such and such conference three years ago.

Google is in the business of making these associations, finding pages that are about the same thing, and they use every available handle to help them do it. Seen in this way, SEO is already a kind of semantic markup, in which self-interested humans try to add information to pages to enhance their discoverability and ranking by Google. What the Rich Snippets announcement does is tell webmasters and SEO professionals a new way to add structure to their markup.

The problem with explicit metadata like this is that it’s liable to gaming. But more dangerously, it generally only captures what we already know. By contrast, implicit metadata can surprise us, giving us new insight into the world. Consider Flickr’s maps created by geotagged photos, which show the real boundaries of where people go in cities and what they do there. Here, the metadata may be added explicitly by humans, but it is increasingly added automatically by the camera itself. (The most powerful architecture of participation is one in which data is provided by default, without the user even knowing he or she is doing it.)

Google’s Flu Trends is another great example. By mining its search database (what John Battelle calls “the database of intentions“) for searches about flu symptoms, Google is able to generate maps of likely clusters of infection. Or look at Jer Thorp’s fascinating project announced just the other day, Just Landed: Processing, Twitter, MetaCarta & Hidden Data. Jer simulated the possible spread of swine flu built by extracting the string “Just landed in…” from Twitter. Since Twitter profiles include a location, and the object of the phrase above is also likely to be a location, he was able to create the following visualization of travel patterns:

Just Landed – Test Render (4 hrs) from blprnt on Vimeo.

This is where the rubber meets the road of collective intelligence. I’m a big fan of structured markup, but I remain convinced that even more important is to discover new metadata that is produced, as Wallace Stevens so memorably said, “merely in living as and where we live.”

P.S. There’s some small irony that in its first steps towards requesting explicit structured data from webmasters, Google is specifying the vocabularies that can be used for its Rich Snippets rather than mining the structured data formats that already exist on the web. It would be more “googlish” (in the machine learning sense I’ve outlined above) to recognize and use them all, rather than asking webmasters to adopt a new format developed by Google. There’s an interesting debate about this irony over on Ian Davis’ blog. I expect there to be a lot more debate in the weeks to come.

tags: , ,
  • Microformats are ‘structured data formats that already exist on the web’. They aren’t ‘a new format defined by Google’, but are defined by looking at existing practices in marking up data online and converging agreement, iteratively, in public.

    Do read the Microformats process document to understand how this works:

  • Not sure if this might be of interest, but we use semantic tools to evaluate Tweets and then further use semantic tools to build structured data from pages referenced by the Tweet and then associate it back. This allows us get past signal/noise issues and to sort through millions of tweets to identify and index ones that are of interest to visitors of our site


  • I’ve wondered for a while if anyone has approached a semantic web application based on parsing prepositional phrases. Take a sentence like “Tim works at O’Reilly Media.” Break it into the prepositional phrase, “works at”, the subject, Tim, and the object, O’Reilly Media. If you could parse any of the 156 or so prepositional phrases, then line up subjects and objects, wouldn’t you be able to present some type of semantic understanding? It may not be as scientific as wolfram or whatever it’s called, but I bet you could provide an interface that allowed for some interested search and results.

  • Two things I think are interesting about this:

    First, the distinction between explicit and implicit metadata is distorted by proximity to Google’s powerful gravitational field – whatever they do, no matter how small, becomes explicit through tweaks and optimizations webmasters will make to stay on good terms with the only search engine that matters.

    Second, the Semantic Web is a constantly receding goal. Like Strong AI it’s almost by definition unreachable, and small movements like this one will be dismissed by semweb people like the “damp squib” discussion you link to. The biggest block to the Semantic Web is not the lack of a killer app, it’s the rejection of incremental steps, “no that’s not what we meant” etc. The web will get more semantic, but the Semantic Web will never be allowed to happen.

  • Justin Sivey

    Excellent post. Thank you. As one who comes from a Library Science background, it’s encouraging to see the use of such “tools” as RDF outside of the traditional library environment. In fact, the use of controlled vocabularies such as the Web Ontology Language (OWL) is another example of how traditional information organization tools and techniques are being applied to new technological and information discovery challenges. However, all keep to the same concern of helping people find the right information at the right time, a concern that has concerned librarians for centuries.

    Twitter: @justinsivey

  • I agree that much of what is interesting is what is discovered through participation. But, pragmatically, the more extensive use of micro-formats might be a boon for small business websites. This structured data could become the ‘definitive reference’ for simple things like contact information and location.

  • Great thoughts Tim. Web is open by not being structured. Discovery end-points such as search engines should keep it that way by not imposing “structure” in discovery. Keyword search is already a limiting paradigm in discovering information since you can never discover content that you can’t define (in a keyword) – now by hiding information that’s not structured, you will never know what you would have discovered accidentally.

    Information is intended to be unstructured.

  • bowerbird

    google seems to be getting lazy. :+)

    oh, it’s so _hard_ to process your stuff, could you please
    “structure” it so our machines can understand it better?

    what a bunch of whiners! ;+)

    go back to work and show some _innovation_ again…


    p.s. google search: “irving wladawsky-berger” “uima”

  • What’s the business model for privacy advocates?

  • people have no idea the level this is done at. makes se data retention look very tame. also google is reselling my info to 1000s of other companies

  • Sean: Can you back up that claim about Google with any references?

  • Link is pointing to “Google’s Rich Snippets and the Semantic Web”

  • Joe Sulewski

    I don’t see this as an either/or argument. People seem to have this natural tendency to apply mutual exclusion to arguments when no such exclusion should exist. I see the microformats as a way of priming or validating machine analysis. Even humans need a “given” to jump start a logical thought process.

    With the knowledge that the data contains location, review or some other basic information perhaps the machine interpretation can make a more logical assumption about the data it is analyzing. Therefore, the two approaches (microformat vs machine interpretation) work together in harmony augmenting each others short comings.

    As an example, knowing the page contains user reviews of a product the machine analysis can better assume the page is not a marketing brochure for a product. This will allow the data to be categorized differently then if it didn’t have this known or “given” information.

  • “The web will get more semantic, but the Semantic Web will never be allowed to happen”

    And that’s perfectly OK. “The Semantic Web” is just a label. If we see incremental adoption of RDF, OWL, RDF/S, RDFa and SPARQL, then that’s just fine. We don’t need a “big bang” deployment of Semantic Web technologies for them to be useful.

  • Falafulu Fisi

    There has been integration of machine learning with semantic web over recent years. The problem of information retrieval of the future is not going to be coming from one domain only, but it will come from various disciplines exactly as Joe Sulewski said above that expect no mutual exclusion in the future.

  • Falafulu Fisi

    Here is one of a few Semantic Web journal publication that contains a section on Inductive Reasoning and Machine Learning, so there is definitely a merging of the 2 disciplines for various web applications.

  • Very interesting discussion!

    I agree with Joe Sulewski. I see no real dichotomy here. Human languages may be ambiguous and (sometimes overly) dependent on context, but they are structured. With markup techniques we have refined this structuring and enabled precision to be added when appropriate. Such as links.

    If we followed the argument of “no structure” all the way, we might end up saying “don’t link to anything at all; machines can find relevant articles based on (what they deem are) interesting snippets in your text”. While that may be true to a point, it’s likely to produce either a rate of information refinement on par with evolution in the primordial soup, or frightfully biased linking. If the slicing and dicing of published content is to gain speed we should use precise data/metadata as leverage to limit speculation. This is what “semantic markup”, such as RDF, is designed for.

    As for cameras adding geodata automatically, someone did program them to do that. And that information is very much explicit. There is no reason why programs (doing machine learning) cannot assist a content creator in adding named relationships or capturing embedded resource properties (think of spelling programs).

    My point is that machine learning and related techniques may benefit us more if they are located closer to the authors of information. To give them fast feedback on interpretation and to aid in creating precise expressions. This enhancement would facilitate processing (searching, filtering, aggregating) further down the line.

    (Of course, ambiguity is sometimes needed. Just think of the social or diplomatic contexts where the art of communication is very complex and nuanced. But that has little to do with the sharing and reuse of simpler “prima facie” resource descriptions. As always, there is need for pragmatism.)

  • Despite the debate and crying wolf from the Semantic standards perfectionists, I tend to agree with Google’s siding on the practical side of things. RDFa/microformats is the right way to jump start it. RDF/OWL was too big of a jump from current practices. User acceptance and state of practice typically wins over standards, especially if standards take a long time to find themselves inside useful commercializations. Eventually, we’ll get to RDF-wide utilization.

  • thnx again for yet an another quality post with lot of valuable information. Its not surprising to see google life their on standards. i’m very much interested with the latest invention.