• Print

Google Engineering Explains Microformat Support in Searches

You may also download this file. Running time: 18:24

Subscribe to this podcast series via iTunes. Or, visit the O'Reilly Media area at iTunes to find other podcasts from O'Reilly.

Today, Google is releasing support for parsing and display of microformat data in their search results. While the initial launch will be limited to a specific set of partners (including LinkedIn, Yelp and CNet reviews), the intent is that very quickly, anyone who marks their pages up with the appropriate microformat data will be able to make their information understandable by Google. This technology would allow you to explicitly search, for example, for only printers that had an average customer review of 3 stars or higher. Initial support will include things such as:

  • Review Ratings
  • Product Prices
  • Personal Details

We talked this morning with Othar Hansson and RV Guha, two of the Google engineers responsible for the new functionality, and you can listen to them discuss it in this exclusive O’Reilly interview.

JAMES TURNER: Why don’t you guys start by introducing yourselves?

OTHAR HANSSON: Sure. I’m Othar Hansson: and I’m a tech lead on this project. And I’m in Google’s Search UI Group.

RV GUHA: My name is Guha. I’m an engineer at Google and I do stuff across the board.

JT: So can you describe briefly, to start off, exactly what it is you’re releasing today?

RVG: Okay. We are asking webmasters who have pieces of data like reviews or people profiles, and in an experimental form, things like information about organizations and products, to put the structure data representing the content on the webpage in a machine-understandable form on the webpage. Typically, what happens is that if you take a website and having created opinions, I can talk about the context of opinions. You would typically have a database in the back-end which has lots of information about products. People write reviews about them. And you get information such as the number of reviews, the average rating of the reviews, the price of the product, who sells it, et cetera, et cetera, et cetera. It’s stored in a structured database in your back-end. You then use some scripts to format it into HTML as per the site’s design. Now going from the structured data to the HTML is quite straight-forward. But going from the HTML back to the structured data in a fashion which works across sites is very, very, very hard. Now our search engine doesn’t — it’s very difficult for a search engine to understand — to sort of get back the structured data for all of the sites. Now if it were to understand that, if it were to understand that this is a review site where the product being reviewed is such and such and it has 30 reviews with an average rating of 3.2 and so on and so forth, we could do a better job of the search. In particular, we could do a better job of presenting the two or three lines of text that appeared as part of the search result so that the user has a better idea of what to expect on that page. And from our experiments, it seemed that giving the user a better idea of what to expect on the page increases the click-through rate on the search results. So if the webmasters do` this, it’s really good for them. They get more traffic. It’s good for users because they have a better idea of what to expect on the page. And, overall, it’s good for the web.

JT: So in some ways, that’s in the same way that right now for certain sites, you’ll give the internal structure of the site as part of the search result or for shopping results, you’ll give price ranges and things like this. This is just, again, enriching and providing more structured — more than just a snippet, giving more of a structured display of the information on that page?

RVG: Yes. If we have a structured data, we can do lots of things. We’re starting off by improving the snippets. It’s an absolute no-brainer. It seems to be helping everybody. And, as you know us, we keep playing it on with different ideas and different things. As structured data becomes more prevalent, there’s a ton of ideas, both inside Google and outside Google, on how you might improve search.

JT: Right. So, for example, you might be able to — if you were looking for reviews, you could say where the review value is greater than three stars.

RVG: Absolutely. Or then another example, there are lots of John Smiths in the world. And my advisor’s favorite example, there’s lots of McCarthy’s in the world. And if you search for McCarthy, the page that you get is often pages that you get about John McCarthy, the computer scientist, mixed up with pages about McCarthy who was a politician in the ’50s and ’60s. Now if the pages about John McCarthy and the other McCarthy were to be somehow using something like RDFa, specify that I’m about that McCarthy whose homepage is at cs.stanford.edu/jmc, then as part of the search, we could come back and say, “Which John McCarthy did you mean?” And you can say, “Oh, yeah, I meant the guy from Stanford.” And that makes life a lot easier for users.

JT: Right. So, again, part of the idea would be that if you were doing eventually a search, you could do more like a SQL parameterized search rather than a free text search?

RVG: But the users will never see SQL.

JT: Right. But that same idea of parameter value pairs.

RVG: You get more precision with structured data queries. There’s not just SQL. There’s other languages as well. But, in general, you get more precession with the structure data query. You get to slice and dice the data in many different ways. If we had a better understanding of the data that is contained in the pages, we would be able to provide these kinds of services to our users.

JT: Now some of the initial things I’ve seen about this talk a lot about microformats. What is the range of microformats that you believe that you will eventually be looking at here?

RVG: So there’s this big issue having to deal with the issue of microformats with RDFa and things like that. And so after thinking about it, when we went inside the company, we went round and round and round and round. There were people who wanted to support microformats. There were people who wanted to support RDFa. There were people who thought that we should come up with our own format. In the end, we decided that the issue is we have to support multiple encodings and that the real issue, if you will, or one of the big issues is not so much the format that is used but the vocabulary. It’s really important that everybody, as far as possible, use the same vocabulary. So Google is essentially going to be making an investment in sort of hosting a vocabulary that maybe is Google Services. It’s not just Google.com search, but Custom’s search engine is also going to be using this. We’re hopeful that other web services from other companies will also use this vocabulary. And we’re not going to do this all by ourselves. As it is, we are drawing from several sources. We’re drawing from microformats. We’re drawing from vCard. And there are other places that you will see. And there’s other people who know more about their topics than we could possibly know. And we’ll draw on all of these things. So to come back and answer your question, we hope that the scope of this will be substantially more than the scope of all the particular data types that work today by microformats.

JT: If you take this to its logical conclusion, you would think that everywhere you had a location, you would use a microformat for a location so you could search on geographic information. Every place you had a person’s name, you would embed a microformat with that. It strikes me that for the website author, until there are tools available to really help them do that, that could be quite a workload.

RVG: We agree. So we’re hoping the tools will evolve. There are only two that are evolving that are beginning to support some of these formats like RDFa. We’re hoping there’s an 80/20 rule over here which is that by doing some little work, you can get a lot of the bang for the buck. And I mean there’s already thousands of sites which support the structure called [inaudible]. Even Whitehouse.gov supports RDFa already. And so we are expecting wireless will be a little bit more work. It will not be too much more work. And for websites that are driven on databases, it’s a one-time change mapping their schema into the vocabulary and just outputting it and then adding it to their –

OH:And on the side of small sites, I mean if you consider the Google Custom Search angle, imagine a small site that sells jewelry and they have 30 different jewelry designs. There’s probably no database developer associated with this company. So the easiest way for them to actually build a database is to output their HTML with the structure data in it and then let Google or other search engines do the sort of SQL-like queries that you mentioned. You know, “I want to look for necklaces that have gold and are $200 or less.” It’s actually easier to mark-up the data and then outsource the slicing and dicing to search engines as opposed to having to build up all of that infrastructure in a small setting. So it’s work, yes. But, on the other hand, it potentially replaces work that’s either not getting done or it’s even more difficult for a small company.

JT: You’ve described the advantages that will be there for content producers in terms of higher click-through rates by people who see more precise and helpful data. And you can kind of conceptually understand what the advantages would be to the user, but in terms of the experience, what is it going to look like to someone doing a Google — would it leap off at them that it’s a different experience than it was?

OH: No. [Laughter] It will not be popups.

JT: Yeah. I didn’t mean leap off like that; I meant will really catch their eye.

OH:I think it’s noticeable enough. I mean one of the design challenges is to not create an arms race among the search results to which one can be flashiest because that actually hurts the user. It makes it harder to scan the page and harder to figure out which result they want to click on. So our design is fairly subtle.

RVG: It’s not so subtle that you don’t notice it. It’s there. It doesn’t detract from the experience. And, as you know, we do lots and lots of experiments. From our experiments, it shows that it makes a substantial difference.

OH:Yeah.

JT: Now you’re rolling out with a few initial partners. And I’d like it if you could mention who they are and what the best ways that a user could see this quickly, but also, for the rest of the world, if I were to mark up my site today, how likely is it that it’s going to start to show up in this new form in Google any time soon? And, also, just to totally overload this question, a site that has this data, it may be much more either ephemeral or frequently updated data, are you going to make any effort to crawl those sites a little more aggressively?

OH: So anyway, so you should check the press materials later today for the official list of partners. But they include Yelp and CNET and LinkedIn, which are reasonably easy to come up with a query that will show data for them. And it’s reviews on CNET. It’s not all the content on CNET. How quickly will sites get in here? Well, so we’re starting this off with reviews and people, primarily. So a site that has that kind of content, it’s going to be a lot easier for them to get involved. We have a form for them to express interest. We’re then going to take a look and see — work with them, I guess, to mark up their data.

RVG: At a higher level, in a fairly short order, we would like it — there is no set of partners. You don’t have to work with us. You don’t have to tell us. You just have to follow our basic guidelines like that Google already specifies, like no cloaking. That is that you tell the crawler exactly the same data that you would give the user. If you follow these kinds of rules and you’re not spam, you should show up. However, we’re being kind of careful in rolling this out because we know that pretty much anything we do, there are, unfortunately, elements on the web who will try and somehow figure out an angle to exploit it and this and that and so on and so forth. But you shouldn’t need to work with us to be a participant in the near future.

OH:Yeah. So the initial phase is a little different because of that concern, but in the end, we’re basically going to notice this on websites and then start.

RVG: Yeah. And there’s another interesting aspect of the CSE angle, which is that for a Custom Search Engine, you use Google’s infrastructure to run a search engine on your site. And there are several prominent websites like the New York Times and About.com who use Custom Search to power their homepage search. So some of these people have wanted to extend the functionality for a long time, except they want to be able to define their own vocabulary, their own X, Y and Z. So clearly, we cannot be sort of gatekeepers in the vocabulary. We cannot expect microformats or anybody to be gatekeepers in their vocabulary. And let me give you kind of an extreme, not an extreme, but an example, if you go to unitedway.org’s homepage, right, that is liveunited.org, they actually use a Custom Search Engine on their homepage. They have all kinds of things that they probably want to mark up there. And it’s searchable with the United Way organizations. There’s about 900 United Way organizations across the world for the different areas and so on and so forth. We cannot be expected, nor microformats.org or any one organization be expected to come up with the vocabulary for all domains. So they should be, as users of Custom Search Engine, be able to mark up their pages with structured data vocabulary that they come up with, hopefully in collaboration with that community and maybe even in collaboration with us and use that to augment and improve the search experience on their own site. What this means is that — I mean there’s a huge number, hundreds of thousands of sites now which are using Custom Search Engine. And these guys can all go and start creating, marking up their pages and realize the benefits of this mark-up. And then our crawler will go pick up this mark-up and see which vocabularies are getting traction. And then work with those communities to incorporate them back into something that is recognized and used on Google.com itself. So what we’re trying to do is create an ecosystem where there is demand for the structured data in all steps of the way. We’re not going and telling people, “You should do this because it’s a good thing to do and it’s motherhood and apple pie.” There is demand for it. You do it; you get a bang for the buck. Not just that, and that’s not the end of it. You’ll see which ones are getting traction and then ratchet it up one level by giving recognition to those on Google.com.

JT: That leads to one other question I had which is that some of the places that come to mind immediately that it would be really nice to have this kind of data, especially review data, are things, for example, like Amazon which has a very extensive review database. Or, for example, if I was searching and I got eBay results, I’d love to have the seller rating of the seller in there. Those guys obviously have their own agendas. Do you think we’re going to see that kind of data available?

RVG: I think you’re going to have different partners have different kinds of philosophies. Let me actually give you an anecdote from about ten years ago. When Eckart Walter and I were doing RSS at Netscape, initially most of the big publishers said, “There’s no way on earth we’re going to give you our headline because the most valuable stuff, we want users to come to our own webpage and use it.” But some publishers said, “Fine, we’ll give it to you.” Then about two or three weeks after we launched, we had some very interesting partners. We had the Mormon Church provide its first RSS feed. We had small sites like Cricketforum.com, which is a cricket site provide its RSS feed. And these people said, “Look, people don’t set us as a homepage. If we can somehow get more traffic, more attention by giving away some of our data, that would be good.” Now, of course, you have everybody providing RSS feeds and realizing that doing so actually encourages people to come to their site. So I don’t know what Amazon and eBay in particular are going to do. But I do believe that there will be some people who are sort of more advanced in their thinking and understand what’s happened. And there are some people who will be less so.

JT: All right, guys. I think that will do it. Thank you so much for taking the time.

RVG: You’re welcome.

OH:Sure.

tags: , , , ,
  • http://basiscraft.com Thomas Lord

    It is sad that Google did not lead off its RDFa support by including ccREL data. It is useful to be able to search for content whose meta-data indicates that it is licensed in specific ways. RDFa was inspired and invented largely in hopes of getting people to use ccREL.

    -t

  • http://pigsonthewing.org.uk Andy Mabbett

    Google are only partially recognising certain microformats )so no phone numbers, e-mail or postal boxes, for instance); have broken microformat code in their documentation,and encourage people not to use them (“”adding the name of an unknown reviewer will dilute the effect of the Google snippet and could make the page appear less relevant” on http://google.com/support/webmasters/bin/answer.py?hl=en&answer=99170). I think they may be doing more harm than good.

  • George Orwell

    This way, Google can extract the data they want, insert it on their SERP, and cut out those pesky websites from interfering with their ad traffic.

  • http://xavierbadosa.com Xavier Badosa

    Google is moving from searching to previewing.

  • http://xavierbadosa.com Xavier Badosa

    And previewing is a form of publishing.

  • http://bexhuff.com bex

    I have to ask…

    If this is only being rolled out to a limited number of partners… and it is very limited in functionality, and Google has control of both ends of the pipe…

    Aren’t there EASIER solutions than RDFa?

    This feels like standards-for-standards sake. If Google eventually opens it up for EVERYONE — highly unlikely — then a standard makes sense.

    Otherwise, I’d guarantee that Google engineers could come up with something 100x better than RDFa.

  • bowerbird

    google seems to be getting lazy. :+)

    oh, it’s so _hard_ to process your stuff, could you please
    “structure” it so our machines can understand it better?

    oh wait — excuse me! it’s not just _hard_, it’s:
    > very, very, very hard

    what a bunch of whiners! ;+)

    go back to work and show some _innovation_ again…

    -bowerbird

    p.s. google search: “irving wladawsky-berger” “uima”

  • http://www.netultimate.com netultimate

    Excellent article – I really appreciate your knowledge about some Google microformat support in searches, I have bookmarked it for later viewing and forwarded it on.

  • konteyner

    güzel paylaşım tskler iyi işler