Greg Linden on Google's Deep Web Strategy

Sat

Mar 24
2007

listen

Greg Linden on Google's Deep Web Strategy

Peter Brantley writes in email: "greg linden speculates that google has sworn off any effort to adopt federated search to facilitate indexing of the deep web due to fundamental concerns with performance and reach. instead it "surfaces" web content to normalize it through its indexing heuristics. the result potentially has legal ramifications (see referenced comment by eric goldman)."

Peter quotes some key bits from Greg's post:

"Google instead prefers a "surfacing" approach which, put simply, is making a local copy of the deep web on Google's cluster.
"Not only does this provide Google the performance and scalability necessary to use the data in their web search, but also it allows them to easily compare the data with other data sources and transform the data (e.g. to eliminate inconsistencies and duplicates, determine the reliability of a data source, simplify the schema or remap the data to an alternative schema, reindex the data to support faster queries for their application, etc.).

"If Google is abandoning federated search, it may also have implications for APIs and mashups in general. After all, many of the reasons given by the Google authors for preferring copying the data over accessing it in real-time apply to all APIs, not just OpenSearch APIs and search forms. The lack of uptime and performance guarantees, in particular, are serious problems for any large scale effort to build a real application on top of APIs.

"Lastly, as law professor Eric Goldman commented, the surfacing approach to the deep web may be the better technical solution, but it does have the potential of running into legal issues. Copying entire databases may be pushing the envelope on what is allowed under current copyright law. While Google is known for pushing the envelope, yet another legal challenge may not be what they need right now."

We should be clear that this is speculation on Greg's part. But it is based on some technical papers from Google, which Greg referenced in his earlier post on the "Google and the deep web."

BTW, Greg's posts should be of great interest to those who are following the Metaweb/Semantic Web discussion on this blog as well as those interested in Google's interoperability strategy.

tags: | comments: 2 | Sphere It
submit:

0 TrackBacks

TrackBack URL for this entry: http://orm3.managed.sonic.net/mt/mt-tb.cgi/6593

John [03.25.07 04:19 AM]

It just sounds like massive scale cacheing to me.

If it's illegal for Google, then wouldn't be illegal for anyone who points a browser to a public web page?

Seth [03.26.07 05:31 AM]

Not necessarily. Since Google's cache is (alteast somewhat) publicly accessible, it could be considered public redistribution of an author's works, without providing any compensation...

However, on the other side of things, Google could be considered and large reference work, and the reproduction is "fair use". Getting in the way of this is the fact that Google is operating in a for-profit manner.