Peter Brantley writes in email: “greg linden speculates that google has sworn off any effort to adopt federated search to facilitate indexing of the deep web due to fundamental concerns with performance and reach. instead it “surfaces” web content to normalize it through its indexing heuristics. the result potentially has legal ramifications (see referenced comment by eric goldman).”
Peter quotes some key bits from Greg’s post:
“Google instead prefers a “surfacing” approach which, put simply, is making a local copy of the deep web on Google’s cluster.
“Not only does this provide Google the performance and scalability necessary to use the data in their web search, but also it allows them to easily compare the data with other data sources and transform the data (e.g. to eliminate inconsistencies and duplicates, determine the reliability of a data source, simplify the schema or remap the data to an alternative schema, reindex the data to support faster queries for their application, etc.).
“If Google is abandoning federated search, it may also have implications for APIs and mashups in general. After all, many of the reasons given by the Google authors for preferring copying the data over accessing it in real-time apply to all APIs, not just OpenSearch APIs and search forms. The lack of uptime and performance guarantees, in particular, are serious problems for any large scale effort to build a real application on top of APIs.
“Lastly, as law professor Eric Goldman commented, the surfacing approach to the deep web may be the better technical solution, but it does have the potential of running into legal issues. Copying entire databases may be pushing the envelope on what is allowed under current copyright law. While Google is known for pushing the envelope, yet another legal challenge may not be what they need right now.”
We should be clear that this is speculation on Greg’s part. But it is based on some technical papers from Google, which Greg referenced in his earlier post on the “Google and the deep web.”
BTW, Greg’s posts should be of great interest to those who are following the Metaweb/Semantic Web discussion on this blog as well as those interested in Google’s interoperability strategy.