Books Working with the Web

Almost a year ago, Tim O’Reilly wrote, “Search engines should be switchboards, not repositories” in his blog post, “Book Search should work like Web Search.” The premise was that search engines should not duplicate the digital book repositories of publishers or other service providers, but should instead direct traffic to them. As Tim said, “Don’t fight the internet.”

In New York this last week, one of the many publishers that I spoke with was HarperCollins. Unique among the trade publishers, HarperCollins has decided to own the browsing and user experience for their digital books; not only do they maintain an active and growing repository (based on LibreDigital), but they also want Google to redirect users who encounter Harper books through the Google Book Search discovery interface over to HarperCollins’ repository. Google has maintained, as the purveyor of Book Search, that they have sole latitude in determining when a destination “user experience” is good enough in terms of response time and functionality to make this switch to an external site.

That is not how the web works. Despite all their claims for openness, within the Book Search product, Google is creating a walled garden. Ultimately, if HarperCollins generates a poor user experience, then that is Harper’s problem, not Google’s.

Google has developed a fledgling specification, called “BookMap,” which aids the harvesting of digital repositories containing digital books. One of the intents of bookmap is to permit the harvesting, and indexing, or whatever material the publisher deems appropriate for exposure (on their terms) to a search discovery interface, with the determination of where the user experience should be based residing as a separate consideration. As far as I know, bookmap has still not been released as a published specification, although it is in use.

There are good reasons for publishers to control the user experience. Ultimately, it is their content, and their property, for which they have the right to determine the functionality of the experience. Music producers gave up service delivery, and they have turned into backend providers of content to services that actually provide the user experience; in short, they are no longer truly publishers.

Google’s response to this may be that they are delivering the user to content, and since that is done through their interface, if a publisher is unhappy with that service, they do not need to provide their content for delivery via Google. I do not believe that is a fair response; it is not how the web works. If a web site is not happy with how Google provides a discovery experience for their content, then they are free to prohibit harvesting through robots.txt; but Google should not exclude a site for harvesting because they are unhappy with the service delivered by the web site.

If we compare the services provided by a public domain text between Google and the Open Content Alliance, it is difficult to argue with the proposition that the user experience of OCA’s OpenLibrary is superior. Let’s take, for example, a copy of Bacon’s Novum Organum; OCA’s copy is from the University of California Berkeley library, and Google’s is from Stanford University’s library.

Both OCA and Google permit a download of this public domain work in pdf, and both provide a pleasant online browsing experience. Both render on-screen a raw text version based on the OCR derived from the page image. Nonetheless, even sans consideration of the question of image quality and OCR fidelity, OpenLibrary provides several services that Google does not, including access in multiple formats — DjVu, FlipBook; B/W and color PDF, as well as text; OCA provides text to speech capability through the FlipBook presentation. OCA also incorporates notice of critical metadata, including known rights information, on each book’s profile page. Finally, OCA permits access to their content in bulk, to the extent their own contracts permit, for purposes of research and education; e.g., for text mining analyses, etc.

I have discussed the online book viewing user experience with Brewster Kahle, and he agrees with Tim O’Reilly: book search should work like web search. A search engine should serve as a switchboard, and not as the sole delivery platform for the content. In other words, a search engine must be an open delivery platform, and not a closed garden. For Google to wear the mantle of open protocols for the social web, but to discard them for books, is a hypocrisy.

OCA is willing, and encourages, Google and others to harvest the metadata and full text of their books through current crawling procedures, as well as nascent protocols (such as BookMap), to facilitate discovery through all search platforms. Some of OCA’s contributors have expressly reserved the right to keep Google from re-hosting materials in the Google Book Search application platform, even as they remain fully available to the public at OCA; for these works, the browsing reader must utilize the OCA site.

OCA encourages Google to redirect users back to the OpenLibrary or (whenever possible) other alternative book library interfaces once they have selected an OCA title for browsing. OCA has never mandated the use of any particular book-viewing program; does not surrender control of the user experience to Google; and offers the distinct possibility of delivering a better browsing and library platform than what Google provides through Google Book Search.

In short, HarperCollins and OCA see a world where there will be many libraries and publishers, as well as many search engines.