Book Search Should Work Like Web Search

Peter Brantley pointed me to a good write up on the Booksquare blog about the release of Microsoft’s book search.” The most important point in the review, Peter and I both agree, was the following:

“So here’s our problem: there is no benefit if the works are
exclusive to one search provider over another. You, dear consumer,
do not know that Microsoft has Book A while Google has Book
B whereas Yahoo! has Book C and some other search engine has
Books P and Q.

Now, maybe eventually, Google, and Microsoft, and Amazon, and the Open Content Alliance (OCA), and everyone else scanning books will come to parity, with all books included in all search engines, just as all web search engines with independent spiders converge on a roughly complete search index for the web. But scanning books is slower and more costly than spidering web pages, and in the meantime (and likely for a long time to come), the situation outlined above is likely to prevail.

There’s a further wrinkle when it comes to rare books. I was talking recently with Brewster Kahle of the OCA, and he remarked, “You only get to do this once.” He has asked to scan various library collections and been told, “We’re already working with Google.” (I talked further about this issue with David Rumsey of the American Antiquarian Society and Mike Keller of the Stanford University Library, and they disagreed. They said that the current scans aren’t actually good enough for a lot of scholarly work, and that eventually all the really important rare works would need to be rescanned. But they agreed that for now at least, the situation Brewster was referring to does create some content silos.)

Having various book search engines competing to build a proprietary online book repository seems silly to me. It also doesn’t seem to be working. (For example, a quick scan of Amazon’s bestseller list shows only 5 out of the top 25 books “search inside” enabled.)

Book search is a big problem, and it could be solved much faster if the various vendors involved would cooperate rather than compete. Web search demonstrates that there are other grounds for competition than getting a lock on some exclusive body of content. (One might suggest that the race ought to be to be the first company to figure out how to do effective relevance matching for advertising on book search.)

A related issue was also brought out in the Booksquare blog: “…scanning is indeed how Microsoft is getting published works into its database. Even if your work is already in electronic format.

As everyone reader of this blog ought to know [key posts], I’m a big fan of the Google library project, which is cutting the Gordian knot of orphaned works for which publishers no longer know the ownership. Scanning makes sense for these books. But it doesn’t make sense for books that are already available in some kind of electronic format. The most advanced publishers already have their books in an XML repository, but even the most backwards have at least PDFs that could be searched.

Three things ought to happen to speed up the development of the book search ecosystem:

  1. Book search engines ought to search publishers’ content repositories, rather than trying to create their own repository for works that are already in electronic format. Search engines should be switchboards, not repositories.
  2. Publishers need to stop pretending that “opt in” will capture more than a tiny fraction of the available works. (I estimated that only 4% of books every published are being commercially exploited.)
  3. Book search engines that are scanning out of print works in order to create a search index ought to open their archives to their competitors’ crawlers, so readers can enjoy a single integrated book search experience. (Don’t fight the internet!)