Book Search Should Work Like Web Search

Mon

Dec 11
2006

listen

Book Search Should Work Like Web Search

Peter Brantley pointed me to a good write up on the Booksquare blog about the release of Microsoft's book search." The most important point in the review, Peter and I both agree, was the following:

"So here’s our problem: there is no benefit if the works are exclusive to one search provider over another. You, dear consumer, do not know that Microsoft has Book A while Google has Book B whereas Yahoo! has Book C and some other search engine has Books P and Q.

Now, maybe eventually, Google, and Microsoft, and Amazon, and the Open Content Alliance (OCA), and everyone else scanning books will come to parity, with all books included in all search engines, just as all web search engines with independent spiders converge on a roughly complete search index for the web. But scanning books is slower and more costly than spidering web pages, and in the meantime (and likely for a long time to come), the situation outlined above is likely to prevail.

There's a further wrinkle when it comes to rare books. I was talking recently with Brewster Kahle of the OCA, and he remarked, "You only get to do this once." He has asked to scan various library collections and been told, "We're already working with Google." (I talked further about this issue with David Rumsey of the American Antiquarian Society and Mike Keller of the Stanford University Library, and they disagreed. They said that the current scans aren't actually good enough for a lot of scholarly work, and that eventually all the really important rare works would need to be rescanned. But they agreed that for now at least, the situation Brewster was referring to does create some content silos.)

Having various book search engines competing to build a proprietary online book repository seems silly to me. It also doesn't seem to be working. (For example, a quick scan of Amazon's bestseller list shows only 5 out of the top 25 books "search inside" enabled.)

Book search is a big problem, and it could be solved much faster if the various vendors involved would cooperate rather than compete. Web search demonstrates that there are other grounds for competition than getting a lock on some exclusive body of content. (One might suggest that the race ought to be to be the first company to figure out how to do effective relevance matching for advertising on book search.)

A related issue was also brought out in the Booksquare blog: "...scanning is indeed how Microsoft is getting published works into its database. Even if your work is already in electronic format."

As everyone reader of this blog ought to know [key posts], I'm a big fan of the Google library project, which is cutting the Gordian knot of orphaned works for which publishers no longer know the ownership. Scanning makes sense for these books. But it doesn't make sense for books that are already available in some kind of electronic format. The most advanced publishers already have their books in an XML repository, but even the most backwards have at least PDFs that could be searched.

Three things ought to happen to speed up the development of the book search ecosystem:

Book search engines ought to search publishers' content repositories, rather than trying to create their own repository for works that are already in electronic format. Search engines should be switchboards, not repositories.
Publishers need to stop pretending that "opt in" will capture more than a tiny fraction of the available works. (I estimated that only 4% of books every published are being commercially exploited.)
Book search engines that are scanning out of print works in order to create a search index ought to open their archives to their competitors' crawlers, so readers can enjoy a single integrated book search experience. (Don't fight the internet!)

tags: web 2.0 | comments: 27 | Sphere It
submit:

Previous | Next

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/5106

Comments: 27

Anonymous [12.11.06 05:34 AM]

How does this apply to O'Reilly's own books. Are they not in the safari silo or are you already allowing search engines to crawl your data store?

Eric [12.11.06 10:01 AM]

Hmmm... but that seems to be the strategy of search engines, doesn't it? They're looking for competitive leverage through exclusive content. Primarily that's been with personal data: If I want to search my Gmail account, I can't do it through Yahoo. But it's always been a staple of the search engine wars:

- Does Google let anyone else index their groups, that is, 20+ years of usenet archives? To the best of my knowledge, they're the sole owners of that data and you can't get there from Yahoo.

- Does Yahoo let anyone else index Flickr? At a glance, it doesn't look like I can use Google Image Search for that.

I agree with you about Book Search - it would make more sense to cooperate. But realistically, that's not how search engines or business works. Any data a search engine can monopolize gives them an advantage, so it's not shocking to see book search move in this direction.

Tim O'Reilly [12.11.06 05:31 PM]

Eric --

No, Google doesn't let anyone else index their groups, and you can't search gmail through Yahoo! (But I believe that Google CAN spider Flickr, but doesn't choose to, unless I'm mistaken.) These are all areas, like the one I posted about, where Google is not following Eric's dictum "don't fight the internet."

FWIW, I still believe that search engines can get lock in, but it's not based on exclusive *content*. To the extent they move in that direction, they are no longer search engines, and I predict that they will lose.

Case in point: gmail, google images, and google video all have failed to become market leaders.

adamsj [12.11.06 05:35 PM]

Tim,

You're a tease when you say

FWIW, I still believe that search engines can get lock in, but it's not based on exclusive *content*

and don't tell us how they can get it.

Like Ross Perot, I'm all ears.

bowerbird [12.11.06 11:42 PM]

sorry, tim, you're missing the big issue,
which is that _all_ of the projects are
falling down seriously on quality-control.

let's hope brewster is wrong when he says,
"you only get to do this once," because
everything is gonna have to be rescanned.

-bowerbird

Jeff Ubois [12.12.06 12:12 AM]

Bowerbird: agreed that quality matters, but unfortunately, it seems we can gauge the probability of re-scanning for better quality by looking at what happened with microfilm, i.e. we got one shot, and we are still living with the consequences.

Michael Jensen [12.12.06 04:18 AM]

If "a book" was simply a big-ol'-file bag-o-words, it'd be easy enough to share, search, and process. But the biggest challenge in book search is that processing term relevance and document significance is much more difficult with huge book files. The content has far fewer links than web pages do, are (often, but not always) comprised of chapters, contain occasional self-referentiality, and represent wildly diverse content types: from dictionaries to historical analyses to stream-of-consciousness to young-adult novel.

And if I read it right, some digitizers are retaining page-by-page book chunks, others whole-book-files, still others with pages as database records; the historical metadata (the finding aids libraries built to try to make sense of the diversity) is generally not easily available to integrate (each big digitizer having a different arrangement with OCLC), and is separated from the content itself. Economically, these huge projects have to be done mechanically and algorithmically, which means that hand-entry of metadata (or tight quality control) is contraindicated.

In the old days of print, when paper was the only medium, one size (bound book) fit all, and we built intellectual tools to get around the physical limits: multiple book-specific indexes, the Dewey Decimal system, card catalogs with subject subcategories, etc., all of which evolved over centuries.

I think these, not "fighting the Internet," are the primary drivers encouraging silos: fighting the last 400 years of complexity, struggling with limits of standards-free digitization, and the difficulties of acquiring and retaining metadata.

I'd love to see a wiki-like opening up of book resources, to gather the "wisdom of crowds" to compensate for the cost of centralized attention.

Brewster Kahle [12.12.06 06:42 AM]

"Book Search Should Work Like Web Search" is crisp and correct. The benefits outweigh the sorry state of lawsuits and licensing bonanzas that are following from a different course. All would win from adopting this, but it is not the current course.

The web was a radical step away from the closed worlds of AOL and lexis/nexis whose closed models wanted to control both the search and the respository. But the Web worked as a balance, many put up valuable works and then others could build services such as search on top of them without having to get permission. Many prospered.

Google has lead the way in mass book scanning but violated this premise of the web. Their books are not indexable, en mass, by others, nor are the books scanned by others indexed well in google. This is causing their commercial competitors to follow similarly restricted systems, and it leading to direct permission-based deals to do indexing.

Do we really want to go back to such a world?
Many don't think so.

In a straw poll at a workshop Tim organized around book publishing on the web, I found the publishers, search engines, universities, archives all in favor of tim's premise "Book Search Should Work Like Web Search", except, so far, for one organization-- google.

Given that Google has made a fortune on re-emptive indexing of other's works, I would think they would see the value in seeing this model move forward. I think of it as the Golden Rule-- "Do unto others, as you would have them do onto you". But so far others can not index their materials.

Bringing google back into the web-oriented world takes a decision at the top of the organization, but I hope they change course because we have seen the permission-based / licensing-heavy movie before. It tends towards lock-out, monopolies, and holiday bonuses for lawyers.

I strongly support the idea that Book Search Should Work Like Web Search.

Tim O'Reilly [12.12.06 08:06 AM]

bowerbird --

Don't confuse the existence of two separate problems with the idea that one obviates the other.

Quality is definitely an issue. But so is the fact that various vendors are not taking the easy route, and scanning the searchable book repositories that already exist out there, and sharing results to get to a better solution.

Tim O'Reilly [12.12.06 08:10 AM]

Ross -- I mean John --

Does Google own the web pages it searches? Isn't Google in fact MORE successful in the fields like web search where they don't host the content than ones like email or social networking where they do?

How can that be? Ask yourself: what does Google own?

They own their own rich and complex metadata, which allows them to:

- deliver the most relevant search results
- deliver the most relevant advertising

They own their advertiser network, and the critical mass that comes from creating a virtuous circle of page views and revenue from those page views. (Much like eBay, there's a system of increasing returns, where the critical mass of buyers and sellers reaches a level where there is a bias against switching.)

They own their adsense network and a whole host of Web 2.0 add-on companies whose business model is fueled by Google adsense.

I could go on and on...

Tim O'Reilly [12.12.06 08:12 AM]

Michael --

No question that book search is harder than web search. All the more reason why it needs to be open. Just as web search struggled for many years till there was a breakthrough by Google, book search may also struggle. And in situations like this, it's better to have lots of people working on a common problem than people working separately with an eye to advantage before anyone knows what that advantage will be.

adamsj [12.12.06 01:33 PM]

Tim,

I get what you're saying--in fact, I'm thinking about AdSense a lot right now--but I'm not sure that constitutes lock-in. I still use other search engines. I don't feel particularly locked in by Google, and I don't anticipate being locked in by them so long as they are not a monopoly (a monopoly doesn't need lock-in).

(Okay, there is one search I repeatedly make using the I Feel Lucky button, 'cause I can't remember the five-part URL. There, though, what makes that search easy for me is that I make it on every machine I use, and autofill saves me the typing after I've gotten the first two letters in.

(And why don't I bookmark it? Frankly, just typing two letters is easier on me, if not on Google.)

bowerbird [12.12.06 02:52 PM]

tim said:
> bowerbird -- Don't confuse
> the existence of two separate problems
> with the idea that one obviates the other.

but tim, don't confuse two problems that are
"separate" as being "equal" in importance...

the "sharing" problem will likely solve itself.
but the "quality" problem is much thornier...

umichigan is putting up the o.c.r. from
its google scans, for the public-domain
books anyway, so the other search engines
will be able to scrape that text with ease.

what you will find, though, if you look at it
(for even as little as a minute or two) is that
the quality is so inferior it's almost worthless.

end-of-line hyphenates have lost their hyphens,
em-dashes have been dropped, quotemarks are
missing in action, and (lest we forget) there
doesn't seem to have been _any_ routines applied
that, you know, _correct_ the o.c.r. mistakes...

and text in microsoft .pdf's is _just_as_bad_.

so even if every project were to share its text,
would good is it? it's garbage-in-garbage-out...

when i can look at a page and see a word there,
but search engines don't find it because it was
recognized incorrectly, what should i think?
how long will people _believe_ in that product?

google _might_ lock up its copyrighted scans,
but as they are the only entity willing to
pay the cost of the legal challenges, that's
a bounty they _deserve_, in my opinion...

-bowerbird

John Mark Ockerbloom [12.13.06 07:45 AM]

One thing that's thankfully different this time from the microfilm era is that, while there were only a few entities able to produce microfilms of library holdings, the bar for book scanners is much lower. Lots of people have own scanners, and can image and digitize books if they know how, and are so inclined.

They won't necess

arily do as good a job on any given book as a trained professional will, and they're not the folks to do the books in rare books and special collections. But they can probably do at least as good a job on any given book as what we're seeing out of the mass-digitizers (where quality is often questionable).

They could also potentially review the output of those mass-digitization projects, and certify (and grab a copy of?) the ones that are in decent shape (all the pages there, in order, and fully legible, for instance.)

So there are potentially ways that the big Internet audience could improve on what the mass digitizers are doing by themselves, if those bodies don't themselves improve their production and distribution processes.

Adam Hodgkin [12.14.06 03:55 AM]

Have commented on this proposal -- especially the first of the three points for the Book Search Ecosystem at

http://exacteditions.blogspot.com/2006/12/google-book-search-and-other-book.html

The first of Tim's points seems to be potentially a very good way of defusing the haze of litigation that surrounds Google on Book Search. If the publishers did maintain 'mostly open to search engine' repositories Google would not be tempted to make a precedent for limited purposes for scanning copyright material.

Tim O'Reilly [12.14.06 08:20 AM]

Adam --

While I appreciate your support on the main issue (book search should work like web search), you're wrong that publishers putting up their own repositories for Google to search would obviate the need for Google to have an opt out policy for their library scanning project.

As I've written at length previously, publishers just don't know what they own, and there are numbers that show that only about 4% of the books that are known to libraries have their rights and ownership known to publishers.

I continue to believe that Google's approach is the only way to cut the Gordian knot of orphaned works.

Adam Hodgkin [12.15.06 03:17 AM]

The publishers could reasonably expect Google to comply with an explicit opt-in solution if they could show that they were maintaining efficient repositories with good web interfaces. The onus would then be on Google to establish its catalog and check the publisher repositories before ingesting a title from a libary. Google is currently as sloppy about maintaining a catalog as most publishers are about archiving their texts.

There is a very big orphaned-rights problem and its worse than you paint it. The truth is publishers probably dont even own what they appear to think they own. Most scholarly books include 'permission-granted' reuse of quotations from other books, many books include 'permission-granted' examples of illustrations and diagrams. These permissions were sought and granted for use in a published book. Only in the last ten years have publishers sought to achieve complete clearance on the distribution of these components through the web and they have not been meticulous about these issues. This is a part of the mess of the orphaned copyrights jungle. But if Google's practice is going to snow-plough a way through these complexities it will in effect allow publishers also to have a clearer field of operation in establishing the rights that they may own.

I am not a lawyer, but it seems very unlikely that Google can get a clear win in the copyright issues it has now snared, unless it gets substantial support from many types of rights holder. A messy win is not really better for Google than a lost case... The publishers, different types of copyright holders and Google are going to be much better served if some compromise is built around win/win best practice.

Putcha V. Narasimham [12.16.06 06:31 AM]

A good search engine should be able to search any content in electronic form. In stead of indexing terms, the content should be metatagged for meaning related search and fetch. For this, ontologies (one per subject) may have to be created collectively and updated regularly.

That done (it may take time but that would be one time-effort) articles and books will turn out to be different versions of master knowledge represented in machine processable form in ontologies.

Over time, growth of knowledge will be far less than the instances of having to use such knowledge. Progressively, humans would spend less time and effort in knowledge acquisition and mastery because they will be augmented with more economical, faster and effective knowledge appliances.

adamsj [12.16.06 11:57 AM]

Tim,

When you say

I continue to believe that Google's approach is the only way to cut the Gordian knot of orphaned works

you skip over my preferred solution: legislation and government action. There's no reason that providing searchable softcopies can't be required for copyright registration, and no reason the Library of Congress couldn't do what Google is doing.

That wouldn't solve the problem completely, but it'd be a darned good start. In particular, going forward, I don't see how publisher-provided softcopies couldn't be of higher quality than corrected OCR copies.

Tim O'Reilly [12.17.06 07:37 PM]

John,

I fail to see how your proposal would be a solution at all. It would be nice to require searchable softcopies as part of copyright registration, but it's not likely that any such regulation would look backwards in time, and that's where the problem lies. A few million titles known to their publishers, 30+ million in libraries. That's why the Gordian knot must be cut. Orphaned works are those that NO ONE knows the rights for any more.

And even those where a publisher has a contract moldering in a lost filing cabinet, it's likely that it makes no provision for online rights. So, in the absence of Google's opt-in approach, NONE of these works will ever see the light of day. This is why I'm so puzzled that publishers don't grab on to the lifeline Google has offered them.

If Google is right and scanning in order to create an index is fair use (just as it has been deemed in web search, where pages are also scanned and copied in order to make an index), then eventually, valuable works will be discovered.

Adam -- I completely understand how bad the rights situation is, although you've highlighted one more of the many issues. However I think the reason that Google doesn't want to scan publisher lists before ingesting library titles is that to do so would give legal precedent to the publisher position. It makes practical sense -- and if the publishers weren't suing Google, I imagine they'd be trying to do this -- but given that both sides are trying to establish precedent (Google that creating an index is fair use and no permission is needed, and publishers that it is not), we have the current sloppy situation...

Bill Seitz [12.18.06 05:18 AM]

It would be helpful to ask some of the companies why they don't get the elecronic versions from publishers.

I'm guessing they've found the publishers to (a) be slow-moving, and (b) to be asking for money.

Bill Seitz [12.18.06 05:25 AM]

Isn't the best-seller list a misleading sample, since it tends to be recently-published books, which plays against any lag-time in scanning?

Tim O'Reilly [12.18.06 08:35 AM]

Bill --

Many publishers are indeed slow off the mark, but I don't think they are asking for money. There are two issues: Google, rightly, is looking for a decent user experience and so have some standards for what will be available, and second, there is this whole jockeying around whether the business model is search engine or repository.

I'm not sure what bestseller list you're referring to. The comparison that yielded the 4% number was between the catalog of the Copyright Clearance Center and that of the OCLC.

bowerbird [12.18.06 09:22 AM]

bill said:
> It would be helpful to ask some of the companies
> why they don't get the elecronic versions
> from publishers.
>
> I'm guessing they've found the publishers to
> (a) be slow-moving, and
> (b) to be asking for money.

the digital files that send a book to press are
so widely varying in their nature and complexity
that it makes no sense to try to cope with them.

that's assuming publishers had those files, and
_amazon_ with their "search inside the book"
program found -- when they asked initially --
that _many_ publishers of even very recent books
couldn't actually pony up their digital files.
(yes, it sounds very ridiculous, but it's true.)

scanning gives you what was really on the page.
which, honestly, is what you truly want to index.

-bowerbird

Bill Seitz [12.18.06 11:51 AM]

the bestseller list i'm referring to:

"For example, a quick scan of Amazon's bestseller list shows only 5 out of the top 25 books "search inside" enabled."

Petra [01.09.08 06:10 AM]

"For example, a quick scan of Amazon's bestseller list shows only 5 out of the top 25 books "search inside" enabled."

I guess its easier for customers to choose one book when you got only 5 in the search results. But they should make it possible to make an advance search with more then 5 results.

Hochzeitsfotograf Berlin [01.21.09 03:41 PM]

The guide doesn’t evaluate, criticize, or take a stand toward the Settlement, but it is a thoughtful and careful guide.