Sat

Mar 18
2006

Tim O'Reilly

Tim O'Reilly

What Do You Do With a Million Books?

Gregory Crane, writing in D-Lib Magazine, asks the provocative question, What can you do with a million books?. While many of his arguments are targeted to librarians and scholars rather than a tech audience, he frames the issues beautifully for anyone thinking about the future of the book:

"Vast collections based on image books - raw digital pictures of books with searchable but uncorrected text from OCR - could arguably retard our long-term progress, reinforcing the hegemony of structures that evolved to minimize the challenges of a world where paper was the only medium of distribution and where humans alone could read."

His point is that we need to get beyond page images if we want to reach the true potential of the digital library. (This was also very much Cliff Lynch's point at the Reading 2.0 summit. What gets interesting is when we have digital books as a body of work to compute against, not just to read.) Crane goes on:

Already the books in a digital library are beginning to read one another and to confer among themselves before creating a new synthetic document for review by their human readers....Figure 1 shows a simple illustration of recombinant documents. The reader has retrieved a summary of information about a canonical query - a Latin poem by the poet Catullus. Although the materials primarily derive from print sources, this view builds on three fundamental components of a truly digital library. First, they have much finer and more meaningful granularity than page breaks: larger documents have been broken down into smaller units aligned to an established authority (in this case, the traditional numbering and lineation of the poems of Catullus, such that "Catullus 1" means the same thing in many information sources). Second, the digital library automatically learns as it grows larger: automated systems scan new documents for references to Catullus; language models update themselves to provide better contextual clues to disambiguate phenomena such as morphology. Third, the documents can learn from their users, both implicitly (by examining patterns of use to determine important questions and sources of information) and explicitly. All three features are present in a rather primitive state in this figure, but the ability to decompose information into smaller, reusable chunks, to learn autonomously from a changing environment, and to accept explicit structured feedback from many human users in real time are fundamental characteristics that separate digital from print."

Very specifically, Google needs to make sure that at least the public domain books read into Google Book Search are made available in text form, not just as images. And beyond that, we need to turn hackers loose on the digital book, such that they can do interesting new things with the text, not just present it in ways that mimic the print book.

Flickr is a good example of how open access to a large database of content can lead to innovative new interfaces and access methods, from the tag cloud to the Flickr Color Picker. It would be wonderful to see this kind of innovation in the book space.

Note to self and team: In Safari, we have all our books as XML text, not just as page images, and we even have a Safari API [tutorial here], but we haven't done a good job ourselves of opening the text up for innovation. The public API is mostly about metadata, not about the data itself. The private API makes much more possible, and we need to get that out in the hands of our customers....

P.S. Just gotta love this line: "Already the books in a digital library are beginning to read one another and to confer among themselves..." Endlessly provocative.


tags:   | comments: 5   | Sphere It
submit:

 
Previous  |  Next

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/4525

Comments: 5

  Ewout ter Haar [03.18.06 02:17 PM]

P.S. Just gotta love this line: "Already the books in a digital library are beginning to read one another and to confer among themselves..." Endlessly provocative.

That is based on a line from Umberto Eco (Name of the Rose) isn't it?, books whispering among themselves and all that. It reminds me also of Daniel Denttet's "A scholar is just a library's way of making another library". Maybe soon we don't need scholars anymore...

  Lorcan Dempsey [03.18.06 07:21 PM]

See Cliff Lynch on the future of the book at http://www.firstmonday.org/issues/issue6_6/lynch/ where he says:

Sometime in the 1980s I heard this statement about digital books:
"Here's a "view from the future," looking back at our "present," from Professor Marvin Minsky of MIT: "Can you imagine that they used to have libraries where the books didn't talk to each other?"" [4]

  Richard Dyce [03.20.06 05:48 AM]

Of course, on a positive note, a million books will make quite a dent in the carbon balance. (Provided you don't burn them.)

  Dakota Nelson [11.16.06 10:08 PM]

The Rolling Stones cancel a gig in Hawaii and postpone other tour dates as Mick Jagger suffers throat troubles...

  Marco Gillies [03.22.07 06:59 AM]

Small technical point, the link to Flikr colour picker is broken (3 ts in http). Should be:

http://krazydad.com/colrpickr/

marco

Post A Comment:

 (please be patient, comments may take awhile to post)






Type the characters you see in the picture above.