Alas, poor book, Gentleman

Mon

Aug 13
2007

listen

Alas, poor book, Gentleman

The scholar Paul Duguid analyzes the worth of Google Book Search (GBS) in a recent issue of one of the first peer-reviewed journals to appear online, First Monday. His article, "Inheritance and loss? A brief survey of Google Books" takes as its subject an early serialized novel, The Life and Opinions of Tristram Shandy, Gentleman, by Laurence Sterne. The novel's first two volumes, of an eventual nine, appeared in 1759; thus this text is decidedly in the public domain.

Dr. Duguid's acerbic consideration of the various versions of Tristram Shandy in GBS is rather humorous, but ultimately indicting of Google's peculiar enactment of the grander vision of a repository of digitized, networked books. His analysis is primarily of two versions of Tristram Shandy, one contributed by the Library of Harvard University, and the other by Stanford. At the summation of his analysis, after noting the peculiar digital necessities of a set of pages in Tristram Shandy that commemorate Shakespeare's "Alas, poor Yorick!" and their ill-served treatment in GBS, he provides this commiseration with the book's author:

Alas Poor Sterne! Evidently he had some premonition of what might become of his work. Like the Harvard edition, which ignored Sterne's black page, the Stanford work not only ignores Sterne's divisions, but introduces new ones of its own. Its chapter 2 has no bearing on Sterne's chapter 2 in either Volume I or any subsequent volume of the original text. This would matter little were it not that Sterne continuously refers back and forth to preceding or future pages, chapters, and books. Indeed, he even opens his second volume with an alert to his readers (and, perhaps, editors) that "I have begun a new book." That phrase is no doubt buried mystifyingly somewhere in the first volume of the Stanford edition which is, in turn, buried mystifying somewhere in Google Books Library Project.

It is worth reminding the reader that the volumes of Tristram Shandy discussed by Duguid are public domain (PD) works harvested from the library partners in Google Book Search, and were thus undoubtedly subjected to high volume, non-destructive scanning. Inevitably, at the scale pursued by Google, errors will be introduced in both the scanning and the OCR. I suspect these are early scans (although I have no way of knowing for sure) and I suspect that Google's efforts have improved with practice (although that is a statement of faith).

Let us also note that in contrast to PD books, in-print, in-copyright works submitted by publishers arrive either in a born digital representation ready for ingest into GBS, or they are submitted with the aim of being scanned destructively (their spines removed and the sheets fed automatically). This type of scanning, used by all high volume digitization operations when works are not unique, introduces far fewer errors. Tellingly, the books most at contention between publishers and Google -- i.e., in-copyright, out of print books -- are almost certain to be obtained via libraries, and are thus subject to the same processing infidelity as inflicts Tristram Shandy. Whether or not their condition as represented online is financially evaluated commensurately in court -- or in settlement -- we will have to wait and see.

However, Duguid's analysis of Google Book Search is far deeper than a consideration of the cosmetic defects of the books' electronic skin. Rather, he recognizes that faults lurk so visibly because Google is throwing away information that are fundamentally characteristic of books -- metadata that describe and even determine what books are, as simple and trivial as volume numbers, or artifacts of type design, editing, and artistic production. Books are not, in other words, mere bags of words, but vehicles in which ride a wide sundry of other passengers -- metadata, artistic expression, whimsy, and error. Books are born and produced in a rich organizational and information-rich social and economic context, and the willing discard of that context carries with it a loss whose surface manifestation may be amusing, but whose deeper ramifications are profoundly disturbing. Duguid concludes:

The Google Books Project is no doubt an important, in many ways invaluable, project. It is also, on the brief evidence given here, a highly problematic one. Relying on the power of its search tools, Google has ignored elemental metadata, such as volume numbers. The quality of its scanning (and so we may presume its searching) is at times completely inadequate [ref]. The editions offered (by search or by sale) are, at best, regrettable. Curiously, this suggests to me that it may be Google's technicians, and not librarians, who are the great romanticisers of the book. Google Books takes books as a storehouse of wisdom to be opened up with new tools. They fail to see what librarians know: books can be obtuse, obdurate, even obnoxious things. As a group, they don't submit equally to a standard shelf, a standard scanner, or a standard ontology. Nor are their constraints overcome by scraping the text and developing search algorithms. Such strategies can undoubtedly be helpful, but in trying to do away with fairly simple constraints (like volumes), these strategies underestimate how a book's rigidities are often simultaneously resources deeply implicated in the ways in which authors and publishers sought to create the content, meaning, and significance that Google now seeks to liberate. Even with some of the best search and scanning technology in the world behind you, it is unwise to ignore the bookish character of books. More generally, transferring any complex communicative artifacts between generations of technology is always likely to be more problematic than automatic.

Ultimately, whether or not Google Book Search is a useful tool will hinge in no small part on the ability of its engineers to provoke among themselves a more thorough, and less alchemic, appreciation for the materials they are attempting to transmute from paper to gold.

tags: | comments: 9 | Sphere It
submit:

0 TrackBacks

TrackBack URL for this entry: http://orm3.managed.sonic.net/mt/mt-tb.cgi/6978

9 Comments

Dori said:

in-print, in-copyright works submitted by publishers arrive either in a born digital representation ready for ingest into GBS, or they are submitted with the aim of being scanned destructively (their spines removed and the sheets fed automatically). This type of scanning, used by all high volume digitization operations when works are not unique, introduces far fewer errors.

Far fewer errors, maybe, but still a non-trivial number. Here's a list of some examples; while it's true that that's a personal search, that means that I can guarantee that every book on that list has problems.

I originally wrote about this problem on my blog in May 2006. Things have changed a little since the numbers I came up with then, but there's still some real issues (unless, of course, you really do want a book on HTNL...).

August 13, 2007 11:10 PM

Jan Bannister said:

A while back Tim did a piece on using SVN in Congress (well in government at least) and i think a repository like SVN would be far better place to put books like this with a rich version heritage.

Many of Shakespeare's works come in different Folio versions which differ substantially, sometime due to transcription errors but other times due to amendments made during productions as Shakespeare experimented with the dialog.

Our literary heritage is like gold and there is a great desire to back-fill something like Google book search with it, but I don't know if OCR is the way to go.

Interestingly writing started out much more like a source tree with branches, moved to a flat single version (with modern publishing) and now seems to be moving back to a more fluid versioned form with the advent of things like SVN and more recently like Google Apps.

August 14, 2007 12:21 AM

Ajeet Khurana said:

I remember the initial hue and cry over Google's decision to introduce Google Book Search. As an outsider (which I am not in this case), one could have wondered what the fuss was all about. If you are a publisher and you do not like this, just go ahead and opt out of the program!

But, the real problem lay in the fact that if your competitors were participating, you did not have the guts to not participate. Crudely put, but that was it.

I think that Google's book search technology is very promising. Yes, it needs to be "more thorough, and less alchemic." But I have not seen any solid evidence of matters being otherwise. In fact, if one of the existing tech giants had to take this up, I am glad that Google did.

August 14, 2007 5:08 AM

Jerome McDonough said:

If I can critique your wording (not your actual point), what we find here is that structural metadata is neither 'simple' nor 'trivial.' It's complex, vital, and apparently too difficult for even an agency with Google's resources to cope with.

And so, the painful question: if Google can't figure out a way to produce useful structural metadata, is there really much of a future for the METS, MPEG-21s and XFDUs of the world?

August 14, 2007 8:48 AM

Adam Hodgkin said:

Some follow up here:
http://exacteditions.blogspot.com/2007/08/book-searching-is-not-same-as-book.html

It is arguable that the requirements of preservation and of scholarly interpretation of our cultural inheritance are so complex, and so various, that Google would do better to revert to the role of search engine and infrastructure provider, leaving it to libraries and to publishers to produce digital resources which match specifications which Google can automate and accommodate. Google Book Search has put a rocket up the scanning plans of us all. The idea that Google should actually do all the scanning seems absurd.....Its also very annoying the way that the Google citations only work reliably in the USA (and Canada). Most of the citations Duguid provides are disabled for European users (presumably for reasons of copyright difference between USA and Europe).

August 14, 2007 8:52 AM

Ricardo Pietrobon said:

Well, I believe the argument here comes down to what is logistically possible. Of course one would want to have all the information in the original book, including meta-data, flawless scanning, and even enhancements to the original book such as hyperlinks to other references. But, as every economist knows, we live in a world with limited resources. And so, I think that rather than creating a laundry list of everything missing under the sun, we should instead move one step ahead and point out what the next steps with the most value should be. As an example, if hiring librarians to do the scanning themselves is not an option, should they have a closer oversight during production or in quality control? Or should the public have an avenue to point to flaws during a certain period after a book is released on the Web. I am not suggesting these are actual solutions, but the point, again, is to move one step further and go from an initial list of problems to a ranking of problems in terms of cost-benefit.

August 15, 2007 8:18 PM

Monica McCormick said:

Tristram Shandy is a fascinating example, but hardly representative of the millions of books being scanned. Similarly with the many editions of Shakespeare.

The pleasures of Sterne derive precisely from the fact that he played with the limits of the technology available to him. Likewise, the challenges of interpreting Shakespeare are created, in part, by the vagaries of print transmission of his texts.

Printers, publishers, librarians, bibliographers, and scholars have spent centuries developing the standards that we use to distinguish editions, versions, printings, multiple copies, etc. Is it reasonable to expect that Google will have solved these problems already, some two and half years into their program?

For me, the value of the Book Search program is that it demonstrates very broadly and quite quickly the challenges of digitizing texts and developing search tools for them. Obviously there are imperfections, and perhaps some future Sergey Brin and Larry Page will adapt print-world structures for organizing these texts, or create new ones.

For me, the act of "transferring... artifacts between generations of technology" suggests the possibility that new forms of text and interpretation may emerge, along with the new technology.

August 16, 2007 8:52 AM

bowerbird said:

it's a good thing duguid had already obtained the
label of "scholar", because this piece would not
have earned it for him...

-bowerbird

August 17, 2007 10:28 AM

Kevin Hawkins said:

Re: Jerome McDonough's comment: the irony here is that Google is supplied with MARC records from Michigan (and possibly other library partners) but uses uses only the most basic fields.

Re: Adam Hodgkin's comment: Last I heard, Google and U-M use GeoIP to allow full text access to works before 1923 in the US only; in Canada, Europe, and elsewhere, only to works before the mid-19th century. (I'm unsure of the exact cut-off date.) This crude approach is used rather than attempting to track the status of copyright law in various jurisdictions outside the US.

August 30, 2007 8:17 AM