Alas, poor book, Gentleman

The scholar Paul Duguid analyzes the worth of Google Book Search (GBS) in a recent issue of one of the first peer-reviewed journals to appear online, First Monday. His article, “Inheritance and loss? A brief survey of Google Books” takes as its subject an early serialized novel, The Life and Opinions of Tristram Shandy, Gentleman, by Laurence Sterne. The novel’s first two volumes, of an eventual nine, appeared in 1759; thus this text is decidedly in the public domain.

Dr. Duguid’s acerbic consideration of the various versions of Tristram Shandy in GBS is rather humorous, but ultimately indicting of Google’s peculiar enactment of the grander vision of a repository of digitized, networked books. His analysis is primarily of two versions of Tristram Shandy, one contributed by the Library of Harvard University, and the other by Stanford. At the summation of his analysis, after noting the peculiar digital necessities of a set of pages in Tristram Shandy that commemorate Shakespeare’s “Alas, poor Yorick!” and their ill-served treatment in GBS, he provides this commiseration with the book’s author:

Alas Poor Sterne! Evidently he had some premonition of what might become of his work. Like the Harvard edition, which ignored Sterne’s black page, the Stanford work not only ignores Sterne’s divisions, but introduces new ones of its own. Its chapter 2 has no bearing on Sterne’s chapter 2 in either Volume I or any subsequent volume of the original text. This would matter little were it not that Sterne continuously refers back and forth to preceding or future pages, chapters, and books. Indeed, he even opens his second volume with an alert to his readers (and, perhaps, editors) that “I have begun a new book.” That phrase is no doubt buried mystifyingly somewhere in the first volume of the Stanford edition which is, in turn, buried mystifying somewhere in Google Books Library Project.

It is worth reminding the reader that the volumes of Tristram Shandy discussed by Duguid are public domain (PD) works harvested from the library partners in Google Book Search, and were thus undoubtedly subjected to high volume, non-destructive scanning. Inevitably, at the scale pursued by Google, errors will be introduced in both the scanning and the OCR. I suspect these are early scans (although I have no way of knowing for sure) and I suspect that Google’s efforts have improved with practice (although that is a statement of faith).

Let us also note that in contrast to PD books, in-print, in-copyright works submitted by publishers arrive either in a born digital representation ready for ingest into GBS, or they are submitted with the aim of being scanned destructively (their spines removed and the sheets fed automatically). This type of scanning, used by all high volume digitization operations when works are not unique, introduces far fewer errors. Tellingly, the books most at contention between publishers and Google — i.e., in-copyright, out of print books — are almost certain to be obtained via libraries, and are thus subject to the same processing infidelity as inflicts Tristram Shandy. Whether or not their condition as represented online is financially evaluated commensurately in court — or in settlement — we will have to wait and see.

However, Duguid’s analysis of Google Book Search is far deeper than a consideration of the cosmetic defects of the books’ electronic skin. Rather, he recognizes that faults lurk so visibly because Google is throwing away information that are fundamentally characteristic of books — metadata that describe and even determine what books are, as simple and trivial as volume numbers, or artifacts of type design, editing, and artistic production. Books are not, in other words, mere bags of words, but vehicles in which ride a wide sundry of other passengers — metadata, artistic expression, whimsy, and error. Books are born and produced in a rich organizational and information-rich social and economic context, and the willing discard of that context carries with it a loss whose surface manifestation may be amusing, but whose deeper ramifications are profoundly disturbing. Duguid concludes:

The Google Books Project is no doubt an important, in many ways invaluable, project. It is also, on the brief evidence given here, a highly problematic one. Relying on the power of its search tools, Google has ignored elemental metadata, such as volume numbers. The quality of its scanning (and so we may presume its searching) is at times completely inadequate [ref]. The editions offered (by search or by sale) are, at best, regrettable. Curiously, this suggests to me that it may be Google’s technicians, and not librarians, who are the great romanticisers of the book. Google Books takes books as a storehouse of wisdom to be opened up with new tools. They fail to see what librarians know: books can be obtuse, obdurate, even obnoxious things. As a group, they don’t submit equally to a standard shelf, a standard scanner, or a standard ontology. Nor are their constraints overcome by scraping the text and developing search algorithms. Such strategies can undoubtedly be helpful, but in trying to do away with fairly simple constraints (like volumes), these strategies underestimate how a book’s rigidities are often simultaneously resources deeply implicated in the ways in which authors and publishers sought to create the content, meaning, and significance that Google now seeks to liberate. Even with some of the best search and scanning technology in the world behind you, it is unwise to ignore the bookish character of books. More generally, transferring any complex communicative artifacts between generations of technology is always likely to be more problematic than automatic.

Ultimately, whether or not Google Book Search is a useful tool will hinge in no small part on the ability of its engineers to provoke among themselves a more thorough, and less alchemic, appreciation for the materials they are attempting to transmute from paper to gold.