The Google Exchange

Paul Duguid’s recent article in First Monday on the worth and merits of Google’s Book digitization drew a range of comment and criticism.

The most insightful and valuable subsequent exchange to my knowledge was that occurring between Paul himself and Patrick Leary, the author of the well-known article, “Googling the Victorians” [pdf]. Patrick is also the editor of SHARP-L, a listserv dedicated to the history of books, authorship and publishing (SHARP eq Society for the History of Authorship, Reading, and Publishing). The catalyst for the conversation was on-list, with the remainder of it off; both Paul and Patrick have granted me permission to reproduce it here. The messages are only very lightly edited for misspellings.

This is a forthright exchange between two brilliant, deeply penetrating scholars, and it well encapsulates some of the most significant conundrums raised by the Google Book Search project. As someone who has written and thought about this quite extensively, I think this is a seminal exchange.

It may prove useful to first scan Paul Duguid’s First Monday article, or my reference to it.


Patrick writes: [posted on the SHARP-L list]

I find Duguid’s article perversely wrongheaded. Judging the potential of Google Books, or any mass digitization project, as a tool for the study of book history by singling out one public domain literary text — Tristram Shandy — out of millions and critiquing its “quality” as an object of study is just silly. (Typical of the kind of silliness that runs throughout the essay is his contention that the Harvard bookplate on the inside cover of one 1893 edition marks it out as promising “an inheritance of quality.”) One might as well judge the usefulness of a national census by picking out the entry for one famous person and noting all the ways in which a full-scale biography would be much more informative. Having noted two older editions’ manifold inadequacies, along with their badly scanned pages, the lack of metadata, etc., Duguid reaches the momentous conclusion that Google Books, like Project Gutenberg, is not a good place to go to gain access to a reliable modern scholarly edition of Tristram Shandy. No kidding. I think we knew that already.

Exercises in bibliographical fastidiousness of this kind are cheap, easy, and utterly pointless where such projects are concerned. For all of its many and often infuriating faults (a serious critique of which would have made a useful article), Google Books is a tool for extensive research across a populous universe of corrupt texts, not a tool for intensive study of one typographically complex literary classic. Book historians will not be using Google Books and its ilk to analyze Sterne’s uses of typography in Tristram Shandy. But they will be using it to find, say, references to the novel inside a wide range of memoirs, biographies, works of criticism, and periodicals, to many of which they would not otherwise have access, and in many of which they might never otherwise have thought to look.

Paul writes:

Patrick:

Your message was forwarded to me and I hope you’ll excuse a brief reply. I’m sorry to have aroused the ire of someone whose work I admire.

I understand your distaste and take your comments to heart, but I would like to clarify a couple of things I did and didn’t say.

I was not, after all, “judging the potential of Google Books … as a tool for the study of book history.” To do that would, as you say, be “silly”. Now I may be silly, but in this case, “tool[s] for the study of book history” are your interest, and you do it very well. In this particular instance, they are not mine. Indeed, one of my central points is that scholars can use offerings like Google Books or Project Gutenberg (PG) with relative ease. We know how to read between the lines, into the gutter, and across blank pages. But both Google Books and PG tend to be offered piously to the ordinary reader, in some cases explicitly disdaining the scholar. My argument is that these are not good tools for the ordinary reader and the scholar is pretty well served already. To attempt your sort of grand metaphor, this kind of humbug reminds me of the Bush tax cuts: everything for the rich (the Matthew effect), but the poor are told it’s all in their interest. They should wait for Google to trickle down.

Consequently, I also didn’t intend to argue that “Google Books, like Project Gutenberg, is not a good place to go to gain access to a reliable modern scholarly edition of Tristram Shandy. No kidding. I think we knew that already”. That is your own interpretation–no doubt a flaw in my writing–but it was not my argument. Again, I was acting in the name of the ordinary reader, not the scholar. While I understand why my work might be an object of sarcasm, I trust the ordinary reader is not — on SHARP-L of all places. And I do think the Harvard bookplate resonates with ordinary readers–though clearly not with yourself. The books that Google found in Harvard or Oxford (particularly bad nineteenth century texts) could easily be found elsewhere. I find it hard to believe that Google went after these names for anything but the prestige they lent to Google and its collection in the eye of the general reader.

Of course, I confess that to single out a single book is indeed tendentious, but as I tried to say, without catalogue, index, metadata or any other way of getting a grasp on Google Books as a whole, there are few other ways to grasp this particular beast. I acknowledge the methodological flaw now, as I did in the article.

However problematic the method is, I don’t think I merely found the equivalent to an error in a single census entry. (I did, after all, capture at least 2 entries, but I don’t with to provoke you again with such a silly reply.) Yet the method did reveal some surprises. On the one hand, it showed how sloppy the scanning is in this high profile project — a little more on that below. In the other, to my surprise, it showed that Google doesn’t even distinguish volumes of a particular work. That struck me as remarkable, and seems as true for the _Decline and Fall_, or any other multi-volume work in the Google project. Now you might respond “No kidding. I think we knew that already”, but to me and to everyone I have spoken to so far, such a low level flaw was new, surprising, and indicative of profound problems with the Google texts.

To cross over to the scholarly side, I’m a little hesitant to accept your enthusiasm for Google as a research tool. Without any reliable dating, and no easy search restrictions, it is incomparably weaker, it seems to me, than many of the robust text collections like Early English Books Online (EEBO), etc. Unless Google is willing to come up to their level of quality, I am hesitant to generalize from any searches done across its collection. But here I defer to you and your skills on this.

While my work may be “cheap” (and “easy” and even “pointless” — to these I must confess; “fastidiousness”, though, has never been one of my personal sins), one of my conclusions is that Google is sucking money and effort out of libraries, but is producing a library for the people that, apart from being hedged around with curious legal restrictions, is, in the quality of its work, remarkably cheap. With no evident control over scans and little willingness to repair errors when they are pointed out, it is not so cheap as to deserve a free pass. Furthermore, because it is indeed in potential a marvelous scholarly tool — well a reasonable one might be fairer, its flaws at the moment make any broad generalizations about the “populous universe of corrupt texts” subject to the corruptions Google introduces itself — I think it deserves scrutiny. As you say, “that would have made a useful article”. I hope you or someone with similar enthusiasm and skills will take it on. Mine was a brief and, relatively light-hearted attempt to raise some questions, particularly around the piety that attaches itself to the Google project but goes under the scholarly radar, because our self-interest too often trumps the general interest. I’m sorry it evidently lacked a “momentous conclusion” and also that, on a personal level, it so clearly annoyed you.

Best wishes and continuing admiration for your work on SHARP.

Patrick writes:

Dear Paul,

Thanks for your note. Let me apologize at once for the offensive crankiness of my post to SHARP-L this morning. There’s no excuse for my having been so damned irritable over what is simply a disagreement about the pros and cons of this tool. I had just stayed up half the night to put the finishing touches on a difficult project on which Google Books had been a real help with literally dozens of queries, and I’m afraid that reading your condemnation of GB at that particular moment was very unwise. I let it get under my skin — which was, to recall a [phrase] I now regret, just plain silly.

Now on to the substance of that disagreement, and I’ll try to emulate you in keeping this brief. I’m puzzled by your purported championing of the needs of the “ordinary reader,” whose need to do a close reading of Tristram Shandy off of a computer monitor seems to me distantly hypothetical at best. Surely it is that reader who is already well served by inexpensive, reliable printed texts of the novel. By contrast, that same reader will have had little or no opportunity even to look at, let alone to thoroughly explore, tens of thousands of lesser-known books and periodicals, many of which were until now entirely unavailable except to faculty, staff, and students affiliated with major universities and their research libraries.

Mass digitization is all about trade-offs. All mass digitizing programs compromise textual accuracy and bibliographical meta-data so that they can afford to include many more texts at a reasonable cost in money and time. All texts in mass digitization collections are corrupt to some degree. Everything else being equal, the more limited the number of texts included in a digital collection, the more care can be lavished on each text. Assessing the balance of value involved in this trade-off, I think, is one of the main places where we part company. You conclude, on the basis of your inspection of these two volumes, that the corruption of texts like Tristram Shandy makes Google Books a “highly problematic” way of getting at the meanings of the books it includes. By contrast, while acknowledging how unfortunate are some of the problems you mention, I believe that the sheer scale of the project and the power of its search function together far outweigh these “problematic” elements.

What neither of us knows for sure is exactly what this trade-off actually entailed in the case of Google Books. Any project this big must obviously be highly automated; it simply can’t afford to spend much time on any individual text. What degree of quality control, had it been imposed, would have made the project unacceptably slow and costly, and therefore simply undoable, is something we don’t yet know. You mention that EEBO is a much more robust and useful tool for the texts it includes, and of course you’re quite right. EEBO integrates generations of specialist scholarly labour and was able to use material that had already been microfilmed; it has made ProQuest millions of dollars of profit in subscription fees. Google Books hasn’t cost anybody a cent (except, of course, Google), and it will be incomparably larger and more varied. Its texts are corrupt as hell. But there are a lot of them.

And here again we come back to the uses to which Google Books may be put. Your article avoids addressing this issue directly, but its method implies that GB’s usefulness is to be judged by the meanings that can be gleaned from reading one text at a time. But GB isn’t about one book, it’s about searching simultaneously across many thousands, and this is what makes it so very different from Project Gutenberg. I wholeheartedly agree with you about Gutenberg, which I have long thought represents a monumental waste of time and effort, occasioned by its open hostility to any scholarly judgment about the selection, organization, and presentation of texts; with about the same amount of effort, they could have done something worthwhile, but they chose not to. But what makes many of these same problems largely irrelevant to GB is the power of the search. Although it is certainly offensive to find badly scanned pages or crude editions of a literary classic, for searching across many books this sort of thing simply doesn’t matter very much. Missing the famous black page in a particular edition of Tristram Shandy may be a shame, but mangling a few pages in the middle of a rambling Victorian political memoir, or in an odd volume of the Publisher’s Circular, while regrettable, isn’t going to have much impact on the kinds of queries that Google Books is uniquely able to address.

You may be interested to know, if you haven’t already heard, that Perry Willett and his colleagues at Michigan have been working to implement a catalogue interface called [MBooks, in] Mirlyn that provides meta-data for Google Books volumes contributed from their collections. I suspect that similar efforts are under way elsewhere, though Michigan is well placed to make the most of them.

I’ve been obliged to set various issues aside–like the “inheritance” notion of yours, and the matter of text selection–for the sake of brevity, but find that I’ve already rambled on longer than I intended. Thanks for so patiently addressing my splenetic comments of this morning; the issues involved in this debate are engrossing ones, and I look forward to continuing to explore them, and your views, in the future.

Best wishes,

Patrick

Paul writes:

Patrick:

Thanks for yours. I think we simply disagree, but there’s nothing wrong with that. Google’s undoubtedly a useful tool, but I’m less sanguine about its usefulness than you are. The argument is, I suppose, about at what point quantity trumps quality (my version of the trade offs you mention). As I teach a course on the quality of information, I tend to resist quantitative arguments — perhaps because economists always win those.

But, a friend of many Googlers from my days at PARC, I’m particularly sensitive to Google hype. Google is pushing this out as a library, a public library, for people to find books, not for scholars to do research. Again, you put words in my mouth when you talk about the ordinary reader who needs “to do a close reading of Tristram Shandy off a monitor”. The ordinary reader today is besieged with stories that most books can be found on line. I didn’t think it close reading to suggest that they ought to be able to identify the first word or the first volume. Neither strikes me as a case for close reading.

By ranking returns like Google searches, Google also implies that the first book on the list is the best book in response to your search, which is why it’s surprising when it turns out to be an unmarked volume 2. (I’m also less sanguine than you about “the power of the search” in Google.)

So their popular and populist claim is, to my mind, humbug. Moreover, it’s damaging humbug, squeezing other efforts, pushing quantity over quality, and limiting what libraries who cooperate may do with this stuff. So my goal was not to “avoid addressing the issue” of what uses GB may be put to, but to address one particular issue. Probably not an issue for the scholars of Sharp, but one, I felt, for the populist audience of First Monday, who tend to believe in online, open, accessible libraries at your fingertips.

Best wishes,

Paul


tags: