Tim O'Reilly

Quality of Book Digitization

Juliet Sutherland of Distributed Proofreaders (which does all the quality checking and correction for Project Gutenberg) wrote a fascinating post on a publishing futures mailing list about the quality of the digitization in the various projects that are digitizing public domain (and other) books (such as the Google Book Search Library project.) I asked Juliet if I could post her comments here as a guest blog. Over to Juliet:

Something I've wondered about and haven't seen discussed anywhere is the need (or lack thereof) for quality, mostly in the sense of completeness, in the mass digitization programs. The volunteers at DP report many missing pages in the Google books, particularly near illustrations, as well as that none of the Google books make available color scans of any illustrations. For searching to find a book, these things probably make little difference. But for using a book as either a research resource or just to read for fun, they matter a lot. I am, of course, referring only to PD books. Quality on newer books may well be much better.
Our experience is that finding and fixing missing pages after the fact takes a lot of effort. We've experienced this with our own volunteers who scan books and with the various scanning projects associated with the Internet Archive (although they have improved dramatically and their current projects have many, many fewer missing page, bad scan, poor cropping, etc problems than they used to), as well as with Google. For us, finding missing or bad pages involves figuring out which libraries have physical copies of the exact edition used, then either finding one of our volunteers who has access to one of those libraries or who can request a copy through Interlibrary Loan, actually getting and scanning the needed pages, and then integrating them back into our workflow. This process should be somewhat simpler when the necessary copy of the book is known to be available. On the other hand, that kind of error handling in a large scanning operation can add a huge amount of overhead in retrieving and reprocessing.

Our experience with various (book page) image archives suggests that those archives associated with libraries are usually reasonably good. Books from Gallica and Canadiana are rarely missing pages, although some of the older scans are of quite low quality and are thus useless for our purposes. The various archives associated with the Univeristy of Michigan and with Cornell (MoA and others), also seem to be reasonably complete, with only occasional problems with missing/bad pages (although lack of color plates in some of the agricultural books is a real shame.) Similarly kentuckiana (from the University of Kentucky). The Internet Archive's Canadian Libraries project and, more recently their American Libraries collection are also quite complete, and with the advantage that they provide color versions of the illustrations.

For those who are interested, a complete list of the book image archives that we (DP) know about can be found in our wiki under Sources for Scan Harvesting You don't need a DP account to read pages in our wiki, although you must have one to edit. We may or may not have actually used images from any one of these archives, and there is very little discussion there about things like quality issues. Nonetheless, it's quite a list, and constantly growing.

If I were in charge of setting up a new, large, book image archive associated with a major library, one of the things that I'd be quite concerned about is completeness/usableness of the images provided. Yet I see little, if no, discussion about this in the context of current mass digitization efforts. What prompted this whole message was reading the University of Michigan announcement, and then picturing the disappointment down the road when missing illustrations and missing text are discovered. Google has proven to be very responsive, so I am sure that the completeness/quality issues can be solved. But not if they are never mentioned.

I'm with Juliet. This is an important issue that needs to get on the radar. Google and others would do well to talk to DP, as they have more experience in this space than anyone else. They are also a fascinating early instance of the approach to harnessing redundant arrays of humans for large tasks that Amazon has popularized with the Mechanical Turk, and that Google is now exploring with the Google Image Labeler.

Comments: 9

  Henrik [09.06.06 06:40 AM]

The obvious solution is to present the scan along with the ocr output. Allow the users to submit corrections in a wikipedia like way, and to vote for their correctness in a digg like way.

  Tim O'Reilly [09.06.06 02:43 PM]

Henrik -- reviewing the OCR and the scan together is exactly what Distributed proofreaders does.

  anjan bacchu [09.06.06 05:11 PM]


is google going to involve volunteers in this ? currently, they involve volunteers for translation and other stuff.



  Winfried Helge Pelz [09.08.06 03:00 AM]

Just to get an impression of the quality how the google project is done: download thr pdf-file or leaf through the first 12 pages of

Eugenio Cappelletti Vocabolario Milanese-Italiano-Francese published 1848 at

I have sampled around 100 other books, all abundant with same defect types, although not at such extreme high rates.
The Google people are about to ruin the reputation of the highly esteemed american academic institutions (U Michigan Ann Arbor, Stanford, Harvard, UC).

  bill landis [09.12.06 11:11 AM]

jstor, which has been committed to completeness in its digitization of scholarly periodicals, presents an interesting model for the various current mass digitization projects involving monographic publications. they've been dealing with assessing completeness and finding replacement pages for about a decade now.

  Bill Tozier [09.16.06 10:57 AM]

JSTOR, as well as other private archiving and publishing companies, are not free.

Distributed Proofreaders produces our works for free, for free redistribution.

  Dr Klaus Graf [11.06.08 01:22 PM] makes available books only with scans in a Wikimedia wiki. You don't have to register to proofread the pages.

  Jim Campbell [11.07.08 08:05 AM]

Google does have the advantage of redundancy, that for many titles they have multiple scans made from different copies at different times. I don't usually follow up when I "provide feedback" on errors (I tried at first, but there are lot of errors and correction comes slowly), but in a couple of cases I've seen that a bad scan was simply replaced by a better one.

The overall quality has definitely improved in the last year or so.

