Quality of Book Digitization

Juliet Sutherland of Distributed Proofreaders (which does all the quality checking and correction for Project Gutenberg) wrote a fascinating post on a publishing futures mailing list about the quality of the digitization in the various projects that are digitizing public domain (and other) books (such as the Google Book Search Library project.) I asked Juliet if I could post her comments here as a guest blog. Over to Juliet:

Something I’ve wondered about and haven’t seen discussed anywhere is the need (or lack thereof) for quality, mostly in the sense of completeness, in the mass digitization programs. The volunteers at DP report many missing pages in the Google books, particularly near illustrations, as well as that none of the Google books make available color scans of any illustrations. For searching to find a book, these things probably make little difference. But for using a book as either a research resource or just to read for fun, they matter a lot. I am, of course, referring only to PD books. Quality on newer books may well be much better.

Our experience is that finding and fixing missing pages after the fact takes a lot of effort. We’ve experienced this with our own volunteers who scan books and with the various scanning projects associated with the Internet Archive (although they have improved dramatically and their current projects have many, many fewer missing page, bad scan, poor cropping, etc problems than they used to), as well as with Google. For us, finding missing or bad pages involves figuring out which libraries have physical copies of the exact edition used, then either finding one of our volunteers who has access to one of those libraries or who can request a copy through Interlibrary Loan, actually getting and scanning the needed pages, and then integrating them back into our workflow. This process should be somewhat simpler when the necessary copy of the book is known to be available. On the other hand, that kind of error handling in a large scanning operation can add a huge amount of overhead in retrieving and reprocessing.

Our experience with various (book page) image archives suggests that those archives associated with libraries are usually reasonably good. Books from Gallica and Canadiana are rarely missing pages, although some of the older scans are of quite low quality and are thus useless for our purposes. The various archives associated with the Univeristy of Michigan and with Cornell (MoA and others), also seem to be reasonably complete, with only occasional problems with missing/bad pages (although lack of color plates in some of the agricultural books is a real shame.) Similarly kentuckiana (from the University of Kentucky). The Internet Archive’s Canadian Libraries project and, more recently their American Libraries collection are also quite complete, and with the advantage that they provide color versions of the illustrations.

For those who are interested, a complete list of the book image archives that we (DP) know about can be found in our wiki under Sources for Scan Harvesting You don’t need a DP account to read pages in our wiki, although you must have one to edit. We may or may not have actually used images from any one of these archives, and there is very little discussion there about things like quality issues. Nonetheless, it’s quite a list, and constantly growing.

If I were in charge of setting up a new, large, book image archive associated with a major library, one of the things that I’d be quite concerned about is completeness/usableness of the images provided. Yet I see little, if no, discussion about this in the context of current mass digitization efforts. What prompted this whole message was reading the University of Michigan announcement, and then picturing the disappointment down the road when missing illustrations and missing text are discovered. Google has proven to be very responsive, so I am sure that the completeness/quality issues can be solved. But not if they are never mentioned.

I’m with Juliet. This is an important issue that needs to get on the radar. Google and others would do well to talk to DP, as they have more experience in this space than anyone else. They are also a fascinating early instance of the approach to harnessing redundant arrays of humans for large tasks that Amazon has popularized with the Mechanical Turk, and that Google is now exploring with the Google Image Labeler.