Thu

Aug 23
2007

Peter Brantley

Peter Brantley

The Google Exchange

Paul Duguid's recent article in First Monday on the worth and merits of Google's Book digitization drew a range of comment and criticism.

The most insightful and valuable subsequent exchange to my knowledge was that occurring between Paul himself and Patrick Leary, the author of the well-known article, "Googling the Victorians" [pdf]. Patrick is also the editor of SHARP-L, a listserv dedicated to the history of books, authorship and publishing (SHARP eq Society for the History of Authorship, Reading, and Publishing). The catalyst for the conversation was on-list, with the remainder of it off; both Paul and Patrick have granted me permission to reproduce it here. The messages are only very lightly edited for misspellings.

This is a forthright exchange between two brilliant, deeply penetrating scholars, and it well encapsulates some of the most significant conundrums raised by the Google Book Search project. As someone who has written and thought about this quite extensively, I think this is a seminal exchange.

It may prove useful to first scan Paul Duguid's First Monday article, or my reference to it.


Patrick writes: [posted on the SHARP-L list]

I find Duguid's article perversely wrongheaded. Judging the potential of Google Books, or any mass digitization project, as a tool for the study of book history by singling out one public domain literary text -- Tristram Shandy -- out of millions and critiquing its "quality" as an object of study is just silly. (Typical of the kind of silliness that runs throughout the essay is his contention that the Harvard bookplate on the inside cover of one 1893 edition marks it out as promising "an inheritance of quality.") One might as well judge the usefulness of a national census by picking out the entry for one famous person and noting all the ways in which a full-scale biography would be much more informative. Having noted two older editions' manifold inadequacies, along with their badly scanned pages, the lack of metadata, etc., Duguid reaches the momentous conclusion that Google Books, like Project Gutenberg, is not a good place to go to gain access to a reliable modern scholarly edition of Tristram Shandy. No kidding. I think we knew that already.

Exercises in bibliographical fastidiousness of this kind are cheap, easy, and utterly pointless where such projects are concerned. For all of its many and often infuriating faults (a serious critique of which would have made a useful article), Google Books is a tool for extensive research across a populous universe of corrupt texts, not a tool for intensive study of one typographically complex literary classic. Book historians will not be using Google Books and its ilk to analyze Sterne's uses of typography in Tristram Shandy. But they will be using it to find, say, references to the novel inside a wide range of memoirs, biographies, works of criticism, and periodicals, to many of which they would not otherwise have access, and in many of which they might never otherwise have thought to look.


Paul writes:

Patrick:

Your message was forwarded to me and I hope you'll excuse a brief reply. I'm sorry to have aroused the ire of someone whose work I admire.

I understand your distaste and take your comments to heart, but I would like to clarify a couple of things I did and didn't say.

I was not, after all, "judging the potential of Google Books ... as a tool for the study of book history." To do that would, as you say, be "silly". Now I may be silly, but in this case, "tool[s] for the study of book history" are your interest, and you do it very well. In this particular instance, they are not mine. Indeed, one of my central points is that scholars can use offerings like Google Books or Project Gutenberg (PG) with relative ease. We know how to read between the lines, into the gutter, and across blank pages. But both Google Books and PG tend to be offered piously to the ordinary reader, in some cases explicitly disdaining the scholar. My argument is that these are not good tools for the ordinary reader and the scholar is pretty well served already. To attempt your sort of grand metaphor, this kind of humbug reminds me of the Bush tax cuts: everything for the rich (the Matthew effect), but the poor are told it's all in their interest. They should wait for Google to trickle down.

Consequently, I also didn't intend to argue that "Google Books, like Project Gutenberg, is not a good place to go to gain access to a reliable modern scholarly edition of Tristram Shandy. No kidding. I think we knew that already". That is your own interpretation--no doubt a flaw in my writing--but it was not my argument. Again, I was acting in the name of the ordinary reader, not the scholar. While I understand why my work might be an object of sarcasm, I trust the ordinary reader is not -- on SHARP-L of all places. And I do think the Harvard bookplate resonates with ordinary readers--though clearly not with yourself. The books that Google found in Harvard or Oxford (particularly bad nineteenth century texts) could easily be found elsewhere. I find it hard to believe that Google went after these names for anything but the prestige they lent to Google and its collection in the eye of the general reader.

Of course, I confess that to single out a single book is indeed tendentious, but as I tried to say, without catalogue, index, metadata or any other way of getting a grasp on Google Books as a whole, there are few other ways to grasp this particular beast. I acknowledge the methodological flaw now, as I did in the article.

However problematic the method is, I don't think I merely found the equivalent to an error in a single census entry. (I did, after all, capture at least 2 entries, but I don't with to provoke you again with such a silly reply.) Yet the method did reveal some surprises. On the one hand, it showed how sloppy the scanning is in this high profile project -- a little more on that below. In the other, to my surprise, it showed that Google doesn't even distinguish volumes of a particular work. That struck me as remarkable, and seems as true for the _Decline and Fall_, or any other multi-volume work in the Google project. Now you might respond "No kidding. I think we knew that already", but to me and to everyone I have spoken to so far, such a low level flaw was new, surprising, and indicative of profound problems with the Google texts.

To cross over to the scholarly side, I'm a little hesitant to accept your enthusiasm for Google as a research tool. Without any reliable dating, and no easy search restrictions, it is incomparably weaker, it seems to me, than many of the robust text collections like Early English Books Online (EEBO), etc. Unless Google is willing to come up to their level of quality, I am hesitant to generalize from any searches done across its collection. But here I defer to you and your skills on this.

While my work may be "cheap" (and "easy" and even "pointless" -- to these I must confess; "fastidiousness", though, has never been one of my personal sins), one of my conclusions is that Google is sucking money and effort out of libraries, but is producing a library for the people that, apart from being hedged around with curious legal restrictions, is, in the quality of its work, remarkably cheap. With no evident control over scans and little willingness to repair errors when they are pointed out, it is not so cheap as to deserve a free pass. Furthermore, because it is indeed in potential a marvelous scholarly tool -- well a reasonable one might be fairer, its flaws at the moment make any broad generalizations about the "populous universe of corrupt texts" subject to the corruptions Google introduces itself -- I think it deserves scrutiny. As you say, "that would have made a useful article". I hope you or someone with similar enthusiasm and skills will take it on. Mine was a brief and, relatively light-hearted attempt to raise some questions, particularly around the piety that attaches itself to the Google project but goes under the scholarly radar, because our self-interest too often trumps the general interest. I'm sorry it evidently lacked a "momentous conclusion" and also that, on a personal level, it so clearly annoyed you.

Best wishes and continuing admiration for your work on SHARP.


Patrick writes:

Dear Paul,

Thanks for your note. Let me apologize at once for the offensive crankiness of my post to SHARP-L this morning. There's no excuse for my having been so damned irritable over what is simply a disagreement about the pros and cons of this tool. I had just stayed up half the night to put the finishing touches on a difficult project on which Google Books had been a real help with literally dozens of queries, and I'm afraid that reading your condemnation of GB at that particular moment was very unwise. I let it get under my skin -- which was, to recall a [phrase] I now regret, just plain silly.

Now on to the substance of that disagreement, and I'll try to emulate you in keeping this brief. I'm puzzled by your purported championing of the needs of the "ordinary reader," whose need to do a close reading of Tristram Shandy off of a computer monitor seems to me distantly hypothetical at best. Surely it is that reader who is already well served by inexpensive, reliable printed texts of the novel. By contrast, that same reader will have had little or no opportunity even to look at, let alone to thoroughly explore, tens of thousands of lesser-known books and periodicals, many of which were until now entirely unavailable except to faculty, staff, and students affiliated with major universities and their research libraries.

Mass digitization is all about trade-offs. All mass digitizing programs compromise textual accuracy and bibliographical meta-data so that they can afford to include many more texts at a reasonable cost in money and time. All texts in mass digitization collections are corrupt to some degree. Everything else being equal, the more limited the number of texts included in a digital collection, the more care can be lavished on each text. Assessing the balance of value involved in this trade-off, I think, is one of the main places where we part company. You conclude, on the basis of your inspection of these two volumes, that the corruption of texts like Tristram Shandy makes Google Books a "highly problematic" way of getting at the meanings of the books it includes. By contrast, while acknowledging how unfortunate are some of the problems you mention, I believe that the sheer scale of the project and the power of its search function together far outweigh these "problematic" elements.

What neither of us knows for sure is exactly what this trade-off actually entailed in the case of Google Books. Any project this big must obviously be highly automated; it simply can't afford to spend much time on any individual text. What degree of quality control, had it been imposed, would have made the project unacceptably slow and costly, and therefore simply undoable, is something we don't yet know. You mention that EEBO is a much more robust and useful tool for the texts it includes, and of course you're quite right. EEBO integrates generations of specialist scholarly labour and was able to use material that had already been microfilmed; it has made ProQuest millions of dollars of profit in subscription fees. Google Books hasn't cost anybody a cent (except, of course, Google), and it will be incomparably larger and more varied. Its texts are corrupt as hell. But there are a lot of them.

And here again we come back to the uses to which Google Books may be put. Your article avoids addressing this issue directly, but its method implies that GB's usefulness is to be judged by the meanings that can be gleaned from reading one text at a time. But GB isn't about one book, it's about searching simultaneously across many thousands, and this is what makes it so very different from Project Gutenberg. I wholeheartedly agree with you about Gutenberg, which I have long thought represents a monumental waste of time and effort, occasioned by its open hostility to any scholarly judgment about the selection, organization, and presentation of texts; with about the same amount of effort, they could have done something worthwhile, but they chose not to. But what makes many of these same problems largely irrelevant to GB is the power of the search. Although it is certainly offensive to find badly scanned pages or crude editions of a literary classic, for searching across many books this sort of thing simply doesn't matter very much. Missing the famous black page in a particular edition of Tristram Shandy may be a shame, but mangling a few pages in the middle of a rambling Victorian political memoir, or in an odd volume of the Publisher's Circular, while regrettable, isn't going to have much impact on the kinds of queries that Google Books is uniquely able to address.

You may be interested to know, if you haven't already heard, that Perry Willett and his colleagues at Michigan have been working to implement a catalogue interface called [MBooks, in] Mirlyn that provides meta-data for Google Books volumes contributed from their collections. I suspect that similar efforts are under way elsewhere, though Michigan is well placed to make the most of them.

I've been obliged to set various issues aside--like the "inheritance" notion of yours, and the matter of text selection--for the sake of brevity, but find that I've already rambled on longer than I intended. Thanks for so patiently addressing my splenetic comments of this morning; the issues involved in this debate are engrossing ones, and I look forward to continuing to explore them, and your views, in the future.

Best wishes,

Patrick


Paul writes:

Patrick:

Thanks for yours. I think we simply disagree, but there's nothing wrong with that. Google's undoubtedly a useful tool, but I'm less sanguine about its usefulness than you are. The argument is, I suppose, about at what point quantity trumps quality (my version of the trade offs you mention). As I teach a course on the quality of information, I tend to resist quantitative arguments -- perhaps because economists always win those.

But, a friend of many Googlers from my days at PARC, I'm particularly sensitive to Google hype. Google is pushing this out as a library, a public library, for people to find books, not for scholars to do research. Again, you put words in my mouth when you talk about the ordinary reader who needs "to do a close reading of Tristram Shandy off a monitor". The ordinary reader today is besieged with stories that most books can be found on line. I didn't think it close reading to suggest that they ought to be able to identify the first word or the first volume. Neither strikes me as a case for close reading.

By ranking returns like Google searches, Google also implies that the first book on the list is the best book in response to your search, which is why it's surprising when it turns out to be an unmarked volume 2. (I'm also less sanguine than you about "the power of the search" in Google.)

So their popular and populist claim is, to my mind, humbug. Moreover, it's damaging humbug, squeezing other efforts, pushing quantity over quality, and limiting what libraries who cooperate may do with this stuff. So my goal was not to "avoid addressing the issue" of what uses GB may be put to, but to address one particular issue. Probably not an issue for the scholars of Sharp, but one, I felt, for the populist audience of First Monday, who tend to believe in online, open, accessible libraries at your fingertips.

Best wishes,

Paul



tags: publishing  | comments: 41   | Sphere It
submit:

 
Previous  |  Next

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/5788

Comments: 41

  Kevin Kelly [08.23.07 04:14 PM]

Peter,

Wonderful exchange. Is Google Books more like a library (Paul), or more like a search catalog (Patrick)? I find Patrick closer to the mark, in that Google Books is a hybrid between a catalog and a library, or what I think of a networked universal text. Its value lays in the collective, even if the individual units are imperfect -- and they are very imperfect. The new text is searchable, actionable, and animated in a way that neither libraries, books, or even catalogs in the past were.

The only thing that might waylay this innovation is if Paul's assertion that we'll have only one chance at digitizing books were true. A new meme is circulating, which darkly hints that once Google scans the seven libraries, those books will never be scanned again. This prospect is so unlikely on so many levels one hardly knows where to begin to argue with it.

With the costs of scanning a book dropping so fast, it is trivial to re-scan problematic ones (discovered by careful readers). And usually the whole book does not need to be re-scanned, just certain pages. It is not hard to imagine readers (a la Wikipedia) re-scanning needed pages. There might even be search-and-rescue teams who seek out damage pages and fix them. If there is any true value in having the books perfect, there will be those who make it so, if they are free to do so. I can't imagine any cultural or technological impediment to re-scanning imperfect pages, nor can I recall any historical precedent of something like this that was done only once and never again.

On the contrary, it is easier to imagine many MORE people scanning the same books again *because* Google proved that they had more value scanned that most had believed before. Look what happened to maps.

The beauty of the scale of the Google Books project is that this massive new tool continues to improve as more texts are entered, and its rapdily expanding value reduces the tiny nuisances of individual problematic books. Even if you are adding books only 80% perfect, their networked value is increasing almost exponentially, so their sum of good vastly overwhelms the sum of their faults.

The idea that excellence can come from a bunch of imperfect parts interlinked is very counterintuitive and is the reason why the virtues of the Wikipedia, Flickr, and even the web itself are so hard to appreciate at first. It seems impossible, or at best utopian. But in the end their value is proved by users, who keep using it, even though it should not work in theory.

I predict the same for Google Books. It is built upon a million books imperfectlly, maybe horribly, scanned. Alone each one is untrustworthy. But interlinked, hypertexted, and connected into one vast text, these fused imperfect texts will be so useful that we won't be able to live without it.

  Peter Brantley [08.23.07 05:42 PM]

Kevin -

One of the errors in Patrick's post is that the scanning has come without cost to the libraries that are participating. From my exposure to the operations of these libraries, many of whom are members of my organization, the Digital Library Federation, this is decidedly NOT the case. Rather, there are significant costs - not insurmountable, but not trivial.

Costs come primarily in two areas: logistics and preservation. Logistics encompasses the people required to: pull items off shelves, ascertain their physical fitness for scanning, update the catalog records so they appear checked out, and place them into book boxes to go off to Happy Scanning Stations. The process is reversed when books return. At thousands of books daily (in some cases), this is not insignificant overhead; most libraries have had to add staff to perform these functions. .

The other cost is the technical infrastructure to host the copies of the works that Google is permitting the libraries to obtain; most of the participating libraries see it as their custodial responsibility to obtain a copy of this material, even if they do not build their own customized services from their contributed GBS content, as Michigan has done, with MBooks.

At any rate, I am unconvinced that these materials will be scanned again, at least by the current contributing institutions. It may be that low-grade distributed error checking by participating academics and other readers may eventually directly (through their own contributions) or indirectly (when Google rescans material for which they can obtain access ) correct the errors prevalent in the material.

Observing the institutional costs for organizations, which have uncertain leeway in re-securing the support necessary to construct their own industrial scanning infrastructures and content processing systems, leaves me nervous.

  Peter Brantley [08.23.07 06:17 PM]

Kevin -

Sorry, one more point, and an important one. An issue that is easily glossed over here is that Google is scanning a significant number of orphan works that are at least potentially still in copyright (although many are cloaked public domain works in actuality). Google is in litigation (in part) over the issues raised by this activity.

These orphan works are obtained from libraries contributing 1923+ materials, and are afaik indistinguishable from PD works in their scanning; in other words, they are as likely to contain errors as any other library contributed material.

Independent of the issue of scanning verifiably public domain works, then, is the issue of scanning orphans. If orphan rights legislation is not clarified, then we are reliant upon litigation or settlement to determine the feasibility of this digitization.

Litigation has unknown consequences, but one possible path towards settlement was laid out in a New Yorker article called Google's Moon Shot by James Toobin. As I observed in a past blog entry elsewhere written, "Monetizing Libraries":

A settlement between Google and publishers would create a barrier to entry in part because the current litigation would not be resolved through court decision; any new entrant would be faced with the unresolved legal issues and required to re-enter the settlement process on their own terms. That, beyond the costs of mass digitization itself, is likely to deter almost any other actor in the market.

Such concern, if valid, introduces an incalculable cost, and one that might induce pause in our appreciation for this new and impressive artifact.

  Kevin Kelly [08.23.07 06:45 PM]

Peter,

Yes, I could imagine that perhaps the libraries who have already pulled books to scan them, will not do it again, but they don't have to. There are a lot of other libraries in the world, often holding a subset of the same books, who would like to play a role in the emerging universal library, and what better way than to do something better than the "big" boys.

This is ignoring the problem of rare books that might be held by only one library, and for that we hope that utmost care is taken the first time. But I don't think that is what we are talking about here.

Is there any data on the question? How well does the sorting filters now work to prevent books from being scanned in duplicate among the cooperating Google libraries? How often do duplicates appear in Google Books? Have the cooperating libraries announced their intention of NOT re-scanning problematic volumes?

What happened in microfilm? What percentage of volumes were re-filmed because of lack of quality? Or re-filmed for other reasons?

Are there books scanned in micro-fiche that are now being scanned by Google? If so, that might suggest the idea that books are 'done' once is bogus.

That's my assumption until more evidence is offered for the speculation that "books are scanned only once."

  Paul Duguid [08.23.07 06:50 PM]

Kevin:

You say you can't "recall any historical precedent of something like this that was done only once and never again." As I suggested in the original article (note 2), some people have suggested that this is more or less what happened with microfilm. We got stuck for a remarkably long time with a retrograde technology poorly implemented because the start up (and opportunity) costs and barriers to entry, once the original microfilms and microforms were in place, made doing it all again prohibitive. As Peter's responses suggest, there are various ways, direct and indirect, intentional and unintentional, for barriers to entry to rise around Google books, should it remain uninterested in calls for openness.

  Kevin Kelly [08.23.07 07:05 PM]

Responding to Peter's point about a possible Google settlement, as suggestion by James Toobin in his New Yorker article.

I acknowledge that if Google and the publishers settle their suit out of court -- which probably means Google will share revenue for search results with publishers -- that this MIGHT discourage other players like Google from financing a massive book digitization project. If that reluctance did happen it would tend the field toward a monopoly in digital book scanning.

While this could have many adverse effects, I don't think that the quality of the scans would be one of them. One of the advantages of old AT&T monopoly bestowed on us was a standard of quality; that in fact was the bargain: we'll make phone calls great if you let us keep a monooply.

The issue in the case of a Google monopoly would become, as it was in AT&t days, not quality and usefulness, but accessibility. Can anyone else, including the public and competitors, get to play in this wonderful and useful library?

This is a seperate fear from the proposed concern about quality stated in First Monday, but in fact may underlie its unreasonable fear.

  Kevin Kelly [08.23.07 07:12 PM]

Paul writes:

"once the original microfilms and microforms were in place, made doing it all again prohibitive."

But if we are now scanning books that were once microfilmed, we are doing it again.

The next time we scan books may not be with the same technology we have today. But I am willing to bet (see our long bets site) that every book that Google is scanning today, will be 're-scanned' in some fashion in this century.

Anyone want to take me up on it?

  Peter Brantley [08.23.07 08:23 PM]

Kevin -

I am not sure how much coordination there has been by Google on the digitization of books across libraries; OCLC is slowly creating records for digital books that will eventually appear in their WorldCat database. I do know that for public domain works, the book that ultimately appears at GBS is often a composite of several library scans (e.g., there are instances where you can see "Harvard" on the title page, and "Michigan" as a running gutter on subsequent pages). I assume this is done also for orphan materials as well. It is not up to the libraries participating in GBS to conduct coordination among their collections; selection is a performance largely reserved for Google, by the contracts that I have seen. (Libraries can opt out titles, presumably on the basis of book condition, but they usually have volume commitments). It is notable that Google has over time increasingly sought libraries with more focused collections, or have specifically targeted subsets of broader library inventories.

As far as your optimism that libraries may eventually mount their own efforts, I suspect this will happen, but to a much more reduced extent, for public domain materials only, and without significant coordination. The consequence may well be a highly fragmented corpus of varying technical standards and lacking the presentation of a single search interface.

The one public domain effort of significance that has been attempted in response to Google Book Search, the Open Content Alliance, is largely reliant on external funding from Yahoo and Microsoft to support library operations; Yahoo made only an early modest investment that they have not renewed, and Microsoft's engagements have increasingly inserted more restricted covenants on re-utilization and ownership by the libraries. Neither Yahoo nor Microsoft has dared venture beyond PD material, except for Microsoft's publisher program, which is approximately equivalent to the publisher-oriented offering from Google. These programs solicit in-print, in-copyright material directly from publishers that is then destructively scanned at a higher quality (books are de-spined, and the pages fed automatically into high volume scanners); alternatively publisher books are sourced directly in digital form.

I remain frankly more pessimistic about the possibility of a service approaching a worthwhile portion of the utility of Google's. Another point dear in this is the sheer inter-connectedness of diverse services that Google can make available, e.g., through integration of DigitalGlobe's mapping, user generated content, scholarly literature, and additional sources of information. In the near to medium-term (20 years, say), I suspect libraries will be quite incapable of replicating this functionality.

I am continually reminded of some of the underlying theses in Vernor Vinge's latest novel, Rainbow's End, which features digitization of the UCSD Geisel Library as a central theme, but that's less here than there. As it were.

  Paul Duguid [08.23.07 09:04 PM]

Kevin:

I suppose a good deal depends on what we mean by "doing it again". If we mean, as you seemed to be saying, that someone will come along and do with the same technology what Google is doing now, then I stand by my case. We suffered 40 years of bad film because the "first mover" inhibited later entrants from coming along and doing the job better. If we mean, someone might invent an alternative technology to lift the text from books, then I'd certainly be a fool to bet against it. But we might be unwise to rely on that to get us out of the hole Google seems to me to be putting us in now. Punting on the possible is an easy way to avoid today's problems, particularly if we're not prepared to acknowledge that there are problems.

Your initial argument about how to improve the quality of Google, by everyone contributing a scanned page, goes back to my earlier paper for First Monday, for which the one under discussion was a coda. You assume those laws of quality, that if everyone contributes, things will inevitably get better and the good will put Gresham's law into reverse and drive out the bad.

In both arguments, whether we are prepared to bet on them or not, the assumption seems to be that quality will just sort itself out without either diagnosis or a plan for a cure. My advice would be, it may, but don't bet on it.

  bowerbird [08.24.07 01:08 AM]

if this is "a forthright exchange between
two brilliant, deeply penetrating scholars",
then the world is in a whole lot of trouble.

and if it "well encapsulates some of the
most significant conundrums raised by
the Google Book Search project", then
i guess i'm a monkey's uncle this week.

the complaints lodged by leary are fairly
obvious, many having been acknowledged
by paul duguid in his original article, with
the rest putting words in duguid's mouth,
which duguid was (rightly) quick to disavow.

the problem with google's scanning effort
is easy to document, and obvious to those
of us who have been following it all along,
namely that their quality-control is awful...

* pages are often badly-scanned or missing.
* insufficient care is taken to identify books.
* the o.c.r. results are sometimes _wretched_.
* google is being too tight-lipped about it all.

and really, that defines the problem-space.

(some people might also include the fact that
google is overly cautious in deciding whether
a text is in the public domain, but considering
all of the legal action pending against them,
the reticence can be seen as understandable.)

duguid calls these findings "surprising", but
they're well-known to everyone who looked.

perhaps even more importantly, however,
-- and contrary to what duguid states --
the trend since the beginning of the effort
has been toward improvement, and google
has demonstrated a willingness to go back
and repair the flaws, which is very admirable.

(would've been cheaper for 'em to do it right
from the outset, but they have deep pockets.)

remember, the philosophy of this company
says that "great just isn't good enough" and
"never settle for the best". much of their work
in this scanning project is crap, at least so far,
but we have to concede they aren't _done_ yet.
indeed, we're only in the top of the 3rd inning.

i might be overly optimistic, but i believe that
google will surprise people who think they are
"trading" quality for quantity. their researchers
know a _lot_ about text, and have _111_tons_
of text to work with, so i think they're going to
_stun_ their critics by whipping the books into
a high-quality state. after that, i expect to see
some _brilliant_ work come out of their labs...

for instance, i find google has several copies of
tristram shandy already scanned, specifically:
> 0G44AAAAIAA
> yI65WOrcbgwC
> pgIlAAAAMAAJ
> D4YLAAAAIAAJ
> mRAfAAAAMAAJ
and most of those returned fairly good o.c.r.
(including that first word of that first chapter),
which means that google can cross-check text
against their various copies to correct errors...

in such a respect, quantity _produces_ quality,
trumping the traditional "trade" between them.

i believe we're going to see a lot of situations
where the sheer mass of their project creates
opportunities that simply couldn't exist before.
(they've already hinted at automatic translation.)

and kelly is right. if it ends up that we have to
do this job ourself, as public digitizers, we will,
a page at a time. after all, this is _our_ library.

***

oh, and by the way, since people mentioned it,
the google o.c.r. text being hosted by michigan
is _abominable_. it's downright embarrassing.
they lost the blank lines separating paragraphs;
they lost quotation marks (single and double);
they lost dashes (en-dashes and em-dashes
and the end-of-line hyphens). simply awful.
it's a big disgrace to the name of digitization.

-bowerbird

  Alain Pierrot [08.24.07 01:44 AM]

Patrick:

"All mass digitizing programs compromise textual accuracy and bibliographical meta-data"

Why should mass digitizing program compromise bibliographical meta-data?

I agree with you any new release of a text compromises textual accuracy — or creates a new version of the text, significantly different from the "original" source. More readers should be made aware of this fact and too many people, about to invest time, efforts, even money to access and/or give access to books, simply ignore that fact, or, worse, believe GBS is immune to this artifact.

Google's warning about the beta status of its GBS functionalities is perhaps too light, however it is there and Google should be praised for that.

But bibliographic metadata, available from the library, can be much more accurately inherited in the mass digitization process. Shouldn't libraries contributing to GBS ask for an explicit, high quality controlled retrieval of their catalog bibliographic data? This information could be displayed as a feature attached to the book, and leave the innovative way of finding the reference provided by Google its own field. At least, it would make it clear for the user who is responsible for what.

Peter:

"Costs come primarily in two areas: logistics and preservation"

About digital preservation, I have just been indicated a very thorough analysis of the relevant issues, worth reading, "Requirements for Digital Preservation Systems":

The authors claims about the necessity of transparency, auditing and economy may help in this discussion.

  Alain Pierrot [08.24.07 01:53 AM]

Oops, the link to "Requirements for Digital Preservation Systems" is missing, sorry!

http://www.dlib.org/dlib/november05/rosenthal/11rosenthal.html

  Paul Duguid [08.24.07 06:50 AM]

Well, Bowerbird, I make no claims for brilliance. Indeed, slow footed, I find myself increasingly caught in a fast-moving squeeze play:

Google is a powerhouse technology company, with a masterful approach, and brilliant technicians (a status that allows their talk of the information "librarians so lovingly organize" to drip with condescension) from whom the Western canon will emerge as long as no one criticizes in public. (Of course, privately we've all known about its problems all along.) This is a truly private, corporate, quasi-philanthropic effort. Business at its best.

Oh no it's not, it's an open source project in which anyone with a scanner, whatever their skills, can contribute. It's merely project Gutenberg with pdfs and everything will fall neatly into place. And because everyone is making public the problems, it will be fixed in no time.

Google Books is for readers to find books. It's like the library down your street. It's the new Carnegie's gift to everyman (and woman).

Oh no it's not, it's a special preserve for scholars to play in; a sand box for the nineteenth-century researchers who've always resented not having an equivalent Early English Books on Line or 18th century STC texts. Moreover, if we wait long enough and place a long bet (every futurist's warrant for earnestness), everyman may not get much use out of this sandbox, but his or her home-scanned contribution will, while we hold our breath, outperform the editorial decisions of the Florida edition of Tristram Shandy through millions of uncoordinated incremental iterations. (At which point, those brilliant technicians and everyman start to sound more and more like the typing monkeys against whom they are racing to reproduce the Western canon by serendipity.)

  bowerbird [08.24.07 09:27 AM]

paul said:
> Well, Bowerbird, I make no claims for brilliance.
> Indeed, slow footed, I find myself increasingly
> caught in a fast-moving squeeze play:

i happen to like your sense of humor... :+)

and i have proclaimed, in many places in cyberspace,
that i found your earlier piece on "first monday" to be
a brilliant piece of work revealing important insights,
which -- in my opinion -- were not sufficiently heeded.

your sequel wasn't as good. but that's not unusual...


> Google is a powerhouse technology company,
> with a masterful approach, and brilliant technicians
> (a status that allows their talk of the information
> "librarians so lovingly organize" to drip with
> condescension) from whom the Western canon
> will emerge as long as no one criticizes in public.
> (Of course, privately we've all known about
> its problems all along.) This is a truly private,
> corporate, quasi-philanthropic effort. Business
> at its best.

why do i feel i'm in the middle of an elaborate setup? :+)


> Oh no it's not, it's an open source project in which
> anyone with a scanner, whatever their skills, can
> contribute. It's merely project Gutenberg with pdfs
> and everything will fall neatly into place. And because
> everyone is making public the problems, it will be
> fixed in no time.

oh! because i am! ;+)


> Google Books is for readers to find books.
> It's like the library down your street.
> It's the new Carnegie's gift to everyman (and woman).

ok, i'll let it play out...


> Oh no it's not, it's a special preserve for scholars
> to play in; a sand box for the nineteenth-century
> researchers who've always resented not having
> an equivalent Early English Books on Line or
> 18th century STC texts. Moreover, if we wait
> long enough and place a long bet
> (every futurist's warrant for earnestness),
> everyman may not get much use out of this sandbox,
> but his or her home-scanned contribution will,
> while we hold our breath, outperform the
> editorial decisions of the Florida edition of
> Tristram Shandy through millions of uncoordinated
> incremental iterations. (At which point, those brilliant
> technicians and everyman start to sound more and more
> like the typing monkeys against whom they are racing
> to reproduce the Western canon by serendipity.)

so, let's see if i can get to your bottom line.

you're absolutely correct -- 100% -- that there is a mass
confusion about "what it all means". part of the reason is
-- as i said above -- google has been far too tight-lipped.

another part of the reason is that many people have been
reluctant to believe the few things that google _has_ said.

another part is because many people -- people who are
somewhat confused -- have projected onto the project
their own hopes and dreams, because it's one of the few
games in town that has been big enough to _do_that_...

yet another part -- a big part -- is because the greedy
capitalists who have been in charge of creating and
sustaining and recording our cultural heritage up to now
see themselves losing that grip, and are trying to regain it
by putting out a lot of fear, doubt, and uncertainty, not to
mention a bunch of bald-faced lies and lots and lots of spin.

and a final part of the confusion about "what it all means"
is because we just plain don't know yet how it will play out.
and we're not likely to know for a number of years, if then.

so, to my thinking, it's perfectly natural and understandable
that it's little more than a ball of confusion at this point...

and your article is bouncing that ball of confusion.

you claim very small aims -- that you just want to point out
the contradictions of the project, that the scanning is sloppy,
that google needs to keep much better metadata, etc. --
but those bullets wouldn't hurt anybody much at all if it
weren't for the fact that the project is fraught with attention.

their work so far is -- much too often -- crap, and bad crap.

but like i said, we're only in the top of the 3rd inning right now.
and hey, in the top of the 3rd inning of that now-famous game
from the other day, baltimore led the rangers by a score of 1-0.
texas came back to score 5 in the 4th, and 25 in the later innings,
to win by a whopping score of 30-3. now i am not saying that
google will come back from its crap in such a spectacular fashion,
but i _am_ saying that it's too early to say who will win this game.

google has already shown a willingness to re-do bad pages.
they respond -- slowly, but they _do_ respond -- to reports
of bad pages, and those pages are _eventually_ done again.

and though it's hard to track these things, because they are
so stingy with information about the progression of the work,
my sense in looking at hundreds of various books across time
is that their quality is steadily increasing. (and yes sir, that is
not very "scientific". but then, neither is looking at one book.)

none of this is surprising. _every_ scanning project _starts_
with shoddy work -- because it takes a tremendous amount
of attention to apply the constant focus needed to do it right,
and it's hard to maintain it, especially with such a simple task.
and getting 99% of the pages right won't cut it; we need 100%.

but i trust google to "get it right" in the long run.

it's not a _blind_ trust, however. not in the slightest, no sir.
i believe we need to "liberate" all of the public-domain work.

(and if it were up to me, i'd liberate every book in the world,
but for now, let's pretend i'm talking about public-domain.)

so i've built some models about how i think things should be.

you recommended one scan-set of tristram shandy, from o.c.a.,
so i grabbed it and have hosted it on my site using one model:
> http://z-m-l.com/go/trist/tristp001.html

this particular set-up is for what i call "continuous proofreading",
where _highly-refined_ text from a book is offered to the public
alongside of the page-scans so they can actively compare the two
if they have any doubt about the accuracy of the o.c.r. transcription.

(the text that i used was from the o.c.a. edition you recommended,
but as you'll see, it's many miles away from being "highly-refined",
most notably because o.c.a. somehow lost all the em-dashes, and
stern used a _lot_ of em-dashes. the o.c.r. from o.c.a. isn't _quite_
as bad as the o.c.r. over at umichigan, but it's still fatally flawed.
this is one of the problems you _should_ have been highlighting.)

anyway, the idea is that "continuous proofreading" makes the public
aware that this text _belongs_ to the public, that they're _responsible_
for removing the errors in it, so as to bring it to a state of perfection.

so if google doesn't do the job completely right, _we_ will fix it up.
and we will host it independently, so we don't have to rely on google.

again, this particular text will need a lot of work done on it before it
can be considered "highly-refined" enough for "continuous proofing"
-- my conception of the correct definition is 1 error every 10 pages --
but if you want to see some sample texts that _do_ qualify, view these:
> http://z-m-l.com/go/mabie/mabiep001.html
> http://z-m-l.com/go/myant/myantp001.html
> http://z-m-l.com/go/tolbk/tolbkp001.html
> http://z-m-l.com/go/sgfhb/sgfhbp001.html
> http://z-m-l.com/go/ahmmw/ahmmwp001.html

-bowerbird

  Gary Charbonneau [08.24.07 10:18 AM]

I work in a large academic library that will, at some point (we don’t know just when yet), be making its books available to Google for digitization, and I’ve read the exchange between Patrick and Paul with considerable interest. I have to say that I’m mostly on Patrick’s side here.

Yes, it’s true that even a cursory look at Google Books often reveals quality control issues, some of them serious. I’ve seen all the problems that Paul identified, and more. Nevertheless, it seems to me that a job worth doing is worth doing badly, if the alternative is that it doesn’t get done at all, or gets done over an excruciatingly long period of time. Clearly Google has opted for a quick and dirty approach to digitization, at least to start with. As Patrick points out, the result is a large mass of books being made available for the public to search (and depending on copyright issues, to read), in a remarkably short period of time. That being the case, I’m not even willing to say that Google is doing the job badly, because to say that I’d have to say that “the job” is to do better at digitizing some copy, or even all copies, of Tristram Shandy, than it is to digitize and index several million books. As Josef Stalin once said of the Red Army, “Quantity has a quality all its own.” It seems to me that Google’s approach meets a real need, and it is a misplaced criticism to find excessive fault with it for failing to meet all needs.

Jeffrey Toobin estimated that Google’s digitization project might cost the company eight hundred million dollars by the time it is done. I don‘t know if that is true, and I’m sure Google’s not saying. Supposing, for the sake of argument, that it is true, I think one could probably safely say that someone who is willing to spend well over a billion dollars to do the same thing could produce better quality in the same amount of time. But unless someone steps up and makes that kind of money available - and I’m not holding my breath - I’m not willing to look this particular gift horse in the mouth. Sure, it will cost my library quite a bit of money (indirectly, in terms of staff time and effort) to make its books available to Google for digitization, but it would cost us a whole lot more money to get the books digitized if we had to assume all the costs it ourselves. And quite frankly I’m a bit put off by Paul’s argument that, if Google badly digitizes a book from, say, Harvard, this somehow damages the reputation of Harvard. It may damage Google’s reputation, but I fail to see how or why it damages Harvard’s.

There is one point on which I do thoroughly agree with Paul, and that is that Google does a horrible job of presenting multi-volume works. In fact, it does a mediocre job of handling “known item searching” in general. Library catalogs are much better at that sort of thing than Google Books currently is (or at least they should be), and fixing this problem is something that I would urge Google to work on as a matter of considerable priority.

One point that needs to be emphasized is that the physical books won't just disappear from the face of the earth merely because Google has digitized them. They will be returned to the libraries from whence they came, to be made available for further use. Google's "about this book" pages have a handy link out to OCLC's "WorldCat" to make it easy to find out where physical copies are located so one can consult them on-site or obtain them via interlibrary loan. Even if the quality of Google's digitized version of a particular title is too poor to satisfy the needs of a particular reader, he or she may nevertheless learn enough about the book to make a more informed judgment about whether either of those options is worth pursuing. I understand that there is some anecdotal evidence from the original Google libraries that books are more likely to be used in their physical manifestation after digitization than they were before.

  Adam Hodgkin [08.24.07 10:05 PM]

The commentary and discussion in this exchange is very helpful. There appears to be a general consensus that Google's approach is excessively closed -- and I am not confident that they have the scholarly expertise and bibliographic commitment to match their collective technical brilliance. Even if the planners of the Google project were justified in ignoring existing metadata, catalogues and and volume or serial numbers when they started their scheme, it is nutty to continue to ignore these issues as the project matures. After all, bibliographies, indexes and catalogues are often books. They need to be scanned and treated appropriately.

Indeed there has been very little public review or consideration of the extent to which Google's basic textual approach is acceptable (no XML text, limited structure, JPEG+ full text database, very limited internal linking etc ) -- I happen to think that the choices are mostly quite good, but the methods and the limitations which flow from the JPEG approach have been little commented upon. So Google's project is admirable in many ways, but if it remains under Google's sole control and ownership it will surely lead to a disappointing result. Getting things right and using the expertise of scholars will require that Google develops a more open and responsive approach.
It is surely very naive and/or hugely optimistic to suppose that 'collective editing' or the 'wisdom of scanner-equipped crowds' will lead to the reliable establishment of good texts and a useful scholarly apparatus -- which in the end is what properly curated literature needs.

There needs to be more pressure on Google to adopt an open and collective approach which includes advice and direction from its library partners. Clearly the libraries felt at the begining that they could negotiate with Google and be quite stipulative about the content which is to be included (Harvard was quite selective, Oxford limited its participation to out-of copyright works, as did NY Public Library), but these major libraries appear to have been extremely docile and subservient in the execution of the collaboration. Is it not time that the libraries that collaborate in the Google project formed a user's group and began to articulate some of the standards and goals that will matter to the bibliographic success of this enterprise?

Thanks to Paul Duguid and First Monday for kicking off discussion of these important issues. But there should be a more appropriate forum or method for canvassing opinion and reviewing the aims of large-scale digitisation. Where are the learned societies? Why are the libraries and the librarians so quiet? Where is the academic leadership?

  Adam Hodgkin [08.24.07 10:28 PM]

Small point re Kevin Kelly's:
"But if we are now scanning books that were once microfilmed, we are doing it again."
I believe that in most cases what happens is that the microfilm gets scanned and OCR'd, so the scanning and the laborious and potentially damaging work of manipulating the originals doesnt usually get done twice. There is fairly widespread concern in the library community that the Google process is not really adequate for preservation and this does pose a problem of what one might call 'scanning blight'. Once done, even though poorly, there is relatively little encouragement to do it again.

  Patrick Leary [08.25.07 06:47 AM]

Books serve different purposes in different formats, editions, and media, and they’re not interchangeable. Even if Google Books had chosen an edition of Tristram Shandy that Paul fully approved of, and had done a meticulous job of scanning every page, and had had a bibliographer create the perfect entry, it would still not be equivalent to the “same” volume from a library. And Paul could still claim, quite rightly, that therefore Google Books cannot and should not replace libraries.

To which I would only add “…for all purposes.” For some purposes, it can, it will, and it should. Like a lot of readers, I have a hardback copy of a favorite old book that has a superior typeface and nice wide margins and feels good in the hand and is perfect for armchair reading, as well as a cheap little paperback that I can scribble in. In some cases I might also have a fully annotated edition for reference, and maybe even an unblemished but still cheap paperback copy to loan out to friends. Now I can also have a copy (for free, no less) that I can instantly search through, or print out pages from, or clip passages from that I can insert into an email or into a paper I’m writing. Each of these copies is flawed - the reading copy isn’t scholarly enough to cite, the paperback is a pain to read, the digital copy has some badly scanned pages, etc. - but none of them is equivalent to or interchangeable with the others. That hypothetical creature whom Paul calls, with such exquisite condescension, “the untutored reader,” knows perfectly well that a copy of a book grabbed off the Internet serves certain uses, and ones from a library or a bookstore serve certain others, and that they’re not all the same.

Google Books as a whole serves certain uses, too, that are congruent with but not equivalent to those served by the cooperating libraries, and it serves those uses for anyone anywhere who can get online. Paul thinks this makes it “a special preserve for researchers to play in, a sandbox for the 19th-century scholars” (etc. etc.) “Special preserve”? Researchers have already had their special preserves for many years - they’re called major university research libraries. How many people around the world have had access to books from these libraries? Mighty few. Only people with longstanding university affiliations have the luxury of imagining that it’s easy to identify, find, and retrieve books on a wide range of specialist subjects when you’re not a student, staff, or faculty of a university, and don’t even live near a major library. It’s laughable to pose as the champion of the little guy when the real and intended effect of your sneering is to discourage people from recommending or using the one resource that is truly beginning to make many of the riches of this hitherto walled-off preserve available to everyone to use. The job may have been done all too hastily and sloppily, but what has been accomplished, and accomplished on a fantastically wide scale, is not trivial. Out here in the real world, where nobody reads Google press releases or cares whether it lives up to its hype, we’re just trying to get our work done and find stuff we need, and Google Books is helping us do that.

  Kevin Kelly [08.25.07 09:55 AM]

Patrick says,

"Out here in the real world, where nobody reads Google press releases or cares whether it lives up to its hype, we’re just trying to get our work done and find stuff we need, and Google Books is helping us do that."

Amen to that. Google Book Search's greatest effect will not be upon the over-booked (this small crowd reading this), but the vast unbooked of the world with no access to well-groomed libraries, or big box book stores, or even Amazon. This is their library, and whatever flaws it has (which we should of course try to correct) are dwarfed by its virtues (which we should of course herald and praise). It would be a crime against the unbooked if we derailed this gift.

  bowerbird [08.25.07 10:07 AM]

adam, i disagree with you, on several fronts, and
i would post the message that i wrote, except that
it seems so... disagreeable. and too many people
misinterpret such comments as "impolite", although
to my mind, it's simply having an honest discussion.

but patrick and kevin, i agree! strongly! thanks!

-bowerbird

  Mike Shatzkin [08.25.07 11:18 AM]

I am fascinated by the conversation. I am adding some salt and pepper here.

It is a bit amusing to consider the notion that books won't be scanned twice when so many books are being scanned now by Google, again by Amazon for SITB, and because each of them keep the file from the publisher, again by Microsoft for their new program. And, you know what, they might be scanned AGAIN for Barnes & Noble in the next six months!

This is reality today.

Scanning any particular thing is not expensive. With contributed labor it can be virtually free. It would seem likely that anything rights cleared that, say, more than a hundred people a year want to see will get adequately scanned over the medium term, say, the next ten years.

Google's principal objective, the one that goes to the core of how they MAKE all their money so that it justifies INVESTING all this money, was never to deliver good scans; it was to make non-digital content searchable. From the perspective of their initial motivation, everything beyond that is gravy, or negotiated out of them by the parties that make achieving the core objective possible.

And, in this way, they aren't much different than anybody else. Just a lot bigger.

  Juliet Sutherland [08.25.07 12:30 PM]

From the outside looking in there appears to be little or no coordination going on within GBS to prevent rescanning the same book from different libraries. For those of us who are trying to put together corrected etext versions, this works out well, since we can often piece together a complete set of scans from the duplicates or near duplicates that Google provides. Some careful searching across other large and small archives of text scans (OCA, LoC, and others) allows us to piece together even more material.




On another subject, something that hasn't been mentioned in this discussion yet, though well-known, is the very poor quality of scans of illustrations in GBS. This is understandable given GBS's focus on unlocking access to "text" but it does eliminate an entire treasure trove of material that is a part of our, not exactly written, but perhaps, printed, heritage. To my mind, this is yet another argument why libraries must not consider GBS as providing archival quality material.



I find myself agreeing most with those who point out that GBS is providing a useful service that wouldn't otherwise exist but which is quite different from what many people think of when they hear "book digitization project".

  Mike Shatzkin [08.25.07 01:43 PM]

I am fascinated by the conversation. I am adding some salt and pepper here.

It is a bit amusing to consider the notion that books won't be scanned twice when so many books are being scanned now by Google, again by Amazon for SITB, and because each of them keep the file from the publisher, again by Microsoft for their new program. And, you know what, they might be scanned AGAIN for Barnes & Noble in the next six months!

This is reality today.

Scanning any particular thing is not expensive. With contributed labor it can be virtually free. It would seem likely that anything rights cleared that, say, more than a hundred people a year want to see will get adequately scanned over the medium term, say, the next ten years.

Google's principal objective, the one that goes to the core of how they MAKE all their money so that it justifies INVESTING all this money, was never to deliver good scans; it was to make non-digital content searchable. From the perspective of their initial motivation, everything beyond that is gravy, or negotiated out of them by the parties that make achieving the core objective possible.

And, in this way, they aren't much different than anybody else. Just a lot bigger.

  Paul Duguid [08.25.07 02:04 PM]

Perhaps Patrick had another long night. In which case, I offer my sympathies. Or perhaps this time the insults are in earnest. In which case, I'll keep my sympathies for myself. Either way, the squeeze is still on. Patrick, who came in as the champion of scholarship and close readings, has swung over to the side of the general reader, feeling he can do it without condescension--perhaps because he's "Out [t]here in the real world" and I presumably am not--where I can't If he can, so much the better.


The scholar will not be forgotten. however. Two Berkeley students have posted a splendid YouTube video singing the praises of Google for scholars. It's very good on what Google's good at (though I wish we had the name of that obscure roadbuilder). But noticeably it expresses no hesitation about Google Books. Now, if anyone were to speak like that about their university library, or any other academic resource, they would be denounced for their naivety and lack of scholarly distance. (Or as I have been, by another nineteenth-century scholar, as a shill for the publishing trade.) I too am an enthusiast for GB. I believe I called it an "invaluable" project. But as for any other scholarly resource or for any other public library, we shouldn't avoid its limitations. To many what I saw were not problems, were not serious problems, were problems that would go away even if never mentioned, or were such idiosyncratic and minute problems that it was undignified to comment on them. I thought they should be discussed.


Why? Well, I don't think Google hype is limited to their press releases. Given the company's remarkable success, hype is pervasive, even beyond its control, and the ability for Google to get things right is widely accepted (look not only at the newspaper commentary on GB, but also at the comments in this exchange). Google has earned that admiration, so understanding that GB might mislead or confuse and how probably does take a bit of--no apologies for the word--tutoring. Indeed, I suspect that we spend more time teaching our students (and ourselves) to read than anything else. One paradoxical outcome, however, is that we tend to forget how much we have learned, and still see reading complex texts as a more-or-less transparent act. Thus it's easy to assume that anyone--from the scholar to the ordinary reader--who picks up a book picks it up with the same skills most readers of this exchange have. Similarly, I find it quite hard to get sophisticated students (and myself) to think of the skills and skepticism our fingertips bring to the keyboard even before we start typing into a Google or Wikipedia search box. The assumption that these skills are available to all strikes me as mistaken. (It will no doubt strike others as elistism or condescension.)


So, as a lot of words have been put in my mouth, I'm going to go back to the search I undertook. (Those who have real work in the real world to get back to should switch off here, if they haven't already.) As I should have made clearer in the article, I picked Tristram Shandy because, in an earlier article about open source methods, I had used that book to explore Project Gutenberg. It was an unkind choice then, because PG would inevitably have difficulty reproducing aspects of Sterne's book in ascii. It was not an unreasonable choice, as a moderately popular film on the book had been released when I began the piece and raised some interest in the text. The PG edition had serious problems (and for pointing these out I was not thanked by people from PG & the Distributed Proofreaders). In response to my comments, however, I was regularly told that all would be cured with Google Books. I too assumed it would. But it seemed wise to check.


In the earlier piece, I had stressed the ordinary reader, because PG does (as a defence against fastidious academics). It seemed reasonable to take the same approach to GB. I share with everyone in this discussion the notion that GB is many things. Among the many, it is offered as a place for people to find and download books. I tried.


To the search. The remarkable power of Google page ranking is widely respected--so widely that it would be very difficult even for Google to prevent people transferring that respect from the standard Google search to GB. While we might have some idea of how the page rank algorithm works, we have no good idea how Google ranks its returns in a book search. The results for TS lead with two editions from the late-nineteenth and early twentieth century. Old books carry a certain aura of authority with them. (I suspect, though you should get your condescension meters out again, that general readers probably don't know R.C. Bald's comments about most errors in TS being introduced in the nineteenth century.) Indeed, the first hit was quite possibly the same malformed edition that PG had used to base its text on. Importing errors added in the nineteenth century and subtracted in the twentieth century, this work is a splendid example of the corrupt texts that interest Patrick, but a challenge for an ordinary reader. Starting with that edition, the PG text, however good its ascii, would always be in the hole. That GB would offer such an edition at the top of its list struck me as absurd and misleading and a sign that its vaunted technological powers were curiously prey to mistakes parallel to those found in PG. Yet, as advertisers know full well, people impute a lot of authority to the top of a Google search.


So there was a problem with the Google ranking. Scholars, particularly those using Google books as a corpus, can see through or get around this. It wasn't obvious to me that ordinary readers would. It would be a pleasant outcome of these exchanges if Google would be shamed into releasing details on its ranking system so that all of us could better understand it.


But it wasn't only the ranking. The scanning, starting with the first page, was horrible. It wasn't merely 5%. Nor was it merely gutters or margins. Page after page were distorted. The text, a bad one to begin with, was unreadable in detail and so, unreadable in the large. Moreover, for those who see GB as primarily a corpus, much of it was unsearchable.


So there was a problem with the Google scanning. If this exchange shames Google into improving that, it would be a further worthwhile outcome.


But it wasn't only the scanning. I looked, as an alternative, to the second edition offered. This was an different edition, but Google's ranking plunged you, without warning, into the third volume (which was also completely missing the first page). But there was no way to identify the volume.


So there was a problem with the metada, at a remarkably low level. If this exchange shames Google into improving metadata, it would be advantageous.


There were, then, three kinds of problem that could be identified very quickly, but that I felt would not be transparent to someone without a certain understanding of how books work. And for all the singularity of my search, it wasn't hard to find those problems replicated elsewhere on a similar scale. And, while I was trying to consider the challenge for the ordinary reader, it was clear that some of those problems, of course, affect scholarship as well. It takes some skill to get robust results, rather than just hits, out of GB.


It was a tiny sample, and subject to all the limitations of such sampling. (If Google would allow better sampling, we might get more robust results.) None of it denounces this GB as worthless. Let me repeat again, I find it invaluable. But it does show limits to the project, many of which could be overcome with some effort, but there was no sign that Google was putting that sort of effort in. If collectively, by poking holes in Google and by poking holes in me we get a better Google Books, I will be as grateful as anyone. If people, as Patrick hopes, will produce better criticism, more scholarly criticism, better informed criticism, I will be the first to defer to them. And I agree that we need to understand (and criticize) from multiple perspectives. If we simply believe that Google will sort these things out on its own, that the Wisdom of Crowds will do all the work, or because Google has financial instincts and interests, all these problems will disappear and we can just go in "trying to get our work done," I suspect they wont.

  bowerbird [08.25.07 06:33 PM]

paul said:
> To many what I saw were not problems, were not serious problems,
> were problems that would go away even if never mentioned, or
> were such idiosyncratic and minute problems that it was
> undignified to comment on them.
> I thought they should be discussed.

but paul, the problems _have_ been "discussed". lots of times,
in lots of places. by lots of people. to anyone who has followed
the project, your analysis didn't really add any new information...

(your earlier article, in direct contrast, _did_ give new insights.
and i was one of very few supporters of p.g. and d.p. who said
that both organizations _should_have_ paid more attention to it,
just so you know that there were some of us who saw its value.)

but who knows? perhaps the higher profile of "first monday" will
indeed have some impact in nudging google into better practices.
perhaps your position will have influence that us peons don't have.

since google is so tight-lipped, it's hard to know whether any of the
previous discussions impacted it or not. i would guess they hadn't,
that google has simply learned from its own experience, and doesn't
much listen to anyone else. indeed, it's amazing to me google has
learned _so_little_ up to now, even if only by their own experience.

by this time, i would have expected -- _at_the_very_least_ -- google
would have learned that it is cost-efficient to do the quality-control
on the scan-set _before_you_let_the_book_leave_the_scanning_bay_.
but i don't see the great preponderance of books-with-zero-problems
that one would expect to see, if google _had_ already learned that lesson.


> While we might have some idea of how the page rank algorithm works,
> we have no good idea how Google ranks its returns in a book search.

even google has "no good idea" of that yet. they just switched it up again
last week. and i expect that they will change it many times in the future...
so i'd say this is one of those "too early to be judging that yet" type things.


> Indeed, the first hit was quite possibly the same malformed edition
> that PG had used to base its text on

if -- after google goes through those many iterations of its search function --
this "malformed edition" remains at the top of the list, then you should yelp.

(this doesn't necessarily mean that i am in agreement with you that the
"malformed editions" of any particular books should be downgraded,
since the fact that they _are_ malformed is another topic of interest,
not just to the scholar but to the "average reader" as well. however,
that discussion is one that is not particular timely at this early stage.)


> That GB would offer such an edition at the top of its list struck me as
> absurd and misleading and a sign that its vaunted technological powers
> were curiously prey to mistakes parallel to those found in PG. Yet,
> as advertisers know full well, people impute a lot of authority
> to the top of a Google search.

well, that _is_ a good counter to my "it doesn't matter quite yet" stance.

but tell me, does this "malformed edition" still come out on top _today_?

and -- pray tell -- why was this "malformed edition" even _present_
in the highly-esteemed libraries of stanford, harvard, u.c., and oxford?
if readers could stumble across it there, with no apparent "warning"
as to its "malformed" status, is it any worse that google points to it?


> So there was a problem with the Google ranking. Scholars, particularly those
> using Google books as a corpus, can see through or get around this.
> It wasn't obvious to me that ordinary readers would. It would be
> a pleasant outcome of these exchanges if Google would be shamed into
> releasing details on its ranking system so that all of us could better understand it.

we don't have enough power of any kind to "shame" google to do anything,
-- anything at all -- let alone to give away the key to their entire kingdom.

so keep yourself within the realm of possibiity.

if you _really_ want to have google point to the _best_editions_ of each book,
then perhaps you and the other academics should make a list of those editions.
should be pretty easy for you to all come to some agreement, yes? ;+)


> But it wasn't only the ranking. The scanning, starting with the first page,
> was horrible. It wasn't merely 5%. Nor was it merely gutters or margins.
> Page after page were distorted. The text, a bad one to begin with, was
> unreadable in detail and so, unreadable in the large. Moreover, for those
> who see GB as primarily a corpus, much of it was unsearchable.

that's the real problem there. but, as i said before, it's all been said before.


> So there was a problem with the Google scanning. If this exchange shames
> Google into improving that, it would be a further worthwhile outcome.

i've thought the same thing about many of my vocal complaints.
but -- just so you know -- it seems to me they fall on deaf ears.


> But it wasn't only the scanning. I looked, as an alternative,
> to the second edition offered. This was an different edition,
> but Google's ranking plunged you, without warning, into the
> third volume (which was also completely missing the first page).
> But there was no way to identify the volume.

well, i assume that that information was on the missing first page.
it'd be a strange book that didn't have _some_ kind of information
about its volume number (if it has one) printed in it _somewhere_.

but yeah, as to the "metadata" thing, google sucks at that, too.
i can't help but think, though, that they have plans to incorporate
the full bibliographic cataloging information from their members.


> So there was a problem with the metadata, at a remarkably low level.
> If this exchange shames Google into improving metadata,
> it would be advantageous.

deaf, i tell you.


> There were, then, three kinds of problem that could be identified
> very quickly, but that I felt would not be transparent to someone
> without a certain understanding of how books work. And for all the
> singularity of my search, it wasn't hard to find those problems
> replicated elsewhere on a similar scale.

they can be replicated quite easily. which is why it was a bit perplexing
to see that you hadn't made your argument more robust in that manner.

but even if you had, the problems are so easy to replicate that they're obvious.
so then people would have accused you of kicking a dead horse repeatedly...


> It takes some skill to get robust results, rather than just hits, out of GB.

i don't think google cares much about "robust results".

and since their efforts are so far away from being even close to finished,
i think it would be premature for them to care about that at this point...


> It was a tiny sample, and subject to all the limitations of such sampling.
> (If Google would allow better sampling, we might get more robust results.)

so you think you can judge their success overall? mighty big of you...

do you think _google_ thinks that you can judge their success overall?

do you think they're gonna make it _easy_ for you to judge their success?

i believe they don't care what we think, because we ain't paying the bills.

and -- know what? -- i can't find any way to fault them for thinking that.


> But it does show limits to the project, many of which could be
> overcome with some effort, but there was no sign that Google
> was putting that sort of effort in.

i can't see that you have been assessing their performance across time,
so how can you measure that? according to what i have seen, they _are_
improving their performance. not as much as i would like, or even expect,
but they are getting better. and they _do_ go back and re-do bad pages,
sometimes even whole books, so i believe they have a solid commitment
to do the job right. their company philosophy tells me that they will,
and if they don't -- given enough time -- i will call them on it again...

moreover, i _know_ that _i_ have a solid commitment to fix their errors.
so if they don't do the job right, i'll make corrections on follow-through.
i come from the punk school that practices a do-it-yourself philosophy.

and i am optimistic enough to think that there are many people like me,
who consider the objective of creating _the_cyberspace_library_ seriously,
seriously enough that we will contribute our time and energy to make it,
in the same way that many people have contributed to creating wikipedia.

you may scoff at our "home scanners", but i believe you will see the light.
because i know that people _love_ books. love them broadly and deeply.
love them enough that they'll adopt a book and become "bookpeople"...

in this regard, you said nothing about my mounting of "tristram shandy".
i invite you to furnish me better o.c.r. results and some proofing energy,
maybe even some scholarly commentary, in order to bring this old book
-- which seems to mean quite a bit to you -- into our cyberspace library.

and -- as the final note -- this is where your analysis has gone wrong.

you seem to believe that _google_ is "creating the world's online library".
no sir. google is just doing some scanning for us. _we_ are creating it...

-bowerbird

  Anne Karle-Zenith [09.04.07 06:04 AM]

One of the big reasons we here at Michigan are not as concerned about Google Book Search image quality is we know that what you are seeing now on GBS and in MBooks is not what you will see ultimately. Google’s scanning process produces a rich, high quality master image from which they derive the versions you’re seeing online, and the derivation process is always being improved.

We have worked closely with Google on quality issues from the beginning of the project. We have our own quality review process at Michigan and have been providing data to Google on a monthly basis. Over time have seen vast improvements in quality of images returned to us, with rates for most error types that we look for having decreased to a point where they are no longer of significant concern. Again, this improvement is a reflection of improvements in the engineering process, which will ultimately be applied to older scanned images. It is worth noting that the quality you are seeing online includes materials that were scanned/processed in the early stages of the project before the improvements were put in place. So going forward the quality of scanning and image processing will be better, and in addition all the images you are seeing now are currently undergoing reprocessing, so older materials will also benefit from the most recent improvements.

  Juliet Sutherland [09.04.07 10:03 AM]

Having higher quality/resolution scans will certainly help with illustrations, but I don't see how reprocessing existing scans will fix problems with missing pages or bad metadata.

I'm glad to hear that Google is improving their quality. That's something that's very hard to tell from the outside since there is no way (that I know of) to determine when any particular book was scanned or processed.

What types of errors do you look for? What is your standard for "significant concern"?

  bowerbird [09.04.07 02:36 PM]

tell us, anne karle-zenith, if those high-quality master images will
yield better o.c.r. than the doo-doo you've been posting thus far.

here's an image:
> http://mdp.lib.umich.edu/cgi/m/mdp/pt?seq=16&size=50&id=39015016881628

and here's the o.c.r page:
> http://mdp.lib.umich.edu/cgi/m/mdp/pt?seq=16;view=text;id=39015016881628

and here's the actual o.c.r. output:
> Material and Method.

> work illustration and confirmation
> of its truth. There are many things
> in his plays which are more intel
> ligible and significant to us than they
> were to the men who heard their
> musical cadence on the rude Eliza
> bethan stage, because the ripening of
> experience has given the prophetic
> thought an historical demonstration;
> and there are truths in these plays
> which will be read with clearer eyes
> by the men of the next century than
> they are now rea4 by us.
> It is this prophetic quality in the
> books of power which silently moves
> them forward with the inaudible ad
> vance of the successive files in the
> ranks of the generations, and which
> makes them contemporary with each
> generation. For while the medi@eval
> frame-work upon which Dante con
> structed the Divine Comedy be..
>
> I0

the first thing to notice is that the end-line hyphenates were lost.
(see "intel-ligible" between the third and fourth lines on the page.)

of even greater concern is that the em-dashes are lost completely.
(and they cannot be re-introduced like most end-line hyphenates.)

but by far the _worst_ loss of data here involves the quote-marks.
you'll see the double-quotes around "divine comedy" are missing.

further, notice that there is no indication of a new paragraph with
the line that begins with "it is this prophetic quality". this loss of
paragraphing can sometimes also be very difficult to re-introduce.

also notice some obvious typos that could have been corrected,
including "rea4" for "read" and "i0" for the "10" pagenumber...

as is patently clear, these o.c.r. problems are _not_ related to
inferior images, but rather to extremely careless handling of the
o.c.r. options, causing a _significant_ loss of data. what are you
planning to do in order to correct this very shoddy workmanship?

-bowerbird

  Anne Karle-Zenith [09.05.07 06:32 AM]

This is in response to Juliet's post - Google's process of automatically assembling the bits has resulted in pages that appear to be missing. Google implemented a new page numbering algorithm prior to the start of the reprocessing, and we did an audit on a small sample of reprocessed material and *did* see vast improvement.

As for errors, we monitor things like legibility of text, page alignment, etc.

  bowerbird [09.13.07 08:09 AM]

this thread illustrates the biggest problem with
the scanning projects -- their nonresponsiveness
to the _actual_ problems with their output.

i have described a number of these problems --
including missing em-dashes, quotemarks, and so on
-- and the bureaucrats have fallen totally silent.

where is the sustained discussion that would
produce solid results? are we doomed to these
false starts that go nowhere? it's too sad...

-bowerbird

  Alain Pierrot [09.14.07 02:44 AM]

I have just been made aware of a very interesting white paper,
Preservation in the Age of Large-Scale Digitization

By Oya Y. Rieger

which is answering a lot of questions raised in this discussion.

http://www.clir.org/activities/details/lsdi.pdf

  bowerbird [10.05.07 01:21 PM]

alain said:
> which is answering a lot of
> questions raised in this discussion.

do you really think so?

because i don't...

from the first two paragraphs of the preface,
"quality/quantity" is presented as a tradeoff.
i think not. we need quality _and_ quantity.

it's entirely possible to do high-quality work
even when you're scanning millions of books.
all you must do is install quality-control checks.

and indeed, it's massively stupid to do anything
_but_ a high-quality job. what would we think
about a person who went all through a library
ripping one or two pages out of each book?
we would call them a vandal, and punish them.
but does it make any more sense to supposedly
digitize a book, but mess up a couple of pages,
with that error-rate on _millions_ of books?

the author also draws a distinction between
_access_ and _preservation_. but if materials
are to be _preserved_, why wouldn't we make
them _accessible_? are we gonna bury them in
a utah mountain? the distinction buys nothing.
surely we need both preservation _and_ access.

likewise, she differentiates "digital surrogates"
from "digital reformatting". but if we cannot
serve both of these purposes, we are failing...

i could go on and on -- like the author does,
with her 54 pages -- but is it really necessary?

all in all, this paper strikes me as _academic_,
in the least appealing sense of the word, sadly.
a lot of distinctions are draw, but found lacking
in any substance when examined more closely...

let alone the question of the impotence of any
scholars trying to tell google what it should do.
the fact of the matter is that it's not that hard
for anyone to figure out what _needs_ to be done.

digitization does _not_ have to be hard. at all.
there are _no_ difficult questions here. _none_.
there is simply some unbelievably bad execution.
(which i feel that google _will_ rectify ultimately,
since, for them, "great just isn't good enough";
so surely they cannot settle for clearly inferior.)

the answer is simple: scan every page nicely,
and then make every single scan _available_,
including a relatively clean o.c.r. of its text,
and let interested volunteers correct that text.

distribute very widely to ensure against loss,
and update to new media whenever they appear.
that's all we must do. don't make it difficult.

-bowerbird

p.s. honestly, i had substantial problems with
this white-paper, but i just sat on them initially.
(just like i sat on this comment i'm now posting.)

basically, i believed the white-paper is clueless,
but figured it would be "impolite" to say that...

on visiting the web-site of c.l.i.r. -- sponsors of
this white-paper -- however, i saw that they had
invited public comment, deadline being today...

i also read there that they've received a grant of
$2.19 million -- yep, over 2 million bucks -- to:
> assess the utility to scholars of
> several large-scale digitization projects.
> ...
> CLIR will ask scholars from historical and literary
> areas of study to summarize key methodological
> considerations in conducting research in their disciplines.
> Scholars will then assess each mass digitization project
> under scrutiny, and each will submit a report. The reports
> will be synthesized and recommendations drawn from them.
> The summary will serve as the basis of a larger meeting of scholars
> in November 2007 to discuss the findings and recommendations
> and to determine next steps. Chief among these will be a strategy
> for working with individual and corporate database developers to
> improve the utility of these databases to scholars. CLIR will issue
> a public report early in 2008.

wow.

to me, this means that their cluelessness becomes
a serious matter. and this white-paper is clueless.

so later today, i'll post a comment telling you why.

  Alain Pierrot [10.06.07 02:21 AM]

bowerbird said:
>this white-paper is clueless

the same said:
>i have described a number of these problems --
>including missing em-dashes, quotemarks, and so on
>-- and the bureaucrats have fallen totally silent.

>where is the sustained discussion that would
>produce solid results? are we doomed to these
>false starts that go nowhere? it's too sad...

my view is that publishing a white paper, a call
for comments with defined goals and schedule might
well _look like_ a good basis for a "sustained
discussion that would produce solid results".

this makes c.l.i.r. a better place to try and help
moving book digitization in a better way.

i agree with *some* of your points in your post
here and at
http://z-m-l.com/oyayr/oya-feedback.txt

i strongly disagree with others.

dismissing oya's paper *and* c.l.i.r. initiative
as "clueless" and discouraging further examination
of what is going to happen there is dangerous:
some people with money and decisive power are
about to take heavy decisions there.

which places (please note the plural) do you think
would be adequate to develop a few arguments?
which places do you think
would be useful to convince _actors_?

_p.s._: z.m.l. like encoding above...

  bowerbird [10.06.07 03:56 PM]

first, alain, thanks for pointing out the alternate location:
> z-m-l.com/oyayr/oya-feedback.txt
(that's a link, but you'll need to add the http thingee so it works.)

i did indeed post it here, but it hasn't cleared moderation yet.
i suppose that might be because it had some links in it, but
it might also be because it's rather frank (ok, it's kinda harsh),
so maybe the powers-that-be would rather not post it here...

fine, i don't care. there are lots of soapboxes in cyberspace,
including ones that i pay for myself, so i can speak my mind.

if people want to read what i wrote, they can read it over there.
if they want to continue the thread here -- or anywhere else --
they can. and if nobody cares about these things, so be it...

***

alain said:
> my view is that publishing a white paper,
> a call for comments with defined goals and
> schedule might well _look like_ a good basis for a
> "sustained discussion that would produce solid results".

on the one hand, alain, i might be able to agree with that.

on the other hand, this particular white-paper did not
impress me with its grasp on the significant issues...

nor does the particular plan of action laid out by c.l.i.r.
appear to me to be pointing us in the correct direction,
especially if they cannot frame the issues in the right way.


> dismissing oya's paper *and* c.l.i.r. initiative
> as "clueless"

i was not so much "dismissing" oya's paper _itself_
-- i.e., its content -- so much as the form it took.

yes, the content was relatively uninspiring to me.
as i said, she drew a number of distinctions, but
none of them seemed to be of much importance.
they were mostly either/or, when we need _both_.
so i think she missed the ball, for the most part...

however, most of her recommendations were fine.

how can anyone argue with those suggestions?
at the same time, as i said, they ain't inspiring.

nonetheless, the _format_ of her paper was indeed
"clueless", in the sense that it showed no awareness
of the dimensions that are important in digitization.
as my feedback argued, it showed a paper mentality.
(and if you have counterarguments to that position,
anything at all showing cluefullness, please share it.)

i would certainly hope that it would be the case that
one of the people who has a position of authority in
the scanning effort would be _acutely_aware_ of the
salient dimensions, and that this _deep_recognition_
would manifest itself in every action of that person...

but that was obviously not the case with _this_ paper.

for scholars, repurposing text is extremely important;
yet she's using the format that's most difficult to remix.
and every single aspect of it revealed a paper mentality.

i don't mean to be harsh, but can see it no other way...


> and discouraging further examination of
> what is going to happen there is dangerous:

well, i don't mean to "discourage further examination",
even if i could. i sincerely doubt i'll have _any_ effect.
people tend to ignore totally a review that's devastating.


> some people with money and decisive power
> are about to take heavy decisions there.

and isn't that always the way it is? i mean, really...

i am quite ready to wash my hands of the whole thing.
i've been dreaming of a cyberlibrary for over 25 years,
and now that it's really happening, the people in charge
are screwing it up, big-time. they are bellyaching about
how "difficult" it is -- it's not, it's as simple as scanning --
and how "expensive" it is -- it'll save money long-run --
yet they're _botching_ the job and _wasting_ big money...

they might as well be the president.


> which places (please note the plural) do you think
> would be adequate to develop a few arguments?

any wiki-slash-listserve would be fine.

but as i said above, this is _not_ a difficult matter, at all.

it should be perfectly clear to _everyone_ what's needed.
you don't need a million dollars to figure out the answer,
let alone two point one nine million dollars, thank you...

i gave the answer above:
> the answer is simple: scan every page nicely,
> and then make every single scan _available_,
> including a relatively clean o.c.r. of its text,
> and let interested volunteers correct that text.
>
> distribute very widely to ensure against loss,
> and update to new media whenever they appear.
> that's all we must do. don't make it difficult.

i also gave it over on the if:book blog a while back,
regarding the specific arena of the academic literature:
> futureofthebook.org/blog/archives/2007/03/aaup_on_open_access_business_a.html#c99776
(that's a link, but you'll need to add the http thingee so it works.)

to summarize what i said there:
> 1. let any author post their own work online,
> without fear of retribution from their publisher
> 2. scholarly work that was funded, in whole or in part,
> by public money can be placed online by any taxpayer
> 3. for deceased authors, who cannot put their own work online,
> any person is allowed to post their material if a publisher has not

put those 3 rules in place, and you'll be absolutely stunned
how rapidly our scholarly record will get replicated digitally.
researchers will do all the work for their own _convenience_.

criminey sakes, volunteers have created an _encyclopedia_
out of _thin_air_ over at wikipedia, in a matter of 7 years...

compared to that, creating an online mirror of something
-- like the academic literature -- which _already_exists_
should be a piece of cake. sure, someone in authority will
need to set up the infrastructure, the framework, but then
after that, it's just a matter of dropping the pages in place.

this is _not_ like putting a man into space.
it's like running to first base. dead simple.


> which places do you think would be
> useful to convince _actors_?

he who pays the piper calls the tune, baby...

so i'd ask google the question. be prepared to
hear absolutely nothing in response, though...

it's not even clear that google will give us digital text,
as opposed to just scans. and where the page-scans
are "owned" by some publisher of academic journals,
it's not even clear that we will get those darn scans!
so the academic community will have to be beggars...
that's how you end up when you give away ownership.

-bowerbird

  Jeff Ubois [10.07.07 05:11 AM]

There is a question of faith underlying part of this dispute: will the books be rescanned?

Believing so has freed some otherwise very conservative institutions to try new things. It has also led to a willingness to accept poor quality, and maybe some disadvantageous deals with commercial vendors.

Based on the microfilm experience, mass rescanning anytime soon seems unlikely. The "oh, it will all be rescanned anyway" argument seems especially bad when it is coming from commercial partners.

The libraries do not insist on getting full scans back from commercial partners, or that accept perpetual restrictions on the use of those scans, will probably need to either rescan their holdings, or accept that their institutions will lose relevance.

-- Jeff (who wishes the mighty Bowerbird had a bigger budget and more influence)

  Gary Frost [10.09.07 12:36 AM]

What I find interesting in this long scroll is the lack of attention to the print collections themselves as the mastering and back-up commodity. Text authentication, completeness, scale, image quality and color, cannot be exactly confirmed by the on-screen surrogate so the mastering role is as important as the delivery role; both are needed for access. My own grip on this is expressed by the "leaf master" concept; print collections held almost exclusively as sources for screen presentation. Library trends toward wider and wider sequestering of print for optimal storage and security converge well with the leaf master scenario.

  Gary Frost [10.09.07 12:37 AM]

What I find interesting in this long scroll is the lack of attention to the print collections themselves as the mastering and back-up commodity. Text authentication, completeness, scale, image quality and color, cannot be exactly confirmed by the on-screen surrogate so the mastering role is as important as the delivery role; both are needed for access. My own grip on this is expressed by the "leaf master" concept; print collections held almost exclusively as sources for screen presentation. Library trends toward wider and wider sequestering of print for optimal storage and security converge well with the leaf master scenario.

  Kathlin Smith [10.09.07 11:13 AM]

Glad to read that the white paper is generating discussion, as intended. Bowerbird, on your extensive criticism of format:

# i was not so much "dismissing" oya's paper
# _itself_-- i.e., its content -- so much as the
# form it took.

I'm sorry your judgment of the paper has been so influenced by its form. The pdf version was mounted for comment on substance. The final paper will be available in three forms: html, pdf, and print. We do this for all reports, as readers' preferences vary.

Also, you wrote

# i also read there that they've received a grant
# of $2.19 million -- yep, over 2 million bucks --
# to:
# assess the utility to scholars of several
# large-scale digitization projects.

You misread. CLIR was awarded $2.19 million for operational support. The project to assess utility to scholars of large-scale digitization projects was a separate grant of $39,800.

  bowerbird [10.10.07 12:16 PM]

katlin said:
> I'm sorry your judgment of the paper
> has been so influenced by its form.

the medium sent a message.
a very _strong_ message.
stronger even than the
"content" it was carrying.

i believe i explained why.

especially since the message is
on the very topic of the medium.
so the form is indeed relevant.

besides, it is perhaps telling
that of the three forms the
content will eventually take
-- .pdf, .html, and print --
not a one of them will be as
remix-friendly as my version.

so i believe i made my point.

plain-text, especially when it
is edited to a structure lending
itself to typographical rigor
and quality format-conversions,
is an extremely powerful force.
considering its small footprint,
its benefit-to-cost ratio is huge.

and _that_, in a nutshell, is
a major factor that all the big
scanning projects are missing...

-bowerbird

  bowerbird [10.10.07 12:19 PM]

kathlin-

sorry for misspelling your name above!

-bowerbird

  bowerbird [11.06.07 04:05 PM]

since it's now a month later, i might as well
announce that a .pdf that i created is now up:
> http://z-m-l.com/oyayr/oyayr.pdf

this .pdf wsa auto-generated from the .zml file:
> http://z-m-l.com/oyayr/oyayr.zml

you will discover the .pdf has _extensive_ links
-- with ugly lines around 'em, so you'd notice --
both internal and external ones, and all of those
links were auto-generated, per .zml philosophy...

-bowerbird

Post A Comment:

 (please be patient, comments may take awhile to post)






Type the characters you see in the picture above.

RECOMMENDED FOR YOU

RECENT COMMENTS