Four short links: 14 May 2009

Open Source Ebook Reader, Libraries and Ebooks, Life Lessons, and Government Licenses

  1. Open Library Book Reader — the page-turning book reader software that the Internet Archive uses is open source. One of the reasons library scanning programs are ineffective is that they try to build new viewing software for each scan-a-bundle-of-books project they get funding for.
  2. Should Libraries Have eBooks? — blog post from an electronic publisher made nervous by the potential for libraries to lend unlimited “copies” of an electronic work simultaneously. He suggests turning libraries into bookstores, compensating publishers for each loan (interestingly, some of the first circulating libraries were established by publishers and booksellers precisely to have a rental trade). I’m wary of the effort to profit from every use of a work, though. I’d rather see libraries limit simultaneous access to in-copyright materials if there’s no negotiated license opening access to more. Unlike the author, I don’t see this as a situation that justifies DRM, whose poison extends past the term of copyright. (via Paul Reynolds)
  3. Lessons Learned from Previous Employment (Adam Shand) — great summary of what he learned in the different jobs he’s had over the years. Sample:
    • More than any other single thing, being successful at something means not giving up.
    • Everything takes longer than you expect. Lots longer.
    • In a volunteer based non-profit people don’t have the shared goal of making money. Instead every single person has their own personal agenda to pursue.
    • Unfortunately “dreaming big” is more fun and less work than “doing big”.

  4. Flickr Creates New License for White House Photos (Wired) — photos from the White House photographer were originally CC-licensed (yay, a step forward) but when it was pointed out that as government-produced information those photos weren’t allowed to be copyright, the White House relicensed as “United States Government Work”. Flickr had to add the category, which differs from “No Known Copyright”, and it’s something that all sharing sites will need to consider if they are going to offer their service to the Government.
tags: , , , , , , , , ,
  • Perry Willett

    >One of the reasons library scanning programs are
    >ineffective is that they try to build new viewing
    >software for each scan-a-bundle-of-books project
    >they get funding for.

    Nat, there’s a lot to unpack there, but I wonder if you can point to examples of this. Also, I’d be interested in hearing the other reasons you have in mind about why library scanning programs are ineffective.

  • bowerbird

    nat, i’d be interested in such a conversation too… :+)


  • @Bowerbird @Perry — from my (admittedly small) exposure, it looks like many large libraries have annual budgets and treat scanning on a project-by-project basis. That it, there’s a budget to scan all the X documents in time for the X anniversary next year. Then they build a custom web site for the X Anniversary and include a custom reader using whatever technology seems most apt to the documents: page-turning, images, etc.

    As a result, the emphasis isn’t on building a sustainable long-term scanning program that digitizes all a library’s holdings because of the wasted time in building and justifying the new projects, and building the new reader interface. Also, because the scanning is viewed primarily as a one-off project and not as part of an effort to build a digital library, there’s no crowdsourced metadata/OCR error correction/etc.

    Now it may well be that I’ve only seen libraries in the early stages of digitization and everyone passes through this project stage before they realize they need to systematically scan and augment their holdings.

    Anyway, that’s what I’ve seen. I’d love to know what you’ve seen.

  • Under the heading “Dunedin Montessori (1991-1993)”, Adam’s got the ‘lesson’ “having a job you can do stoned isn’t all it’s cracked up to be.” Hope the guy wasn’t in charge of other people’s children.

    Also, as I’m someone currently organizing a volunteer-based nonprofit, the lesson “in a volunteer based non-profit people don’t have the shared goal of making money. Instead every single person has their own personal agenda to pursue” is a bit disheartening. Hopefully we can turn that to our advantage instead of letting it build acrimony.

  • @Ken – I don’t think that people having their own personal agendas should be viewed as disheartening. Rather the intended lesson was that if you approach volunteer staff with the same attitude as you approach paid staff you will quickly discover that you are in for some surprises. People’s motivations for being a part of a non-profit vary *vastly*, as such it requires more work and understanding on the part of the manager/ coordinator.


    PS. I was a night janitor at the school. :-)

  • Perry Willett

    @Nat: I think this is the standard development path that a lot of digital libraries have followed. Typically, they first get one-time funding (grant, donation, end-of-year budget funds) for a project. They hire one or more staff members, digitize some materials and develop a access system, or implement an existing one.

    Then, more funds are found for additional work. They are now working with external partners, and they have additional requirements not part of the old system, so they develop or modify a new (to them) system. They don’t have staff to migrate the first collection, so they keep it running under the old system. This goes on perhaps a couple of more iterations, but eventually they build a permanent staff for their digital library and budget line for digitization, and are faced with a set of legacy collections and systems. Migrating those collections is not nearly as interesting and exciting as working on new ones, so it becomes a lower priority until they realize how much time they’re spending on maintaining multiple access systems.

    I don’t think any library starts with an annual budget for digitizing, but instead they build up to it. They may try to go from 0 to 60 in one step, but most digital libraries of any size and scope have gotten there by building infrastructure and experience incrementally. This means some missteps along the way, but I don’t know that this makes them “ineffective.” We’re at an early enough stage that it’s still part of the process.

  • Re: Should Libraries Have eBooks?

    The same argument could be used for movie rentals, especially the digital delivery on-demand model Netflix uses. Obviously the movie industry isn’t going broke, so whatever the license fee used, it is workable, and I get to watch movies for almost nothing.

    I once worked out that a single copy of a song could theoretically be shared on a sequential basis for next to nothing, all perfectly legal using a quasi buy/sell model of a single CD or song’s content.

    Again, this all boils down to the length of copyright. If it was short, like the patents it used to resemble, then we could have most of the books in the public domain after 20 years after the author and publisher have recouped their commercial interests. The long you make the copyright period, the more informational content looks like tangible property and the more the barriers will go up to capture all the gains.

  • Re: Open Library Book Reader

    I like the functionality. It’s a bit crude, but nothing that isn’t fixable. I’d like to know much more about this approach – where are the page images and extracted text stored. What are the possibilities of extending the functionality. Is it likely to become a standard, or is this just an interesting project?

  • bowerbird

    fixating on the web-reader is jumping into the middle.

    i’d say that we need to begin by making a solid
    list of the purposes we believe need to be served.
    my experience says there are easily a half-dozen.

    then we need to brainstorm ways to serve them…

    and yes, the focus should not be small collections.

    indeed, i’d hike up that logic to the highest levels,
    and argue that there shouldn’t be any duplication
    of our functionality across _any_ institutions at all.

    there’s no reason for two libraries to mount all of
    the same books, let alone for two hundred to do it.

    institutions should collaborate on a _joint_ system
    that would mount the entirety of the public domain,
    sharing the costs and the benefits among them all
    — and taxpayers who financed libraries all along…

    (copies could be stored at several different schools,
    but that would just be redundancy as our safeguard.)

    of course, institutional collaboration usually produces
    a huge mess with a suffocating blanket of bureaucracy,
    and the systems libraries created so far, on their own,
    are a shining example of that, so that’s a big problem…

    and this example from the internet archive is no better.

    for instance, let’s look at the questions that alex raised:
    where are the page images and the extracted text stored?

    you’d think that would be fairly obvious, wouldn’t you?

    i mean, after all, these are public-domain books, right?

    so one of the “purposes” that should be on our “list” is
    to enable quick, easy, transparent access to this content.
    at least that would be one of the top things on _my_ list.

    but yet, like far too many other library systems out there,
    this information is _not_ readily apparent to the public,
    even someone who is technically skilled like alex tolley.

    i would say, at a bare minimum, that the u.r.l. of each
    and every scan of a book should be easy to figure out,
    which requires us to use a sensible naming convention.

    moreover, the text for each and every page should also
    be easily accessible, and the _concatenated_ text from
    _all_ the pages in the book should be readily available…

    and, by the way, isn’t it kind of stupid to be serving up
    the _scan_ of a page, instead of its actual digital text?
    scans are wasteful of bandwidth, and are inconvenient
    in the sense that their text can’t be searched or copied.

    oh yeah, in case it wasn’t obvious, one of the purposes
    that should be on our list up above is to _correct_ the
    _o.c.r._ for these scansets, since it is often very crappy.
    you’d be amazed at the huge improvements that can be
    produced by just a little smidgen of clever programming.

    i’ve done a lot of work in this arena, so i could go on…
    but that’s enough for now. let’s hear from more people.


  • That article on “Should Libraries Have eBooks?” seems to completely ignore the fact that lots of libraries already have them. I have cards at multiple libraries that let me access books from my home. For example, O’Reilly Safari is a great resource that doesn’t require me to go to the library for access. Sure, I can’t use it on a kindle or a Sony e-book reader, but those aren’t the devices of now, and there is nothing preventing O’Reilly from using the same business model to provide the books to those devices.

    Sometimes my library runs out of slots on Safari, so I can’t get a book. Or the library pays for a certain part of the O’Reilly collection, so I have to go to a different library to access a book online. And sometimes none of my libraries have access to the book, so I can either ask them to pay for it or go buy it myself. All-in-all, it works well.

    Also, the author of the article claims that libraries should finance the corporate sector’s digitization costs. I’m sorry, but if you’re a publisher and you don’t have a simple way for the books you’re putting on the market today to be digital, you’re doing it wrong.

  • @ bowerbird: You raise some very good points which I would like followed up by Nat.

    However I would disagree with this comment: “isn’t it kind of stupid to be serving up the _scan_ of a page, instead of its actual digital text?”

    In some cases I agree, e.g. plain text books like novels. But the bird book is an excellent example of the use of images. many books would benefit from showing the full original content exactly how it was published. The extraction of teh text by OCR and other technologies which allows search is very nice and it could still be selected if necessary, perhaps with a note where images are missing and a hyperlink to retrieve the original page image.

  • bowerbird

    alex said;
    > the bird book is an excellent example of the use of images.
    > many books would benefit from showing
    > the full original content exactly how it was published.

    ok, alex, first of all, i firmly believe that end-users should be
    able to call up the page-images if that’s what they want to do.
    remember, i urged that their u.r.l. be transparently obvious?

    more importantly, i strongly stress that _images_ in the book
    be presented even when we serve the digital text of the book.
    i’m sure that you’ve noticed that the web is fully capable of
    having pages these days containing _both_ text and pictures. ;+)

    the point is that it’s silly to serve page-images, and _only_ that,
    as our sole methodology when building an online reader-app,
    because that’s wasteful of bandwidth, and _also_ sub-optimal.

    and, just to go one step further, sometimes it is desirable to
    put _both_ digital text _and_ the page-scan, on the same page.
    specifically, we’ll want to do this so people can do _proofing_.
    (or, more generally, so they can _verify_ that our digital text
    was accurately transcribed from the scan, when they doubt it.)
    this is one of the “purposes” our online-viewer should serve.
    now, if you’ve ever done this type of task, you will know that
    it is facilitated greatly if the text’s linebreaks match the scan,
    so that’s a condition that our online-viewer needs to meet…
    that’s why you need to know what you want to _accomplish_
    before you build your online-viewer, so it will serve your needs.

    and, while i’m still here, i’ll add one more point, which is that
    — in my experience — an online-viewer like this should be
    a tool of last resort. downloading one-scan-at-a-time is silly,
    and particularly so when you don’t even bother to _save_ it
    on the end-users machine, meaning that if they want to see it
    again in the future, they will have to download it yet again…
    not only does it waste bandwidth, it fails to empower people.

    i think it’s much more intelligent to build a program that actively
    downloads and saves the digital text to the end-user’s machine.
    the average book is small, consisting of a few hundred kilobytes.
    a typical book might have no images in it, or just a frontispiece,
    or that and a few pictures inside, so they download quickly too.

    then, in the rare case where an end-user needs a particular scan
    of a page, our _offline_ viewer-application could download that.
    (but of course, the person can also download all of the scans
    in a one-click batch operation, if that’s what they want to do.)
    and, following the logic above, any downloaded scan is saved,
    on the user’s machine, so they don’t have to download it again.

    i’ve built working prototypes of all of the pieces in my system,
    so i know that my system works, and i know it works well, and
    i know that it’s far superior to the complex and bloated systems
    which are being built, over and over, by the library technocrats.

    further, most pieces in my system are already mounted publicly,
    so you can look at them and come to the very same conclusion.


  • @ bowerbird: “further, most pieces in my system are already mounted publicly,
    so you can look at them and come to the very same conclusion.”


  • On further reflection, the “skin” is of less importance than source.

    That bowerbird and I can disagree on page formats based on intangibles like composition, fidelity to the original book, available bandwidth (today and tomorrow) suggests that there is a wide range of possible reader functionality possible. Anyone could build one given access to the data.

    Looking at Google Books, perusing the html shows that the data is far more transparent to the outside world than with the Open Library Book Reader. Aside from the controversial copyright issues, I wonder if the better solution is for Google to open up its data/API so the market can build readers based on their content. If we could agree on a common data format and API, then even copyright and private library collections could be accessed by all readers.

  • bowerbird

    alex said:
    > URL?

    first let’s make that list of purposes we want to serve.
    viewing my list first might truncate your imagination.

    > there is a wide range of possible reader functionality possible.


    > Anyone could build one given access to the data.

    and we’re likely to disagree on the effectiveness of each one
    if we don’t have a solid idea of the functionality that we need.

    so we’ll end up with a dozen online-viewers, all of ’em inferior.

    > Looking at Google Books, perusing the html shows that
    > the data is far more transparent to the outside world
    > than with the Open Library Book Reader.

    well, yes and no.

    you can scrape the “data” out of google’s .html, certainly.
    but then you’ve got the “data” for one page in one book…

    scraping and concatenating for an entire book is a chore;
    and doing it for the millions of books google has scanned?

    internet archive, on the other hand, offers other ways to
    obtain the “data” for a book, which have nothing to do with
    their book-reader, so they cover that base a different way.
    (they don’t cover it particularly _well_, but they do cover it.)

    > Aside from the controversial copyright issues,
    > I wonder if the better solution is for Google to
    > open up its data/API so the market can
    > build readers based on their content.

    even ignoring the fact that the opening clause manages to
    “pretend” the 8000-pound elephant in the room isn’t there,
    there’s not one good reason to think google would _want_
    “the market” to build readers based on google’s content…
    google wants to keep that competitive advantage to itself.

    > If we could agree on a common data format and API

    a lot of programmers like to think in terms of an “a.p.i.”,
    but that’s the wrong approach. we need to design the
    system so that all of the data is exposed _without_ an
    a.p.i., by virtue of a clean and simple design that can be
    understood by a 4th-grader which surfaces the content.


    p.s. alex, you do good analysis when you look at stuff.
    so go over and look at the “mirlyn” system at umichigan.
    look for “the jungle” by upton sinclair, for a comparison.

    p.p.s. it also appears that you might be interested in
    programming your own online-reader. if so, make the
    list of purposes i discussed above, so we can discuss it,
    and then i’ll tell you how you can access my materials
    as content for your online-reader. we’ll work together.

  • bowerbird – let me get back to you off this thread about putting together a functional spec/wishlist for a eBook system.

    You can reach me at:

  • bowerbird

    as always, bowerbird at aol dot com.


  • bowerbird

    this is one big problem with the blogosphere:
    it has a severely truncated attention-span that
    fails to show anything close to the dedication
    and tenacity that’s needed to solve a problem.

    oh shiny! look shiny!

    and twitter only makes it worse…


  • bowerbird

    is it just me, or is the openlibrary reader broken?


  • bowerbird

    over and above the a.d.d. of the blogosphere,
    the bigger problem here is that the big entities
    that are mounting viewers for online libraries
    apparently feel absolutely no need to engage
    in dialog with the public about their efforts…

    perry from umichigan hasn’t posted back yet,
    and nobody from openlibrary has visited here.

    it’s very difficult to convince ourselves that
    these people want to hear feedback from us.


  • @bowerbird: this is just a blog–I certainly don’t monitor the Internet for every occurrence of my name so that I can respond wherever I’m mentioned. If you want to engage in conversation with Perry or with the book reader folks, send them email. Radar doesn’t have a “subscribe to this thread” option — perhaps it should, to enable conversations like this to continue.

  • bowerbird

    nat, it was an observation, not an accusation.

    besides, alex and i will keep this thread going.

    (and the backchannel e-mail he sent me shows
    he’s done some good thinking on these matters,
    so you might just want to bookmark this thread.)