The search for a minimum viable record

Open Library's George Oates on the pursuit of concise categorization.

Catalogue by Helga's Lobster Stew, on FlickrAt first blush, bibliographic data seems like it would be a fairly straightforward thing: author, title, publisher, publication date. But that’s really just the beginning of the sorts of data tracked in library catalogs. There’s also a variety of metadata standards and information classification systems that need to be addressed.

The Open Library has run into these complexities and challenges as it seeks to create “one web page for every book ever published.”

George Oates, Open Library lead, recently gave a presentation in which she surveyed audience members, asking them to list the five fields they thought necessary to adequately describe a book. In other words, what constitutes a “minimum viable record“? Akin to the idea of the “minimum viable product” for getting a web project coded and deployed quickly, the minimum viable record (MVR) could be a way to facilitate an easier exchange of information between library catalogs and information systems.

In the interview below, Oates explains the issues and opportunities attached to categorization and MVRs.

What are some of the challenges that libraries and archives face when compiling and comparing records?

George OatesGeorge Oates: I think the challenges for compilation and comparison of records rest in different styles, and the innate human need to collect, organize, and describe the things around us. As Barbara Tillett noted in a 2004 paper: “Once you have a collection of over say 2,000 items, a human being can no longer remember every item and needs a system to help find things.”

I was struck by an article I saw on a site called Apartment Therapy, about “10 Tiny Gardens,” where the author surveyed extremely different decorations and outputs within remarkable constraints. That same concept can be dropped into cataloging, where even in the old days, when librarians described books within the boundaries of a physical index card, great variation still occurred. Trying to describe a book on a 3×5 card is oddly reductionist.

It’s precisely this practice that’s produced this “diabolical rationality” of library metadata that Karen Coyle describes [slide No. 38]. We’re not designed to be rational like this, all marching to the same descriptive drum, even though these mythical levels of control and uniformity are still claimed. It seems to be a human imperative to stretch ontological boundaries and strive for greater levels of detail.

Some specific categorization challenges are found in the way people’s names are cataloged. There’s the very simple difference between “Lastname, Firstname” and “Firstname Lastname” or the myriad “disambiguators” that can help tell two authors with the same name apart — like a middle initial, a birthdate, title, common name, etc.

There are also challenges attached to the normal evolution of language, and a particular classification’s ability to keep up. An example is the recent introduction of the word “cooking” as an official Library of Congress Subject Heading. “Cooking” supersedes “Cookery,” so now you have to make sure all the records you have in your catalog that previously referred to “Cookery” now know about this newfangled “Cooking” word. This process is something of a ouroboros, although it’s certainly made easier now that mass updates are possible with software.

A useful contrast to all this is the way tagging on Flickr was never controlled (even though several Flickr members crusaded for various patterns). Now, even from this chaos, order emerges. On Flickr it’s now possible to find photos of red graffiti on walls in Brooklyn, all through tags. Using metadata “native” to a digital photograph, like the date it was taken, and various camera details, you can focus even deeper, to find photos taken with a Nikon in the winter of 2008. Even though that’s awesome, I’m sure it rankles professionals since Flickr also has a bunch of photos that have no tags at all.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

In a blog post, you wrote about “a metastasized level of complexity.” How does that connect to our need for minimum viable records?

George Oates: What I’m trying to get at is a sense that cataloging is a bit like case law: special cataloging rules apply in even the most specific of situations. Just take a quick glance at some of the documentation on cataloging rules for a sense of that. It’s incredible. As a librarian friend of mine once said, “Some catalogers like it hard.”

At Open Library, we’re trying to ingest catalogs from all over the place, but we’re constantly tripped up by fields we don’t recognize, or things in fields that probably shouldn’t be there. Trying to write an importing program that’s filled with special treatments and exceptions doesn’t seem practical since it would need constant tweaking to keep up with new styles or standards.

The desire to simplify this sort of thing isn’t new. The Dublin Core (DC) initiative came out of a meeting hosted by OCLC in 1995. There are now 15 base DC fields that can describe pretty much anything, and DC is widely used as an approachable standard for all sorts of exchanges of data today. All in all, it’s really successful.

Interestingly, after 16 years, DC now has an incorporated organization, loads of contributors, and documentation that seems much more complex than “just use these 15 fields for everything.” As every good archivist would tell you, it’s better to archive something than nothing, and to get as much information as you can from your source. The temptation for us is to keep trying to handle any kind of metadata at all times, which is super hard.

How do you see computers and electronic formats helping with minimum viable records?

George Oates: MVR might be an opportunity to create a simpler exchange of records. One computer says “Let me send you my MVR for an initial match.” If the receiving computer can interpret it, then the systems can talk and ask each other for more.

The tricky part about digital humanities is that its lifeblood is in the details. For example, this section from the Tillett paper I mentioned earlier looked at the relationship between precision and recall:

Studies … looked at precision and recall, demonstrating that the two are inversely related — greater recall means poorer precision and greater precision means poorer recall — high recall being the ability to retrieve everything that relates to a search request from the database searched, while precision is retrieving only those relevant to a user.

It’s a huge step to sacrifice detail (hence, precision) in favor of recall. But, perhaps that’s the step we need, as long as recall can elicit precision, if asked. Certainly in the case of computers, the less fiddly the special cases, the more straightforward it is to make a match.

Photos: Catalogue by Helga’s Lobster Stew, on Flickr; profile photo by Derek Powazek

This interview was edited and condensed.

Related:

<ul

  • The Library of the Commons: Rise of the Infodex
  • Rethinking museums and libraries as living structures
  • The quiet rise of machine learning
  • tags: , , , ,

    Get the O’Reilly Data Newsletter

    Stay informed. Receive weekly insight from industry insiders.

    • http://jakoblog.de Jakob

      The missing question here is: How relates MVR to the 15 basic Dublin Core fields? Just telling that DC evolved into something more complex does not convince – how should MVR be different?