Checking Copyright

At the DLF Fall Forum today in Philadelphia, Mimi Calter [pdf] presented a paper on an examination of the Copyright Registration database, which Carl Malamud and I have been active in “liberating.”

Stanford has been working on creating a full-text database of copyright renewal records for books published between 1923 and 1963; renewals are required after 28 years, so the relevant examination period for this paper was 1950-1992. The project tracks books only (no serials, music, etc.), and is supported by the Mellon Foundation.

1950-1977 data was compiled from published Catalog of Copyright Records, and Project Gutenberg transcripts. Edits to internal Copyright Office records could not be incorporated in the database. 1978 -> records were harvested from the online database, and per the work of Joel Hardi at Public Resource, are now available at http://rss.resource.org. The Stanford datafile, which was indexed with Lucene, is available by request.

This project is important because it is a critical input to an online copyright analysis system; since renewals were required during this period for copyright extension, there are potentially large numbers of works which have fallen into the public domain. The analysis is particularly valuable because of the lack of database records prior to 1978.

From the selected period, 545 records were examined manually, about 100 records were searched online for a comparison.

Startlingly, over 30 percent of the searched items had been renewed; this was higher than many people anticipated.

Although gross, crippling errors were relatively low, there were many inconsistencies: internal CO formats change from one year to another; fields are sometimes concatenated or left unlabeled; unique identifiers were often missing; registration numbers and dates were often omitted.

This important work points out two critical things for me. The first, and in some ways the most critical, is to figure out how to merge the Copyright Renewal database with a major bibliographic database, such as the Library of Congress, or a major university catalog, such as the Univ. of California’s Melvyl. This would both enrich the Copyright database, as well as augment the ability of book catalogs to provide authoritative information on copyright status. As my friend Karen Coyle said in comments on my blog post, “Making a Brouhaha in the Blogosphere,” “If we ever do get MARC records connected to these, we need to upgrade the copyright database with decent bibliographic data.” (I have heard that Brewster Kahle and the OpenLibrary are working on this problem).

This data merge is tremendously complicated by the lack of unique identifiers in the Copyright database, requiring a multi-stage or fuzzy merge. A merge based on something like Bowker’s Books in Print would be unsuccessful as ISBNs were only assigned prospectively from 1967 onwards.

The other thing that I feel is a requisite is an ongoing service provided by the Copyright Office, on behalf of the public.

Here is the text of a letter (mildly edited) that I wrote to Deanna Marcum, the Associate Librarian for Library Services, and one of the DLF Board members, requesting a conversation about these kinds of services:

The desire is not to get the registration records on a one time basis, but the ability to continuously obtain the records, such that [they] are offered as a service on the network by the CO. Users could elect the whole db, or update, or new records; a feed would be available even through such a simple means as RSS on a daily basis. [N.B.: Public Resource is doing this as a fallback until the CO office initiates such a service directly].

There may be ways of designing a service such that the library community as a whole could enrich these records with bibliographic data, or correct simple mistakes that impede their use and lessen their value. In fact, Karen Coyle related that the first record she looked at had a spelling error in the title. There may be a way to work with the registration records; certainly, day-lighting them more openly will enrich immeasurably their value [for] all of us, permitting the construction of services that recognize and honor the rights of these materials far more faithfully.

At any rate, I [have] only begun to think through such [mutually beneficial] community based solutions, and should have tried to focus more on them earlier, but they exist for the Library to exploit to everyone’s advantage. I think there would be tremendous enthusiasm among research libraries to hold a conversation with the CO and the Library [of Congress] on how such services might be constructed, and I am happy to aid such dialogue in whatever way might be useful, such as facilitating such a meeting among our members.

I am still looking forward to engaging the Copyright Office.

tags: