Checking Copyright

Wed

Nov 7
2007

listen

At the DLF Fall Forum today in Philadelphia, Mimi Calter [pdf] presented a paper on an examination of the Copyright Registration database, which Carl Malamud and I have been active in "liberating."

Stanford has been working on creating a full-text database of copyright renewal records for books published between 1923 and 1963; renewals are required after 28 years, so the relevant examination period for this paper was 1950-1992. The project tracks books only (no serials, music, etc.), and is supported by the Mellon Foundation.

1950-1977 data was compiled from published Catalog of Copyright Records, and Project Gutenberg transcripts. Edits to internal Copyright Office records could not be incorporated in the database. 1978 -> records were harvested from the online database, and per the work of Joel Hardi at Public Resource, are now available at http://rss.resource.org. The Stanford datafile, which was indexed with Lucene, is available by request.

This project is important because it is a critical input to an online copyright analysis system; since renewals were required during this period for copyright extension, there are potentially large numbers of works which have fallen into the public domain. The analysis is particularly valuable because of the lack of database records prior to 1978.

From the selected period, 545 records were examined manually, about 100 records were searched online for a comparison.

Startlingly, over 30 percent of the searched items had been renewed; this was higher than many people anticipated.

Although gross, crippling errors were relatively low, there were many inconsistencies: internal CO formats change from one year to another; fields are sometimes concatenated or left unlabeled; unique identifiers were often missing; registration numbers and dates were often omitted.

This important work points out two critical things for me. The first, and in some ways the most critical, is to figure out how to merge the Copyright Renewal database with a major bibliographic database, such as the Library of Congress, or a major university catalog, such as the Univ. of California's Melvyl. This would both enrich the Copyright database, as well as augment the ability of book catalogs to provide authoritative information on copyright status. As my friend Karen Coyle said in comments on my blog post, "Making a Brouhaha in the Blogosphere," "If we ever do get MARC records connected to these, we need to upgrade the copyright database with decent bibliographic data." (I have heard that Brewster Kahle and the OpenLibrary are working on this problem).

This data merge is tremendously complicated by the lack of unique identifiers in the Copyright database, requiring a multi-stage or fuzzy merge. A merge based on something like Bowker's Books in Print would be unsuccessful as ISBNs were only assigned prospectively from 1967 onwards.

The other thing that I feel is a requisite is an ongoing service provided by the Copyright Office, on behalf of the public.

Here is the text of a letter (mildly edited) that I wrote to Deanna Marcum, the Associate Librarian for Library Services, and one of the DLF Board members, requesting a conversation about these kinds of services:

The desire is not to get the registration records on a one time basis, but the ability to continuously obtain the records, such that [they] are offered as a service on the network by the CO. Users could elect the whole db, or update, or new records; a feed would be available even through such a simple means as RSS on a daily basis. [N.B.: Public Resource is doing this as a fallback until the CO office initiates such a service directly].
There may be ways of designing a service such that the library community as a whole could enrich these records with bibliographic data, or correct simple mistakes that impede their use and lessen their value. In fact, Karen Coyle related that the first record she looked at had a spelling error in the title. There may be a way to work with the registration records; certainly, day-lighting them more openly will enrich immeasurably their value [for] all of us, permitting the construction of services that recognize and honor the rights of these materials far more faithfully.

At any rate, I [have] only begun to think through such [mutually beneficial] community based solutions, and should have tried to focus more on them earlier, but they exist for the Library to exploit to everyone's advantage. I think there would be tremendous enthusiasm among research libraries to hold a conversation with the CO and the Library [of Congress] on how such services might be constructed, and I am happy to aid such dialogue in whatever way might be useful, such as facilitating such a meeting among our members.

I am still looking forward to engaging the Copyright Office.

tags: publishing | comments: 4 | Sphere It
submit:

Previous | Next

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/6026

Comments: 4

Peter Brantley [11.07.07 11:08 AM]

Rick Prelinger, president of the board for the Internet Archive, sez:

I don't know whether I told you, but we are starting to scan the Catalog of Copyright Entries from its inception in the late 19c until 1978. This will include registration and renewal listings for all classes of works, not simply books. Prelinger Library has a near-complete run and we will be working with other libraries to fill holes. It's being funded by Yahoo money, so there will be neither restrictions nor watermarks. We'll then need to work hard, and I'm sure this will be collaborative, to bring this into a database. Since (c) records are full of valuable bib information, their integration into other databases will be critical.

Bill Carney [11.07.07 01:53 PM]

Just a quick note here to say that OCLC Research is working on matching the copyright renewal data to WorldCat as we speak. We’re also discussing a search and update layer so that interested parties could help link bibliographic records with unmatched copyright renewal records. Of course, as others have pointed out, matching this data isn’t a trivial undertaking given the lack of identifiers, so the initial match rate may be quite low. This is all part of an OCLC project to develop a registry of copyright evidence, which would enable libraries and other organizations to share what they learn as they conduct copyright investigations for orphan works. The idea is similar in nature to the copy cataloging model where libraries share the workload across the cooperative. More details on the research as available. If you have questions, feel free to contact me directly.

Bill Carney, OCLC

Personanondata [11.08.07 10:11 AM]

On ISBN's there will be some titles that will have had ISBN's assigned to later editions but to your larger point, there is nothing to stop some entity (say LC/CO) acquiring an ISBN prefix and assigning ISBNs to those pre-1970ish titles that don't have them. In fact this would be a preferable route IMHO since it would maintain consistency of identifier. No one need mention the "publisher" prefix issue because we all know about it and for this one implementation which would benefit all users we could make an exception. This action wouldn't be immediately remediative since by nature there are many incompatible databases of titles reflecting published works prior to 1968 so rationalizing all these databases to the now ISBN inclusive pre-1968 Copyright database would be not an insignificant task. Assuming this rationalization were needed its main objective would be to make sure my "The Good Soldier" was the same as the one with an ISBN in the new CO database.

Also, I am not sure why you think 30% is
startlingly high. In the pre-1968 period there were not nearly as many titles published annually(in fact if I recall correctly BIP print only went to two volumes in the very late '60s)so publishers probably had time to be more dilligent.

Well done on opening this up BTW

john [11.09.07 12:37 PM]

Much ado about dated books. One day Google will wake up and realize that their ancient books that they spent millions of dollars scanning are worthless.

STAY CONNECTED

	Subscribe to Radar
	Follow Radar on Twitter

Wed