Nov 4

Tim O'Reilly

Tim O'Reilly

Oops - Only 4% of Titles Are Being Commercially Exploited

In a recent post, I made the assertion that 10-20% of titles published were still in print and being commercially exploited, with another 20% clearly in the public domain, leaving approximately 60% in what I called "the twilight zone" -- with no clear rights. Farhad Manjoo of Salon, who is writing a followup story, emailed me for confirmation of those numbers, and in so doing, made me realize an error in my calculations. I had taken the number supplied by the OCLC, of 10.5 million unique titles in the five libraries cooperating with the Google Print Library Project, and applied to that the recent report by Nielsen Bookscan that 1.2 million unique titles sold at least one copy in 2004, and came up with the estimate of 12% I used in that prior post, which I generously expanded to 10-20% by assuming that books that didn't sell even one copy might still be considered "active" by some publishers.

However, in answering Farhad's question, I realized that the correct number to work with is not the 10.5 million unique titles in the five libraries working with Google, but the 32 million unique titles in the entire OCLC "WorldCat", which represents their best estimate of the number of titles held in all US libraries, and is a reasonable proxy for the number of books published. That math leads to the following revision of the picture I published earlier:


Or, expressed as a pie chart:


Put another way, if the AAP and Author's Guild were to prevail, and opt-in were required, the AAP is asking us to believe that publishers are willing to unearth the contracts for more than 25 million books, track down the authors (since many of those books surely don't grant electronic rights to the publishers, since those rights weren't even conceived at the time many of those contracts were signed), and get their permission to opt them in, and this despite the fact that those 25 million books didn't sell even one copy in 2004. Try to be serious. There is no economic incentive for publishers to opt in books in what I've called "the twilight zone." This approach will make the creation of a comprehensive search engine for books virtually impossible.

These numbers are corroborated by a conversation I had today with the Copyright Clearance Center. The AAP has asserted that the CCC can help track down the rights to all the orphaned works in an opt-in scenario. Yet the CCC has rights records for only 1.7 million titles, including journals, and many foreign language works. So again, there's not much traction on the titles in the twilight zone.

A bit more math: in Microsoft's announcement that they were joining the Open Content Alliance, they estimated that their $5 million dollar contribution would be enough to digitize approximately 150,000 books. By that measure, it will cost Google over $300 million to digitize the 10.5 million unique titles in the collections of the five participating libraries. Of that, assuming these numbers hold, more than $200 million will be spent to digitize works that the publishers have determined have no current economic value. Are the publishers going to make this investment to digitize works they no longer exploit in print? Not very likely.

In short, Google is offering the publishers and authors a $200 million dollar gift horse. Based on our experience with Safari, I'm confident that search helps people to find and use books that aren't available in print. A preliminary study comparing Safari usage to BookScan data on books sold showed us that 23% of all Safari usage came from books that represented only 6% of print sales. (I'll be blogging this data in more detail soon.) When people can search books, they will discover forgotten gems -- to which authors and publishers can then assert their rights, perhaps bringing them back into print, laying claim to a share of the advertising revenue, or enabling click through to an electronic copy. But for that dream to happen, someone has to make the investment to create the search index.

And as to the publishers' claim that Google's intended use isn't fair, I'll point out that it's exactly the same fair use exception that Google and other search engines use to create an index of web pages, which are also copyrighted material that, by the publishers' interpretation of the law, should have required opt in by web publishers. Making intermediate electronic copies in order to create derived works that are themselves fair use is one of those technological changes that requires us to rethink the narrow interpretation of copyright law that old-line companies would have us hold onto.

In short, I believe that the AAP's position is intellectually dishonest. They are pretending that opt-in is a real solution to the orphaned works problem, when by the numbers, it clearly is not. And at the same time they are resorting to scare tactics, calling the project "a license to steal", when in fact the proper analogue is to what Google already does on the web -- and we all know how much value that has created. As Cory Doctorow likes to point out, Google should no more be negotiating with publishers over this extension of fair use than Sony should have negotiated with the movie studios before introducing the VCR. New technology always brings challenges, but it also brings opportunities. If we allow old line industries to suppress new opportunities in the interest of protecting their entrenched businesses (and I should note that the five large publishers who dominate the AAP control more than 50% of all books sold), we will all be the poorer for it.

tags:   | comments: 21   | Sphere It

Previous  |  Next

0 TrackBacks

TrackBack URL for this entry:

Comments: 21

  Owen [11.04.05 03:45 PM]

Very interesting. And I completely agree about the opt-in issue and that Google Print can benefit publishers. But there is a very big IF in there. It hinges around whether or not it is worth indexing all those books. I know that the underlying philosophy is that it is all knowledge and it can't hurt to make it searchable etc. But this brings up the information overload issue and the reason why search engines still struggle to gain further traction than they already have. A strong case can be made that much of this information is NOT worth indexing. I don 't want 5000 mediocre or inaccurate results for my search on home curing of alligator hides or whatever. I want one excellent result. There is a reason that all those books are lost in neverland. The vast majority of them aren't worth reading or indexing - something better was deemed to be the best book and is still in print and selling.

I'm not saying that the publishing and author communities are right here - they aren't - but I am saying that Google's goal in this case may not be one worth pursuing.

  Tim O'Reilly [11.04.05 04:32 PM]

Owen, finding the needle in the haystack is always a challenge in search, but let me offer three thoughts:

1. As I mentioned in the piece, our evidence from Safari is that people do find value in works that aren't available in print. The long tail is fatter online than it is when physical goods are involved. But even more striking, the long tail has bumps in it, as readers find and discover value in older works. (As promised, I'll be publishing some of that data shortly.)

2. As an avid collector of old books, I can assure you that many of my favorites are no longer in print. Many wonderful books are no longer available just because of the limited capacity of the sales channels, and the economics of keeping books in print. It would be wonderful to be able to search their text for a passage I'm trying to remember, or want to link to and pass on.

3. A thought from a conversation today with Rick Prelinger of OCA. There are many works that are not of general interest that suddenly have value in new contexts when they become searchable. The example he gave was genealogy. When you're trying to trace your ancestors, you want to look at all kinds of old documents that are not conventionally "useful." One of the benefits of the search paradigm is that the user gets to pursue specialized interests.

Another example was written up in the New York Times just today: how the study of old cookbooks and even menus yields a treasure trove of data for historians.

  Mike Perry [11.04.05 04:33 PM]

I agree with Owen. Large numbers do not tell the whole story. Most out-of-print works could stay unavailable forever with no one being the loser. And unless some clever ranking scheme is developed, their existence in a Google Print database will simply create enormous clutter. It will be like looking for a book in a library without a classification system. "Virtual books" would appear alongside one another for no more reason than that both use some catch phrase.

I ran into that problem when I tried to research an obscure term that happened to be used both for a modern technology and a late-nineteenth century health food. When I began, I didn't know what the latter use was, so I had to wade through hundreds of recent tech uses to find what may have been the only place on the Internet where the other use was found. Too much trash can hide a gem.

Even more important, arguments of convenience and necessity miss the point that even if an every word index is fair use, presenting users with no more than a few sentences of the text is neither convenient or useful. Not even Google is claiming the right to show the entire text to anyone, and that's what users need.

We simply can't graft the Internet and all the new print technologies on a copyright law that laid down when 'computer' meant an IBM 370 mainframe. And Congress, in their great folly, compounded the problem in the late 1990s when they extended copyright, so it now often exceeds the lifetimes of both an author and his children. Unfortunately, on death intellectual property isn't treated as carefully in law as real property. Determining ownership is often impossible.

What we need isn't a courtroom battle whose rulings will apply only to a specific set of circumstances and which will take years to conclude. We need the 'horsetrading' of legislation, a situation where authors and their heirs gain something, perhaps an easy-to-get stream of modest royalties, while the reading public gains easy access to out-of-print books, with something Google cannot legally provide, an ability to download and print out one copy of the entire text for perhaps 1-2 cents per page. And Congress could settle by dictate all the messy issues that are likely to arise when there's no way to say where a copyright has gone.

That's what we'll end up having to do in the end. It'd be far wiser for everyone to begin to work in that direction today. With sufficient prodding, Congress could settle this entire issue before the first inconclusive rulings have cleared the appeals courts.

--Mike Perry, Inkling Books, Seattle

  Rich Gibson [11.04.05 05:52 PM]

I am confused by the argument that these out of print works should be unavailable, that they will just clutter the search space, that they have been superceeded for a reason.

We are not competent to judge which works are valuable, which books should be preserved and made searchable and which should be thrown to the wolves of the dark night.

It takes a single generation of inattention to lose fundamental skills and ways of life. In theory books allow for the potential to (painfully) bridge the gap and regain those skills.

We don't know what the future needs of us, we don't even know what we need of the past.

  Hadley Stern [11.04.05 07:33 PM]

The part of this issue that concerns me is that Google is merely using this as a premise to make money. Googles primary goal as a public company has to be to become more and more profitable. Making all of the world's books available within a websearch is simply an excuse to push more (very profitable) adwords at us. This is why it is against fair use. Because Google is profiting from displaying other people's words.

Yes the notion of being able to search all books in seductive, but really, all this is is more content to put adwords against.

  Tim O'Reilly [11.04.05 07:50 PM]

So, Hadley, why is this different than, say, the New York Review of Books or the New Yorker running ads alongside their book reviews? Whether or not someone is making money is not actually the test of fair use, despite what some would have you believe.

I agree that this could be a huge moneymaker for Google -- but as I noted above, they will need to make a huge investment to get that benefit. It's much like any other business. You make money if you do things that serve your customers, and thus allow you to monetize the value that you've provided.

And of course what interests me, is that based on what Google has already proven with Adsense for Content, this business model also provides attractive revenue streams for content providers, not just for Google. In a declining publishing market, this is something that every author and publisher ought to be rooting for!

  Pierre Sandboge [11.05.05 12:20 AM]

According to this:

there will be no ads. I presume that ads will be displayed on surrounding pages, but I don't see why authors should have a cut of those ad revenues.

As for most books not being worth indexing, I think quality is just one of many factors that determine the success of any given book. High quality does not necessarily lead to commercial success. Also Google provides (mostly) high quality search results on the web, despite the prevalence of mediocricity. There is nothing to suggest they won't eventually succeed to do so with Google Print as well.

  Tim O'Reilly [11.05.05 11:22 AM]

One more fact. Again, according to Bookscan (in a presentation at The BookStandard Summit 2005 conference), only about 7% of the 1.2 million titles that sold at least one copy last year sold more than 1000 copies. That's about 84,000 titles, or about a quarter of one percent of all the titles that have ever been published.

Most books published really are doomed to obscurity. A search engine for books will most definitely help readers to find books that are currently "thrown away" by their publishers. I don't think it's an accident that the AAP, dominated by the five big publishers who control 50% of all book sales by their current frontlist marketing power, are against a technology that would help to level the playing field for smaller publishers.

  Brock [11.05.05 07:08 PM]

Although it will multiply their copyright headaches by an order of magnitude, I expect Google will one day team up with Print on Demand publishers. Then, once scanned into Google's index, no book will ever be out of print.

  Anthony R. Thompson [11.05.05 08:06 PM]

There is an HTML error in your post near the "Safari" link.

Tim: Thanks a bunch. Fixed now. I can't believe I missed that. It cut out a whole (and important) paragraph.

  J.T. Wenting [11.06.05 10:57 PM]

Still trying to convince the world that the wholesale copying and redistribution of books by Google is not breaking the law Tim?

It's up to Google to decide the status of a book before even attempting to distribute it, not the publisher.
What you're effectively saying (even if it's not what you intend to say) is that it's an impossible task for authors and publishers to check on the status of copyright of their works that Google decides to give away to the world, impossible even to decide whether it's really their work at all.
That should set off all kinds of alarm bells telling that it's (as it is) Google's task to determine that status and take appropriate action to get permission to give it away if it turns out a work is protected.
After all, they're the ones who want to use that work for purposes not originally intended...

  Tim O'Reilly [11.07.05 09:13 AM]

Jeroen --

Once again (I think I've said it each time you've posted this misleading comment on one of my postings):

1. Google is NOT showing the entire work, EXCEPT when it's clearly in the public domain.

2. Google is showing SELECTED PAGES if the book has been OPTED IN by the publisher.

3. Google is showing ONLY SNIPPETS (those little one liners you see on the first page of web search results as well) if the copyright status of the book is unclear (most of them.)

The repeated claims that by the AG and the AAP that Google is giving away their copyrighted work are a smokescreen. It is true that Google has to make a complete copy (i.e. scan the books) in order to create their index, but my contention (and Google's, though I don't speak for them) is that this is is fair use, and a natural extension of fair use at that, as current technology requires the making of transient copies in all kinds of contexts that were not originally intended in copyright law. And as I pointed out in the posting, this is the very same interpretation of fair use that allows search engines like Google to make copies of other people's web pages in order to create an index for them. And yes, there were people who protested when that process started too!

Meanwhile, the one area that Google may be on shaky ground, is that they are also giving a scanned copy back to the libraries to do with what they will. Understanding whether or not this is a problem depends a lot on how the agreements between the libraries and Google are structured. Are the libraries making the copies (with Google as a contractor) and giving a copy to Google for their index, or is Google making the copies and giving one to the libraries.

I can say, based on private conversations with a couple of members of the AAP board, that they are actually NOT concerned that Google is giving away their property to the public via Google Print. (Except for Pat Schroeder of the AAP, who is taking this public position purely as a PR move, to mislead folks like yourself.) They are concerned about the copy that the libraries have, because big publishers have always distrusted libraries and their mission. And they are concerned about the precedents being set: depending on how this comes out, who else might come out of the woodwork with a fair use argument.

What bothers me about all this are the disingenuous arguments from the AG/AAP side:

1. If this program were opt-in, we know who owns the rights, and publishers and authors could make the decision.

This is an serious misrepresentation, which I've tried to out with the numbers in this post. The CCC was cited as a mechanism, when they have records on only 1.7 million out of more than 30 million titles.

The titles that the publishers know about are already in Google's first bucket, and many are opted in, although even there, the publishers have opted in many titles to which authors only own the rights, again demonstrating the practical infeasibility of the publisher position.

(I will say that Google's unwillingness to pre-screen titles in the library program on the basis of publsher-supplied lists is a potential problem for them, but because I believe that publishers don't actually know who owns many of the rights anyway, I think Google's still making the right call on this. It will be much easier to sort out after the fact, and Google is making it very clear that if anyone wants out, they can get out. Again, this is analogous to how we handled search on the web.)

2. Google is giving away our content. This is an appeal to people who don't take the time to do their homework, and just believe what they want to hear, when it matches their prejudices. The AAP and AG are working hard to portray Google Print's Library Project as the napster of books when they know damn well that that's a completely misleading analogy. I despise people who lie to try to win a PR war, and I'm getting increasingly disgusted with the AAP over their tactics.

3. Google is so rich and we're so poor. First off, this is an argument that is offensive in its irrelevance, but it's also untrue. Google's on a roll, to be sure, but the top five publishers who dominate the AAP are collectively ten or more times Google's size, and collectively have profits in excess of Google's. They might say, "but we're big conglomerates who don't make all of our money from publishing," but by that measure, Google Print is a startup with almost zero revenue...

There are real issues at stake here, and we ought to be debating those issues, not a PR war designed to frame backroom negotiations that will turn out very differently depending on whether the public (and the courts) buy the misrepresentations.

I ask myself three questions:

1. Is this a good thing for authors and publishers? All of my experience says yes, although it is definitely fraught with peril as well as opportunity, as is any new medium that changes the playing field.

2. Is this fair use, even though it was not a contemplated use when the copyright law was framed? Again, I come up with a yes. If it isn't fair use, then neither is Google's web search engine, or anyone else's.

3. Does Google propose to share their revenue model with authors and publishers fairly? Yes. If this works and they make money, we make money, as far as I know at Google's usual content-provider-friendly splits. (See Adsense for Content.)

4. Are there other issues? Sure. There are issues of precedent, on both sides. (Publishers are worried that if the precedent is set, there will be other less friendly parties attempting the same argument. But so too is Google worried, as a more restrictive view of fair use could have their web search business bitten to death by ducks as they are harrassed by opportunistic lawsuits.)

There is the issue of the library copies. But if those are the issue, come out and say that front and center. But you'll notice that the issue of the libraries having the free copies (which looms large in private conversation) is curiously absent from the public debate, since it's a far less flattering attack for the publishers to make.

In short, the tactics of the AAP and AG have turned this into a legitimate dispute into one in which one side is being pretty dishonest, if you ask me, and that side isn't Google. These guys do want to do the right thing, and I find it disheartening that the AAP is using such sleazy tactics as in Pat Schroeder's Washington Times Op Ed (also linked to above in the original article) to mislead people about what the real issues are.

  Dan Bednarek [11.07.05 10:23 AM]

Tim, I take exception with your description of fair use in this context. Web search engines are indexing content that the author and copyright holder has already decided to provide free of charge (on the web), so the indexing of that content, while technically a possible copyright violation, has no negative financial consequence for the original author. Google plans to scan and index copyrighted works that are not otherwise provided to the public free of charge, which I think is a material difference. But I do agree that the program would likely drive users to published works they would then have to purchase, so I guess I'm just splitting hairs with one of your points of argument.

  Owen [11.07.05 10:24 AM]


First, thanks for this forum discussing issues like this. This is a more substantive and worthwhile discussion than I have seen elsewhere. I am very much in a grey area on all this. I am a very small publisher that could in theory benefit greatly from Google's plan - or at least my author's could since all the material is in fact already available online - apart from the three sites that are already down and out forever. BUT I still have issues that I cannot quite resolve that are around two things. One is the ability to find, have, hold a printed copy of the work. I remain to be convinced that we will really replace the printed book. Maybe if and when e-paper gets to be as thin as real paper.

Second, I also do NOT believe that Google and Amazon and anyone else has the publisher's interests at heart. I make less money off a sale via Amazon than any other avenue for a sale. I also strongly believe that in out-of-print, low volume content, the copyright owner's share of return should go up not down - and I think it could if those involved in digital models of preserving content really wanted to do so.

I have opted in to Google Print but have not heard anything at all about when and if they will digitize the book - and frankly, if they just came to me I could GIVE them a digital version saving all that expensive scanning!

Having said all that, I do really believe that the underlying premise of Google's idea here will come about and is a good idea - but the devil is in the details as always.

I agree with your summary of the issues as well - and legally I strongly suspect that it is number 2 - the fair use point - that is the real underlying issue for Google and for everyone else. While Search seems reasonable in terms of helping people to find, the question becomes how much of an excerpt is OK? And then we turn to the recently raised points about printing a page or two for micropayment fees - say 5 cents a page - that's already cheaper than my book is period. I don't think Google/Amazon can be the ones setting that price - and they can't be claiming very much of the revenue either or they'll see their permission dry up faster than a raindrop in the Sahara.

Finally, I completely agree about the AAP and AG in this case and their attacks have obscured the real issues. They should be thinking about the whole new book production and distribution chain - that's where there big issues lie.

For the record, I started a small book publishing company as a sideline at the start of this year. I have published one title so far. It has sold poorly despite being original, unique and a very very good book. I market primarily directly and can make money off far smaller sales than big pub lishers but we have not reached that target yet. I pay 20% royalties to authors because I believe in good content - and that is for single time first serial rights in book form only - they can resell the content many other ways.

  Sid Steward [11.07.05 03:11 PM]

From the DMCA protects search engine page caching, indexing, etc.:

"When folks talk about Google Library and fair use, Google's current practice of indexing web pages is commonly offered as a sort of precedent. 'Google can index web pages, so it should also be able to index books.' Or even: 'if Google can't index books, then it can't index the web.' However, the way search engines index the web is specifically protected by the DMCA."

From: copyright law lets libraries distribute Twilight Zone material:

"[Section 108] gives new life to works in Tim's Twilight Zone. It sounds more permissive than fair use, too."

  Ian Barker [11.08.05 07:07 AM]

Fascinating discussion.

An earlier argument against digitization and expanding the searchable content universe was based upon the idea that it's possible to have too much of something; that 1000 more books on a given topic won't add enough incremental value (measured by education or entertainment value, and revenue) to warrant the scanning cost.

Even though I think Tim addressed this point very well, there is another key impact of the massive technological change we're experiencing in digital content access: technology has shifted our definition of "book". Clearly, books are no longer static pieces of information. Publishers must shift toward defining the book more as a sort of digital content object, part of which goes to print(what we now think of as the traditional printed work) and other parts that exist only digitally. Yes, there are examples of this to be found (O'Reilly, for one) but it's really under-developed.

Further, what of derivative works? Seems to me repackaging will come to be one of major new methods of driving interest in new and backlist titles. (Amazon's recent announcement that it will begin selling content by the page comes to mind.) Imagine being able to read through organized excerpts derived from multiple same-topic works, with the chance, at every step, to purchase that out-of-print title! It seems sensible to me, though, it must be confessed, I'm expressing this argument in part from professional self-interest.

Regardless, once rights-holders realize the scope of the revenue possibilities there may quickly be much more support for the more open policies demanded by digitization technology and recommended by the OCA.

  Mike Perry [11.09.05 09:06 AM]

Raymond T. Nimmer, the Leonard Childs Professor of Law at the University of Houston Law Center and co-director of the Houston Intellectual Property and Information Law Institute, has a blog and at:

he explains why Arribasoft's website is too different from what Google is doing to provide much help.

He also seems skeptical of Google's chance of winning in court, noting: "Indeed, in Tasini, the Supreme Court held that it was infringement for the publisher of magazines to reproduce those magazines in digital, online form. This trampled on a market that copyright law gave the copyright owner the right to control - online publication. Google does not even have that much of a relationship to the copyrighted works."

Here's his summary:

But these are legal points. The further question is whether, as a matter of policy, Google should have the right to do what it wants? I think not. On the one hand, this large company desires to make a massive number of copies of other persons' property for its own profit. On the other hand, the authors and publishers that own the property rights have been given exclusive rights to copy or distribute copies of their works as part of a statutory scheme that intends to provide authors with incentive to create new works. The incentive lies in their ability to control how the work is distributed and, even, when or if it is distributed. This is exactly the right that Google plans to take away.

And what about the libraries that assist Google? They have special statutory rights under the Copyright Act but none of those rights authorize the deal they are making with Google. It will be interesting to see if the Author's Guild brings a claim of inducement under Grokster. But that will be for another day.

I'd add that, with Google's chance of winning in court less than impressive, the only workable solution is a legistlative compromise that brings copyright law into the 21st century, while respecting the interests of all parties. Hang tough and duke it out in court isn't going to work.

--Mike Perry, Inkling Books, Seattle

  Jeroen Hellingman [11.10.05 11:54 AM]

One aspect I never recognize in the discussions on copyright is the level of competition that public domain books, and also print-on-demand services would give to established publishers. I believe that the elimination of competition from the public domain has been even more a driving force behind repeated copyright extentions than the rent-seeking for a few long-lived classics. Google's move would open up a enormous backlist of works more recent works in the twilight zone, for which the rights probably lie with individual authors, who, after seeing increased attention and demand for their works, may decide to sell "print-on-demand" rights to Google for a small price, happy to see their works revived, and getting some pocket money as well. Although the books may be somewhat older, they are by far not as outdated as public domain works from the early twenties.--and yes, that would undercut the current publisher's prices, but not in any way against the spirit and letter of copyright. For most authors of twilight-zone works, even 1 cent per page would be a great deal.

  Michael S. Hart [09.06.06 07:27 AM]

I noticed in your article about how many books in
print vs. under copyright that you mentioned the
holdings of The Open Content Alliance and Amazon,
but neither Project Gutenberg or World eBook Fair.

Project Gutenberg has over 100,000 eBook files for
free download at 5 sites, and The World eBook Fair
had 1/3 million in July and may have 2/3 million
next month for International Book Fair Month.

This appears to be more than we can download from
either of the sources you mentioned.

I can sent more details on request.

Try for a teaser.



  Carl B. Adams [08.24.07 04:24 AM]

I was told by a publisher that, a law prevents sell of out of print books. In a search for an answer to this obvious question, you were found. I had inquired about promoting one of their books. They said it is out of print, and tried to sell me another edition. It seems like a stupid lie. I happen to have been involved with copyright law for a measurable number of years. I hope you can comment.

  George London [01.23.09 02:52 PM]

I was googling for a reasonably definitive list of published books (something like a complete ISBN list) and found this page. What a fascinatng debate!!

I knew nothing of the Google Book Search, and know that his is an old (in web terms, an ANCIENT!) discussion but as a musician with an interest in the online music debate, I guess I'm not surprised that the issue that at first hit the music (and then the movie, and then the TV industries) should be hitting publishing too.

I'm off to find out what's happened since November 4th 2005!

Post A Comment:

 (please be patient, comments may take awhile to post)

Type the characters you see in the picture above.