Books Working with the Web

Mon

Nov 5
2007

listen

Books Working with the Web

Almost a year ago, Tim O'Reilly wrote, "Search engines should be switchboards, not repositories" in his blog post, "Book Search should work like Web Search." The premise was that search engines should not duplicate the digital book repositories of publishers or other service providers, but should instead direct traffic to them. As Tim said, "Don't fight the internet."

In New York this last week, one of the many publishers that I spoke with was HarperCollins. Unique among the trade publishers, HarperCollins has decided to own the browsing and user experience for their digital books; not only do they maintain an active and growing repository (based on LibreDigital), but they also want Google to redirect users who encounter Harper books through the Google Book Search discovery interface over to HarperCollins' repository. Google has maintained, as the purveyor of Book Search, that they have sole latitude in determining when a destination "user experience" is good enough in terms of response time and functionality to make this switch to an external site.

That is not how the web works. Despite all their claims for openness, within the Book Search product, Google is creating a walled garden. Ultimately, if HarperCollins generates a poor user experience, then that is Harper's problem, not Google's.

Google has developed a fledgling specification, called "BookMap," which aids the harvesting of digital repositories containing digital books. One of the intents of bookmap is to permit the harvesting, and indexing, or whatever material the publisher deems appropriate for exposure (on their terms) to a search discovery interface, with the determination of where the user experience should be based residing as a separate consideration. As far as I know, bookmap has still not been released as a published specification, although it is in use.

There are good reasons for publishers to control the user experience. Ultimately, it is their content, and their property, for which they have the right to determine the functionality of the experience. Music producers gave up service delivery, and they have turned into backend providers of content to services that actually provide the user experience; in short, they are no longer truly publishers.

Google's response to this may be that they are delivering the user to content, and since that is done through their interface, if a publisher is unhappy with that service, they do not need to provide their content for delivery via Google. I do not believe that is a fair response; it is not how the web works. If a web site is not happy with how Google provides a discovery experience for their content, then they are free to prohibit harvesting through robots.txt; but Google should not exclude a site for harvesting because they are unhappy with the service delivered by the web site.

If we compare the services provided by a public domain text between Google and the Open Content Alliance, it is difficult to argue with the proposition that the user experience of OCA's OpenLibrary is superior. Let's take, for example, a copy of Bacon's Novum Organum; OCA's copy is from the University of California Berkeley library, and Google's is from Stanford University's library.

Both OCA and Google permit a download of this public domain work in pdf, and both provide a pleasant online browsing experience. Both render on-screen a raw text version based on the OCR derived from the page image. Nonetheless, even sans consideration of the question of image quality and OCR fidelity, OpenLibrary provides several services that Google does not, including access in multiple formats -- DjVu, FlipBook; B/W and color PDF, as well as text; OCA provides text to speech capability through the FlipBook presentation. OCA also incorporates notice of critical metadata, including known rights information, on each book's profile page. Finally, OCA permits access to their content in bulk, to the extent their own contracts permit, for purposes of research and education; e.g., for text mining analyses, etc.

I have discussed the online book viewing user experience with Brewster Kahle, and he agrees with Tim O'Reilly: book search should work like web search. A search engine should serve as a switchboard, and not as the sole delivery platform for the content. In other words, a search engine must be an open delivery platform, and not a closed garden. For Google to wear the mantle of open protocols for the social web, but to discard them for books, is a hypocrisy.

OCA is willing, and encourages, Google and others to harvest the metadata and full text of their books through current crawling procedures, as well as nascent protocols (such as BookMap), to facilitate discovery through all search platforms. Some of OCA's contributors have expressly reserved the right to keep Google from re-hosting materials in the Google Book Search application platform, even as they remain fully available to the public at OCA; for these works, the browsing reader must utilize the OCA site.

OCA encourages Google to redirect users back to the OpenLibrary or (whenever possible) other alternative book library interfaces once they have selected an OCA title for browsing. OCA has never mandated the use of any particular book-viewing program; does not surrender control of the user experience to Google; and offers the distinct possibility of delivering a better browsing and library platform than what Google provides through Google Book Search.

In short, HarperCollins and OCA see a world where there will be many libraries and publishers, as well as many search engines.

tags: publishing | comments: 11 | Sphere It
submit:

Previous | Next

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/6021

Comments: 11

Karl Fogel [11.05.07 07:29 PM]

Google has a point, though. Consider their own cached copies of web pages, for example. Many people try the cache first, because it's often faster than the original site, and because it highlights the search terms: in other words, a better user experience.

The important thing is that they clearly label it as a cache, and tell the user that it might be out-of-date. That concern doesn't apply as much to books, which (at least for now) tend not to change much after publication.

And really, why should HarperCollins (or whoever) be considered the canonical source for a text that they happen to publish in treeware form? Is that what we've decided publishing means? Do all authors agree? What about authors who also have their own web sites?

"Walled garden" doesn't quite capture what's going on here. The fact is, the web is made of lots of exact copies of things. The question of which copy to point can be very complicated, all the more so in the presence of different metadata, presentation styles, server speeds, potential for outdatedness, endorsing parties, etc. It might be more accurate to say that Google is trying to compete with HarperCollins as a publisher of HarperCollins' own books :-).

I completely agree with you that Google shouldn't force the user to stay in Google's UI, and if Google would make the text of the book itself available via simple APIs, you can bet that lots of browser plugins would pop up to provide better viewing experiences than either Google or HarperCollins can provide. And why can't Google do that? Because the publishers won't allow it.

So if there are walled gardens here, Google isn't the only one building them, and doesn't merit special criticism for wanting to do the same thing that HarperCollins wants to do. I wish they'd both open up the data, and in the end it's not Google's fault that that isn't permitted.

Adam Hodgkin [11.06.07 12:54 AM]

Some very good points here. But I am not quite sure that you have nailed the issues down in a way that Google will recognise and accept. Also there is something wrong with this statement, you say of publishers (contra Google):
"Ultimately, it is their content, and their property, for which they have the right to determine the functionality of the experience....."
By and large publishers do not own the 'property' or the copyrights. In most cases authors do. In many of the biggest properties, publishers dont even control the key rights, which are managed by agents on behalf of their authors. But I think you are right, in due course Google will find out that being THE publisher for all digital literature is a crazy goal. They will bend their plans to become the repository for all out of copyright texts and for searching all other publications.They will want to be THE switchboard. For them, that is ultimately the comfortable position and one which will increase their chances of remaining the Search Engine for most of us.

Ron Murray [11.06.07 12:48 PM]

A reasonable enough argument regarding where Google book searchers "should" send the reader. But this blog entry does not give us a sense of how many books are available from any of the parties involved.

How many books and book delivery systems exist out there for Google to direct the reader to? If the great proportion of digitized books existed outside of Google, the argument is stronger.

If comparatively fewer books are out there - irrespective of their nice user interface and other features - compared to what Google is generating, then we should expect to see more results from Google's own stores.

I would also like to hear not only about relative numbers of searchable digitized books, but also about rates of conversion. That will be the real indicator of which delivery model (pass-through or Google-resident) will come to dominate the digitized library landscape.

bowerbird [11.06.07 03:57 PM]

how come i can go to my library
and read a book _in_full_, but i
can't read it online in full, from
_either_ google _or_ its publisher?
set those books free, please, or
we will rip them from your hands.

we entrusted our public culture
to publishers so it'd be _spread_,
not as a form of corporate welfare.

so spread it, or we'll take it back.
and those are your _only_ choices...

-bowerbird

Tim O'Reilly [11.07.07 12:25 AM]

bowerbird --

Who is "we"? The fact is, the biggest worry we get from our authors, even to an online subscription service like Safari, is that they are worried that it will cannibalize their sales, and that it enables piracy. The authors who have asked for their books to be free get contracts that make them free -- see openbooks.oreilly.com. We also put books there after they are no longer selling in print.

But you'd be surprised how many traditional book authors are quite resistant to having their work up for free online. They are, for the most part, more conservative than their publishers.

Free culture is alive and well on the net. But anyone who says that some author is obligated to make his or her content available for free -- whether it's Richard Stallman saying that about software, or you saying that about books -- is missing the point.

If free is really a better model, people will adopt it because they want to, because they see the potential. Anyone who wants to rip it from their hands is going too far.

I believe that free as a strategic choice is very, very often the right one. But it is the right of the author (or the publisher, if the author has granted them those rights) to make, or fail to make, that choice.

Returning to free software, it has always seemed to me that the Apache and BSD model is far freer than the GPL. The Apache, BSD, and X licenses said: here, this is a gift. Do with it what you will.

The fact that software under Apache and related licenses remains free is a testament to the power of the model, rather than to legal coercion.

I'm not saying that GPL is bad -- it's a powerful model in its own right. But it's actually a strong intellectual property license, as strong and as coercive in its goals as the licenses it seeks to replace.

So let's discover where making content free works, and where it doesn't. Because anyone who knows anything about history realizes that free is widespread in print too. This is not new to the internet! In fact, I got the idea for making GNN, the first web site supported by advertising (back in early 1993) from all my subscriptions to free, ad-supported computer industry trade papers.

Free has always co-existed with paid. There are types of content that will never become free, while there are other types that will migrate from paid to free, while others migrate from free to paid.

Meanwhile, back to your contention that libraries are "free." Yes, they are free to the reader. But they buy their books before lending them out. And you, as a citizen, pay for those books through your taxes.

They are actually a great illustration of a balancing act -- like copyright itself -- between the right of the creator to ask for recompense for his or her work, and the benefit to society from free redistribution.

bowerbird [11.07.07 02:09 AM]

tim said:
> Who is "we"?

that amorphous "mob" that rode napster-cat.
(didn't you get the wordplay on the term "rip"?)

but tim, don't worry, i think your books are safe.
"java for dummies" isn't our "cultural heritage"...
(yeah, i realize that series is from someone else.)

if you need me to repeat the point, however, i can.

we -- as a _society_ -- "contracted out" to the
private sphere the job of propagating our culture,
under the rubric of capitalism. all fine and good.

but we were not guaranteeing them a cash cow...

and we ain't stupid. we see that the job of now
spreading our culture is one heckuva lot easier,
thanks to the net (which _we_ paid to develop).

if business thinks they can now hold our heritage
for ransom, they've got another thought coming...

they can't say "we're not gonna use that tool",
because they are not dictating this situation...
we hired 'em, and we can fire 'em if we decide.

and if you look at the way some publishers act,
seems they _do_ think they can milk the public.

the textbook publishers are in the middle of a
_tremendous_ grab at the taxpayer's purse...

the academic journal publishers are gouging to
unbelievable degrees, and just gettin' warmed up.

even mainstream publishers are on the verge of
pricing themselves out of their market entirely...

when these segments can no longer do the job
that we hired them to do at a reasonable price,
we can fire them. we should fire them. we will.

> Meanwhile, back to your contention that
> libraries are "free."

my "contention"?

> Yes, they are free to the reader.

right. free to the reader. who did you think
i was talking about when i said i could read it
for free? or that you could read it for free?

> But they buy their books before lending them out.

if you think you can lead this discussion down
the path where we treat physical goods and
digital goods in the exact same manner, then
let me disabuse you of that notion right now.

in the age of hard-copy, we let publishers charge
for each copy because there was an obvious cost
that was incurred on each copy, so that was _fair_.

if _we_ had made the copies, it would've cost us
the same amount as it cost them, so it was _fair_.

but we're not stupid. we know that it _doesn't_
cost as much to make additional _digital_ copies.
we know it costs virtually _nothing_ to make them.

so charging us for each additional digital copy
as though it was an additional physical copy is
_unfair_. and we're not gonna let ourselves be
cheated in that fashion. because we ain't stupid.

> And you, as a citizen, pay for those books
> through your taxes.

look, what _we_ (that amorphous mob) care about
is the care and feeding of our cultural heritage...
if you want to set up a cozy little sweetheart deal
where the corporations get a paycheck so that we
can get free access to our cultural heritage, fine...

heck, haliburton is raiding the treasury, might as
well let the publishers do it as well. but listen up.
we're not stupid, and we care about _fairness_, so
don't make your deal unfair, or we won't go along.
this is not carte blanche to take whatever you can.
and don't put any restrictions on our usage either.
we want full and free access, in full-on remix mode.

we care about fair. so come back with something fair.
because if you don't, we'll just declare eminent domain
on all our cultural heritage, make a one-time payment
(that the coporations will think is too low, but _tough_),
and free ourselves from these stupid bonds forever...

there's a reason libraries are _free_. it's because we
have learned -- through experience -- that knowledge
is more valuable when it's shared and openly available.

so we're not gonna let a bunch of greedy businessmen
lock it up and charge us ransom to access it. never!

stop acting like the greedy businessmen are in charge.
they're not. we can displace them any time we want...
and we're getting impatient with their silly obstacles.
we don't care if "free" works as a business tactic or not,
so don't even try to change the subject to that silliness,
because we don't care a bit about business _or_ tactics.

we care about knowledge and culture and our heritage
and moving our society into the 21st century...

-bowerbird

Tim O'Reilly [11.07.07 03:42 AM]

bowerbird --

I totally agree that fairness is key. But I think that you're overstating the case for free. I've been as quick to excoriate greed and foolishness as anyone, and to urge publishers to explore the benefits of free.

But many types of content will not be produced for free. Saying that anyone who wants to be paid (cf recent Radiohead experiment) is greedy is just silly.

And I made no assertion that digital goods are the same as physical good.

In your comments, please be careful when using the word "you." The way you're writing makes it sound like you're accusing me personally of various crimes against culture.

Jerome McDonough [11.07.07 08:24 AM]

I am shocked, *shocked* to find that Google is trying to exploit a monopoly position.

But I'm not sure of your final sentence: "In short, HarperCollins and OCA see a world where there will be many libraries and publishers, as well as many search engines." If HarperCollins is trying to have traffic redirected from Google straight to their site, rather than, say, a library that has licensed HarperCollins electronic content and will make it available for free to a user, then are they really operating with libraries' best interests in mind. One way that web search and book search *are* different is the multiple copy problem, and libraries' existence may depend on search engines returning links to libraries that make the content available for free along with publishers that make it available for a price. I'm sure HarperCollins sees a place for many publishers, but where exactly do libraries fit into their worldview in a world of electronic book search?

Peter Brantley [11.07.07 09:07 AM]

Jerry -

I think that is a critical point: switching has to happen back to libraries, once digital book licensing programs are in place. This would replicate the success of using OpenUrl resolvers, which are widely deployed for journal literature, whereby institutional subscribers are directed back to their local copy. It would seem theoretically easy for publishers and software solutions providers to do the minimal work necessary to develop this service consistently.

bowerbird [11.07.07 04:15 PM]

tim said:
> In your comments, please be careful
> when using the word "you."
> The way you're writing makes it sound like
> you're accusing me personally of
> various crimes against culture.

tim, when reading my comments, please be careful
how you interpret it when i use the word "you"...
i have absolutely no reason to accuse _you_
-- and i mean _you_personally_, tim o'reilly --
of anything, let alone "crimes against culture".

and if i ever do have any reason for a _personal_
accusation, i will make that crystal-clear...
just like i did in that last paragraph. at least
i _hope_ that last paragraph was crystal-clear...

but i don't anticipate a personal accusation, as
i see you as a nice guy, or i wouldn't be here
conversing with you. you are nice, tim, right?
weren't you the "badge of civility" proponent?

-bowerbird

p.s. i also never said that "anyone who wants to
be paid is greedy". watch your interpretations!
and _no_, i didn't mean just you, tim o'reilly,
personally, by that, but _everyone_ reading here.

herbert van de sompel [11.08.07 07:17 AM]

I would like to pick up on the OpenURL thread initiated by Jerome.

I obviously agree with Jerry and Peter that providing OpenURLs from digitized book repositories to other environments is something we want. We have successfully achieved something similar for journal article repositories, and these book repositories are the logical next step.

But these kind of OpenURLs that link out from book repositories and into e.g. library environments are just one way in which OpenURL traffic could flow. During the talk on the OCA book scanning project at this week's DLF Forum, I was kind of stunned that the idea of linking _into_ those book repositories using OpenURLs had not yet been explored. I trust we will all agree that these digitized resources (books, chapters, pages) need persistent URIs if we want to be able to do some serious stuff with them. And it looks as if all these massive digitization projects will somehow meet this requirement. As a result, once a resource is known, one will be able to persistently link to it. However, there is an important use case in which the digitized resource is not (yet) known, but is expected to exist in one or other book repository. Say a book metadata record has been discovered in e.g. Worldcat, and one would like to point at a digitized version of the book that is likely to exist in one of those massive repositories. In that case, an approach would be to generate a HTTP URL that carries some book metadata (available from the Worldcat record), including elements such as isbn, title, author, etc. The HTTP URL would be pointed at e.g. an OCA resolver that would read the metadata from the HTTP URL and redirect to the digitized book resource that corresponds with the metadata on the HTTP URL. We actually have an ANSI standard that describes how to create those HTTP URLs. We call them OpenURLs and there is even a profile of them that deals with books. The metadata elements of that profile are listed at http://www.openurl.info/registry/docs/mtx/info:ofi/fmt:kev:mtx:book . To conclude, it would be totally great it we could motivate all these book repositories to support inbound book OpenURLs. That way, every system that has some decent book metadata could point in a consistent manner into those repositories in an attempt to dynamically locate a digitized book that corresponds with the metadata.

STAY CONNECTED

	Subscribe to Radar
	Follow Radar on Twitter

Mon

Books Working with the Web

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/6021

Comments: 11

Post A Comment:

STAY CONNECTED

RECOMMENDED FOR YOU

RECENT COMMENTS

MOST ACTIVE | MOST RECENT

RADAR TEAM

RADAR TOPICS