Fri

Dec 7
2007

Andy Oram

Andy Oram

An editor critiques the publishing industry's Automated Content Access Protocol

The Automated Content Access Protocol (ACAP) is a new technical venture by an international consortium of publishers, and a proposed technical solution to the tug of war between publishers and intermediaries such as search engines and news aggregation sites. This article goes into some detail about ACAP and offers both a technical and a philosophical context for judging its impact and chances of success.

I exchanged email with Mark Bide of Rightscom Limited, Project Coordinator of ACAP, who provided explanations of the project and a defense in response to my critique. I'll incorporate parts of his insights here--of course with his permission!

What the Automated Content Access Protocol does

Clashes between publishers and intermediary Internet sites have entered the news routinely over the past decade. Publishers recognize the waste and risk involved in their current legal activities, notably:

  • Suing sites that offer large chunks of books or news articles without licensing them, a concern related to the Association of American Publishers' lawsuit against Google for providing a new channel to long-forgotten information and potential sales. (So far as I know, O'Reilly Media was the only publisher who publicly backed Google in this controversy.) A major part of ACAP is devoted to listing a "snippet," "extract," or "thumbnail" that can be displayed when a site is found, and nailing down the requirement that search engines display only what the publisher wants displayed.
  • Suing news crawlers that display articles in frames next to advertisements that generate revenue for the news crawlers--in place of the original advertisements that the publisher put up to generate revenue for itself. One field in the protocol is provided to explicitly forbid this practice.

The designers of ACAP posit a cooperation between publishers and search sites, whereby publishers put up their specifications for display and search sites honor these specifications. The ACAP specifications (listed mostly in Part 1 of a technical document on the site I pointed to) is loaded with a range of features that publishers think would improve their business model, such as requirements that search engines accompany links with credits or licensing conditions for articles.

One's judgment of ACAP could be influenced by whether one sees the current problems in publishing as social or technical. Bide says the problems are technical ones, because publishers can't convey their intentions along with their content search engines and aggregators. I see difficult social issues here, such as author integrity versus innovation in derivative uses, and copyright infringement versus fair use. I see ACAP as a classic stratagem of applying technical fixes to social problems.

The whole bag is presented as an extension of the traditional robots.txt file, which is like describing the Pacific Ocean as an extension of the Bering Strait. Unlike the simple yes/no decisions and directory listings of robots.txt, ACAP overflows with conditions that, as we will see, multiply rapidly into a world of calculations.

Officially backed by the World Association of Newspapers, the International Publishers Association, and the European Publishers Council, ACAP seems tailored mostly to news sites, but appeals to other publishers as well.

What is the goal?

In my opinion, ACAP is a platform for a new business partnership between publishers and search engines. Considering how much work the search engines will have to do to re-instrument many parts of their code for gathering and displaying information, I don't believe they'll do it without a cut of the take. Thus, an important feature of robots.txt that ACAP employs is the User-agent line that can provide rules for a particular crawler.

If successful, this initiative could turn into nothing less than a new information source that's separate from and parallel to the existing Internet, although using its underlying lines and protocols. I would not be surprised if publishers started encrypting content so it wouldn't even be found by search engine that fail to obtain a license.

I summarize the attitude of the publishing consortium that thought up ACAP as "We don't like the Internet, but we'd better compete with it." I acknowledge that the publishers will feel this characterization is insulting, but I intend it to highlight the difference between the free flow for which the Internet is known and the hedges built by ACAP.

On the other hand, the success of ACAP depends on search engines integrating ACAP-controlled searches with general search results. The best hope publishers have is to see their content promoted in general-purpose searches, cheek by jowl with free content from all around the Internet. Segregating content, even if just in searches, would put publishers on the same road as CompuServe.

Currently, many web users link to publisher content or republish it without using crawlers, but publishers presumably don't need a special protocol to deal with any economic impact of occasional deep links or copying. Still, ACAP may also have applications outside of search engines, according to Bide. It could be the basis for agreements with many trading partners.

Technical demands of ACAP

Lauren Weinstein presciently demonstrates that publishers are likely to turn ACAP from a voluntary cooperation into a legal weapon, and suggests that it shifts the regulatory burden for copyright infringement (as well as any other policy defined at the whim of the publishers) from the publishers to the search engines. I would add that a non-trivial technical burden is laid on search engines too.

Bide assured me that the designers of ACAP consulted with several search engine companies (most of whom do not want to be listed publicly) and has run tests establishing that the technology is feasable. I'm sure search engines can handle the calculations required to observe ACAP rules for publishers who use it, given that the search enginers already routinely generate indexes from billions of documents, each containing thousands of words.

Bide writes, "There is some overhead on deciding at display time whether or not and how a particular item can be displayed (as snippet, thumbnail, etc), but the proportion of web pages about which search engines will have to make this kind of decision is tiny, since they will only be associated with high-value content from commercial publishers, who represent only a small proportion of the content of a search engine's index."

So here I simply ask how much coding and validation is required to conform to ACAP, and what the incentive is for publishers and search engines to do so.

First, the search engine must compile a policy that could be a Cartesian product of a huge number of coordinates, such as:

  • Whether to index the actual page found, or another source specified by the publisher as a proxy for that page, or just to display some fixed text or thumbnail provided by the publisher
  • When to take down the content or recrawl the site
  • Whether conversions are permitted, such as from PDF to HTML
  • Whether translations to another language are permitted

Although publishers will probably define only one or two policies for their whole site, the protocol allows policies to be set for each resource (news page, picture, etc.) and therefore the search engine must store the compiled policies and re-evaluate them when it encounters each resource. To display content in conformance with publishers' wishes, the search engine must store some attributes for the entire time it maintains information on the page.

Seasoned computer programmers and designers by now can recognize the hoary old computing problem of exponential complexity--the trap of trying to apply a new tool to every problem that is currently begging for an a solution. Compounding the complexity of policies is some complexity in identifying the files to which policies apply. ACAP uses the same format for filenames as robots.txt does, but some rarely-used extensions of that format interact with ACAP to increase complexity. Search engines decide which resources to apply a policy to by checking a filename such as:


/news/*/image*/

The asterisks here can refer to any number of characters, including the slashes that separate directory names. So at whatever level in the hierarchy the image*/ subdirectories appear, the search engine has to double back and figure out whether it's part of /news/. The calculation involved here shouldn't be as bad as the notorious wildcard checks that can make a badly designed regular expression or SQL query take practically forever. For a directory pathname, there are ways to optimize the check--but it still must be performed on every resource. And if there are potentially competing directory specifications (such as /news/*.jpg) the search engine must use built-in rules to decide which specification applies, a check that I believe must be done at run-time.

The ACAP committee continues to pile on new demands. Part 2 of their specification adds the full power of their protocol to META tags in HTML files. This means each HTML file could potentially have its own set of policies, and the content of the file must be read to determine whether it does. Once again, publishers are not likely to use ACAP that way, but the provision of that feature would require the search engine to be prepared to compile a policy and store it for each HTML file. As Bide points out, Yahoo! already provides a "nocontent" class that can be placed on any element in an HTML file to keep crawlers from indexing that element. I maintain that such extensions don't require search engines to juggle a large set of policies for each document, as ACAP does.

Search engines may be spared some of the complexity (and the consequent risk of error) of ACAP implementation if the project provides a library to parse the specifications, but each search engine still must hook the operations into its particular algorithms and functions--and these hooks extend to nearly everything it does with content.

ACAP as a collaboration

The concept of collaboration between search engines and targeted sites is not new. I spoke enthusiastically about such an effort back in December 2003, going so far to call it a harbinger of search's next generation. My interest at that time lay in dynamic content: the huge amount of facts stored in databases and currently unavailable to search engines because they normally look only at static content.

For example, you can search for your Federal Express package number on the Federal Express web site, but only because Federal Express submits your search to its own database. The content doesn't exist in any static form available to Google. But Google can use the Federal Express site to do the search and act as middleman. As described in BusinessWeek article:

...Google is providing this new shipment tracking service even though it doesn't have a partnership with FedEx. Rather, Google engineers have reprogrammed it to query FedEx directly with the information a user enters and provide the hyperlink direct to the customer's information.

This is Web 2.0 long before Tim O'Reilly coined the term: two sites mashed-up for the convenience of the user. Such combined searches are flexible, open the door to new applications, and provide a wealth of data that was previously hidden. Search engines have also implemented dynamic information retrieval in other areas, such as airline flights and patents. Searches for addresses turn up names of institutions located at those addresses, while searches for institutions turn up addresses, maps, etc.

Contrast such innovation with ACAP. It imposes rules in the place of flexibility, closes off possibilities rather than opens them up, and makes data less valuable. When a publisher can require that a specific blurb be presented instead of the relevant content found by a search engine, not only does that prevent innovation in searches; it tempts the publisher to broadcast self-serving promotions that ill serve the users trying to make informed judgments about the content they're searching for.

ACAP is also the opposite of productive collaboration in a technical sense: far from the publisher contributing its own resources (not to mention its own intelligence regarding the structure and details of its content) to a search--instead, the publisher is putting an additional burden of calculation on the search engine.

Finally comes the problems of standardization that I described three years ago in the article From P2P to Web Services: Addressing and Coordination. Standards reflect current needs and go only so far as the committee's imagination allows. They must be renegotiated by all parties whenever a new need arises. New uses may be held up until the specification is amended. And the addition of new use cases exacerbate the complexity from which ACAP already suffers.

Bide offers a general perspective on our conversation:

"I guess at its heart we have a disagreement here about whether the owners of content (authors and publishers) have any right to decide whether and how their content should be used -- and then have a mechanism to make those decisions transparent so that others may know their intentions. At the moment, in so far as they wish to exercise that right with respect to content that is openly available on the Internet, they can express their policies only in multi-thousand word sets of terms and conditions on their websites. Alternatively, they can simply keep their content as far away from the Internet as possible. You may believe that publishers are commercially unwise to seek to exercise any control over the use made of their content -- a point of view but one with which honest men may disagree. The work undertaken in the ACAP pilot project is a first step towards solving that challenge."

This looks to me like a narrow, negative defense of the project: one based on publishers' perceptions of problems in a new medium. I take the blame here for pushing so hard with my criticisms that I made Bide focus on the issue of content rights, rather than a more inspiring justification promoting the hope of creating new business opportunities. Let's see how many search engines actually implement ACAP, and whether the financial rewards make it worthwhile for both search engines and publishers. I still place my bet on external sources of innovation and collaboration with user communities--are those fostered by ACAP?

December 9: James Grimmelmann of New York Law School's Institute for Information Law and Policy (who is a programmer as well as law professor) wrote a critique of ACAP as a specification that will interest people who have to deal with interpreting requirements, and a follow-up.


tags: publishing  | comments: 5   | Sphere It
submit:

 
Previous  |  Next

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/6107

Comments: 5

  bowerbird [12.07.07 09:08 AM]

gosh, let's hope the search engine companies are
smart enough to kick this to the curb right away.

and then let's see how these "publishers" like
being invisible in cyberspace...

-bowerbird

  Pete [12.08.07 02:26 PM]

How on earth do they expect this to fly? Why would a search engine or aggregator bother to take a peek at the extended robots.txt if this is what they would find?

To me this looks like publishers want the whole cake. They want search engines to crawl their sites in a heavily controlled way to drive visitors and then they want to limit what the search engine can display to the user?!

As has been mentioned elsewhere, there are laws that provide cover for illegal use of copyrighted material. Or is this a sign of that copyright law is broken? The Pirate party is probably right.

  Andy Wong [12.09.07 05:10 PM]

"On the Internet, nobody knows you're a dog."

But ACAP wants to know whether you are human or search engine.

The idea of ACAP, you as human can read the pages along with the adv, and you as search engine will be discriminated.

While I think that Google (and likewise) should form a better partnerships and win-win strategy with news publishers, I think ACAP from those publishers is just an ideal and outdated illusion without considering how collective intelligence of search engines can benefit them.

  Terry Steichen [12.10.07 03:27 AM]

Hi Andy,

Been years since our last contact. Can't even remember the details; I reviewed a book you were writing and I was debating Clay Shirky about, as I recall, charging for online content.


Anyway, to the point: I agree with you on ACAP. The way I see it, it is fundamentally wrong to require a second party (search engines) to enforce the rules of first party (publishers). To do so places legal liability on the wrong party, and can easily (as you correctly point out) create potentially significant burdens on the second party, including burdens that might have nothing to do with the first party. Moverover, once such a relationship is accepted, it will inevitably be expanded to include other parties (consumers, aggregators, etc.).


Best,


Terry

  Karl Fogel [12.12.07 11:09 PM]

Sign me on for bowerbird's and Pete's comments :-).

It's fascinating just how thoroughly the ACAP project leaders Don't Get It. They have this idea that they can just write up an arbitrarily complex set of rules for how information -- "their" information, as they probably think of it, neatly confusing the possessive and associative senses of "their" -- should be allowed to spread around the Internet. The issue isn't so much that their specification file will be hard to implement, it's that things, well, just don't work that way on the Internet.

Bide at least seizes the point with both hands, when he writes: "I guess at its heart we have a disagreement here about whether the owners of content (authors and publishers) have any right to decide whether and how their content should be used..."

Yes, that's exactly what the disagreement is about -- except that one might respin it as "producers" rather than "owners" of content. The very meaning of ownership gets rather shaky when information resides on a huge distributed network that functions, essentially, as a global copying machine.

You were tremendously polite in your response, Andy, as always. But Bide and the ACAP initiative are working against the grain of the network, so to speak, and just like you can't fight City Hall, you can't fight the Internet. The data will flow, whether they want it to or not.

Post A Comment:

 (please be patient, comments may take awhile to post)






Type the characters you see in the picture above.

RECOMMENDED FOR YOU