The Persistence of (Bad) Online Data

Mon

10.30.06

listen

The Persistence of (Bad) Online Data

One of the looming problems in our data-rich yet decentralized future is the difficulty of fixing bad data, once it is released onto the net. We've heard stories of Facebook profiles that turned up in embarrassing circumstances, identity theft, and so on, but I want to focus on a slightly different problem, one that potentially has a technical solution. And that is the question of who or what is the authoritative source of data on a particular topic.

We face this problem all the time with book metadata in our publishing business. Retailers demand notification of upcoming books as much as six months before the book is published. As you can imagine, titles change, page counts change, prices change -- sometimes books are cancelled -- and corralling the old data becomes a game of whack-a-mole. You'd think that the publisher would be an authoritative source of correct data, but this turns out not to be the case, as some wholesalers and retailers have difficulty updating their records, or worse, retailers sometimes overwrite newer, correct data with older bad data from one of the wholesalers who also supply them.

A particularly trying case of persistent bad data was brought to my attention a few months ago by Lou Rosenfeld, co-author of Information Architecture for the World Wide Web. We'd been in discussions with Lou to publish a new book on search analytics. The discussions had progressed as far as an editor issuing a contract for discussion, before Lou decided to launch his own publishing company.

And that's where the problem started. Issuing a contract at O'Reilly results in various data being added to our own internal tracking databases. When Lou decided to publish his own book, we flagged the book as "cancelled" in our own database. But unfortunately, our system sends out automated notice of cancelled titles via an ONIX feed -- even cancelled titles that have never been announced. That would have been well and good, except for the fact that someone else's automated system noticed a new book in the cancelled book feed, and added the book to their database -- without the cancelled flag!

(Aside: That's a classic bug cascade. As my friend Andrew Singer once noted, debugging is discovering what you actually told your computer to do, rather than what you thought you told it. Unfortunately, you sometimes don't understand what you told your computer to do until some particular corner case emerges.)

From there, it went from bad to worse. Lou and his co-author Rich Wiggins first let us know about the problem in May, and our folks went to work trying to clear bad data out of the channel. We'd get the data cleared out of some accounts, and then it would reappear, as they repopulated their databases from wholesalers who hadn't yet made the fix. By July, Lou and his co-author Rich Wiggins were getting (politely) irate. Lou wrote:

I know and appreciate that you've tried to rectify the problem, but now that this data [is] out there, it's propagating, and the problem is getting worse. It's caused me some frustration, and my co-author is downright angry.

In a sense, this is a form of identity theft, however inadvertent, and will likely have a very damaging impact on our ability to sell the book. As many potential readers know that I'm also an O'Reilly author, they may already assume that O'Reilly is our book's publisher. The incorrect data now in many book catalogs will likely confirm this wrong assumption, and will generally confuse the marketplace.

Unfortunately, despite all our efforts, the bad data remains up at bookstores that ought to be able to fix this, including Powells, Amazon Canada, and Kinokuniya. A Google Search still turns up 56 results for an ISBN that should never have propagated beyond O'Reilly's internal database.

It's frustrating for Lou and Rich -- although here's hoping we can turn lemons into lemonade by giving them some publicity for the book, Search Analytics for Your Site: Conversations With Your Customers, which they expect to publish in January. In addition to this piece, we're going to have to end up creating a page for the book on oreilly.com, which says that we didn't publish it, and sending them to Rosenfeld Media instead. Now that's not all bad -- I'm happy to be giving Lou some link love, and hope we can send some sales his way -- but it's a real shame that it's not possible just to remove the bad data. As Lou said in one email, "the whole story seems to be such a strong illustration of the downsides of connected and linked databases (and therefore very much a Web 2.0 lesson)."

And the lesson seems to be that you can't ever take anything back. What you have to do is to spread correct information, and hope that it bubbles up more strongly than the incorrect information it's competing with.

But I did hint at the possibility of a technical solution. And I want to propose it here, in lazyweb fashion. It seems to me that a critical part of our future distributed database infrastructure is the development of metadata about data integrity and authority. Looking forward, we do want to expect a world in which automated data feeds replace manual processes. But in that world, we'll also expect that some feeds are more authoritative than others, and that especially for unique data items (such as ISBNs), the owner of that data item would be given precedence over others reporting on the status of that item.

tags: | comments: 20 | Sphere It
submit:

0 TrackBacks

TrackBack URL for this entry: http://orm3.managed.sonic.net/mt/mt-tb.cgi/1433

Comments: 20

Mike Glasspool [10.30.06 06:45 AM]

It almost seems as though we need to 'sign' all data and have a CRL-type system in which all data is passed with it's signature and it can be revoked by the author. This becomes a problem if the original data is copied and pasted.

An interesting problem to solve and it will require a lot of co-operative effort to keep databases correct.

David Megginson [10.30.06 08:50 AM]

Ironically, the biggest problem here is that your title/ISBN information is free (as in beer). When there's a strong commercial relationship, like in the news industry, it's much easier to deal with embargos, kills, etc., because the rules are well-defined through enforceable contracts among a relatively small group.

Most news stories are online for only a few days (usually with the latest corrections), while Atom- or RSS-syndicated blog postings live forever (often in a wide range of different versions). Apparently, the same is true of ISBN/title announcements.

curtis snow [10.30.06 10:00 AM]

Sounds like an instance for dating the data.

Example : a tag that says this data expires on "some date" & is not valid beyond "some date". For data to be valid it needs to be refreshed periodically ("keep-alive"), not difficult when automated (tho of course there are other "issues" that would come from this)

Bob Aman [10.30.06 10:14 AM]

I was about to mention the same thing Mike said, but honestly, the cryptographic solution — while technically it's probably the best fit for the problem — rarely makes for a realistic solution simply because it increases complexity so significantly.

Total [10.30.06 10:24 AM]

"them some publicity for the book, Search Analytics for Your Site: Conversations With Your Customers, "

So this is published by O'Reilly, right?

Tim O'Reilly [10.30.06 10:38 AM]

Total -- your comment is either meant to be funny, or suggests that you didn't read the posting closely. I'll assume the former. But just to be sure: The whole point of the posting is that the book is being published by Lou's new company, Rosenfeld Media, NOTby O'Reilly.

Total [10.30.06 10:39 AM]

"your comment is either meant to be funny"

You wound me, sir. I rather thought that it _was_ funny.

Tim O'Reilly [10.30.06 10:40 AM]

Mike -- re "signing" all data -- I'm not sure that's necessary. All we really need is some kind of "authority" metadata standards that associates certain types of data with the most authoritative provider. So for example, an ISBN is "owned" by the publisher who registers it. Therefore, their data about the status of that ISBN ought to be more authoritative (and thus override) other data from less authoritative sources.

Tim O'Reilly [10.30.06 10:42 AM]

David -- I have to disagree. Much of the bad ISBN data is propagating not through public web pages (though a bit of that is starting to happen) but through the pages of companies that are linked in a web of commercial B2B relationships. The issue is that a feed from a distributor like Ingram or Baker & Taylor may be given priority at a retailer over a feed directly from the publisher.

Tim O'Reilly [10.30.06 11:58 AM]

Total, I'm laughing! I just wanted to make sure that no one missed the point.

Somehow reminds me of the early scene in Romeo and Juliet: "would the law be on my side if I bite my thumb at you?"

David Megginson [10.30.06 12:01 PM]

Thanks for the reply, Tim. My point was more along the lines that you don't (and shouldn't be able to) enforce usage rights over the data, the way that news services can enforce usage rights over their syndicated material. When someone buys the right to display or print an article, there is usually information attached that traces it all the way back to its source copyright holder, along with restrictions on how and when that information can be used.

I don't think that's practical for product information, because the information exists only to help you sell the product. As the information provider, you want the widest possible dissemination; at the same time, the retailers and wholesalers want the most complete information possible, so they scavenge it from wherever they can find it.

That said, some kind of advisory metadata probably is a good idea, along the lines of cache and expire headers over HTTP. The trouble is that people will still mess up and ignore or misinterpret it, as they ignored the advisory 'kill' in your initial message.

David [10.30.06 02:40 PM]

This isn't really an online problem - it's a problem common to IT systems everywhere. It's why your bank sends you two copies of each promotion with your name spelt differently, or why your HR department and payroll department have two different addresses for you.

It's a computerised version of Chinese whispers.

People have been trying to solve this problem for years. Data warehouses, data cleansing, metadata etc etc etc. None of it has really worked.

I really think that the problem is inherant in the transmission of information, and the likley remedies are going to come from that domain - error correction, verification, retransmission etc.

(Stateful communication? Wash my mouth out!)

JimL [10.30.06 03:39 PM]

Tim

Very interesting story - funny, but I can also see how it would be maddening for Lou.

It seems like this issue of bad information is even more evident in the world of blogging. Increasingly, because of syndication and the popularity of aggregators like Techmeme (which IS cool), bad information propagates very rapidly in the blogosphere.

An example of this was the recent incorrect meme that Apple was trademarking "podcasting". This rumor was reported as fact by dozens of blogs. Most people checking a site like Techmeme and seeing dozens of posts supporting the meme would think it was true.

Compounding this effect are the facts that scandal and snarkiness propagate through the blogosphere more readily than thoughtful discussion, and that search engines tend to bubble up rumors with lots of links.

Because of these effects, the quality of news and information that percolates to the top of the blogosphere seems to be shifting towards entertaining misinformation.

Your suggestion about adding a layer of meta information about the quality of the data seems like it could apply to the blogosphere.

Currently, the link largely serves this meta information role. Links, though, represent what we think is interesting to talk about, not what is right or valid information.

Maybe we need a way to convey richer meta information with links - so that we could link to things that are stupid but entertaining without them becoming "facts".

rhandir [10.31.06 06:14 AM]

I think David Megginson hit it above:
"the retailers and wholesalers want the most complete information possible, so they scavenge it from wherever they can find it."

I've noticed in other publishing genres that the publisher is the least authoritative source on when the book will actually arrive at a point of purchase. (E.g. Dark Horse has been saying that Colors" would ship in August 2005. Still not out, after pushing the date back each quarter. See Sporadic Sequential for details.)

It seems that there are two problems:
1. Wholesalers/bookstores don't trust anyone's data, so they built their systems to silently fail closed - any data that indicates a book exists makes it exist. This makes sense, who wants to miss stocking something that might be the next Harry Potter?

2. Readers, reviewers, journalists and retail-end-bookstore-management don't have any way to discount false positives from the above system, since the publicly available data from the publisher is frequently fantasy.

Possible solution for #1:
Publish two RSS feeds: Books coming from O'Reilley, and books no longer coming from O'Reilley. Any ISBN appearing in both can be subtracted by the end user's database. (That shouldn't require human intervention!)

Possible solution for #2:
Present your end users with the same data you present your supply chain. Openness wins. If you are supplying crap data to your supply chain, chances are your most vocal fans will let you know much faster than your supply chain will.

Graeme Williams [11.01.06 12:52 PM]

Well, it may be that more complicated solutions are better, but what about hop count? You could set the value to whatever you wanted in the O'Reilly database, but anyone who copied the data would have to subtract one from the hop count when they took the copy, as well as enforcing the same rule on any downstream copiers.

You could start off low -- perhaps just 1 when a contract is issued -- and increase it as a book gets closer to real.

fauigerziger [11.02.06 12:49 AM]

The trouble with metadata solutions is that there's no way to make metadata stick to the content. ISBNs end up in text documents or in database fields where the data model doesn't support attaching metadata at the field level.

I can see only two solutions. The first is a rather stringent process which requires a close contractual relationship between all copiers of data. This is not always possible and it has its obvious drawbacks in flexibility and complexity. The other is some kind of third party trust brokering infrastructure. I know this is vague and can hardly be called a solution. But it's something that a body like the W3C could be tasked with. I vaguely recall that Tim Berners Lee's Semantic Web idea has the trust thing at the very top of the stack.

Roman Bischoff [11.02.06 02:00 PM]

Yet another approach would be a combination of push & pull styles:

Notifications (feeds) are just to notify :)
The responsibility for a correct up-to-date copy of the data could be transfered to the data consumer.

If somebody wants to act upon the received notification data, she should fetch an up-to-date copy first, which means a "pull" on the master system for this data item (metadata). In RESTful architecture you would maybe just need the URI for each data item's master system to do that.

In this case this would mean a request to your "internal tracking databases" (directly or indirectly via ONIX). Your system could respond with HTTP 301 Moved Permanently and provide the URL to Lou Rosenfeld's System.

But I guess at the end it's mostly a question of incentives. What is the motivation for all the data sinks to clean-up their data as long as it's not a real problem for them?

Avi Rappoport [11.08.06 01:01 PM]

JimL pointed out the similarity to rumors and urban legends, such as the incorrect claim that Apple would copyright the term "podcasting". Snopes.com has been the only real solution to the urban legend problem -- I have now trained my mom and my mother-in-law to check Snopes first, and reduced the nonsense in my email to zero.

Some kind of clearinghouse that all the bookstores could check might be an incomplete and yet good-enough solution.

Al [04.18.07 08:17 PM]

Bill, you mentioned the "Copycat" nature of the tragedy in Va. As long as you and other new stations cover and support this event, there WILL be other "copycats". You continue to give it press.....you get more and more copycats ! Love ya man but move on to another subject !

Tim O'Reilly [04.18.07 10:16 PM]

Al -- you've got the wrong O'Reilly.

Mon

The Persistence of (Bad) Online Data

0 TrackBacks

TrackBack URL for this entry: http://orm3.managed.sonic.net/mt/mt-tb.cgi/1433

Comments: 20

Post A Comment:

MOST ACTIVE | MOST RECENT

RADAR TEAM

BUSINESS INTELLIGENCE

CURRENT CONFERENCES

TAGS