Oct 3

Nikolaj Nyholm

Nikolaj Nyholm

The data mining dilemma

A couple of days ago I pointed to Justin Hall's call for more intense datamining to enhance the web experience, and just yesterday Marc pointed to the Netflix recommendation algorithm contest.
The Netflix contest is interesting for several reasons, but maybe mostly because it points directly at the dilemma between data usefulness and data privacy. The utility of good film recommendations should be obvious. Conversely, to train and evaluate the algorithms Netflix is publishing anonymized film ratings from their database to train the algorithm on.

Considering the recent AOL search data publishing uproar this seems an odd thing to do. And if you're not yet convinced that this is at least touching on the privacy issue, here is Dan Frankowski's Google Tech Talk on how to identify individuals based on - you guessed it - anonymized movie ratings.
This isn't just idle speculation but an actual project Frankowski and coworkers have carried out to infer the identity of anonymized movie raters by merging the anonymous movie ratings with non-anonymous movie discussion board posts.

Bonus link: Another Tech Talk on how to publish minable data without privacy risks.

tags:   | comments: 4   | Sphere It

Previous  |  Next

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/4956

Comments: 4

  Doug Karr [10.03.06 10:01 AM]

There are 2 schools of thought on this. Do you utilize your customer data to act on their behavior and provide them with what they are looking for? Or do you collect knowingly collect data and worry about privacy issues?

By now, people are accustomed to providing large amounts of data - especially via the web. It angers me when I get an email from a Pharmacy for geriatric supplies when they have all of the data in the world to recognize that I am 38 years old.

If you are NOT utilizing your clients' data to enhance their experience with your company, you are doing your clients a disservice. It's as though you are not listening to them but they keep trying to tell you what they want.

Now... on the other side is public disclosure of data that a client entrusted you with (i.e. AOL). That's a betrayal of trust, especially when the expectation is that of privacy.

IMHO, you're mixing two different issues.


  Claus [10.03.06 12:36 PM]

The problem with keeping the issues separate is that it is not very efficient to keep on doing so: Data locked up in a silo is less useful
1. Because it is only handled by few people. The odds of these people realizing the full potential of the data is minimal
2. Because correlations between different data sets aren't utilized

These factors are bringing the issues together.

  Nikolaj [10.03.06 01:09 PM]


I don't disagree in using customer data to better serve your customers, the school example obviously being Amazon.

However, the real issue is what to do about opening up this data in a way that makes it purposeful. As Marc noted in his original post (linked above), the fun stuff is going to happen by mixing the data (and thus search) with other sources. And as Brady Forrest seconded on the Radar email back-channel "I can't wait for that data to get out in the open and be used for a greasemonkey script."

It's odd that they want to risk disclosing the data, albeit anonymous, but it's even more valuable in serving customers than just mining it internally.

I'm not sure what the answer is but this is not the last time the question will surface.


  marble2 [10.04.06 12:05 AM]

We live in tricky times. Trying to work backwards from someone's movies recommendations to figure out who they are seems a lot less effective than the rampant pretexting available to the bad guys and highlighted during the current HP debacle. Does anyone really care? Maybe if you rented really out of character movies and had strange things to say about them, but I'm not sure what major societal harm emerges from the crowdsourcing that Netflix is attempting to do to try to get a bunch of engineers on the cheap.

We all crave privacy but trade privacy for benefit every time we disclose our information to anyone. If I meet someone while traveling and they ask me where I'm from, I trade not making them uncomfortable by disclosing who I am and where I'm from. Really, I don't know them and haven't qualified them as a trusted party. I could lie, but that makes me uncomfortable. From those deliciously tiny little pieces of information they can go after my identity.

By using my credit card at a restaurant I trade some convenience for the possibility my identity will get stolen. This has a much greater costs than fraudulent use of my card. Overseas banks don't even have the $50 safety net, so by not carrying cash I could have my bank account emptied by using it for convenience while sacrificing my privacy.

Privacy online is different. In the folksonomic, semantic web, every day you contribute something, and in turn you gain immediate rewards. First you get some authority and a loud voice if people actually read what you have to say. Down the road if you are one of the lucky few you might even make some money. If you approach it as a hoarder, you'd never share in the first place.

If you protect your identity and use the web semantially and socially while protecting your identity. Screen names anyone?

Marketers strike while the iron is hot. If I'm planning a trip around the world and get hit with the perfect offer, I smile. I haven't received the geriatric offer in my thirties, but if I did I would be bummed. So really, outside of identity theft; good timely marketing is heaven and bad marketing is hell. Usually the result of bad data mining.

To conclude, open the floodgates. Amazon, Yahoo, Netscape, Netflix, and all the fantastic web2.0 services gaining momentum every day. The power is in the people. The web through decentralized checks and balances will sort itself out like it always has.

[sorry for the long comment Nikolaj, the trackbacks are incommunicado]

Post A Comment:

 (please be patient, comments may take awhile to post)

Type the characters you see in the picture above.