The data mining dilemma

A couple of days ago I pointed to Justin Hall’s call for more intense datamining to enhance the web experience, and just yesterday Marc pointed to the Netflix recommendation algorithm contest.
The Netflix contest is interesting for several reasons, but maybe mostly because it points directly at the dilemma between data usefulness and data privacy. The utility of good film recommendations should be obvious. Conversely, to train and evaluate the algorithms Netflix is publishing anonymized film ratings from their database to train the algorithm on.

Considering the recent AOL search data publishing uproar this seems an odd thing to do. And if you’re not yet convinced that this is at least touching on the privacy issue, here is Dan Frankowski’s Google Tech Talk on how to identify individuals based on – you guessed it – anonymized movie ratings.
This isn’t just idle speculation but an actual project Frankowski and coworkers have carried out to infer the identity of anonymized movie raters by merging the anonymous movie ratings with non-anonymous movie discussion board posts.

Bonus link: Another Tech Talk on how to publish minable data without privacy risks.