"Hardcore Data Science" entries
Understanding information cascades, viral content, and significant relationships.
I rarely work with social network data, but I’m familiar with the standard problems confronting data scientists who work in this area. These include questions pertaining to network structure, viral content, and the dynamics of information cascades.
At last year’s Strata + Hadoop World NYC, Cornell Professor and Nevanlinna Prize Winner Jon Kleinberg walked the audience through a series of examples from social network analysis, looking at the content of shared photos and text, as well as the structures of the networks. It was a truly memorable presentation from one of the foremost experts in network analysis. Each of the problems he discussed would be of interest to marketing professionals, and the analytic techniques he described were accessible to many data scientists. What struck me is that while these topics are easy to describe, framing the right question requires quite a bit of experience with the underlying data.
Predicting whether an information cascade will double in size
Can you predict if a piece of information (say a photo) will be shared only a few times or hundreds (if not thousands) of times? Large cascades are very rare, making the task of predicting eventual size difficult. You either default to a pathological answer (after all most pieces of information are shared only once), or you create a balanced data set (comprised of an equal number of small and large cascades) and end up solving an artificial task.
Thinking of a social network as an information transport layer, Kleinberg and his colleagues instead set out to track the evolution of cascades. In the process, they framed an interesting balanced algorithmic prediction problem: given a cascade of size k, predict whether it will reach size 2k (it turns out 2k is roughly the median size of a cascade conditional on whether it reaches size k). Read more…