Exploring Twitter Influence with Jaccard Similarity and Python

What Do Tim O’Reilly, Lady Gaga, and Marissa Mayer All Have In Common?

Let’s examine the followers of some popular Twitter users by asking the (Freakonomics-inspired) question, What do Tim O’Reilly, Lady Gaga, and Marissa Mayer all have in common? Although it may initially seem like an obnoxious question to ask, some of the answers may intrigue you once you begin to take a closer look at the data. (Although dashingly good looks might be one thing that they all have in common, we’ll let the data do the talking and stick with Twitter followers as the basis of computing similarity.)

Which two of these three accomplished entrepreneurs are most alike? It all depends on the features that you’re comparing!

Goals

The initial idea behind this entire series on Twitter influence is that it would be an interesting and educational experiment in data science to put Tim O’Reilly‘s ~1.7 million followers under the microscope and explore the correlation between popularity (based upon number of followers) and Twitter influence.

In order to draw some meaningful comparisons, however, we’ll need to consider at least one other account. Marissa Mayer seems like a fine selection for comparison since her Twitter account is similar yet different to Tim’s account. For example, she’s also a “tech celebrity” and business executive. However, her particular expertise is not quite the same, and she only has about one-fourth as many followers. (Or so it would initially appear…)

Just to make this interesting, let’s further mix things up a bit by introducing a wildcard. Lady Gaga seems as good a choice as any to introduce a bit of unexpected fun into the situation. She is one of the ten most popular Twitter users based upon number of followers, an accomplished entrepreneur, and surely draws interest from a broad cross-section of the population.  The introduction of a third account also provides the opportunity to draw some additional comparisons, so let’s compute the Jaccard index for the various combinations of these three accounts and see what turns up. The Jaccard index measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets, or, more plainly, the amount of overlap between the sets divided by the total size of the combined set. This is a simple way to measure and compare the overlap in followers.

Results

The full results (example code, notes, and the results from executing each cell) are available as an IPython Notebook, and you are encouraged to review it in depth. For convenience, a summary of the key results that you’ll see computed in the notebook follow:

We can now gather from Twitter’s IPO that it’s fundamentally postured as an advertising company, but its real value isn’t in advertising. Twitter’s most fundamental value rests squarely within data analytics. However, just because Twitter could make a lot of money in advertising doesn’t mean that advertising is where it should concentrate the majority of your efforts or where its most fundamental value proposition lies.

More specifically, Twitter’s most fundamental value is in the overall collective intelligence of its user base when interpreted as an interest graph. Think of an interest graph as a mapping of people to their interests. In other words, if you follow an account on Twitter, what you’re really saying is that you’re interested in that account. Even though there’s lots to be gleaned in all of the little 140 character quips associated with a particular account, there’s a good bit you can tell about a person by solely examining the accounts that the person follows.

Why Is Twitter All the Rage?

I’m presenting a short webcast entitled Why Twitter Is All the Rage: A Data Miner’s Perspective that is loosely adapted from material that appears early in Mining the Social Web (2nd Ed). I wanted to share out the content that inspired the topic. The remainder of this post is a slightly abridged reproduction of a section that appears early in Chapter 1. If you enjoy it, you can download all of Chapter 1 as a free PDF to learn more about mining Twitter data.