Exploring Twitter Influence with Jaccard Similarity and Python
What Do Tim O’Reilly, Lady Gaga, and Marissa Mayer All Have In Common?
Let’s examine the followers of some popular Twitter users by asking the (Freakonomics-inspired) question, What do Tim O’Reilly, Lady Gaga, and Marissa Mayer all have in common? Although it may initially seem like an obnoxious question to ask, some of the answers may intrigue you once you begin to take a closer look at the data. (Although dashingly good looks might be one thing that they all have in common, we’ll let the data do the talking and stick with Twitter followers as the basis of computing similarity.)
The initial idea behind this entire series on Twitter influence is that it would be an interesting and educational experiment in data science to put Tim O’Reilly‘s ~1.7 million followers under the microscope and explore the correlation between popularity (based upon number of followers) and Twitter influence.
In order to draw some meaningful comparisons, however, we’ll need to consider at least one other account. Marissa Mayer seems like a fine selection for comparison since her Twitter account is similar yet different to Tim’s account. For example, she’s also a “tech celebrity” and business executive. However, her particular expertise is not quite the same, and she only has about one-fourth as many followers. (Or so it would initially appear…)
Just to make this interesting, let’s further mix things up a bit by introducing a wildcard. Lady Gaga seems as good a choice as any to introduce a bit of unexpected fun into the situation. She is one of the ten most popular Twitter users based upon number of followers, an accomplished entrepreneur, and surely draws interest from a broad cross-section of the population. The introduction of a third account also provides the opportunity to draw some additional comparisons, so let’s compute the Jaccard index for the various combinations of these three accounts and see what turns up. The Jaccard index measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets, or, more plainly, the amount of overlap between the sets divided by the total size of the combined set. This is a simple way to measure and compare the overlap in followers.
The full results (example code, notes, and the results from executing each cell) are available as an IPython Notebook, and you are encouraged to review it in depth. For convenience, a summary of the key results that you’ll see computed in the notebook follow:
Twitter’s Most Fundamental Value
Twitter could be so much better than an advertising company
We can now gather from Twitter’s IPO that it’s fundamentally postured as an advertising company, but its real value isn’t in advertising. Twitter’s most fundamental value rests squarely within data analytics. However, just because Twitter could make a lot of money in advertising doesn’t mean that advertising is where it should concentrate the majority of your efforts or where its most fundamental value proposition lies.
More specifically, Twitter’s most fundamental value is in the overall collective intelligence of its user base when interpreted as an interest graph. Think of an interest graph as a mapping of people to their interests. In other words, if you follow an account on Twitter, what you’re really saying is that you’re interested in that account. Even though there’s lots to be gleaned in all of the little 140 character quips associated with a particular account, there’s a good bit you can tell about a person by solely examining the accounts that the person follows.
Writing Paranoid Code
Computing Twitter Influence, Part 2
In the previous post of this series, we aspired to compute the influence of a Twitter account and explored some relevant variables to arriving at a base metric. This post continues the conversation by presenting some sample code for making “reliable” requests to Twitter’s API to facilitate the data collection process.
Given a Twitter screen name, it’s (theoretically) quite simple to get all of the account profiles that follow the screen name. Perhaps the most economical route is to use the GET /followers/ids API to request all of the follower IDs in batches of 5,000 per response, followed by the GET /users/lookup API to retrieve full account profiles for up to Y of those IDs in batches of 100 per response. Thus, if an account has X followers, you’d need to anticipate making ceiling(X/5000) API calls to GET /followers/ids and ceiling(X/100) API calls toGET /users/lookup. Although most Twitter accounts may not have enough followers that the total number of requests to each API resource presents rate-limiting problems, you can rest assured that the most popular accounts will trigger rate-limiting enforcements that manifest as an HTTP error in RESTful APIs.