Matthew Russell

Exploring Twitter Influence with Jaccard Similarity and Python

What Do Tim O’Reilly, Lady Gaga, and Marissa Mayer All Have In Common?

by Matthew Russell | November 22, 2013

Let’s examine the followers of some popular Twitter users by asking the (Freakonomics-inspired) question, What do Tim O’Reilly, Lady Gaga, and Marissa Mayer all have in common? Although it may initially seem like an obnoxious question to ask, some of the answers may intrigue you once you begin to take a closer look at the data. (Although dashingly good looks might be one thing that they all have in common, we’ll let the data do the talking and stick with Twitter followers as the basis of computing similarity.)

Which two of these three accomplished entrepreneurs are most alike? It all depends on the features that you’re comparing!

Goals

The initial idea behind this entire series on Twitter influence is that it would be an interesting and educational experiment in data science to put Tim O’Reilly‘s ~1.7 million followers under the microscope and explore the correlation between popularity (based upon number of followers) and Twitter influence.

In order to draw some meaningful comparisons, however, we’ll need to consider at least one other account. Marissa Mayer seems like a fine selection for comparison since her Twitter account is similar yet different to Tim’s account. For example, she’s also a “tech celebrity” and business executive. However, her particular expertise is not quite the same, and she only has about one-fourth as many followers. (Or so it would initially appear…)

Just to make this interesting, let’s further mix things up a bit by introducing a wildcard. Lady Gaga seems as good a choice as any to introduce a bit of unexpected fun into the situation. She is one of the ten most popular Twitter users based upon number of followers, an accomplished entrepreneur, and surely draws interest from a broad cross-section of the population. The introduction of a third account also provides the opportunity to draw some additional comparisons, so let’s compute the Jaccard index for the various combinations of these three accounts and see what turns up. The Jaccard index measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets, or, more plainly, the amount of overlap between the sets divided by the total size of the combined set. This is a simple way to measure and compare the overlap in followers.

Results

The full results (example code, notes, and the results from executing each cell) are available as an IPython Notebook, and you are encouraged to review it in depth. For convenience, a summary of the key results that you’ll see computed in the notebook follow:

Read more…

Twitter’s Most Fundamental Value

Twitter could be so much better than an advertising company

by Matthew Russell | November 12, 2013

We can now gather from Twitter’s IPO that it’s fundamentally postured as an advertising company, but its real value isn’t in advertising. Twitter’s most fundamental value rests squarely within data analytics. However, just because Twitter could make a lot of money in advertising doesn’t mean that advertising is where it should concentrate the majority of your efforts or where its most fundamental value proposition lies.

More specifically, Twitter’s most fundamental value is in the overall collective intelligence of its user base when interpreted as an interest graph. Think of an interest graph as a mapping of people to their interests. In other words, if you follow an account on Twitter, what you’re really saying is that you’re interested in that account. Even though there’s lots to be gleaned in all of the little 140 character quips associated with a particular account, there’s a good bit you can tell about a person by solely examining the accounts that the person follows.

Read more…

Writing Paranoid Code

Computing Twitter Influence, Part 2

by Matthew Russell | October 10, 2013

In the previous post of this series, we aspired to compute the influence of a Twitter account and explored some relevant variables to arriving at a base metric. This post continues the conversation by presenting some sample code for making “reliable” requests to Twitter’s API to facilitate the data collection process.

Given a Twitter screen name, it’s (theoretically) quite simple to get all of the account profiles that follow the screen name. Perhaps the most economical route is to use the GET /followers/ids API to request all of the follower IDs in batches of 5,000 per response, followed by the GET /users/lookup API to retrieve full account profiles for up to Y of those IDs in batches of 100 per response. Thus, if an account has X followers, you’d need to anticipate making ceiling(X/5000) API calls to GET /followers/ids and ceiling(X/100) API calls toGET /users/lookup. Although most Twitter accounts may not have enough followers that the total number of requests to each API resource presents rate-limiting problems, you can rest assured that the most popular accounts will trigger rate-limiting enforcements that manifest as an HTTP error in RESTful APIs.

Read more…

Mining One Million Tweets About #Syria

Surprising social media stats

by Matthew Russell | September 11, 2013

I’ve been filtering Twitter’s firehose for tweets about “#Syria” for about the past week in order to accumulate a sizable volume of data about an important current event. As of Friday, I noticed that the tally has surpassed one million tweets, so it seemed to be a good time to apply some techniques from Mining the Social Web and explore the data.

While some of the findings from a preliminary analysis confirm common intuition, others are a bit surprising. The remainder of this post explores the tweets with a cursory analysis addressing the “Who?, What?, Where?, and When?” of what’s in the data.

Read more…

Matthew Russell

Exploring Twitter Influence with Jaccard Similarity and Python

What Do Tim O’Reilly, Lady Gaga, and Marissa Mayer All Have In Common?

Goals

Results

Twitter’s Most Fundamental Value

Twitter could be so much better than an advertising company

Investigating the Twitter Interest Graph

Why Is Twitter All the Rage?

Writing Paranoid Code

Computing Twitter Influence, Part 2

Computing Twitter Influence, Part 1: Arriving at a Base Metric

The subtle variables affecting a base metric

Mining One Million Tweets About #Syria

Surprising social media stats