Writing Paranoid Code

Computing Twitter Influence, Part 2

In the previous post of this series, we aspired to compute the influence of a Twitter account and explored some relevant variables to arriving at a base metric. This post continues the conversation by presenting some sample code for making “reliable” requests to Twitter’s API to facilitate the data collection process.

Given a Twitter screen name, it’s (theoretically) quite simple to get all of the account profiles that follow the screen name. Perhaps the most economical route is to use the GET /followers/ids API to request all of the follower IDs in batches of 5,000 per response, followed by the GET /users/lookup API to retrieve full account profiles for up to Y of those IDs in batches of 100 per response. Thus, if an account has X followers, you’d need to anticipate making ceiling(X/5000) API calls to GET /followers/ids and ceiling(X/100) API calls toGET /users/lookup. Although most Twitter accounts may not have enough followers that the total number of requests to each API resource presents rate-limiting problems, you can rest assured that the most popular accounts will trigger rate-limiting enforcements that manifest as an HTTP error in RESTful APIs.

Although it seems more satisfying to have all of the data you could ever want, you really should ask yourself if you really need every follower profile for an account of interest, or if a sufficiently large random sample will do. However, be advised that in order to truly collect a random sample of followers for an account, you must sample from the full population of all follower IDs as opposed to just taking the first N follower IDs. The reason is that Twitter’s API docs state that IDs are currently returned with “the most recent following first” but the order may change with little to no notice. Even in the latter case, there’s no expectation or guarantee of randomness. We’ll revisit this topic in the next post in which we begin harvesting profiles.

Write Paranoid Code

Only a few things are guaranteed in life: taxes, death, and that you will encounter inconvenient HTTP error codes when trying to acquire remote data. It’s never quite as simple as assuming that there won’t be any “unexpected” errors associated with code that makes network requests, because the very nature of making calls to remote web server inherently introduces the possibility of failure.

Only a few things are guaranteed in life: taxes, death, and that you will encounter inconvenient HTTP error codes when trying to acquire remote data.

In order to successfully harvest non-trivial amounts of remote data, you must employ robust code that expects errors to happen as a normal occurrence as opposed being an exceptional case that “probably won’t happen.” Write code that expects a mysterious kind of network error to crop up deep somewhere deep in the guts of the underlying HTTP library that you are using, be prepared for service disruptions such as Twitter’s “fail whale,” and by all means, ensure that your code accounts for rate-limiting and all other well-documented HTTP error codes that the API documentation provides.

Finally, ensure that you don’t experience any data loss if your code fails despite your best efforts by persisting the data that is returned from each request so that your code doesn’t run for an extended duration only to fail and leave you with nothing at all to show for it — even though you might otherwise be able to easily recover by restarting from the point of failure as opposed to starting from scratch. For what it’s worth, I’ve found that consistently being able to think about writing code that behaves this way is a little easier said than done, but like anything else, it gets easier with a little bit of practice.)

Making Paranoid Twitter API Requests

Example 9-16 [viewable IPython Notebook link from Mining the Social Web’s GitHub repository] presents a pattern for making paranoid Twitter API requests and is reproduced below. It accounts for the HTTP errors in Twitter’s API documentation as well as a couple of other errors (such as urllib2′s infamousBadStatusLine exception) that sometimes appear, seemingly without rhyme or reason. Take a moment to study the code to see how it works.

In the next post, we’ll continue the conversation by using make_twitter_request to acquire account profiles so that the data science/mining can begin. Stay tuned!

Note: If you’re interested in learning more about the tools and techniques presented here, you’ll want to check out Matthew’s tutorial, Mining Social Web APIs with IPython Notebook, at our Strata Conference + Hadoop World in New York on October 28-30, 2013. This originally appeared on miningthesocialweb.com.

tags: , , ,

Get the O’Reilly Programming Newsletter

Get weekly insight from industry insiders—plus exclusive content, offers, and more on the topic of software engineering.