Allen Downey

One step closer to a two-hour marathon

In September, Wilson Kipsang ran the Berlin Marathon in 2:03:23, shaving 15 seconds off the world record.  That means it’s time to check in on the world record progression and update my article from two years ago. The following is a revised version of that article, including the new data point.

Abstract: I propose a model that explains why world record progressions in running speed are improving linearly, and should continue as long as the population of potential runners grows exponentially.  Based on recent marathon world records, I extrapolate that we will break the two-hour barrier in 2043.

Let me start with the punchline:

world_record_predict_a

Read more…

Software engineering practices for graduate students

Recently I was talking with an Olin student who will start graduate school in the fall, and I suggested a few things I wish I had done in grad school. And then I thought I should write them down. So here is my list of Software Engineering Practices All Graduate Students Should Adopt:

Version Control

Every keystroke you type should be under version control from the time you initiate a project until you retire it. Here are the reasons:

  1. Everything you do will be backed up. But instead of organizing your backups by date (which is what most backup systems do) they are organized by revision. So, for example, if you break something, you can roll back to an earlier working revision.
  2. When you are collaborating with other people, you can share repositories. Version control systems are well designed for managing this kind of collaboration. If you are emailing documents back and forth, you are doing it wrong.
  3. At various stages of the project, you can save a tagged copy of the repo. For example, when you submit a paper for publication, make a tagged copy. You can keep working on the trunk, and when you get reviewer comments (or a question 5 years later) you have something to refer back to.

I use Subversion (SVN) primarily, so I keep many of my projects on Google Code (if they are open source) or on my own SVN server. But these days it seems like all the cool kids are using Git and keeping their repositories on GitHub.

Either way, find a version control system you like, learn how to use it, and find someplace to host your repository.
Read more…

The Red Line problem

One of the chapters of Think Bayes is based on a class project two of my students worked on last semester. It presents “The Red Line Problem,” which is the problem of predicting the time until the next train arrives, based on the number of passengers on the platform.

Here’s the introduction:

In Boston, the Red Line is a subway that runs between Cambridge and Boston. When I was working in Cambridge I took the Red Line from Kendall Square to South Station and caught the commuter rail to Needham. During rush hour Red Line trains run every 7–8 minutes, on average.

When I arrived at the station, I could estimate the time until the next train based on the number of passengers on the platform. If there were only a few people, I inferred that I just missed a train and expected to wait about 7 minutes. If there were more passengers, I expected the train to arrive sooner. But if there were a large number of passengers, I suspected that trains were not running on schedule, so I would go back to the street level and get a taxi.

While I was waiting for trains, I thought about how Bayesian estimation could help predict my wait time and decide when I should give up and take a taxi. This chapter presents the analysis I came up with.

Sadly, this problem has been overtaken by history: the Red Line now provides real-time estimates for the arrival of the next train. But I think the analysis is interesting, and still applies for subway systems that don’t provide estimates.

Read more…

Are your data normal? Hint: no.

One of the frequently-asked questions over at the statistics subreddit (reddit.com/r/statistics) is how to test whether a dataset is drawn from a particular distribution, most often the normal distribution.

There are standard tests for this sort of thing, many with double-barreled names like Anderson-Darling, Kolmogorov-Smirnov, Shapiro-Wilk, Ryan-Joiner, etc.

But these tests are almost never what you really want. When people ask these questions, what they really want to know (most of the time) is whether a particular distribution is a good model for a dataset. And that’s not a statistical test; it is a modeling decision.

All statistical analysis is based on models, and all models are based on simplifications. Models are only useful if they are simpler than the real world, which means you have to decide which aspects of the real world to include in the model, and which things you can leave out.

For example, the normal distribution is a good model for many physical quantities. The distribution of human height is approximately normal (see this previous blog post). But human heights are not normally distributed. For one thing, human heights are bounded within a narrow range, and the normal distribution goes to infinity in both directions. But even ignoring the non-physical tails (which have very low probability anyway), the distribution of human heights deviates in systematic ways from a normal distribution.

Read more…