The Red Line problem

One of the chapters of Think Bayes is based on a class project two of my students worked on last semester. It presents “The Red Line Problem,” which is the problem of predicting the time until the next train arrives, based on the number of passengers on the platform.

Here’s the introduction:

In Boston, the Red Line is a subway that runs between Cambridge and Boston. When I was working in Cambridge I took the Red Line from Kendall Square to South Station and caught the commuter rail to Needham. During rush hour Red Line trains run every 7–8 minutes, on average.

When I arrived at the station, I could estimate the time until the next train based on the number of passengers on the platform. If there were only a few people, I inferred that I just missed a train and expected to wait about 7 minutes. If there were more passengers, I expected the train to arrive sooner. But if there were a large number of passengers, I suspected that trains were not running on schedule, so I would go back to the street level and get a taxi.

While I was waiting for trains, I thought about how Bayesian estimation could help predict my wait time and decide when I should give up and take a taxi. This chapter presents the analysis I came up with.

Sadly, this problem has been overtaken by history: the Red Line now provides real-time estimates for the arrival of the next train. But I think the analysis is interesting, and still applies for subway systems that don’t provide estimates.

One interesting tidbit:

As it turns out, the average time between trains, as seen by a random passenger, is substantially higher than the true average.

Why? Because a passenger is more like to arrive during a large interval than a small one. Consider a simple example: suppose that the time between trains is either 5-minutes or 10-minutes with equal probability. In that case the average time between trains is 7.5 minutes.

But in fact a passenger is twice as likely to arrive during a 10 minute gap than a 5-minute gap. If we surveyed arriving passengers, we would find that 2/3 of them arrived during a 10-minute gap, and only 1/3 during a 5-minute gap. So the average time between trains, as seen by an arriving passenger, is 8.33 minutes.

This kind of observer bias appears in many contexts. Students think that classes are bigger than they are, because more of them are in the big classes. Airline passengers think that planes are fuller than they are, because more of them are on full flights.

In each case, values from the actual distribution are oversampled in proportion to their value. In the Red Line example, a gap that is twice as big is twice as likely to be observed.

The data for the Red Line are close to this example. The actual time between trains is 7.6 minutes (based on 45 trains that arrived at Kendall square between 4pm and 6pm so far this week). The average gap as seen by random passengers is 8.3 minutes.

The MBTA provides a web interface for the location of trains along the red line. I collected data for 5 workdays and computed the time between trains during the afternoon rush hour. This figure shows the distribution of gaps between trains (z) and the biased distribution as seen by passengers (zb):

redline0

The distribution of z shows that the most common gap between trains is 7 minutes, but it is sometimes as long as 15-minutes. Because passengers are more likely to arrive during the long gaps, the distribution of zb is shifted to the right. At the high end, passengers would over-report the number of 15-minute gaps by a factor of two.

You can read about the rest of the analysis in Chapter 8 of Think Bayes, but here’s what the results look like:

redline5

The x-axis is the number of passengers you see on the platform. The y-axis is the probability that the time until the next train exceeds 15 minutes (which means you will not get to South Station in time). If you carry this graph in your pocket, you can use it to decide when to go upstairs and catch a taxi.

Related Resources

Editor’s Note: This post originally appeared in the Probably Overthinking It blog. It has been edited.

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England
Strata in Santa Clara: February 11-13 | Santa Clara, CA
Strata Rx Health Data Conference: April 23-25 | Boston, MA

The Red Line problem

Related Resources

Editor’s Note: This post originally appeared in the Probably Overthinking It blog. It has been edited.

Get the O’Reilly Data Newsletter