Decoding Climate Change with Perl, gnuplot and Google Earth

Back in August The New York Times reported that the word ‘statistics’ had replaced the word ‘plastics’ in the famous career guidance given in the film The Graduate. And more recently the same paper reported that data and its analysis are the future of science.

And it’s not just in business and ivory towers that statistical
analysis of masses of data is becoming important: just understanding
the wealth of percentages, risk factors and charts that confront us
all requires a form of ‘data literacy’. As a recent blog post by John
Udell points out, even understanding your electricity bill requires everyday analytical thinking.

For those of us who can code (whatever the language) raw data presents
a rich target for analysis, and some plain fun crunching numbers to
understand them better. A case in point is the recent release of
climate change data which is accessible to anyone with some basic
mathematical and computing skills.

If you don’t have the coding skills you can still follow along and
analyze the output of the program using basic statistical skills like
calculating the mean and standard deviation of a set of data.
Although climate science itself is very complex, much is within the
grasp of someone with basic statistics knowledge.

On December 8, 2009 the UK Met
Office
released historical
land surface temperature records
spanning 150 years of
measurements from around the globe. The data was a subset of
measurements used to create the CRUTEM3 and
HadCRUT3
datasets.

HadCRUT3 is set of data showing the combined land and marine
temperature anomalies on a 5° by 5° grid-box basis (CRUTEM3 just has
the land temperatures). The globe is divided into grid boxes of 5° of
longitude by 5° of latitude and an average temperature for that box is
calculated from all the temperatures measured at meteorological
stations falling within that square of the Earth’s surface.

Instead of reporting actual temperatures, the CRUTEM3 and HadCRUT3
datasets report temperature anomalies. The anomaly is the difference
between the 1961 to 1990 average temperature for that grid box and the
measured average temperature. Using this anomaly value means that
trends can be clearly seen without having to compare areas with very
different normal temperatures. For example, the north of Scotland and
the south of Spain can be compared just by looking at these deviations
from normal temperatures.

If you’ve seen a chart of global warming you’ve likely seen a graph of
temperature anomalies. Here’s one released by the UK Met Office along
with their historical records:

Although the gridded data in CRUTEM3 and HadCRUT3 had been available
before (just drop in at http://hadobs.org/), the underlying data
released by the UK Met Office was new and I couldn’t resist poking
into it myself. They had promised to release software to analyze it,
but why wait? Using a variety of free (as in speech) and free (as in
beer) software I was able to quickly reproduce the UK Met Office’s
results and produce some visualizations that haven’t been seen before. There was even a surprise lurking in the data that required confirmation from the UK Met Office.

The first step was reading the data. Each meteorological station is
stored in a single text file that contains monthly average
temperatures, plus metadata about the station. For example, here’s
the beginning of file 724940 (which is a measuring station in San
Francisco, CA). The number 724940 is assigned by the World Meteorological Organization and a
quick look in its records says that this station is at San Francisco
International Airport.
Number= 724940
Name= SAN FRANCISCO, CA
Country= UNITED STATES
Lat=   37.6
Long=  122.4
Height= 6
Start year= 1871
End year= 2009
First Good year= 1871
Source ID= 10
Source file= Jones+Anders
Jones data to= 2001
Normals source= Data
Normals source start year= 1961
Normals source end year= 1990
Normals=   9.3  11.2  11.9  13.1  14.5  16.2  17.1  17.6  18.0  16.1   12.6   9.6
Standard deviations source= Data
Standard deviations source start year= 1941
Standard deviations source end year= 1990
Standard deviations=   1.3   1.2   1.1   1.2   1.1   1.3   1.4   1.4   1.4   1.0   1.0   1.4
Obs:
1871 -99.0 -99.0  10.7  11.8  12.4  13.8  13.7  14.4  15.8  16.6  12.7  11.6
1872  11.2  12.3  12.3  11.9  13.3  15.1  14.4  15.4  15.2  14.8  13.4  11.2
1873  12.2   9.9  12.1  12.2  12.7  13.7  14.2  14.9  14.2  14.9  13.9   9.9
1874   9.2  10.3  10.1  12.4  13.8  14.8  14.0  14.6  15.8  14.9  13.3  10.1
1875   9.4  10.5  10.6  12.3  13.3  14.3  14.1  14.1  14.2  15.7  13.7  10.6
1876   9.2  11.3  11.4  12.3  12.9  15.7  14.5  14.6  15.5  14.8  14.0  11.4
1877  12.2  12.8  13.4  12.0  12.6  16.2  15.2  14.7  16.1  14.3  13.4  11.2

There are observations starting in 1871 (although the two -99.0 values
indicate that there are no values available for January and February
of 1871) with one column per month. In the full data file there are
values for every month until October 2009.

The Normals values give the monthly average temperature for San
Francisco, CA based on calculating the averages between 1961 and 1990.
The latitude and longitude (to one decimal place) of the station are
given.

A small amount of Perl later (which you can download here)
and I had read in all 1,729 files and created a single Perl hash that
contained the metadata and all the observations.

Along the way the script checked for inconsistencies in the data (such
as verifying that the Normals values really were the average of the
1961 to 1990 observations) and threw up a small number of errors.
Almost all the differences
were in temperature data for Australasia and I wrote to the UK
Met Office asking for their help in understanding why the Normals
don’t look ‘normal’. More on that later.

With the data verified the script goes on to reproduce gridded data
for the entire globe using the same technique as the UK Met Office.
The full details of how their datasets are created is documented in
the paper Uncertainty
estimates in regional and global observed
temperature changes: a new dataset from 1850
. It’s quite
technical, but worth sitting down and reading to understand how the
data is interpreted.

For gridding it’s simply a question of calculating the average
temperature from all the stations in the grid square and subtracting
that from the ‘normal’; temperature for that square to obtain the
anomaly. This is done on a month-by-month basis since January 1850.

Temperature trends are calculated separately for the northern and
southern hemispheres and then combined to get the global trend. For
the hemispheres the average temperature is calculated by averaging the
temperatures given by each grid box to obtain an average hemisphere
temperature for that month.

The average is actually a weighted average based on the latitude of
the grid box. At latitudes far from the equator the grid boxes cover
a much smaller area and so temperature values need to be scaled. The
weight is the cosine of the latitude of the central point of the grid
box.

My Perl program outputs .dat files compatible with gnuplot. Here are the northern
and southern hemisphere trends produced by the program. You can
clearly see the period of global warming starting in the mid-1970s.
a-729462.png
b-786658.png
The global temperature anomaly is just the average of the northern and
southern temperature anomalies. My program outputs another gnuplot
file which can be used to produce this graph:
c-737092.png
But to really show off the data I decided to use Google Earth. The UK
Met Office had released a picture (see here)
of the October 2009 global temperature anomaly, so I decided to start
there and try to reproduce it. It wasn’t much work to get the Perl
script dumping out KML for
import to Google Earth. Once in Google Earth the same pattern of a
colder than average North America and warmer than average Europe was
clearly visible.
Picture-4-715010.png
But I wasn’t satisfied with just October 2009 and decided to use
Google Earth’s animation capability to show the evolution of October
temperatures since 1850. Any element of a Google Earth KML file can
have a time-stamp or time span associated with it and Google Earth
will automatically create an animation.
The best way to experience this is to either grab my code and create
the 28Mb animation file yourself, or simply watch this YouTube
version:

You can grab the code and files from my blog at
http://www.jgc.org/blog/. If you do something interesting with it, or find problems, please let me know.

Epilogue

It turns out that the errors my program was reporting weren’t caused by my own bugs (as I first suspected) but because of genuine errors in the data. On December 18 I received a reply from the UK Met Office to my email regarding the errors confirming that they are genuine:

First off, thank you for bringing this to our attention. We have undertaken further investigation upon the full dataset and confirmed this. The error affects <1% of the network and is primarily in Oceania. It arises because normals were calculated outside of the update cycle and the normals for these stations were not updated when extra data were added in the normals period as CRUTEM3 was being finalised for publication.

We intend to add this information to our online Q and A for the data and we would like to credit you with pointing out the error. Would you be happy to be mentioned in this way?

Amazing what you can do with perseverance and a bit of Perl.