Decoding Climate Change with Perl, gnuplot and Google Earth

Back in August The New York Times reported that the word ‘statistics’ had replaced the word ‘plastics’ in the famous career guidance given in the film The Graduate. And more recently the same paper reported that data and its analysis are the future of science.

And it’s not just in business and ivory towers that statistical
analysis of masses of data is becoming important: just understanding
the wealth of percentages, risk factors and charts that confront us
all requires a form of ‘data literacy’. As a recent blog post by John
Udell points out, even understanding your electricity bill requires everyday analytical thinking.

For those of us who can code (whatever the language) raw data presents
a rich target for analysis, and some plain fun crunching numbers to
understand them better. A case in point is the recent release of
climate change data which is accessible to anyone with some basic
mathematical and computing skills.

If you don’t have the coding skills you can still follow along and
analyze the output of the program using basic statistical skills like
calculating the mean and standard deviation of a set of data.
Although climate science itself is very complex, much is within the
grasp of someone with basic statistics knowledge.

On December 8, 2009 the UK Met
released historical
land surface temperature records
spanning 150 years of
measurements from around the globe. The data was a subset of
measurements used to create the CRUTEM3 and

HadCRUT3 is set of data showing the combined land and marine
temperature anomalies on a 5° by 5° grid-box basis (CRUTEM3 just has
the land temperatures). The globe is divided into grid boxes of 5° of
longitude by 5° of latitude and an average temperature for that box is
calculated from all the temperatures measured at meteorological
stations falling within that square of the Earth’s surface.

Instead of reporting actual temperatures, the CRUTEM3 and HadCRUT3
datasets report temperature anomalies. The anomaly is the difference
between the 1961 to 1990 average temperature for that grid box and the
measured average temperature. Using this anomaly value means that
trends can be clearly seen without having to compare areas with very
different normal temperatures. For example, the north of Scotland and
the south of Spain can be compared just by looking at these deviations
from normal temperatures.

If you’ve seen a chart of global warming you’ve likely seen a graph of
temperature anomalies. Here’s one released by the UK Met Office along
with their historical records:

Although the gridded data in CRUTEM3 and HadCRUT3 had been available
before (just drop in at, the underlying data
released by the UK Met Office was new and I couldn’t resist poking
into it myself. They had promised to release software to analyze it,
but why wait? Using a variety of free (as in speech) and free (as in
beer) software I was able to quickly reproduce the UK Met Office’s
results and produce some visualizations that haven’t been seen before. There was even a surprise lurking in the data that required confirmation from the UK Met Office.

The first step was reading the data. Each meteorological station is
stored in a single text file that contains monthly average
temperatures, plus metadata about the station. For example, here’s
the beginning of file 724940 (which is a measuring station in San
Francisco, CA). The number 724940 is assigned by the World Meteorological Organization and a
quick look in its records says that this station is at San Francisco
International Airport.

There are observations starting in 1871 (although the two -99.0 values
indicate that there are no values available for January and February
of 1871) with one column per month. In the full data file there are
values for every month until October 2009.

The Normals values give the monthly average temperature for San
Francisco, CA based on calculating the averages between 1961 and 1990.
The latitude and longitude (to one decimal place) of the station are

A small amount of Perl later (which you can download here)
and I had read in all 1,729 files and created a single Perl hash that
contained the metadata and all the observations.

Along the way the script checked for inconsistencies in the data (such
as verifying that the Normals values really were the average of the
1961 to 1990 observations) and threw up a small number of errors.
Almost all the differences
were in temperature data for Australasia and I wrote to the UK
Met Office asking for their help in understanding why the Normals
don’t look ‘normal’. More on that later.

With the data verified the script goes on to reproduce gridded data
for the entire globe using the same technique as the UK Met Office.
The full details of how their datasets are created is documented in
the paper Uncertainty
estimates in regional and global observed
temperature changes: a new dataset from 1850
. It’s quite
technical, but worth sitting down and reading to understand how the
data is interpreted.

For gridding it’s simply a question of calculating the average
temperature from all the stations in the grid square and subtracting
that from the ‘normal’; temperature for that square to obtain the
anomaly. This is done on a month-by-month basis since January 1850.

Temperature trends are calculated separately for the northern and
southern hemispheres and then combined to get the global trend. For
the hemispheres the average temperature is calculated by averaging the
temperatures given by each grid box to obtain an average hemisphere
temperature for that month.

The average is actually a weighted average based on the latitude of
the grid box. At latitudes far from the equator the grid boxes cover
a much smaller area and so temperature values need to be scaled. The
weight is the cosine of the latitude of the central point of the grid

My Perl program outputs .dat files compatible with gnuplot. Here are the northern
and southern hemisphere trends produced by the program. You can
clearly see the period of global warming starting in the mid-1970s.
The global temperature anomaly is just the average of the northern and
southern temperature anomalies. My program outputs another gnuplot
file which can be used to produce this graph:
But to really show off the data I decided to use Google Earth. The UK
Met Office had released a picture (see here)
of the October 2009 global temperature anomaly, so I decided to start
there and try to reproduce it. It wasn’t much work to get the Perl
script dumping out KML for
import to Google Earth. Once in Google Earth the same pattern of a
colder than average North America and warmer than average Europe was
clearly visible.
But I wasn’t satisfied with just October 2009 and decided to use
Google Earth’s animation capability to show the evolution of October
temperatures since 1850. Any element of a Google Earth KML file can
have a time-stamp or time span associated with it and Google Earth
will automatically create an animation.
The best way to experience this is to either grab my code and create
the 28Mb animation file yourself, or simply watch this YouTube

You can grab the code and files from my blog at If you do something interesting with it, or find problems, please let me know.


It turns out that the errors my program was reporting weren’t caused by my own bugs (as I first suspected) but because of genuine errors in the data. On December 18 I received a reply from the UK Met Office to my email regarding the errors confirming that they are genuine:

First off, thank you for bringing this to our attention. We have undertaken further investigation upon the full dataset and confirmed this. The error affects <1% of the network and is primarily in Oceania. It arises because normals were calculated outside of the update cycle and the normals for these stations were not updated when extra data were added in the normals period as CRUTEM3 was being finalised for publication.

We intend to add this information to our online Q and A for the data and we would like to credit you with pointing out the error. Would you be happy to be mentioned in this way?

Amazing what you can do with perseverance and a bit of Perl.

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

Get the O’Reilly Web Ops and Performance Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.

Get the O’Reilly Programming Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.

Get the O’Reilly Hardware Newsletter

Get weekly insight and knowledge on how to design, prototype, manufacture, and market great connected devices.

Get Four Short Links in Your Inbox

Sign up to receive Nat’s eclectic collection of curated links every weekday.

Get the O’Reilly Design Newsletter

Stay informed. Receive weekly insight from industry insiders.

Get the O’Reilly Web Platform Newsletter

Stay informed. Receive weekly insight from industry insiders—plus exclusive content and offers.

  • Josh Knauer

    It’s clear to me that as the amount of data we have access to increases (as a result of the open data movement), we will need better tools for telling these types of data stories. If the alpha geeks are doing it using scripts and kludging together existing tools, it’s only a matter of time until we start seeing more polished tools created for accomplishing the same sort of tasks.

    The emergence of data storytelling is a trend to watch, as it is a realization of the goal of freeing up all of this data in the first place.

    Josh Knauer, CEO, Rhiza Labs

  • Bruno


    Great analysis. Can you post the final KML/KMZ files? I tried looking in your blog but without much luck!


  • Jim H

    Isn’t using the mean to represent “normal” a bit misleading? Variability is normal and is better represented by something which considers the normal distribution/range of the data instead of an “average.”

  • Frank Ch. Eigler

    “HadCRUT3 is set of data showing the combined land and marine temperature anomalies on a 5° by 5° grid-box basis”

    It is important to consider that HADCRUT3 only purports to show this stuff. The various manipulation algorithms that have been applied to the real raw data have been controversial and/or simply lost. Hadley’s web site specifically says:

    “The Met Office do not hold information as to adjustments that were applied and so cannot advise as to which stations are underlying data only and which contain adjustments.”

  • Stephen S

    It would be nice to see a longer time line. 150 years of partial data, in the scope of humans 4,000 –> 200,000 years on the planet seems a bit ‘sparse’. Having said that, I like the approach and the work done to date.

  • Simon Cast

    Interesting reproduction of the graphs. From my analysis background removing anomalies using STD is probably false. It makes the graph more pretty but the cause of the anomalies are unknown and therefore should be left in.

    Going back to the graphs you produced, the structure is quite interesting and worth some comments. Firstly, the below normal in the early part of the graph (1850s to 1900s) shows what seems to be natural event over-compensating for the additional CO2. Then it rises up to what looks like a peak in the middle to late Depression era before beginning to fall and then stabilise until the early 1960s before beginning to rise again.

    There is a smaller dip in the early 1980s before the more general upward trend. Interestingly, this points to the fact that CO2 may have a half-life of about 10 to 15 years in the atmosphere. As the Depression reduce economic activity, this lower CO2 released showed 10 years later as a lower atmospheric CO2. The hypothesis is generally supported, with late 1940s and 1950s recovery showing up about 10 years later. The 1980s dip being the product of the oil shock and the late 1990s flatline the product of the late 1980s productivity boost (energy efficiency) from computerisation kicking in.

    This hypothesis does assume that temperatures are reflective of carbon concentrations. At 10 to 15 year cycles this rise and falls could be the result of solar cycle as well.

    Assuming this hypothesis is valid, 2 interesting conclusions arise. 1 is that the rise in temperature correlates with the general industrialisation of the globe from the late 1950s onward and not the earlier European & North American industrialisation of the 1850s to 1920s with the obvious bump showing in the very late 1990s corresponding to rapid industrialisation of Asia. 2 as there seems to be a 10 year lag, efforts begun now will start to have effect in 2020 with continual rise in CO2 until various saving measures have time to flow through system and be reflected in the data but as there has been a general move to efficiency in the developed world (small albeit) since about 2001 this should start showing in the data in 2011 (along with the effect of the 2000 to 2003) economic slow down.

  • Don

    150 years is about as long as we have been keeping data ? If you believe in going back 1million years could temperatures reverse the other way ?

    Sometimes climb over 150 years and sometimes lower over 150 years . Rather sometimes climb over 1000 years or sometimes lower over 1000 years ? That is if you think going back 1 million years of climate change ?


  • Mike Gale

    This looks like excellent work.

    I think it’s a great idea to run your own analysis of the data. I just wish I’d got as far as you have!

    I’ve had a preliminary look at the underlying data and been disturbed by what I’ve found.

    1. The US ground stations have been giving inaccurate data for several decades. The officials didn’t do anything about quantifying the error. Some citizens did. See for some details. Conclusion in a nutshell, most ground based temperature measurements in the US are unreliable. Rest of world I’m unsure but I’d be suspicious.

    2. Some data has been systematically altered so there seems to be anthropogenic data which shows warming while the unaltered data shows cooling. See I’ve analysed some of the original data here myself and can confirm that some of the NZ data which went into CRU HadCru… is, in my view, suspect.

    Maybe you might like to check out some of this data yourself. The NZ data is at the CliFlo site of NIWA. It’s free to sign up and get fairly large data sets. (There’s a bit of a story about the Auckland data. If faced with the same issues I would simply have omitted the Auckland, Albert Park, data altogether and looked for something else.)

  • Phil

    In response to Mike Gale, it’s worth looking at these as a counterpoint to the propaganda from the denialist group “The New Zealand Climate Change Coalition”:


  • Richard Holle

    To see what can be done with 22,800 DAILY data records and grafting at 0.1 degree resolution, plotted for three cycles to forecast a fourth cycle (current weather) take a look at;

    NO data was excluded, all stations were used as often as they intermittently appeared in the data base TD3200 Cooperative Summary of the day.

    Ends up being better than the NWS or NOAA forecasts, and beats the crap out of the models for long range use. Maps posted to web site for next 4 years.

  • Julien

    I’m glad to see more eyeballs on this data :-)
    This is only land measurements, right? Was it corrected for the installation problems (

  • ligne

    Julien: contains a lot of abstract talk about how their photos prove the stations are mis-sited. but they don’t seem to have tried to quantify how these affect the measurements taken at these stations, or tried to analyse how this affects the overall trend across all stations.

    NOAA on the other hand *have* run the numbers: (a less technical summary is here: ).

    turns out there’s no meaningful difference between the best stations and the whole lot. not that you’d know it from the site, of course.

  • ligne

    oh dear, it stripped the links. because everything inside pointy-brackets is HTML, right?

    NOAA on the other hand *have* run the numbers: . (a less technical summary is here: ).

  • Mike Gale

    Thanks for those links Phil. (It might be useful to have your full name!)

    The science I’ve worked with takes recorded data as accurate. If it’s not then you fix it at source.

    A lot of the weather data I’ve seen recently takes a different approach.

    That rings my alarm bells. At the least I expect to see a plot of original data compared to “adjusted data”.

    I’m suspicious of anthropogenic data.

    I’d love to see a plot of the HadCRU data before the adjustments, alongside the adjusted data.

    I’d also like to see less written from a blatantly political viewpoint. Calling people supporters and deniers is, I hope, plain wrong. I hope that there are a lot of scientists in the mix. People who are after fact based reasoning instead of the fact free reasoning that emerges when political thought is involved. Life on this planet might be at stake. We’re making a mess of the place and need to think clearly about fixing it. (Including acknowledging areas where we don’t have a clue.)

    Wouldn’t it be great if these blogs emailed contributors automatically (should they so desire)!

  • ligne


    Fixing the data sources would be nice, but unfortunately just isn’t possible when dealing with historical data. For instance, if a station is moved, or its equipment (thermometer, stephenson screen, etc) is modified, there’s no way to go back in time to update previous measurements. it’s still important to correct for these, though.

    The procedures used to correct the data are well documented. Unfortunately i’ve not got links for the CRU papers to hand, and their website is currently a mess while they rebuild after the hack. But for comparison, GISTEMP’s (which independently processes more or less the same data and gets essentially the same results) are detailed here: (the papers are all freely available — just follow the links!).

    I can’t remember seeing a direct raw-vs-adjusted graph, but this piece looks at the trends introduced by these adjustements, and finds that almost as many being adjusted downwards as upwards:

    It’s also worth noting that the warming trend from the surface-based measurements of GISS and CRU correlates very well with the satellite-based warming reported by RSS and UAH.

    If you want a play with the data to see for yourself, a thorough collection of links to datasets is here:
    I can particularly recommend woodfortrees (linked from the “visualisation” section) as a nice way of comparing the different temperature data-sets.

    Finally, i agree that immediately calling someone a “denier” is unhelpful. But i’ve yet to come across a better term for those who knowingly mis-represent and distort the research, make baseless accusations of fraud towards scientists who report data they dislike, or trot out long-debunked arguments over and over again, as NZCSC clearly have.

    full ack about the notification emails though. that would be handy :-)