Google Correlate: Your data, Google's computing power

Upload state- or time-based data and Google Correlate reveals search trends.

Google CorrelateGoogle Correlate is awesome. As I noted in Search Notes last week, Google Correlate is a new tool in Google Labs that lets you upload state- or time-based data to see what search trends most correlate with that information.

Correlation doesn’t necessarily imply causation, and as you use Google Correlate, you’ll find that the relationship (if any) between terms varies widely based on the topic, time, and space.

For instance, there’s a strong state-based correlation between searches for me and searches for Vulcan Capital. But the two searches have nothing to do with each other. As you see below, the correlation is that the two searches have similar state-based interest.

Picture 476.png

For both searches, the most volume is in Washington state (where we’re both located). And both show high activity in New York.

State-based data

For a recent talk I gave in Germany, I downloaded state-by-state income data from the U.S. Census Bureau and ran it through Google Correlate. I found that high income was highly correlated with searches for [lohan breasts] and low income was highly correlated with searches for [police shootouts]. I leave the interpretation up to you.

Picture 443.png

Picture 445.png

By default, the closest correlations are with the highest numbers, so to get correlations with low income, I multiplied all of the numbers by negative one.

Clay Johnson looked at correlations based on state obesity rates from the CDC. By looking at negative correlations (in other words, what search queries are most closely correlated with states with the lowest obesity rates), we see that the most closely related search is [yoga mat bags]. (Another highly correlated term is [nutrition school].)

Picture 478.png

Maybe there’s something to that “working out helps you lose weight” idea I’ve heard people mention. Then again, another highly correlated term is [itunes movie rentals], so maybe I should try the “sitting on my couch, watching movies work out plan” just to explore all of my options.

To look at this data more seriously, we can see with search data alone that the wealthy seem to be healthier (at least based on obesity data) than the poor. In states with low obesity rates, searches are for optional material goods, such as Bose headphones, digital cameras, and red wine and for travel to places like Africa, Jordan, and China. In states with high obesity rates, searches are for jobs and free items.

With this hypothesis, we can look at other data (access to nutritious food, time and space to exercise, health education) to determine further links.

Time-based data

Time-based data works in a similar way. Google Correlate looks for matching patterns in trends over time. Again, that the trends are similar doesn’t mean they’re related. But this data can be an interesting starting point for additional investigation.

One of the economic indicators from the U.S. Census Bureau is housing inventory. I looked at the number of months’ supply of homes at the current sales rate between 2003 and today. I have no idea how to interpret data like this (the general idea is that you, as an expert in some field, would upload data that you understand). But my non-expert conclusion here is that as housing inventory increases (which implies no one’s buying), we are looking to spiff up our existing homes with cheap stuff, so we turn to Craigslist.

Picture 481.png

Picture 482.png

Picture 483.png

Of course, it could also be the case that the height of popularity of Craiglist just happened to coincide with the months when the most homes were on the market, and both are coincidentally declining at the same rate.

Search-based data

You can also simply enter a search term, and Google will analyze the state or time-based patterns of that term and chart other queries that most closely match those patterns. Google describes this as a kind of Google Trends in reverse.

Google Insights for Search already shows you state distribution and volume trends for terms, and Correlate takes this one step further by listing all of the other terms with a similar regional distribution or volume trend.

For instance, regional distribution for [vegan restaurants] searches is strongly correlated to the regional distribution for searches for [mac store locations].

Picture 484.png

What does the time-trend of search volume for [vegan restaurants] correlate with? Flights from LAX.

Picture 485.png

Time-based data related to a search term can be a fascinating look at how trends spark interest in particular topics. For instance, as the Atkins Diet lost popularity, so too did interest in the carbohydrate content of food.

Picture 486.png

Interest in maple syrup seems to follow interest in the cleanse diet (of which maple syrup is a key component).

Picture 488.png

Drawing-based data

Don’t have any interesting data to upload? Aren’t sure what topic you’re most interested in? Then just draw a graph!

Maybe you want to know what had no search volume at all in 2004, spiked in 2005, and then disappeared again. Easy. Just draw it on a graph.

Picture 489.png

Apparently the popular movies of the time were “Phantom of the Opera,” “Darkness,” and “Meet the Fockers.” And we all were worried about our Celebrex prescriptions.

Picture 490.png

Picture 491.png

(Note: the accuracy of this data likely is dependent on the quality of your drawing skills.)

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

Related:

tags: ,
  • http://oswco.com Tom Brander

    Nice write up.
    Most curious where did you get months of housing supply by state? I do a bunch of Re analysis and am currently looking for that data.
    Thanks

  • Alex Tolley

    “Of course, it could also be the case that the height of popularity of Craiglist just happened to coincide with the months when the most homes were on the market, and both are coincidentally declining at the same rate.”

    Or pe5rhaps the inventory increase due to foreclosures is resulting in people selling their stuff on Craigslist?

    This illustrates in microcosm the social problem of “big data”. Spurious correlations lead to silly ideas about connections. It was bad enough that we had incessant spurious epidemiological correlations between some factor and health, usually resulting in some media scare story. Now anyone can do the same thing and publish the results on a blog. This is just ramping up the infonoise pollution the memesphere.

  • http://www.ninebyblue.com Vanessa Fox

    Alex, indeed, that’s always the case with data analysis. Half is the math and the other half is understanding how to interpret the data. Google Correlate only helps with the math part.

  • Alex Tolley

    @Vanessa

    I agree that the math has always been ahead of understanding. That is why we end up with stories about correlation assuming causation. What has been much less understood, even by scientists, is that when extensive data mining is done, that significance bounds need to be tightened.

    Google Correlate promises to hugely expand the data mining model. The suggested correlated series may simply be the result of randomness and thus no connection can be extracted. That will not stop the human need to find patterns in data and communicate those “patterns”.

    When idiots can correlate, idiots will correlate. Worse, idiots will replicate false correlations. We will drown in false information noise. It will require a huge amount of effort to contain this pollution.

  • Titilayomi F

    Hey Vanessa,
    Thats a piece I must say!

    Can you please tell me how an organization can take advantage of this. Do you think there is a marketing opportunity therein? What am thinking is how a company can match the results of finding perhaps a flagship product to it’s name? Do you think that is possible with this?

    Thanks.