Google's New Correlation Mining Tool: It Works!

You may have heard of Google Trends. It’s a cool tool which will show you the ups-and-downs of the public’s interest in a particular topic—at least as revealed in how often we search for it. And you may have even heard of the first really important use of this tool: Google Flu Trends, which uses search data to try to predict flu activity. Now Google has released an amazing way to reverse engineer the process: Google Correlate. Just feed in your favorite weekly time series (or cross-state comparisons), and it will tell you which search terms are most closely correlated with your data.

So I tried it out.  And it works! Amazingly well.

I fed in the weekly numbers on initial unemployment claims—one of the most important weekly economic time series we have.  The search term that is most closely correlated? Crikey, it’s “filing for unemployment.”  Indeed, the correlation is an astounding 0.91.

 

Given the latest Google Trends on “filing for unemployment,” I’ll forecast that initial unemployment claims will tick down in the next couple of weeks.

With an eye to earning a quick trading fortune, I also uploaded data on weekly returns on the S&P 500. But Google failed to find anything significantly correlated. Score one for the random walk hypothesis.

Interested in more?  Here’s Google correlate; here’s a comic introduction; and here’s their white paper. Also, here’s Hyunyoung Choi and Hal Varian’s research on “Predicting the Present with Google Trends,” which shows that retail, auto, home and travel sales are also now-castable with search data (blog summary here); they’ve also previously shown the value of Google Trends in predicting initial unemployment claims.  And here’s Albert Saiz and Uri Simonsohn on “Downloading Wisdom from Online Crowds.”

(Hat tip: Bo Cowgill)


Jareau

Very cool!

scott cunningham

I wish it could handle a panel. I think it's either cross section or time series though.

Matt

It works! I knew Kate Middleton was behind it...

http://correlate.googlelabs.com/search?e=royal%20wedding&e=playstation%20network%20hack&t=weekly#default,20

DC

"facebook" and "tapeworm in humans" are 0.8721 correlated.

Shane

I'm a huge fan of the Trends services and Correlation looks like lots of fun too. Though I'd like to see them show results for the numbers of actual searches instead of just the proportion of all searches. As it is, it's hard to spot longer term trends.

For example the search term "terrorism" seems to have declined strongly since 2004:
http://www.google.com/insights/search/#q=terrorism&cmpt=q

However we don't know if this is because fewer people are searching for terrorism, or if more people are searching for other terms. The arrival of new demographic groups (older people, say, or poorer people) to the internet could distort relative results.

Anyway Trends is still fascinating. I used it twice to predict the results of elections on which I placed small bets - successfully! But it needs to be used very carefully. Fox News ran a story last year arguing that Pakistani people are unusually likely to search for pornographic terms. Really Trends do not give us enough information to make that claim. I've explored the subject here, should anyone be interested:
http://shaneleavy.blogspot.com/2010/08/just-how-kinky-is-pakistan.html

Read more...

Matt

There would be similar problems with using absolute numbers as well though, first one that springs to mind is if Google's market share changed drastically.

Shane

Absolutely Matt, that makes sense. I presume, though, that between both kinds of information the user would be better informed than simply using the one.

Robbie

Have you checked out what is correlated with Superfreakonomics? What do Windows 7 Clean Install and Jeff Dunham show have to do with Superfreakonomics?

http://correlate.googlelabs.com/search?e=Superfreakonomics&t=weekly#

Kevin

Very nice, Can this be used to handicapp the repubics primary presidential candidate in Vegas?

Jason

As far as predicting the stock market, this paper might be interesting to some:

http://arxiv.org/PS_cache/arxiv/pdf/1010/1010.3003v1.pdf

They adapted another tool that google created, but doesn't make available to the public, which is called the GPOMS: the google profile of mood states. Here is the last sentence of the abstract:

"We ?nd an accuracy of 87.6% in predicting the daily up and
down changes in the closing values of the DJIA and a reduction
of the Mean Average Percentage Error by more than 6%."

The reduction of the error refers to the inclusion of the GPOMS after some standard variables in a forecast.

The most important predictor turned out to be a calm mood, as measured by the GPOMS.

Now if only GPOMS were publicly available, one could do some interesting tests with google trends.

The paper relies on regression, but also something called a self organizing fuzzy neural network. It takes some reading to get a handle on, but my understanding is that a SOFNN is a merging of artificial intelligence and fuzzy logic. Fuzzy logic is derived from the fuzzy set, which is a set whose elements can be partly inside the set. I guess it's math's concept of "sort of." Apparently, the SOFNN is helpful when inputs are linguistic.

Read more...

James

As a converse point, and a great example of correlation does not equal causation, there's a .83 correlation between electricity prices in Ireland and searches for 'stanford webmail'. A .82 correlation between the prices and 'copper theft' (which might have some weak link due to the commodities boom and bust of 2008-9).

More generally, it seems those correlations are calculated from absolute levels, noot the returns (% change daily), which is known to give spuriously high correlations.

Anoush

We are in a hugely fascinating moment with regard to real-time data indeed! I work at UN Global Pulse, an innovations initiative in the UN, which is looking precisely at this potential: are there signals in new data which can serve as early indicators of stress in a society/community?

http://www.unglobalpulse.org/blog/digital-smoke-signals

In a world of increasing global crises and shocks, we need better real-time data to understand when populations are vulnerable - and be able to respond with more agility. Information seeking/online search behavior is a very important type of new data that we are exploring at Global Pulse. We are delving into research projects/experiments to explore what indicators could be most telling. Our approach is collaborative and we welcome any comments/insights (Twitter: @UNGlobalPulse) as we ideate!

Renato P. dos Santos

I thought you would like to know that I cited this blog post in the Web Search Database I am building from Google Correlate: http://www.searchcorrelations.com/initial-unemployment-claims.html

You may find interesting how the most most highly correlated search term changed since them but how your forecast "that initial unemployment claims will tick down" in the next couple of weeks after the end of May, 2011 seems confirmed from an updated graph.