Augmenting Unstructured Data

OSCON 2013 Speaker Series

Our world is filled with unstructured data. By some estimates, it’s as high as 80% of all data.

Unstructured data is data that isn’t in a specific format. It isn’t separated by a delimiter that you could split on and get all of the individual pieces of data. Most often, this data comes directly from humans. Human generated data isn’t the best kind to place in a relational database and run queries on. You’d need to run some algorithms on unstructured data to gain any insight.

Play-by-Play

Advanced NFL Stats was kind enough to open up their play-by-play data for NFL games from 2002 to 2012. This dataset consisted of both structured and unstructured data. Here is a sample line showing a single play in the dataset:

20121119_CHI@SF,3,17,48,SF,CHI,3,2,76,20,0,(2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC,0,3,0,27,7 ,2012

This line is comma separated and has various queryable elements. For example, we could query on the teams playing and the scores. We couldn’t query on the unstructured portion, specifically:

(2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC

There are many more insights that can be gleaned from this portion. For example, we can see that San Francisco’s quarterback, Colin Kaepernick, passed the ball to the wide receiver, Michael Crabtree. He was tackled by Charles Tillman. Michael Crabtree caught the ball at the 25 yard line, gained 1 yard. After catching the ball, he didn’t make any forward progress or yards after catch (“0-yds YAC“).

Writing a synopsis like that is easy for me as a human. I’ve spent my life working with language and trying to comprehend the meaning. It’s not as easy for a computer. We have to use various methods to extract the information from unstructured data like this. These can vary dramatically in complexity; some use simple string lookups or contains and others use natural language processing.

Parsing would be easy if all of the plays looked like the one above, but they don’t. Each play has a little bit different formatting and varies in the amount of data. The general format is different for each type of play: run, pass, punt, etc. Each type of play needs to be treated a little differently. Also, each person writing the play description will be a little different from the others.

Augmenting Data

As I showed in the synopsis of the play, one can get some interesting insight out of the structured and unstructured data. However, we’re still limited in what we can do and the kinds of queries we can write.

I had a conversation with a Canadian about American Football and the effects of the weather. I was talking about how the NFL now favors domed locations to take weather out as a variable for the Superbowl or NFL championship. With the play-by-play dataset, I couldn’t write a query to tell me the effects of weather one way or another. The data simply doesn’t exist in the dataset.

The physical location and date of the game is captured in the play-by-play. We can see that in the first portion “20121119_CHI@SF“. Here, Chicago is playing at San Francisco on 2012-11-19. There are other datasets that could enable us to augment our play-by-play. These datasets would give us the physical location of the stadium where the game was played (the team’s name doesn’t mean that their stadium is in that city). The dataset for stadiums is a relatively small one. This dataset also shows if the stadium is domed and the ambient temperature is a non-issue.

Based on the stadium location, we need to figure out the weather there. Using the NCDC site, I was able to find the weather station closest to the stadium where the games are played. The NCDC dataset has the weather station and weather for each day. These weather observations include: minimum temperature, maximum temperature, and precipitation.

By adding these two datasets in the mix, we could get the day’s weather. This could allow us to write queries that account for the ambient weather in a game.

Conclusion

A single dataset by itself can give us some interesting insights. It’s when we start combining datasets and processing our unstructured data that we get even more value. We can start answering questions that weren’t possible before.

What does this marriage of play-by-play data, stadium data, and weather produce? You’ll have to come to my OSCON 2013 talk to find out. For now, you can check out some of the insights I gained by just looking at the play-by-play data.

See you then!

NOTE: If you are interested in attending OSCON to check out Jesse’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% off your registration fee.

tags: , , , , ,