Strata Gems: Use Wikipedia as training data

The online encyclopedia is a great resource for data scientists

We’ll be publishing a new Strata Gem each day all the way through to December 24. Yesterday’s Gem: Try MongoDB without installing anything.

Strata 2011One of the most exciting analytical techniques is natural language processing and sentiment analysis. Given natural language text, can we use a computer to discover what’s being said? Applications ranges from user interface through marketing and espionage.

The hard part of the problem is how do you teach a computer what words mean, and how do you figure out the context to select the right meaning for a word? The word “apple” could refer to the fruit, the computer company, or the Beatles’ record label. Or a bank of the same name, the rock band, New York City, the singer Fiona Apple, the list goes on.

One answer is to use a classifier, which can differentiate between the different contexts in which a word is used in order to determine its sense. Most anti-spam filtering solutions use a classifier. Classifiers must be trained to be effective though, as anybody who has used anti-spam systems will tell you.

It’s relatively easy to differentiate between spam and non-spam email, but how do you go about breaking down the English language to finding training data for each word sense?

Fortunately, there’s a large open data source available that has put a lot of effort into the disambiguation of terms such as “apple”Wikipedia. Data scientists often use information from Wikipedia to aid in the identification of real world entities in their work, and its use for disambiguation has been described in several reports, including this 2007 paper from Rada Mihalcea, Using Wikipedia for Automatic Word Sense Disambiguation (PDF).

Wikipedia front page

The key concept is that in the Wikipedia article for the Apple computer company, the world “apple” is used in the context of meaning the company, so you can use it to train natural language classifiers for that sense of the word. The Wikipedia article for apple the fruit offers a similar corpus for the fruity context, and so on. The Wikipedia URL for a particular concept is an unambiguous tag that you can then use to identify word sense.

Fortunately, you don’t need to be a deep researcher to start using Wikipedia in this way. A recent blog post from Jim Plush shows how to use Wikipedia and Python to disambiguate words from Twitter posts. With a relatively brief Python script and training data culled from Wikipedia, Plush was able to distinguish between apple the fruit and Apple the company in the text of Twitter posts mentioning “apple”.

For more information, check out the Python Natural Language Toolkit web site. Also, the Strata panel Online Sentiment, Machine Learning, and Prediction will dive into real world uses of sentiment analysis and machine learning.

tags: , , , ,