Topic modeling for the newbie

Learning the fundamentals of natural language processing.

Get “Data Science from Scratch” at 50% off with code DATA50. Editor’s note: This is an excerpt from our recent book Data Science from Scratch, by Joel Grus. It provides a survey of topics from statistics and probability to databases, from machine learning to MapReduce, giving the reader a foundation for understanding, and examples and ideas for learning more.

When we built our Data Scientists You Should Know recommender in Chapter 1, we simply looked for exact matches in people’s stated interests.

A more sophisticated approach to understanding our users’ interests might try to identify the topics that underlie those interests. A technique called Latent Dirichlet Analysis (LDA) is commonly used to identify common topics in a set of documents. We’ll apply it to documents that consist of each user’s interests.

LDA has some similarities to the Naive Bayes Classifier we built in Chapter 13, in that it assumes a probabilistic model for documents. We’ll gloss over the hairier mathematical details, but for our purposes the model assumes that:

  • There is some fixed number K of topics.
  • There is a random variable that assigns each topic an associated probability distribution over words. You should think of this distribution as the probability of seeing word w given topic k.
  • There is another random variable that assigns each document a probability distribution over topics. You should think of this distribution as the mixture of topics in document d.
  • Each word in a document was generated by first randomly picking a topic (from the document’s distribution of topics) and then randomly picking a word (from the topic’s distribution of words).

In particular, we have a collection of documents, each of which is a list of words. And we have a corresponding collection of document_topics that assigns a topic (here a number between 0 and K – 1) to each word in each document.

So that the fifth word in the fourth document is:

and the topic from which that word was chosen is:

This very explicitly defines each document’s distribution over topics, and it implicitly defines each topic’s distribution over words.

We can estimate the likelihood that topic 1 produces a certain word by comparing how many times topic 1 produces that word with how many times topic 1 produces any word. (Similarly, when we built a spam filter in Chapter 13, we compared how many times each word appeared in spams with the total number of words appearing in spams.)

Although these topics are just numbers, we can give them descriptive names by looking at the words on which they put the heaviest weight. We just have to somehow generate the document_topics. This is where Gibbs sampling comes into play.

We start by assigning every word in every document a topic completely at random. Now we go through each document one word at a time. For that word and document, we construct weights for each topic that depend on the (current) distribution of topics in that document and the (current) distribution of words for that topic. We then use those weights to sample a new topic for that word. If we iterate this process many times, we will end up with a joint sample from the topic-word distribution and the document-topic distribution.

To start with, we’ll need a function to randomly choose an index based on an arbitrary set of weights:

For instance, if you give it weights [1, 1, 3], then one-fifth of the time it will return 0, one-fifth of the time it will return 1, and three-fifths of the time it will return 2.

Our documents are our users’ interests, which look like:

And we’ll try to find K = 4 topics.

In order to calculate the sampling weights, we’ll need to keep track of several counts. Let’s first create the data structures for them.

How many times each topic is assigned to each document:

How many times each word is assigned to each topic:

The total number of words assigned to each topic:

The total number of words contained in each document:

The number of distinct words:

And the number of documents:

For example, once we populate these, we can find the number of words in documents[3] associated with topic 1 as:

And we can find the number of times nlp is associated with topic 2 as:

Now we’re ready to define our conditional probability functions. As in Chapter 13, each has a smoothing term that ensures every topic has a nonzero chance of being chosen in any document and that every word has a nonzero chance of being chosen for any topic:

We’ll use these to create the weights for updating topics:

There are solid mathematical reasons why topic_weight is defined the way it is, but their details would lead us too far afield. Hopefully it makes at least intuitive sense that—given a word and its document—the likelihood of any topic choice depends on both how likely that topic is for the document and how likely that word is for the topic.

This is all the machinery we need. We start by assigning every word to a random topic, and populating our counters appropriately:

Our goal is to get a joint sample of the topics-words distribution and the documents-topics distribution. We do this using a form of Gibbs sampling that uses the conditional probabilities defined previously:

What are the topics? They’re just numbers 0, 1, 2, and 3. If we want names for them we have to do that ourselves. Let’s look at the five most heavily weighted words for each (Table 20-1):

Table 20-1: Most common words per topic

Topic 0 Topic 1 Topic 2 Topic 3
Java R HBase regression
Big Data statistics Postgres libsvm
Hadoop Python MongoDB scikit-learn
deep learning probability Cassandra machine learning
artificial intelligence pandas NoSQL neural networks

Based on these I’d probably assign topic names:

at which point we can see how the model assigns topics to each user’s interests:

which gives:

and so on. Given the “ands” we needed in some of our topic names, it’s possible we should use more topics, although most likely we don’t have enough data to successfully learn them.

tags: , , , , ,