# "natural language processing" entries

## Topic modeling for the newbie

### Learning the fundamentals of natural language processing.

Get “Data Science from Scratch” at 50% off with code DATA50. Editor’s note: This is an excerpt from our recent book Data Science from Scratch, by Joel Grus. It provides a survey of topics from statistics and probability to databases, from machine learning to MapReduce, giving the reader a foundation for understanding, and examples and ideas for learning more.

When we built our Data Scientists You Should Know recommender in Chapter 1, we simply looked for exact matches in people’s stated interests.

A more sophisticated approach to understanding our users’ interests might try to identify the topics that underlie those interests. A technique called Latent Dirichlet Analysis (LDA) is commonly used to identify common topics in a set of documents. We’ll apply it to documents that consist of each user’s interests.

LDA has some similarities to the Naive Bayes Classifier we built in Chapter 13, in that it assumes a probabilistic model for documents. We’ll gloss over the hairier mathematical details, but for our purposes the model assumes that:

• There is some fixed number K of topics.
• There is a random variable that assigns each topic an associated probability distribution over words. You should think of this distribution as the probability of seeing word w given topic k.
• There is another random variable that assigns each document a probability distribution over topics. You should think of this distribution as the mixture of topics in document d.
• Each word in a document was generated by first randomly picking a topic (from the document’s distribution of topics) and then randomly picking a word (from the topic’s distribution of words).

In particular, we have a collection of documents, each of which is a list of words. And we have a corresponding collection of document_topics that assigns a topic (here a number between 0 and K – 1) to each word in each document. Read more…

## Four short links: 29 October 2014

### Tweet Parsing, Focus and Money, Challenging Open Data Beliefs, and Exploring ISP Data

1. TweetNLP — CMU open source natural language parsing tools for making sense of Tweets.
2. Interview with Google X Life Science’s Head (Medium) — I will have been here two years this March. In nineteen months we have been able to hire more than a hundred scientists to work on this. We’ve been able to build customized labs and get the equipment to make nanoparticles and decorate them and functionalize them. We’ve been able to strike up collaborations with MIT and Stanford and Duke. We’ve been able to initiate protocols and partnerships with companies like Novartis. We’ve been able to initiate trials like the baseline trial. This would be a good decade somewhere else. The power of focus and money.
3. Schooloscope Open Data Post-MortemThe case of Schooloscope and the wider question of public access to school data challenges the belief that sunlight is the best disinfectant, that government transparency would always lead to better government, better results. It challenges the sentiments that see data as value-neutral and its representation as devoid of politics. In fact, access to school data exposes a sharp contrast between the private interest of the family (best education for my child) and the public interest of the government (best education for all citizens).
4. M-Lab Observatory — explorable data on the data experience (RTT, upload speed, etc) across different ISPs in different geographies over time.

## Four short links: 30 July 2014

### Offline First, Winograd Schemata, Jailbreaking Nest for Privacy, and Decentralised Web Cache

1. Offline First is the New Mobile First — Luke Wroblewski’s notes from John Allsopp’s talk about “Breaking Development” in Nashville. Offline technologies don’t just give us sites that work offline, they improve performance, and security by minimizing the need for cookies, http, and file uploads. It also opens up new possibilities for better user experiences.
2. Winograd Schemas as Alternative to Turing Test (IEEE) — specially constructed sentences that are surface ambiguous and require deeper knowledge of the world to disambiguate, e.g. “Jim comforted Kevin because he was so upset. Who was upset?”. Our WS [Winograd schemas] challenge does not allow a subject to hide behind a smokescreen of verbal tricks, playfulness, or canned responses. Assuming a subject is willing to take a WS test at all, much will be learned quite unambiguously about the subject in a few minutes. (that last from the paper on the subject)
3. Reclaiming Your Nest (Forbes) — Like so many connected devices, Nest devices regularly report back to the Nest mothership with usage data. Over a month-long period, the researchers’ device sent 32 MB worth of information to Nest, including temperature data, at-rest settings, and self-entered information about the home, such as how big it is and the year it was built. “The Nest doesn’t give us an option to turn that off or on. They say they’re not going to use that data or share it with Google, but why don’t they give the option to turn it off?” says Jin. Jailbreak your Nest (technique to be discussed at Black Hat), and install less chatty software. Loose Lips Sink Thermostats.
4. SyncNet — decentralised browser: don’t just pull pages from the source, but also fetch from distributed cache (implemented with BitTorrent Sync).

### My day one experience

While there was no skydiving this year to show off Google’s new wearable Glass, there were plenty of attendees wearing them proudly including me. This year hardware, however, didn’t take center stage. The focus was on new tools and upgrades to existing products and platforms.

Android developers were thrilled to see new APIs and tools. The biggest cheers, at least in my section, were for Android Studio built on IntelliJ which from what I can tell is way better than Eclipse but notably not open source. The Developer Console got a substantial update with integrated translation services, user metrics, and revenue graphs, but what really made a big splash the beta testing and staged rollout facilitation. These along with new location and gaming APIs rounded out the new offering for the Android development crowd.

## Unstructured data is worth the effort when you've got the right tools

### Alyona Medelyan and Anna Divoli on the opportunities in chaotic data.

Alyona Medelyan and Anna Divoli are inventing tools to help companies contend with vast quantities of fuzzy data. They discuss their work and what lies ahead for big data in this interview.

## Unstructured data is worth the effort when you’ve got the right tools

### Alyona Medelyan and Anna Divoli on the opportunities in chaotic data.

Alyona Medelyan and Anna Divoli are inventing tools to help companies contend with vast quantities of fuzzy data. They discuss their work and what lies ahead for big data in this interview.

## "We need tools that can help people have their ideas faster"

### Aditi Muralidharan on improving discovery and building intuition into search.

Ph.D. student Aditi Muralidharan aims to make life easier for researchers and scientists with WordSeer, a text analysis tool that examines and visualizes language use patterns.