"natural language processing" entries

Four short links: 23 September, 2015

Four short links: 23 September, 2015

Sentence Generator, Deep Neural Networks Explainer, Sports Analytics, and System Hell

  1. Skip Thought Vectors — research (with code) that produces surrounding sentences, given a sentence.
  2. A Beginner’s Guide to Deep Neural Networks (Google) — Googlers’ 20% project to explain things to people tackles machine learning.
  3. Data Analytics in Sports — O’Reilly research report (free). When it comes to processing stats, competing companies Opta and ProZone use a combination of recording technology and human analysts who tag “events” within the game (much like Vantage Sports). Opta calculates that it tags between 1,600 and 2,000 events per football game — all delivered live.
  4. On Go, Portability, and System Interfaces — No point mentioning Perl’s Configure.sh, I thought. The poor bastard will invent it soon enough.

ResourceMiner: Toppling the Tower of Babel in the lab

An open source project aims to crowdsource a common language for experimental design.


Contributing author: Tim Gardner

Editor’s note: This post originally appeared on PLOS Tech; it is republished here with permission.

From Gutenberg’s invention of the printing press to the Internet of today, technology has enabled faster communication, and faster communication has accelerated technology development. Today, we can zip photos from a mountaintop in Switzerland back home to San Francisco with hardly a thought, but that wasn’t so trivial just a decade ago. It’s not just selfies that are being sent; it’s also product designs, manufacturing instructions, and research plans — all of it enabled by invisible technical standards (e.g., TCP/IP) and language standards (e.g., English) that allow machines and people to communicate.

But in the laboratory sciences (life, chemical, material, and other disciplines), communication remains inhibited by practices more akin to the oral traditions of a blacksmith shop than the modern Internet. In a typical academic lab, the reference description of an experiment is the long-form narrative in the “Materials and Methods” section of a paper or a book. Similarly, industry researchers depend on basic text documents in the form of Standard Operating Procedures. In both cases, essential details of the materials and protocol for an experiment are typically written somewhere in a long-forgotten, hard-to-interpret lab notebook (paper or electronic). More typically, details are simply left to the experimenter to remember and to the “lab culture” to retain.

At the dawn of science, when a handful of researchers were working on fundamental questions, this may have been good enough. But nowadays this archaic method of protocol record keeping and sharing is so lacking that half of all biomedical studies are estimated to be irreproducible, wasting $28 billion each year of U.S. government funding. With more than $400 billion invested each year in biological and chemical research globally, the full cost of irreproducible research to the public and private sector worldwide could be staggeringly large. Read more…


Topic modeling for the newbie

Learning the fundamentals of natural language processing.

Get “Data Science from Scratch” at 50% off with code DATA50. Editor’s note: This is an excerpt from our recent book Data Science from Scratch, by Joel Grus. It provides a survey of topics from statistics and probability to databases, from machine learning to MapReduce, giving the reader a foundation for understanding, and examples and ideas for learning more.

When we built our Data Scientists You Should Know recommender in Chapter 1, we simply looked for exact matches in people’s stated interests.

A more sophisticated approach to understanding our users’ interests might try to identify the topics that underlie those interests. A technique called Latent Dirichlet Analysis (LDA) is commonly used to identify common topics in a set of documents. We’ll apply it to documents that consist of each user’s interests.

LDA has some similarities to the Naive Bayes Classifier we built in Chapter 13, in that it assumes a probabilistic model for documents. We’ll gloss over the hairier mathematical details, but for our purposes the model assumes that:

  • There is some fixed number K of topics.
  • There is a random variable that assigns each topic an associated probability distribution over words. You should think of this distribution as the probability of seeing word w given topic k.
  • There is another random variable that assigns each document a probability distribution over topics. You should think of this distribution as the mixture of topics in document d.
  • Each word in a document was generated by first randomly picking a topic (from the document’s distribution of topics) and then randomly picking a word (from the topic’s distribution of words).

In particular, we have a collection of documents, each of which is a list of words. And we have a corresponding collection of document_topics that assigns a topic (here a number between 0 and K – 1) to each word in each document. Read more…

Four short links: 29 October 2014

Four short links: 29 October 2014

Tweet Parsing, Focus and Money, Challenging Open Data Beliefs, and Exploring ISP Data

  1. TweetNLP — CMU open source natural language parsing tools for making sense of Tweets.
  2. Interview with Google X Life Science’s Head (Medium) — I will have been here two years this March. In nineteen months we have been able to hire more than a hundred scientists to work on this. We’ve been able to build customized labs and get the equipment to make nanoparticles and decorate them and functionalize them. We’ve been able to strike up collaborations with MIT and Stanford and Duke. We’ve been able to initiate protocols and partnerships with companies like Novartis. We’ve been able to initiate trials like the baseline trial. This would be a good decade somewhere else. The power of focus and money.
  3. Schooloscope Open Data Post-MortemThe case of Schooloscope and the wider question of public access to school data challenges the belief that sunlight is the best disinfectant, that government transparency would always lead to better government, better results. It challenges the sentiments that see data as value-neutral and its representation as devoid of politics. In fact, access to school data exposes a sharp contrast between the private interest of the family (best education for my child) and the public interest of the government (best education for all citizens).
  4. M-Lab Observatory — explorable data on the data experience (RTT, upload speed, etc) across different ISPs in different geographies over time.
Four short links: 30 July 2014

Four short links: 30 July 2014

Offline First, Winograd Schemata, Jailbreaking Nest for Privacy, and Decentralised Web Cache

  1. Offline First is the New Mobile First — Luke Wroblewski’s notes from John Allsopp’s talk about “Breaking Development” in Nashville. Offline technologies don’t just give us sites that work offline, they improve performance, and security by minimizing the need for cookies, http, and file uploads. It also opens up new possibilities for better user experiences.
  2. Winograd Schemas as Alternative to Turing Test (IEEE) — specially constructed sentences that are surface ambiguous and require deeper knowledge of the world to disambiguate, e.g. “Jim comforted Kevin because he was so upset. Who was upset?”. Our WS [Winograd schemas] challenge does not allow a subject to hide behind a smokescreen of verbal tricks, playfulness, or canned responses. Assuming a subject is willing to take a WS test at all, much will be learned quite unambiguously about the subject in a few minutes. (that last from the paper on the subject)
  3. Reclaiming Your Nest (Forbes) — Like so many connected devices, Nest devices regularly report back to the Nest mothership with usage data. Over a month-long period, the researchers’ device sent 32 MB worth of information to Nest, including temperature data, at-rest settings, and self-entered information about the home, such as how big it is and the year it was built. “The Nest doesn’t give us an option to turn that off or on. They say they’re not going to use that data or share it with Google, but why don’t they give the option to turn it off?” says Jin. Jailbreak your Nest (technique to be discussed at Black Hat), and install less chatty software. Loose Lips Sink Thermostats.
  4. SyncNet — decentralised browser: don’t just pull pages from the source, but also fetch from distributed cache (implemented with BitTorrent Sync).
Comment: 1

Google I/O 2013: Android Studio, Google Play Music: All Access, and New Advances in Search

My day one experience

While there was no skydiving this year to show off Google’s new wearable Glass, there were plenty of attendees wearing them proudly including me. This year hardware, however, didn’t take center stage. The focus was on new tools and upgrades to existing products and platforms.

Android developers were thrilled to see new APIs and tools. The biggest cheers, at least in my section, were for Android Studio built on IntelliJ which from what I can tell is way better than Eclipse but notably not open source. The Developer Console got a substantial update with integrated translation services, user metrics, and revenue graphs, but what really made a big splash the beta testing and staged rollout facilitation. These along with new location and gaming APIs rounded out the new offering for the Android development crowd.

Read more…


Unstructured data is worth the effort when you've got the right tools

Alyona Medelyan and Anna Divoli on the opportunities in chaotic data.

Alyona Medelyan and Anna Divoli are inventing tools to help companies contend with vast quantities of fuzzy data. They discuss their work and what lies ahead for big data in this interview.


Unstructured data is worth the effort when you’ve got the right tools

Alyona Medelyan and Anna Divoli on the opportunities in chaotic data.

Alyona Medelyan and Anna Divoli are inventing tools to help companies contend with vast quantities of fuzzy data. They discuss their work and what lies ahead for big data in this interview.


"We need tools that can help people have their ideas faster"

Aditi Muralidharan on improving discovery and building intuition into search.

Ph.D. student Aditi Muralidharan aims to make life easier for researchers and scientists with WordSeer, a text analysis tool that examines and visualizes language use patterns.

Comment: 1