Topic modeling for the newbie

Learning the fundamentals of natural language processing.

Get “Data Science from Scratch” at 50% off with code DATA50. Editor’s note: This is an excerpt from our recent book Data Science from Scratch, by Joel Grus. It provides a survey of topics from statistics and probability to databases, from machine learning to MapReduce, giving the reader a foundation for understanding, and examples and ideas for learning more.

When we built our Data Scientists You Should Know recommender in Chapter 1, we simply looked for exact matches in people’s stated interests.

A more sophisticated approach to understanding our users’ interests might try to identify the topics that underlie those interests. A technique called Latent Dirichlet Analysis (LDA) is commonly used to identify common topics in a set of documents. We’ll apply it to documents that consist of each user’s interests.

LDA has some similarities to the Naive Bayes Classifier we built in Chapter 13, in that it assumes a probabilistic model for documents. We’ll gloss over the hairier mathematical details, but for our purposes the model assumes that:

  • There is some fixed number K of topics.
  • There is a random variable that assigns each topic an associated probability distribution over words. You should think of this distribution as the probability of seeing word w given topic k.
  • There is another random variable that assigns each document a probability distribution over topics. You should think of this distribution as the mixture of topics in document d.
  • Each word in a document was generated by first randomly picking a topic (from the document’s distribution of topics) and then randomly picking a word (from the topic’s distribution of words).

In particular, we have a collection of documents, each of which is a list of words. And we have a corresponding collection of document_topics that assigns a topic (here a number between 0 and K – 1) to each word in each document. Read more…

Comment: 1

Startups suggest big data is moving to the clouds

A look at the winners from a showcase of some of the most innovative big data startups.

At Strata + Hadoop World in London last week, we hosted a showcase of some of the most innovative big data startups. Our judges narrowed the field to 10 finalists, from whom they — and attendees — picked three winners and an audience choice.

Underscoring many of these companies was the move from software to services. As industries mature, we see a move from custom consulting to software and, ultimately, to utilities — something Simon Wardley underscored in his Data Driven Business Day talk, and which was reinforced by the announcement of tools like Google’s Bigtable service offering.

This trend was front and center at the showcase:

  • Winner Modgen, for example, generates recommendations and predictions, offering machine learning as a cloud-based service.
  • While second-place Brytlyt offers their high-performance database as an on-premise product, their horizontally scaled-out architecture really shines when the infrastructure is elastic and cloud based.
  • Finally, third-place OpenSensors’ real-time IoT message platform scales to millions of messages a second, letting anyone spin up a network of connected devices.

Ultimately, big data gives clouds something to do. Distributed sensors need a widely available, connected repository into which to report; databases need to grow and shrink with demand; and predictive models can be tuned better when they learn from many data sets. Read more…

Comment: 1

The tensor renaissance in data science

The O'Reilly Data Show Podcast: Anima Anandkumar on tensor decomposition techniques for machine learning.

glory_nosha_flickr

After sitting in on UC Irvine Professor Anima Anandkumar’s Strata + Hadoop World 2015 in San Jose presentationI wrote a post urging the data community to build tensor decomposition libraries for data science. The feedback I’ve gotten from readers has been extremely positive. During the latest episode of the O’Reilly Data Show Podcast, I sat down with Anandkumar to talk about tensor decomposition, machine learning, and the data science program at UC Irvine.

Modeling higher-order relationships

The natural question is: why use tensors when (large) matrices can already be challenging to work with? Proponents are quick to point out that tensors can model more complex relationships. Anandkumar explains:

Tensors are higher order generalizations of matrices. While matrices are two-dimensional arrays consisting of rows and columns, tensors are now multi-dimensional arrays. … For instance, you can picture tensors as a three-dimensional cube. In fact, I have here on my desk a Rubik’s Cube, and sometimes I use it to get a better understanding when I think about tensors.  … One of the biggest use of tensors is for representing higher order relationships. … If you want to only represent pair-wise relationships, say co-occurrence of every pair of words in a set of documents, then a matrix suffices. On the other hand, if you want to learn the probability of a range of triplets of words, then we need a tensor to record such relationships. These kinds of higher order relationships are not only important for text, but also, say, for social network analysis. You want to learn not only about who is immediate friends with whom, but, say, who is friends of friends of friends of someone, and so on. Tensors, as a whole, can represent much richer data structures than matrices.

Read more…

Comment: 1

The unwelcome guest: Why VMs aren’t the solution for next-gen applications

Scale-out applications need scaled-in virtualization.

scale_in_esterno_Mia_Felicita_Bertelli_FlickrData center operating systems are emerging as a first-class category of distributed system software. Hadoop, for example, is evolving from a MapReduce framework into YARN, a generic platform for scale-out applications.

To enable a rich ecosystem of diverse applications to coexist on these platforms, providing adequate isolation is crucial. The isolation mechanism must enforce resource limits, decouple software dependencies among applications and the host, provide security and privacy, confine failures, etc. Containers offer a simple and elegant solution to the problem. However, a question that comes up frequently is: Why not virtual machines (VMs)? After all, these systems face a number of the same challenges that have been solved by virtualization for traditional enterprise applications.

All problems in computer science can be solved by another level of indirection, except of course for the problem of too many indirections” — David Wheeler

Read more…

Comment: 1

On the evolution of machine learning

From linear models to neural networks: an interview with Reza Zadeh.

Get notified when our free report, “Future of Machine Intelligence: Perspectives from Leading Practitioners,” is available for download. The following interview is one of many that will be included in the report.

As part of our ongoing series of interviews surveying the frontiers of machine intelligence, I recently interviewed Reza Zadeh. Reza is a Consulting Professor in the Institute for Computational and Mathematical Engineering at Stanford University and a Technical Advisor to Databricks. His work focuses on Machine Learning Theory and Applications, Distributed Computing, and Discrete Applied Mathematics.

Key Takeaways

  • Neural networks have made a comeback and are playing a growing role in new approaches to machine learning.
  • The greatest successes are being achieved via a supervised approach leveraging established algorithms.
  • Spark is an especially well-suited environment for distributed machine learning.

David Beyer: Tell us a bit about your work at Stanford

Reza Zadeh: At Stanford, I designed and teach distributed algorithms and optimization (CME 323) as well as a course called discrete mathematics and algorithms (CME 305). In the discrete mathematics course, I teach algorithms from a completely theoretical perspective, meaning that it is not tied to any programming language or framework, and we fill up whiteboards with many theorems and their proofs. Read more…

Comment: 1

More tools for managing and reproducing complex data projects

A survey of the landscape shows the types of tools remain the same, but interfaces continue to improve.

The_right_fit_Ian_D_Keating_Flickr

As data projects become complex and as data teams grow in size, individuals and organizations need tools to efficiently manage data projects. A while back, I wrote a post on common options, and I closed that piece by asking:

Are there completely different ways of thinking about reproducibility, lineage, sharing, and collaboration in the data science and engineering context?

At the time, I listed categories that seemed to capture much of what I was seeing in practice: (proprietary) workbooks aimed at business analysts, sophisticated IDEs, notebooks (for mixing text, code, and graphics), and workflow tools. At a high level, these tools aspire to enable data teams to do the following:

  • Reproduce their work — so they can rerun and/or audit when needed
  • Collaborate
  • Facilitate storytelling — because in many cases, it’s important to explain to others how results were derived
  • Operationalize successful and well-tested pipelines — particularly when deploying to production is a long-term objective

As I survey the landscape, the types of tools remain the same, but interfaces continue to improve, and domain specific languages (DSLs) are starting to appear in the context of data projects. One interesting trend is that popular user interface models are being adapted to different sets of data professionals (e.g. workflow tools for business users). Read more…

Comment: 1