The O'Reilly Data Show Podcast: Anima Anandkumar on tensor decomposition techniques for machine learning.
After sitting in on UC Irvine Professor Anima Anandkumar’s Strata + Hadoop World 2015 in San Jose presentation, I wrote a post urging the data community to build tensor decomposition libraries for data science. The feedback I’ve gotten from readers has been extremely positive. During the latest episode of the O’Reilly Data Show Podcast, I sat down with Anandkumar to talk about tensor decomposition, machine learning, and the data science program at UC Irvine.
Modeling higher-order relationships
The natural question is: why use tensors when (large) matrices can already be challenging to work with? Proponents are quick to point out that tensors can model more complex relationships. Anandkumar explains:
Tensors are higher order generalizations of matrices. While matrices are two-dimensional arrays consisting of rows and columns, tensors are now multi-dimensional arrays. … For instance, you can picture tensors as a three-dimensional cube. In fact, I have here on my desk a Rubik’s Cube, and sometimes I use it to get a better understanding when I think about tensors. … One of the biggest use of tensors is for representing higher order relationships. … If you want to only represent pair-wise relationships, say co-occurrence of every pair of words in a set of documents, then a matrix suffices. On the other hand, if you want to learn the probability of a range of triplets of words, then we need a tensor to record such relationships. These kinds of higher order relationships are not only important for text, but also, say, for social network analysis. You want to learn not only about who is immediate friends with whom, but, say, who is friends of friends of friends of someone, and so on. Tensors, as a whole, can represent much richer data structures than matrices.
A survey of the landscape shows the types of tools remain the same, but interfaces continue to improve.
As data projects become complex and as data teams grow in size, individuals and organizations need tools to efficiently manage data projects. A while back, I wrote a post on common options, and I closed that piece by asking:
Are there completely different ways of thinking about reproducibility, lineage, sharing, and collaboration in the data science and engineering context?
At the time, I listed categories that seemed to capture much of what I was seeing in practice: (proprietary) workbooks aimed at business analysts, sophisticated IDEs, notebooks (for mixing text, code, and graphics), and workflow tools. At a high level, these tools aspire to enable data teams to do the following:
- Reproduce their work — so they can rerun and/or audit when needed
- Facilitate storytelling — because in many cases, it’s important to explain to others how results were derived
- Operationalize successful and well-tested pipelines — particularly when deploying to production is a long-term objective
As I survey the landscape, the types of tools remain the same, but interfaces continue to improve, and domain specific languages (DSLs) are starting to appear in the context of data projects. One interesting trend is that popular user interface models are being adapted to different sets of data professionals (e.g. workflow tools for business users). Read more…
The O'Reilly Data Show Podcast: Michael Stack on HBase past, present, and future.
Subscribe to the O’Reilly Data Show to explore the opportunities and techniques driving big data and data science.
At least once a year, I sit down with Michael Stack, engineer at Cloudera, to get an update on Apache HBase and the annual user conference, HBasecon. Stack has a great perspective, as he has been part of HBase since its inception. As former project leader, he remains a key contributor and evangelist, and one of the organizers of HBasecon.
In the beginning: Search and Bigtable
During the latest episode of the O’Reilly Data Show Podcast, I decided to broaden our conversation to include the beginnings of the very popular Apache HBase project. Stack reminded me that in the early days much of the big data community in the SF Bay Area was centered around search technologies, such as HBase. In particular, HBase was inspired by work out of Google (Bigtable), and the early engineers had ties to projects out of the Internet Archive:
At the time, I was working at the Internet Archive, and I was working on crawlers and search. The Bigtable paper looked really interesting to us because the archive, as you know, we used to host — or still do — the Wayback Machine. The Wayback Machine is a picture of the Web that goes back to 1998, and you could look at the Web at any particular time. What pages looked liked at a particular time. Bigtable was very interesting at the Internet Archive because it had this time dimension.
A group had started up to talk about the possibility of implementing a Bigtable clone. It was centered at a place called Powerset, a startup that was in San Francisco back then. That was about doing a search, so I went and talked to them. They said, ‘Come on over and we’ll make a space for doing a Bigtable clone.’ They had a very intricate search pipeline, and it was based on early Amazon AWS, and every time they started up their pipeline, they’d get a phone call from Amazon saying, ‘Please stop whatever it is you’re doing.’ … The first engineer would be a fellow called Jim Kellerman. The actual first 30 classes came from Mike Cafarella. He was instrumental in getting the first versions of Hadoop going. He was hanging around Apache Nutch at the time. … Doug [Cutting] used to work at the Internet archive, and the first actual versions of Hadoop were run on racks at the Internet archive. Doug was working on fulltext search. Then he moved on to go to Yahoo, to work on Hadoop full time.
The O'Reilly Data Show Podcast: Mikio Braun on stream processing, academic research, and training.
Mikio Braun is a machine learning researcher who also enjoys software engineering. We first met when he co-founded a real-time analytics company called streamdrill. Since then, I’ve always had great conversations with him on many topics in the data space. He gave one of the best-attended sessions at Strata + Hadoop World in Barcelona last year on some of his work at streamdrill.
I recently sat down with Braun for the latest episode of the O’Reilly Data Show Podcast, and we talked about machine learning, stream processing and analytics, his recent foray into data science training, and academia versus industry (his interests are a bit on the “applied” side, but he enjoys both).
Things are moving fast in the stream processing world.
Register for Strata + Hadoop World, London. Editor’s note: Ben Lorica is an advisor to Databricks and Graphistry. Many of the technologies discussed in this post will be covered in trainings, tutorials, and sessions at Strata + Hadoop World in London this coming May.
There’s renewed interest in stream processing and analytics. I write this based on some data points (attendance in webcasts and conference sessions; a recent meetup), and many conversations with technologists, startup founders, and investors. Certainly, applications are driving this recent resurgence. I’ve written previously about systems that come from IT operations as well as how the rise of cheap sensors are producing stream mining solutions from wearables (mostly health-related apps) and the IoT (consumer, industrial, and municipal settings). In this post, I’ll provide a short update on some of the systems that are being built to handle large amounts of event data.
Apache projects (Kafka, Storm, Spark Streaming, Flume) continue to be popular components in stream processing stacks (I’m not yet hearing much about Samza). Over the past year, many more engineers started deploying Kafka alongside one of the two leading distributed stream processing frameworks (Storm or Spark Streaming). Among the major Hadoop vendors, Hortonworks has been promoting Storm, Cloudera supports Spark Streaming, and MapR supports both. Kafka is a high-throughput distributed pub/sub system that provides a layer of indirection between “producers” that write to it and “consumers” that take data out of it. A new startup (Confluent) founded by the creators of Kafka should further accelerate the development of this already very popular system. Apache Flume is used to collect, aggregate, and move large amounts of streaming data, and is frequently used with Kafka (Flafka or Flume + Kafka). Spark Streaming continues to be one of the more popular components within the Spark ecosystem, and its creators have been adding features at a rapid pace (most recently Kafka integration, a Python API, and zero data loss). Read more…
The O'Reilly Data Show Podcast: Erich Nachbar on testing and deploying open source, distributed computing components.
When I first hear of a new open source project that might help me solve a problem, the first thing I do is ask around to see if any of my friends have tested it. Sometimes, however, the early descriptions sound so promising that I just jump right in and try it myself — and in a few cases, I transition immediately (this was certainly the case for Spark).
I recently had a conversation with Erich Nachbar, founder and CTO of Virtual Power Systems, and one of the earliest adopters of Spark. In the early days of Spark, Nachbar was CTO of Quantifind, a startup often cited by the creators of Spark as one of the first “production deployments.” On the latest episode of the O’Reilly Data Show Podcast, we talk about the ease with which Nachbar integrates new open source components into existing infrastructure, his contributions to Mesos, and his new “software-defined power distribution” startup.
Ecosystem of open source big data technologies
When evaluating a new software component, nothing beats testing it against workloads that mimic your own. Nachbar has had the luxury of working in organizations where introducing new components isn’t subject to multiple levels of decision-making. But, as he notes, everything starts with testing things for yourself:
“I have sort of my mini test suite…If it’s a data store, I would just essentially hook it up to something that’s readily available, some feed like a Twitter fire hose, and then just let it be bombarded with data, and by now, it’s my simple benchmark to know what is acceptable and what isn’t for the machine…I think if more people, instead of reading papers and paying people to tell them how good or bad things are, would actually set aside a day and try it, I think they would learn a lot more about the system than just reading about it and theorizing about the system. Read more…
Tensor methods for machine learning are fast, accurate, and scalable, but we'll need well-developed libraries.
Data scientists frequently find themselves dealing with high-dimensional feature spaces. As an example, text mining usually involves vocabularies comprised of 10,000+ different words. Many analytic problems involve linear algebra, particularly 2D matrix factorization techniques, for which several open source implementations are available. Anyone working on implementing machine learning algorithms ends up needing a good library for matrix analysis and operations.
But why stop at 2D representations? In a recent Strata + Hadoop World San Jose presentation, UC Irvine professor Anima Anandkumar described how techniques developed for higher-dimensional arrays can be applied to machine learning. Tensors are generalizations of matrices that let you look beyond pairwise relationships to higher-dimensional models (a matrix is a second-order tensor). For instance, one can examine patterns between any three (or more) dimensions in data sets. In a text mining application, this leads to models that incorporate the co-occurrence of three or more words, and in social networks, you can use tensors to encode arbitrary degrees of influence (e.g., “friend of friend of friend” of a user).
Being able to capture higher-order relationships proves to be quite useful. In her talk, Anandkumar described applications to latent variable models — including text mining (topic models), information science (social network analysis), recommender systems, and deep neural networks. A natural entry point for applications is to look at generalizations of matrix (2D) techniques to higher-dimensional arrays. Read more…
The O'Reilly Data Show Podcast: Angie Ma on building a finishing school for science and engineering doctorates.
Editor’s note: The ASI will offer a two-day intensive course, Practical Machine Learning, at Strata + Hadoop World in London in May.
Back when I was considering leaving academia, the popular exit route was financial engineering. Many science and engineering Ph.D.s ended up in big Wall Street banks; I chose to be the lead quant at a small hedge fund — it was a natural choice for many of us. Financial engineering was topically close to my academic interests, and working with traders meant access to resources and interesting problems.
Today, there are many more options for people with science and engineering doctorates. A few organizations take science and engineering Ph.D.s, and over the course of 8-12 weeks, prepare them to join the ranks of industrial data scientists and data engineers.
I recently sat down with Angie Ma, co-founder and president of ASI, a London startup that runs a carefully structured “finishing school” for science and engineering doctorates. We talked about how Angie and her co-founders (all ex-physicists) arrived at the concept of the ASI, the structure of their training programs, and the data and startup scene in the UK. [Full disclosure: I’m an advisor to the ASI.] Read more…
The O'Reilly Data Show Podcast: David Blei, co-creator of one of the most popular tools in text mining and machine learning.
I don’t remember when I first came across topic models, but I do remember being an early proponent of them in industry. I came to appreciate how useful they were for exploring and navigating large amounts of unstructured text, and was able to use them, with some success, in consulting projects. When an MCMC algorithm came out, I even cooked up a Java program that I came to rely on (up until Mallet came along).
I recently sat down with David Blei, co-author of the seminal paper on topic models, and who remains one of the leading researchers in the field. We talked about the origins of topic models, their applications, improvements to the underlying algorithms, and his new role in training data scientists at Columbia University.
Generating features for other machine learning tasks
Blei frequently interacts with companies that use ideas from his group’s research projects. He noted that people in industry frequently use topic models for “feature generation.” The added bonus is that topic models produce features that are easy to explain and interpret:
“You might analyze a bunch of New York Times articles for example, and there’ll be an article about sports and business, and you get a representation of that article that says this is an article and it’s about sports and business. Of course, the ideas of sports and business were also discovered by the algorithm, but that representation, it turns out, is also useful for prediction. My understanding when I speak to people at different startup companies and other more established companies is that a lot of technology companies are using topic modeling to generate this representation of documents in terms of the discovered topics, and then using that representation in other algorithms for things like classification or other things.”