Let’s build open source tensor libraries for data science

Tensor methods for machine learning are fast, accurate, and scalable, but we'll need well-developed libraries.

Rubik's_cube_collection_Gerwin_Sturm_Flickr

Data scientists frequently find themselves dealing with high-dimensional feature spaces. As an example, text mining usually involves vocabularies comprised of 10,000+ different words. Many analytic problems involve linear algebra, particularly 2D matrix factorization techniques, for which several open source implementations are available. Anyone working on implementing machine learning algorithms ends up needing a good library for matrix analysis and operations.

But why stop at 2D representations? In a recent Strata + Hadoop World San Jose presentation, UC Irvine professor Anima Anandkumar described how techniques developed for higher-dimensional arrays can be applied to machine learning. Tensors are generalizations of matrices that let you look beyond pairwise relationships to higher-dimensional models (a matrix is a second-order tensor). For instance, one can examine patterns between any three (or more) dimensions in data sets. In a text mining application, this leads to models that incorporate the co-occurrence of three or more words, and in social networks, you can use tensors to encode arbitrary degrees of influence (e.g., “friend of friend of friend” of a user).

Being able to capture higher-order relationships proves to be quite useful. In her talk, Anandkumar described applications to latent variable models — including text mining (topic models), information science (social network analysis), recommender systems, and deep neural networks. A natural entry point for applications is to look at generalizations of matrix (2D) techniques to higher-dimensional arrays. For example, the image that follows is an attempt to illustrate one form of eigen decomposition:

spectral-tensors

Spectral decomposition of tensors. Source: Anima Anandkumar, used with permission.

Tensor methods are accurate and embarrassingly parallel

Latent variable models and deep neural networks can be solved using other methods, including maximum likelihood and local search techniques (gradient descent, variational inference, EM). So, why use tensors at all? Unlike variational inference and EM, tensor methods produce global and not local optima, under reasonable conditions. In her talk, Anandkumar described some recent examples — topic models and social network analysis — where tensor methods proved to be faster and more accurate than other methods.

tensors-social-network-results

Error rates & recovery ratios from recent community detection experiments  (running time measured in seconds). Source: Anima Anandkumar, used with permission.

Scalability is another important reason why tensors are generating interest. Tensor decomposition algorithms have been parallelized using GPUs, and more recently using Apache REEF (a distributed framework originally developed by Microsoft). To summarize, early results are promising (in terms of speed and accuracy), and implementations in distributed systems lead to algorithms that scale to extremely large data sets.

tensors-and-machine-learning

General framework. Source: Anima Anandkumar, used with permission.

Hierarchical decomposition models

Their ability to model multi-way relationships make tensor methods particularly useful for uncovering hierarchical structures in high-dimensional data sets. In a recent paper, Anandkumar and her collaborators automatically found patterns and “…concepts reflecting co-occurrences of particular diagnoses in patients in outpatient and intensive care settings.”

Why aren’t tensors more popular?

If they’re faster, more accurate, and embarrassingly parallel, why haven’t tensor methods become more common? It comes down to libraries. Just as matrix libraries are needed to implement many machine learning algorithms, open source libraries for tensor analysis need to become more common. While it’s true that tensor computations are more demanding than matrix algorithms, recent improvements in parallel and distributed computing systems have made tensor techniques feasible.

There are some early libraries for tensor analysis in MATLAB, Python, TH++ from Facebook, and many others from the scientific computing community. For applications to machine learning, software tools that include tensor decomposition methods are essential. As a first step, Anandkumar and her UC Irvine colleagues have released code for tensor methods for topic modeling and social network modeling that run on single servers.

But for data scientists to embrace these techniques, we’ll need well-developed libraries accessible from the languages (Python, R, Java, Scala) and frameworks (Apache Spark) we’re already already familiar with. (Coincidentally, Spark developers just recently introduced distributed matrices.)

It’s fun to see a tool that I first encountered in math and physics courses having an impact in machine learning. But the primary reason I’m writing this post is to get readers excited enough to build open source tensor (decomposition) libraries. Once these basic libraries are in place, tensor-based algorithms become easier to implement. Anandkumar and her collaborators are in the early stages of porting some of their code to Apache Spark, and I’m hoping other groups will jump into the fray.

To view Anima Anandkumar’s talk at Strata + Hadoop World in San Jose, Tensor Methods for Large-scale Unsupervised Learning: Applications to Topic and Community Modeling, sign-up for a free trial of Safari Books Online.

Cropped image on article and category pages by Gerwin Sturm on Flickr, used under a Creative Commons license.

tags: , , , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Kaz

    FLAME project, U of Texas.

    • Thanks Kaz for letting us know about this. I however, do not see much about tensors there.

  • Igor

    Welcome to the Advanced Matrix (and Tensor) Factorization Jungle: https://sites.google.com/site/igorcarron2/matrixfactorizations

    ;-)

    Igor.

  • Igor

    Ben,

    You ask “Why aren’t tensors more popular?”, besides the dimensionality issue, some properties of matrices do not translate well for tensors (rank, see http://arxiv.org/pdf/math/0607647.pdf).

    Please note other types of decompositions such as Tensor Trains ( http://spring.inm.ras.ru/osel/?p=154 )

    Cheers,

    Igor.

    • Dear Igor, Indeed, there are many types of decompositions on tensors, we are certainly not the only ones to be working on them. The paper you point out about tensors states that there can be ill-posed tensors, which is true. In fact, there is another paper which states that most tensor problems are NP-hard http://arxiv.org/abs/0911.1393 However, the tensors we study, which are relevant for machine learning, turn out to be “easy cases” establish is that under some very reasonable non-degeneracy. I talk about some of these points in the podcast that Ben will post in due course. Stay tuned!

      • Robert John Freeman

        I started looking at crude kinds of vector product decompositions for natural language processing some time ago.

        What I concluded was that for natural language processing, cognitive computing, and other problems, tensors without global decompositions are in fact the norm. So not normally the “easy cases”. But, on the bright side, that this will turn out to be a feature not a bug. Because if we reject global decompositions we can find a greater number of local decompositions, and the result is more information carrying capacity (an explanation for the apparent rampant ambiguity of words in natural language.) That the capacity for contradictory local decompositions will in fact prove the great power of vector/tensor representations.

  • I’d like to throw out our lib nd4j: http://nd4j.org/ N dimensional representations that run first class on the jvm but can talk to CUDA, native, open cl all via swapping out the jar file for what you want to use.

    • Indeed, nd4j has array representations, but not tensor decompositions and other operations that are critical for machine learning applications.

      • Right. I’m thinking wrt apache spark, the array representations on the jvm are a good basis to implement such algorithms. The DSL is the first step to getting these algorithms working at scale.

  • Mike Anderson

    I’d like to point out that the library Vectorz offers pretty mature tensor support on the JVM (as well as all the usual vector / matrix stuff). It’s now the go-to library for numerics in the Clojure world.

    https://github.com/mikera/vectorz

    • Thanks Mike for pointing this out. But again, I do not see any decomposition methods for tensors there.

      • Mick

        Ah yes, you are correct, the decompositions are only the standard 2D ones at the moment. But it’s open source – so these may come soon!

  • https://github.com/dmlc/mshadow this library is about tensor operation and its very fast. also support GPU computing

  • Oleg

    Hi Ben, Did you mean “Let’s build the libraries from scratch”? Or
    you had something in mind, some projects to start from? Thanks

    • Ben Lorica

      I have no specific project in mind, but have been working on convincing the Apache Spark (MLlib) folks :-)