What is probabilistic programming?

Probabilistic languages can free developers from the complexities of high-performance probabilistic inference.

Probabilistic programming languages are in the spotlight. This is due to the announcement of a new DARPA program to support their fundamental research. But what is probabilistic programming? What can we expect from this research? Will this effort pay off? How long will it take?

A probabilistic programming language is a high-level language that makes it easy for a developer to define probability models and then “solve” these models automatically. These languages incorporate random events as primitives and their runtime environment handles inference. Now, it is a matter of programming that enables a clean separation between modeling and inference. This can vastly reduce the time and effort associated with implementing new models and understanding data. Just as high-level programming languages transformed developer productivity by abstracting away the details of the processor and memory architecture, probabilistic languages promise to free the developer from the complexities of high-performance probabilistic inference.

What does it mean to perform inference automatically? Let’s compare a probabilistic program to a classical simulation such as a climate model. A simulation is a computer program that takes some initial conditions such as historical temperatures, estimates of energy input from the sun, and so on, as an input. Then it uses the programmer’s assumptions about the interactions between these variables that are captured in equations and code to produce forecasts about the climate in the future. Simulations are characterized by the fact that they only run in one direction: forward, from causes to hypothesized effects.

A probabilistic program turns this around. Given a universe of possible interactions between different elements of the climate system and a collection of observed data, we could automatically learn which interactions are most effective in explaining the observations — even if these interactions are quite complex. How does this work? In a nutshell, the probabilistic language’s runtime environment runs the program both forward and backward. It runs forward from causes to effects (data) and backward from the data to the causes. Clever implementations will trade off between these directions to efficiently home in on the most likely explanations for the observations.

PP Figure.002

Better climate models are but one potential application of probabilistic programming. Other models include: shorter and more humane clinical trials with fewer unneeded side effects and more accurate outcomes; machine perception that transcends the capabilities of the now-ubiquitous quadcopters and even Google’s self-driving cars; and “nervous systems” that fuse data from massively distributed and noisy sensor networks to better understand both the natural world and artificial environments.

Of course, any technology this general carries a lot of uncertainty around its development path and eventual impact. So much depends on complex interactions with other technology threads and, ultimately, social factors and regulation. With all possible humility, here is one sample from the predictive distribution, conditioned on what we know so far:

  • Phase I — Probabilistic programming will transform the practice of data science by unifying anecdotal reasoning with more reliable statistical approaches. If data science is first and foremost about telling stories, then probabilistic programming is in many ways the perfect tool. Practitioners will be able to leverage the persuasive power of narrative, while staying on firm quantitative ground.
  • Phase II — Practitioners will really start to push the boundaries of modeling in fundmental ways in order to address many applications that don’t fit well into the current machine learning, text mining, or graph analysis paradigms. Many real-world datasets are a mixture of tabular, relational, textual, geospatial, audiovisual, and other data types. Probabilistic programs can weave all of these pieces together in natural ways. Current solutions that claim to integrate heterogeneous data typically do so by beating it all into a similar form, losing much of the underlying structure along the way.
  • Phase III — Probabilistic programming will push well into territory that is universally recognized as artificial intelligence. As we’re often reminded, intelligent systems are very application-specific. Good chess algorithms are unlike Google’s self-driving car, which is totally different from IBM’s Watson. But probabilistic programs can be layered and modularized, with subsystems that specialize in particular problem domains, but embedded in a shared fabric that recognizes the current context and brings appropriate modeling subsystems to bear.

What will it take to make all this real? The conceptual underpinnings of probabilistic programming languages are well in hand, thanks to trailblazing work by research groups at MIT, UMass Amherst, Microsoft Research, Harvard, and elsewhere. The core challenge at this point is developing performant inference engines that can efficiently solve the very wide range of models that these languages can express. We’ll also need new debugging, optimization, and visualization tools to help developers get the most from these systems.

This story will take years to play out in full, but I expect we’ll see real progress over the next three to four years. I’m excited.

Want to learn more? BUGS is a probabilistic programming language originally developed by statisticians more than 20 years ago. While it has a number of limitations around expressivity and dataset size, it’s a great way to get your feet wet. Also check out Rob Zinkov’s tutorial post, which includes examples of several models. Church is the most ambitious probabilistic programming language. Don’t miss the tutorials, though it may not be the most accessible or practical option until the inference engine and toolset mature. For that reason, factorie might be a better bet in the short term, especially if you like Scala, or Microsoft Research’s infer.net with C# and F# bindings. The proceedings from a recent academic workshop provide a great snapshot of the field as of late 2012. Finally, this video from a long-defunct startup that I co-founded contains one stab at explaining many of the concepts underlying probabilistic programming referred to under the more general term probabilistic computing:

tags: , , , ,

Get the O’Reilly Programming Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.

  • @TWiecki points out that Stan ( http://mc-stan.org/ ) is another good place to start exploring probabilistic programming as well – and it has R bindings ( https://code.google.com/p/stan/wiki/RStanGettingStarted ).

  • Linas Vepstas

    Here in the land of opencog ( http://opencog.org ) we’ve been doing probabilistic programming for a while. Just last month, I wrote a blog entry providing a simple introduction to some foundational concepts: http://blog.opencog.org/2013/03/24/why-hypergraphs/

  • dangleebits
  • I’m trying to understand this concept at its highest level, and had to mention that the video does an excellent job.

  • Super old. They were building hardware doing this in the 60s. Some of it deployed in missiles.

  • Very nice post Beau,

    What’s your opinion on introducing a new domain specific language (with its own new syntax, development, compiler and runtime environment) as done in BUGS, JAGS and Stan vs embedding probabilistic programming as a library (optionally with code generation and just in time compiler) of an existing generic programming language as done in FACTORIE with Scala or PyMC3 with Python?

    The latter seems to make it possible to benefit from existing development tools (editor / IDE support for the syntax, documentation tools, debuggers, easy integration with other third party libraries) and also benefit from a larger existing developers mind-share as there is a higher cognitive cost in learning to use yet a new syntax and development environment vs learning a new library (“cultural scalability”).

    • Thomas Wiecki

      I’m obviously biased towards PyMC’s approach but for me it is about a library VS a DSL. Tools like PyMC can be very easily extended (adding new probability distributions, new samplers etc) and give more control in general. However, they sometimes come at the cost of expressiveness (for reference see a recent discussion on syntax in PyMC3 here: https://github.com/pymc-devs/pymc/issues/189).

      Having also worked with rjags and winbugs I must say that there is a pretty high cost associated with making two languages work together (R and JAGS, Matlab and winbugs). Getting the data in and traces out requires some black magic (coda, blackbox), debugging is nearly impossible.

      Your last point about standing on the shoulder of giants definitely rings true. Stan had to hand-code things like the autodiff to compute the gradient. PyMC3 piggy-backs on Theano for all computations and only creates the computing graph.

    • I think that at this early stage, it’s important for research-oriented languages like Church to have the freedom to use exactly the syntax and semantics that is needed. In fact, so much of the elegance and power of Church’s approach stems from the fact that they’ve identified just the right generalization of functional purity that is needed to (elegantly) represent the universe of stochastic processes. (I’m paraphrasing and probably botching, but see http://danroy.org/papers/RoyManGooTen-ICMLNPB-2008.pdf)

      That said, if you just want to get things done by adding some inference to an existing system, the library-based solutions that you mention may well be the best bet.

      My perspective is that this is really early days – none of the existing systems offer all of 1) a clean separation between modeling and inference, 2) high-performance generic inference without developer intervention, and 3) sufficient expressivity to capture all models of interest. The conjunction of these is a high bar, but that’s where I’d like to see things progress.

      What will this look like in 5-10 years when it’s shaken out? Don’t ask me (though I have some guesses and aesthetic preferences) – for now, researchers should explore all of these options, and there will be plenty of time to evaluate and consolidate down the line.

      • zv

        I really do see Church as this ideal of how expressive a probabilistic programming system can be. On the other hand, I think we need to be open to exploring tradeoffs so we can have high performance systems now which we can make more expressive over time. Systems that can be useful now will attract more people than systems that will be useful eventually.

  • What about the Bayesian logics out there:
    – Alchemy http://alchemy.cs.washington.edu/
    – Prism http://sato-www.cs.titech.ac.jp/prism/
    – ProbLog http://dtai.cs.kuleuven.be/problog/
    – BLOG http://people.csail.mit.edu/milch/blog/index.html
    ? They fit your “tell a story” part as they are declarative/logic programming languages.

    Also there are several Bayesian programming API/libraries like ProBT for C++ http://probayes.com/index.php/en/products/sdk/probt .

  • I didn’t understood it very well: to me this is just evolutionary computing (e.g. genetics, evolutive, etc) used for machine learning (yes, they use probabilistic methods too; machine learning is about getting the rule to explain causes). I can’t see the need of another branding-name to relaunch an old concept.

    Besides that, it’s a very interesting topic :)

  • Sean Stromsten

    The church language implementations are not mature, but the church-based tutorials Beau links to are a great way–maybe the best way–to get an idea of the aim and scope of the field. That is, if you can get past the conviction that Aristotle/Da Vinci/Tesla did all of this before.

  • nebakhet

    This is pedantry as it doesn’t affect the point made in the article in the slightest, but climate models don’t take historical temperatures as an input to make forecasts. They do take an initial state, but that can be rough.