Can Data Flow Help Us Escape the von Neumann Machine?

Untangling code with flow-based programming

About a year ago I was struck by George Dyson‘s plea in his Strata London keynote

That’s why we live in this world where we follow this one particular [von Neumann] architecture and all the alternatives were squashed… Turing gave us this very powerful one-dimensional model, von Neumann made it into this two-dimensional address matrix, and why are we still stuck in that world? We’re fully capable of moving on to the next generation… that becomes fully three-dimensional. Why stay in this von Neumann matrix?

Dyson suggested a more biologically based template-based approach, but I wasn’t sure at the time that we were as far from three dimensions as Dyson thought. Distributed computing with separate memory spaces already can offer an additional dimension, though most of us are not normally great at using it. (I suspect Dyson would disagree with my interpretation.)

Companies that specialize in scaling horizontally—Google, Facebook, and many others—already seem to have multiple dimensions running more or less smoothly. While we tend to think of that work as only applying to specialized cases involving many thousands of simultaneous users, that extra dimension can help make computing more efficient at practically any scale above a single processor core.

Unfortunately, we’ve trained ourselves very well to the von Neumann model—a flow of instructions through a processor working on a shared address space. There are many variations in pipelines, protections for memory, and so on, but we’ve centered our programming models on creating processes that communicate with each other. The program is the center of our computing universe because it must handle all of these manipulations directly.

Last night, however, as I was exploring J. Paul Morrison’s Flow-Based Programming, I ran into this:

Today we see that the problem is actually built right into the fundamental design principles of our basic computing engine….

The von Neumann machine is perfectly adapted to the kind of mathematical or algorithmic needs for which it was developed: tide tables, ballistics calculations, etc., but business applications are rather different in nature….

Business programming works with data and concentrates on how this data is transformed, combined, and separated…. Broadly speaking, whereas the conventional approaches to programming (referred to as “control flow”) start with process and view data as secondary, business applications are usually designed starting with data and viewing processes as secondary—processes are just the way data is created, manipulated, and destroyed. We often call this approach “data flow.” (21)

Of course, Morrison is cheering on the data flow approach, so he talks about the tangles flow-based programming can avoid:

In any factory, many processes are going on at the same time, and synchronization is only necessary at the level of an individual work item. In conventional [control flow] programming, we have to know exactly when events take place, otherwise things are not going to work right. This is largely because of the way the storage of today’s computers works—if data is not processed in exactly the right sequence, we will get wrong results, and may not even be aware that it has happened! There is no flexibility or adaptability.

In our [data flow] factory image, on the other hand, we don’t really care if one machine runs before or after another, as long as processes are applied to a given work item in the right order. For instance, a bottle must be filled before it is capped, but this does not mean that all the bottles must be filled before any of them can be capped….

In programming, it means that code steps have to be forced into a single sequence which is extremely difficult for humans to visualize correctly, because of a mistaken belief that the machine requires it. It doesn’t! (24)

To step back a bit, the von Neumann machine works very very well for some kinds of tasks, but doing a series of data processing steps in a single memory space piles on risk as the complexity of the task increases. Steps to defend against that risk, notably regimentation of tasks and encapsulation of data, can ease some of that—but change the process in ways that may not be efficient. Morrison goes on to discuss the challenges of memory garbage collection in a world of long-lived processes that create and discard handles to data. He contrasts them with processes that operate on a particular chunk of information and then explicitly close up shop by disposing of their “information packets” at the end.

I may be too optimistic. Between my time in Erlang, where processes that handle messages and then start over with a clean slate are normal, and my time in markup, which offers tools to define message formats independently of control flows, I see Morrison as presenting a much-needed untangling.

Whether that untangling is a complete answer to Dyson’s question, I’m not certain. I suspect, however, that that the flow-based programming approach solves a lot of questions about distributed and parallel processing, even before we get to its potential for designing code through graphical interfaces.

tags: ,