Google's Folding@Home on the "Multi-Core Crisis"

There’s been a fascinating exchange about programming for multi-core computers on Dave Farber’s Interesting People List. It started when Andrew Donoho wrote in a thread about the need for US colleges to retool their programming classes:

It is very clear that the software industry is going to hit a programming wall some time in the next 6 years (4 Moore’s Law generations). Microprocessors are going to progress from 4 to 64 processor cores. Most algorithms, other than the embarrassingly parallel ones, will hit this wall. This wall is not going to be surmounted by ‘code monkeys’. To truly exploit this coming surfeit of processors, we will need both architectural and software invention. This problem will need ‘classically trained’ disciplined thinkers. While I was not trained as a computer scientist (experimental physics was my field), the problem is going to need folks of the caliber that originally defined our core architectures. (The next Johnny Von Neumann, I’m looking for you.) In other words, we are entering an era of unprecedented architectural research and invention opportunity.

Adam Beberg replied:

Sorry Dave, I just can’t let the multi-core “crisis” comment stand
without a reply. That false meme needs to die.

I work on Folding@home, which is a 250K-node distributed system, where
each node is now days a dual-core at minimum, an 8-core Cell on avg, or
at maximum an on chip cluster of SMP elements with 1000’s of threads in
a NUMA setup of vector processors in the new GPUs. Nominally we have a
1M+core system operating as a coordinated whole.

The idea that 64 cores is a problem is patently absurd. Complex on many
levels? Sure. CS not available by the 80’s? No – the 64-node Caltech
Cosmic Cube was in 1981. Wall in sight for multi-core? Nope maybe at 1B
cores, I’ll keep you posted. GPU compilers ready for mainstream? Give
them another year. Courses to learn this stuff? Already out there.

Andrew then replied:

With all due respect to your excellent work on Folding@Home, your algorithms fall into the “embarrassingly parallel” category.

Adam didn’t agree at all:

While on the surface Folding@home appears in the job queue category of
SETI/BOINC projects, it’s not. Under the hood it’s a rather crazy
feedback process with Markov models and each new work unit depending on
the previous ones. If anything this is where the fun CS problems are,
and where the active research is. 10 years ago noone even thought you
could distribute these problems and had to use supercomputers.

The argument continued into all kinds of fabulous detail about the kinds of problems that are or are not amenable to parallel computing. In summing up, Andrew wrote:

the industrial world, which I am from, is hitting the wall of many core programming. Many of our economically important algorithms (value > $1 Billion) do not scale above 16-20 cores.

Adam replied:

Yes we will solve them, but we have to change our algorithms from what
most people are used to and this will take time. The same methods we use
for distributed folding also seem to translate to a wide variety of
other domains, so I see no hard walls on the horizon. I really do hope
we find a wall soon so I can climb it and I’m crossing my fingers for a
surprise at a billion.

The current generation of programmers is learning in a world of
multi-core, and from what I have seen they have zero if any trouble
dealing with it. Once they get some experience, we’ll wonder why this
was ever considered hard.

  • I tend to agree more with Donoho that the problem is currently hard and that most software and tools do not take good advantage of massively parallel architectures. This is particularly true for Enterprise software. I’ve written a brief note about Enterprise software and grid computing on my blog:

  • OK, my comment is intended to be a pot boiler so lots of smart people will educate me. So about ten years ago, I sold a 20 processor database server called a SparcServer 20, mostly running Oracle. I never saw a customer get any benefits past twelve processor in the real world. It was diminishing returns past that. If we are talking real world, we have to be talking about boxes at one location right for the most part? Folding/SETI are pretty unique computer scientist applications. I know TPC benchmarks so extra return past twelve processors, but they weren’t very real world back then. When Intel ships quad core desktops and 8 core, won’t this still be a big programming challenge? Is there a compiler that will “8 way” your code after a recompile?

  • Geoweb, you’re pointing to exactly the problem that Andrew Donoho brings up. Most software today isn’t designed to take advantage of very many cpu’s. Now that cpu’s are becoming multi-core, it only makes that problem worse. If every cpu has 2 cores, your 12 cpu example shrinks to 6. So you are right, this is a big programming challenge.

    It can be overcome, and as I mention in my blog article (link in the first comment post), it is something Enterprise ISV’s need to be thinking about.

    It’s worth noting that the advantages are even more acute for multi-tenant SaaS vendors. Multi-tenancy greatly increases the volumes a single instance has to process. SaaS greatly increases the desirability of scaling at low cost due to the economics involved.

    The challenge has not been cracked in any very generic way–it takes some very clever programming to take advantage of these machine architectures. I absolutely agree that more tools are needed!



  • I disagree with Andrew. In a few years time the tools will exist to empower any code monkey to deal with x number of cores

  • Probably so, Aaron–but who’s going to build them?

  • Adam, the CPU vendors are going to build these tools. Think about operating systems, compilers, frameworks.

    Well not operating systems maybe. Only Sun seems to be in a unique position with Solaris 10 combined with their crazy multi-core CPUs.

    Btw, it is a myth that a smart compiler will suddenly make your code X times faster. Where X is the number of cores. Sure, compilers can do some cool optimizations accross core boundaries now (for example the just released Intel C/C++/Fortran compiler v10 that can do multi-core optimizations) but to REALLY use all those cores the application programmer will need to come up with a proper design that scales on multiple cores.

    A good design is not hard in many cases. It just requires a different way of thinking. SMP used to be special, now everybody has a dual core laptop.

    Just think how much the next generation will have when their FIRST computer will be 16 core :-)

  • Stefan is exacly right–it takes a different kind of thinking, and you can’t just automagically have the compiler deal with it for you.

    In fact, one of the biggest problems with dev tools today is they’re entirely too procedural. The more procedural they are, the harder it is to take advantage of many cpu’s in a scalable way. Languages like Java and C++ are painfully procedural. Ironically, one of the oldest languages available, SQL, can be fairly declarative, and hence one sees products like Oracle for the grid because the SQL sees no difference. The language is designed to work on sets of data, not single elements at a time.

    Programming these new hardware architectures is going to involve working with the old languages “by hand” until a new language takes off. The window for the new language will be opening wider and wider the more pressure there is to take advantage of many cores.

    It’s about time–we’ve had essentially the same languages for a long long time. They’re all object oriented retreads of C with slight upgrades here and there.

  • random8r

    meh. erlang it.

  • dbt

    Procedural is easier to parallelize than object oriented with shared state.

  • Scot

    Maybe I’m being a bit dense, but what applications are out there that could use “large value of X”-way processors that aren’t already “embarrassingly parallel”? My email client is probably not designed to make use of 64 cores, but then it does fine on a single core. Maybe my browser could make use of it, with each tab running on a separate core, but is that really going to require a PhD in Physics to implement?

  • Scot, do I need to remind you of Bill Gates’ famous comment (which he’s since denied) that no one should need more than 64K of RAM? Even if he didn’t say that, here’s a great document I turned up searching for that Gates quote with a newsgroup debate a decade ago in which people were arguing that no personal computer would ever have a gigabyte of RAM.

    Don’t imagine today’s computer applications. Imagine tomorrow’s. That’s what Google and others are doing.

    Here are some possible areas that will require far more complex programming than we use today:

    * Machine vision for autonomous robots
    * Real time language translation
    * Intelligent personal assistants
    * Smart car autopilot
    * Google Adwords on steroids, providing real time personalized advertising not just on web search but on radio, newspapers, television, perhaps even on ambient display surfaces, integrating data from hundreds of new types of “sensors” feeding in real-time data

    Do you think that computer applications are always going to be limited to things you type to on a screen? The personal computer era is ending if not over. We’re in the “64K” stage of the next revolution.

  • Also appropriate to this discussion, even the number of “embarrassingly parallel” applications is going to explode beyond the obvious. Consider this one. Collective intelligence of all kinds is massively parallel, and harnessing collective intelligence is the essence of Web 2.0. In this TED video, Blaise Aguera y Arcas shows how Photosynth stitches together models of the world from millions of separate Flickr images. What other applications can we rethink and redo as we get beyond today’s technology…?

  • hopia

    Well, I’ve been an embedded programmer for 12 years and I’ve been working with designs that are multi-threaded and that simultaneously execute code running from different real-time hardware components. When I first started–fresh from college, then–it took about a few months to get used to all of the concurrent stuff that was going on in my systems. I think it’s just a matter of thinking about your problems differently and coming up with the most effective solution. I don’t think it’s a systemic problem with current programmers in general. Just like any programming skill, it can be learned and the learning curve is not as steep as some of the industry people make it sound.

    Admittedly, the old embedded multi-threading problems I worked with are slightly different from the current issues brought upon by true multi-core chips. And I think the biggest difference is scalability. In my projects, old embedded multi-threading designs usually have a fixed number of specialized “cores” in the system. For example, you have one or two for the image decompressor, one for the general processing, one for the USB controller, etc. However in multi-core environments, the best designs should be able to accommodate a variable number of processors. Writing code for a fixed core environment is different from writing code for an environment you know will use a variable number processors. And I think just being aware of this fact before planning your design is the key difference. But it’s just a matter of putting your designs in a scalable multi-core framework. It’s just like working on any other software design problem. Sure, the architecture will be different, and definitely I won’t assign the junior programmers in my team to work on the designs by themselves, but experienced embedded programmers won’t find it horrendously difficult.

  • geoweb

    The Folding@home application heavily uses hidden markov models, basically heavily simulating complex probabilities to determine hidden values – ie original signals. Speech recognition, handwriting recognition, gesture recognition (Minority Report style computing), synthesizing multiple rss feeds all could make use of HMMs. One can envision dedicating a processor to speech, one to gestures, even one to skin temperature. If your computer senses you are tired, it keeps the answers short. If there is stress in your voice, it gives answers in a calming voice. Remember Hal in 2001? You just have to imagine processors and memory are really cheap. Still need those tools though.

  • Jeff Keasler

    There is alot of software out there that is already bandwidth limited on one core. This means that the data bus is not able to shuttle the information from memory into the processor at a rate fast enough to work on it. As more cores have to share the same data bus, that situation only gets worse. People have already done an extrordinary amount of work to reuse data in cache, so the argument that recoding software for the new era will get around the bandwidth problem is bogus for that category of software.

  • Nes Liber

    You can use 4 core CPUs today: give the operating system, virus scanner, email program and web browser each one core, Voila!
    Your current software will continue to work. It will not get faster automatically by buying the latest hardware but it can be optimized over time if enough interest exists. But frankly the bottleneck today is the hard drive and the network not the CPU. The driving force for faster CPUs at the enterprise are servers (file, web, print, app, database) and at the consumer level computer games. I am confident that developers for those types of applications will adapt.

  • Imanpreet


    Sure, you could run OS, Virus Scanner, Email program and web browser on each of the cores, but that is only true for the multicore architectures akin to the one that are there in market like Intel Dual core or Quad core.

    By Multicore processors which are coming up now, like Cell, Tilera, Stream it is not in the architecture to run multiple Operating systems on each of the cores, rather like in case of Cell there would be a central core, PPE, which would be the only one that is capable of running an Operating System.

  • blah blah

    Parallelism really just depends on what you’re doing with the app. In the case of your Oracle DB, a lot of operations are linear. A call for data must be made, and finished, before the next step in the query can be ran. Granted, a lot of SQL db’s have optimizers that look at one query, and can piece-meal it out to pull various records from this table + this table while another thread matches that table + that table, then they come together to put this table results + that table results together to give the end-user the final result. And, obviously, the more cores you have, the more end-users can be tapping data, meanwhile a core or two can grind away on data optimizations (shuffling data around in the db or storage for better performance).

    In looking at things as simple (and complex) as gaming, look at a typical RPG. Each character in a town has an AI script firing off. If the game has been programmed to allow those scripts to run multi-threaded, then you can farm off the work to multiple cores. The problem arises when you have sequences of events. Just because actor X processes before actor Y, actor X must still have a flag that lets him know he needs ot wait for actor Y to finish some activity before continuing with his own. So, it’s not just as simple as “each actor gets their own core” as some folks have argued in gaming forums. Parts of code can multi-thread / core while others need to know when to wait.

    However, overall, if there is a problem then it will sort itself out. Humans are creative creatures. Someone will find ways to tackle the problem, then others will find ways to optimize it, then others will find ways to make tools that the masses can use without needing a degree in rocket science to implement.

    Multi-core desktops / servers may be new, but cluster computing is not. CS students have been working on this type of stuff for years. It’s just these days we have the privilage of having a computer “cluster” in the palm of our hand or in our desktop.