Supercomputing on the cheap with Parallella

Parallella topview

Packing impressive supercomputing power inside a small credit card-sized board running Ubuntu, Adapteva‘s $99 ARM-based Parallella system includes the unique Ephiphany numerical accelerator that promises to unleash industrial strength parallel processing on the desktop at a rock-bottom price. The Massachusetts-based startup recently ran a successfully funded Kickstarter campaign and gained widespread attention only to run into a few roadblocks along the way. Now, with their setbacks behind them, Adapteva is slated to deliver its first units mid-December 2013, with volume shipping in the following months.

What makes the Parallella board so exciting is that it breaks new ground: imagine an Open Source Hardware board, powered by just a few Watts of juice, delivering 90 GFLOPS of number crunching. Combine this with the possibility of clustering multiple boards, and suddenly the picture of an exceedingly affordable desktop supercomputer emerges.

This review looks in-depth at a pre-release prototype board (so-called Generation Zero, a development run of 50 units), giving you a pretty complete overview of what the finished board will look like.

The Hardware

The board is properly structured like a Supercomputer, with a host side powered by a 667MHz Zynq 7020 ARM A9 System-on-Chip manufactured by Xilinx. This is an interesting chip that includes alongside a dual-core ARM v7 CPU a full-fledged programmable logic facility equivalent to an Artix–7 FPGA. The Adapteva team uses some of the FPGA gates to communicate with the Epiphany chip, but I am told there is plenty of room left to add custom FPGA designs, if you have the necessary hardware design skills. This is the same architecture that powers the considerably more expensive Zedboard FPGA development kit now becoming popular in hardware design graduate courses. The number-crunching side is powered by a 600MHz, 16-core Adapteva Epiphany-III numerical accelerator, which is replaced by a 64-core version in more expensive board configurations. A Gigabyte of RAM complements the SoC with working memory.

Taking the tour clockwise from the Ethernet connector on the top-left, we have Gig-Ethernet, a three-pin header (which may not be in the final production design), the reset button, a micro-USB host port, a micro-HDMI port, a microSD slot (32 GB max), the power connector, and a second micro-USB host. The connectors are not really what is interesting, but if you have an eye for detail, you may have spotted the 1.7mm barrel plug power jack, such as the ones used on some legacy phone chargers and on the ODROID U2 board. This is an often-flaky connector that degrades with use, and a power supply harder to source than the common 2.1mm plug found in Arduino and most other boards, which is why the team has indicated that shipping boards will be manufactured with the more common 2.1mm connector.

Parallella underside

On the underside, four connectors provide (clockwise, from the microSD slot) GPIO access, PEC North, PEC South, and power expansion options. The North/South connectors are meant to eventually deliver interconnect routing to multiple boards, bypassing the bandwidth limitations potentially imposed by Gigabit Ethernet.

The board itself is the size of a credit card, and includes the mounting holes that so many other boards seem to have foregone. In this prototype, these are M3, just a wee bit shy of the more common (at least in the US) 6/32 standard standoffs.

Growing Pains

I am going to detail my hardware bring-up experience with the Gen–0, as it will highlight some of the ground that the Parallella team has covered in testing the hardware and entertain the embedded Linux developers among you that, like me, find this eerily familiar. Bear in mind, this is not the experience end-users will encounter when finished hardware starts shipping!

Fan, Mk1

The Gen–0 board needs a fan (more on that later), so the first step was a tape-down with a fan strategically placed. This resolved the micro-barrel connector’s unreliable connection, and the persistent desire of the fan to “walk” my desk. With the three LEDs flashing green, we have a good power connection.

We got powerrrr, Captain

The micro-HDMI connector is not functional in the Gen–0, due to a wiring issue already resolved on newer boards, so we will boot the board blind and intercept the DHCP request off the network. Dropping Wireshark, we leverage the fact that DHCP is a broadcast request, and use

dhcpdump -i eth0

while sitting on the same switch to intercept the request packet.

DHCP Intercepted

All these finicky details pertain to the first 50 Gen–0 boards, which are literally prototypes. The 6,000+ Kickstarter backers of the production run will see nothing of this.

Beyond this point, the board is remarkably stable, even in this early prototype form—I mounted it on my most silent Antec test stand, and clocked an impressive 17 days of uninterrupted uptime with it.

development setup

Heat and Power

I contributed a bit of thermal testing for the Gen–0 to the Parallella community, which should be an interesting point of comparison when the final boards are released. The aim is to have the system run fanless in the production design, and this looks quite possible as the infrared thermal-camera images show the greatest source of heat dissipation are the power ICs, not the Epiphany or Zync chips. The team has confirmed that the Gen–0 has a less-than-optimal power IC configuration, as well as some heatsink issues that are being addressed before mass production.

heat and power

It will be interesting to compare the progress with the Gen–1 when it ships—the team has demonstrated a fanless Gen–1 configuration in operation just a few weeks ago, so this seems to be well underway.

Similarly, the aim is to power the production board off of USB, and under 5W—when this is accomplished, for the 2+64 core version, it will deliver a remarkable 90 GFLOPS. The Gen-0 prototype hovers between 7 and 9.2 Watts in my stress testing, placing it outside of what USB can power, but if you look carefully at the power stand in the YouTube video I linked above, you can see the Gen–1 board prototype is powering up not only fanless, but drawing less than 5W. The team had already demonstrated during the summer 756 cores running as a single cluster, with 42 boards in a power envelope under 500W. That was an accomplishment in its own right if compared with other number-crunching hardware, but the 5W-per-board milestone will make the board not only easier to use, but also easier to power with many highly original power form factors sized for USB current, from LiPO batteries to solar panels… and all the way to lemon batteries if you are feeling creative.

The Software

Linux parallella-01 3.6.0-xilinx-dirty #18 SMP PREEMPT Thu May 9 10:35:46 EDT 2013 armv7l armv7l armv7l GNU/Linux

Following uname’s greeting, we look at /proc/cpuinfo for a bit more detail:

Processor : ARMv7 Processor rev 0 (v7l)
processor : 0
BogoMIPS : 418.02

processor : 1
BogoMIPS : 418.02

Features : swp half thumb fastmult vfp edsp neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x3
CPU part : 0xc09
CPU revision : 0

Hardware : Xilinx Zynq Platform
Revision : 0000
Serial : 0000000000000000

Out of the box, this build of the board came with the Linaro 12.10 Ubuntu derivative, so it is extremely standard as Linux ARM distributions go. The kernel is 3.6.0 with additional patches, and the system runs the SSH daemon out of the box. The process tree shows a few more background tasks, and the default login is a comfortably predictable user: linaro, password: linaro. Note how Avahi is installed, giving us another way to discover the board in headless mode. The Linaro image has some minor tailoring touch-ups that one would expect will be addressed in final release, or quickly contributed by appropriately enthusiastic community members—things like setting a default hostname, and removing unnecessary daemons like Cups or modem-manager from the startup set.

init-+-NetworkManager-+-dhclient
     |                |-dnsmasq
     |                `-2*[{NetworkManager}]
     |-avahi-daemon---avahi-daemon
     |-console-kit-dae---64*[{console-kit-dae}]
     |-cron
     |-cupsd
     |-dbus-daemon
     |-5*[getty]
     |-landscape-clien-+-landscape-broke---2*[{landscape-broke}]
     |                 |-landscape-manag
     |                 `-landscape-monit---landscape-packa---{landscape-packa}
     |-2*[login---bash]
     |-modem-manager
     |-ntpd
     |-polkitd---{polkitd}
     |-rsyslogd---3*[{rsyslogd}]
     |-sshd---sshd---sshd---bash---pstree
     |-udevd---2*[udevd]
     |-upstart-socket-
     |-upstart-udev-br
     `-whoopsie---{whoopsie}

apt-get has access to the full Linaro 12.10 repositories, so the choice of software is pretty much infinite. I tried a few simple program installs, including my full performance testing toolchain (stress, vmtouch, etc), then jumped directly to a complete app by installing the Landscape client for ARM (full disclaimer: this is $DAYJOB related) to give the stack a more thorough shakedown, and I was pleased to see that everything worked as advertised.

Landscape managing the Parallella

The board comes loaded with GCC 4.6, Perl 5.4.2, and Python 2.7—and at these performance levels, cross-compile is not really necessary. Man pages are also installed by default, which helps hacking directly on the device. Of course, the really interesting bit is programming the numerical accelerator, binaries for which are created with a special fork of GCC, called e-gcc. The device comes with a few examples, stored under the default user’s home directory. Entitled matmul–16, one example compares the performance of the ARM CPU (2-cores) with that of the Epiphany accelerator (16 cores). In my testing, I have seen the 16-core matrix multiplication outpace the general purpose CPU by nearly 11X, for this “embarrassingly parallel” type of problem.

Epiphany  -  time:          166.8 msec      (@ 600 MHz)
Host      -  time:          1761.6 msec     (@ 667 MHz)

Each example has the same structure, including the following files:

COPYING     README      build.sh        device/     host/       run.sh

COPYING informs us of the GPL v3 status of the code, while README describes in generic term the example. The interesting bits are in the build script, showing us how both the host side of the program, and the accelerated numerical kernel (“device side” in Parallella parlance) are compiled. In the simplest example provided by Adapteva’s Hello World program, these steps generate the binary executable and put it in the right format for use:

# Build HOST side application
gcc src/hello_world.c -o Debug/hello_world.elf -I ${EINCS} -L ${ELIBS} -le-hal

# Build DEVICE side program
e-gcc -T ${ELDF} src/e_hello_world.c -o Debug/e_hello_world.elf -le-lib

# Convert ebinary to SREC file
e-objcopy --srec-forceS3 --output-target srec Debug/e_hello_world.elf Debug/e_hello_world.srec

The host-side ELF executable is then paired with the device side numerical kernel (the SREC file) at execution time

sudo -E LD_LIBRARY_PATH=${ELIBS} EPIPHANY_HDF=${EHDF} ./hello_world.elf

Conclusion

With the entire Ubuntu software archive available in Linaro’s package repository, any F/OSS package is but one apt-get call away, opening up possibilities for the board as a general-purpose automation/embedded system. Many loyal Ubuntu users have asked for an ARM v6 port to boot the Raspberry PI, and the Parallella could stand alongside the Beagle Bone as a valid alternative to the Raspberry PI for developers in that audience—the key difference in this comparison is that despite being that much more powerful, the Parallella does not have an on-board GPU to accelerate HDMI video output. While much less brawny a number cruncher, the Raspberry PI remains better suited to multimedia applications where its on-board GPU can offload rendering tasks.

Rev 3

Just last week, the Parallella team unveiled the board’s Gen–1.1 design, and I got to borrow the new circuit board for a quick inspection. Alongside some fixes we already detailed, the addition of more vias into the ground plane to draw heat away from the chips stands out. Additionally, there are a few changes in the position of headers, which now include (top to bottom), a serial connection, a jumper to disable the Epiphany, and a power header. The serial connection brings invaluable help when debugging, and the jumper disabling the Epiphany promises to enable lower-power operation for those using the board for automation—finally, the power header is there to enable convenient assembly of very large numbers of boards without requiring as many individual power sources.

I can go on indefinitely where a new piece of hardware is concerned, especially one as original as this! You are welcome to ask questions in the comments or on Twitter (reach me at @0xf2), and I will do my best to answer. Once I get my hands on my final release board from the Kickstarter, expect a follow-on tutorial on Parallella programming to hit these screens. In the meantime, the key take-away is that this board promises to blow open the doors to low-power, on-demand supercomputing, putting on the table a real, low-cost alternative to current number crunching powerhouses costing thousands of dollars and requiring hundreds of watts to power. Watching a small Open Hardware startup’s device tread on the turf of industry behemoths like Intel and NVIDIA will be no less interesting, and their Kickstarter page is promising exciting announcements in the coming month.

Supercomputing on the cheap with Parallella

Blowing open the doors to low-power, on-demand supercomputing

Get the O’Reilly Programming Newsletter