Handling Data at a New Particle Accelerator

Unlocking Scientific Data with Python

dust_accelerator

Most people working on complex software systems have had That Moment, when you throw up your hands and say “If only we could start from scratch!” Generally, it’s not possible. But every now and then, the chance comes along to build a really exciting project from the ground up.

In 2011, I had the chance to participate in just such a project: the acquisition, archiving and database systems which power a brand-new hypervelocity dust accelerator at the University of Colorado.

Origins

In 2004, NASA decided to go back to the Moon.

You’d be forgiven for not noticing; in the end, Bush’s Vision for Space Exploration disappeared in a cloud of politics, expensive hardware, and unrealistic budgeting. But we got a lot of science out in the meantime. For example, we got the Lunar Reconnaissance Orbiter, equipped with cameras capable of imaging the lunar surface at 20 inches per pixel, sending back unprecedented high-resolution photographs. And the groundwork was laid for the another spacecraft called LADEE, launched last month and orbiting the Moon right now, that will take the first high-quality measurements of the thin lunar atmosphere and dust environment.

We also have a facility in Colorado.

What’s a Dust Accelerator?

The space between planets is not empty. Among other things, it’s full of tiny little grains of rock, affectionately referred to as “dust”, whizzing around at speeds that are hard to imagine. They’re everywhere and collide with everything: the Earth’s atmosphere, the Moon, asteroids, and innocent spacecraft that just happen to be in the wrong place at the wrong time. About 20 tons of the stuff hits the Earth every day, most of which arrives at a mind-bending 40,000-or-so miles per hour and immediately burns up.

Given that we’re sending satellites and spacecraft into this shooting gallery, not to mention the extremely cool physics that happens when projectiles like this slam into the lunar surface, NASA felt that it was important to have a facility here on Earth able to answer questions about the impact process.

The result is the hypervelocity dust accelerator at the University of Colorado, a machine that literally shoots dust grains at speeds up to 220,000 mph (100 km/s). In collaboration with an extraordinarily experienced sister facility in Heidelberg, Germany, we do a wide range of experiments, and even calibrate sensors which look for dust in space. One of them, called LDEX, is orbiting the Moon right this moment.

Data Requirements and Challenges

The thing about this machine is, it generates a lot of data.

A lot of data.

So much, in fact, that the machine and software layers are engineered to throw away any data that is bad, or even just inconvenient.

To start with, the first requirement is that scientists and engineers who use the machine want it to fire dust grains with very well constrained speeds and masses. For example, the speed of a dust grain encountered by a spacecraft at the Moon is about 2 km/s. Particles substantially slower or faster are not useful when calibrating a sensor that looks for such grains. So we shouldn’t even provide grains that don’t meet pre-selected criteria in speed and mass.

Users of the machine also want to know exactly what hit them and when: that means a timestamp, with millisecond accuracy, along with highly precise estimates for the speed and mass of each grain. So the second requirement is that each and every grain we fire should be well characterized in speed, mass, and event time.

The last requirement is that scientists actually be able to get this information out of the database in a useful way. The data must be searchable and exportable in standard formats in order for people to analyze it.

Step One: Active Downselection

What we get from the machine itself is nothing like the well-constrained, well-characterized distribution desired. The raw “beam” of dust particles consists of hundreds or even thousands of particles per second, which are all over the map in speed and mass. We use detectors on the beamline to determine the speed and mass of each and every grain, compare it to the desired speed and mass programmed in by the experimenter, and only admit those particles which meet the criteria.

The process can’t be done reliably with analog circuits, and normal software simply can’t cope with the high rates involved. So the first component of the stack is a Field Programmable Gate Array (FPGA) running a custom program, which throws away every particle that doesn’t fit in the range requested.

In a typical experiment, about 1 particle in every 10,000 is kept.

Step Two: Analysis and Recording

Here’s where we start to see traditional software layers. Particles that make it through the FPGA downselection process are admitted to the target chamber. Each admitted particle event is stored as a row in a MySQL database along with some very crude estimates of its speed and mass. Raw signal waveforms from the beamline detectors are stored on disk in the standard HDF5 binary format.

By themselves, the detector waveforms are not scientifically useful; they have to be automatically analyzed to produce speed and mass estimates. However, analyzing them in a blocking fashion would slow the system down to the point where it would affect the experiments. So the final component is a batch process that runs asynchronously, performing extremely high-quality estimates of particle speed and mass and updating the appropriate rows of the database.

About half of the events in this step are also thrown away by the analyzer; they’re false positives from the FPGA selection, real particles lost on their way to the end of the beamline, etc. If you’re keeping track, we’re now down to 1 out of every 20,000 particles.

Each event is also associated with a particular user and experiment in MySQL, which makes it easy to search for events corresponding to a particular experimental run, and support the multiple internal and external clients who use the machine.

Step Three: Data Visualization and Export

A scatter plot of admitted particles and their speed/mass values is available through a LabView program at the accelerator console. Since the data is already in MySQL/HDF5, a web-based system written in Python is currently under development to allow facility users to remotely search and access their data (By the way, using Python and HDF5 together is just about the best combination on the planet!).

Because data is streamed into MySQL and continuously updated while the experiment is happening, users can see each particle in their experiment appear live at the console. Such active feedback is critical to performing precise, high-confidence experiments. This is in contrast to the typical process of analyzing data after an experiment has concluded, and the opportunity to collect more data points or refine the process has passed.

Like other labs, we’ve standardized on HDF5 both as an archival tool for waveforms and also a data export format. Scientists who wish to take advantage of the database system can retrieve all beamline and experimental waveforms packaged in a structured, explorable file together with speed, mass, timestamp and accelerator metadata.

We’re Open for Business

This facility is open to everyone.

Let that sink in for a moment.  In addition to our own research efforts, the dust accelerator facility is available to our colleagues in research and industry.  Applications include study of erosion processes, particle implantation, materials science, instrument calibration and testing, and a wide array of basic scientific processes.  Visit the accelerator home page for more information.

tags: , , , ,

Get the O’Reilly Programming Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.