Data hand tools

A data task illustrates the importance of simple and flexible tools.

drill bitThe flowering of data science has both driven, and been driven by, an explosion of powerful tools. R provides a great platform for doing statistical analysis, Hadoop provides a framework for orchestrating large clusters to solve problems in parallel, and many NoSQL databases exist for storing huge amounts of unstructured data. The heavy machinery for serious number crunching includes perennials such as Mathematica, Matlab, and Octave, most of which have been extended for use with large clusters and other big iron.

But these tools haven’t negated the value of much simpler tools; in fact, they’re an essential part of a data scientist’s toolkit. Hilary Mason and Chris Wiggins wrote that “Sed, awk, grep are enough for most small tasks,” and there’s a layer of tools below sed, awk, and grep that are equally useful. Hilary has pointed out the value of exploring data sets with simple tools before proceeding to a more in-depth analysis. The advent of cloud computing, Amazon’s EC2 in particular, also places a premium on fluency with simple command-line tools. In conversation, Mike Driscoll of Metamarkets pointed out the value of basic tools like grep to filter your data before processing it or moving it somewhere else. Tools like grep were designed to do one thing and do it well. Because they’re so simple, they’re also extremely flexible, and can easily be used to build up powerful processing pipelines using nothing but the command line. So while we have an extraordinary wealth of power tools at our disposal, we’ll be the poorer if we forget the basics.

With that in mind, here’s a very simple, and not contrived, task that I needed to accomplish. I’m a ham radio operator. I spent time recently in a contest that involved making contacts with lots of stations all over the world, but particularly in Russia. Russian stations all sent their two-letter oblast abbreviation (equivalent to a US state). I needed to figure out how many oblasts I contacted, along with counting oblasts on particular ham bands. Yes, I have software to do that; and no, it wasn’t working (bad data file, since fixed). So let’s look at how to do this with the simplest of tools.

(Note: Some of the spacing in the associated data was edited to fit on the page. If you copy and paste the data, a few commands that rely on counting spaces won’t work.)

Log entries look like this:

 QSO: 14000 CW 2011-03-19 1229 W1JQ       599 0001  UV5U       599 0041 QSO: 14000 CW 2011-03-19 1232 W1JQ       599 0002  SO2O       599 0043 QSO: 21000 CW 2011-03-19 1235 W1JQ       599 0003  RG3K       599 VR QSO: 21000 CW 2011-03-19 1235 W1JQ       599 0004  UD3D       599 MO ... 

Most of the fields are arcane stuff that we won’t need for these exercises. The Russian entries have a two-letter oblast abbreviation at the end; rows that end with a number are contacts with stations outside of Russia. We’ll also use the second field, which identifies a ham radio band (21000 KHz, 14000 KHz, 7000 KHz, 3500 KHz, etc.) So first, let’s strip everything but the Russians with grep and a regular expression:

 $ grep '599 [A-Z][A-Z]' rudx-log.txt | head -2 QSO: 21000 CW 2011-03-19 1235 W1JQ       599 0003  RG3K       599 VR QSO: 21000 CW 2011-03-19 1235 W1JQ       599 0004  UD3D       599 MO 

grep may be the most useful tool in the Unix toolchest. Here, I’m just searching for lines that have 599 (which occurs everywhere) followed by a space, followed by two uppercase letters. To deal with mixed case (not necessary here), use grepi. You can use character classes like :upper: rather than specifying the range A-Z, but why bother? Regular expressions can become very complex, but simple will often do the job, and be less error-prone.

If you’re familiar with grep, you may be asking why I didn’t use $ to match the end of line, and forget about the 599 noise. Good question. There is some whitespace at the end of the line; we’d have to match that, too. Because this file was created on a Windows machine, instead of just a newline at the end of each line, it has a return and a newline. The $ that grep uses to match the end-of-line only matches a Unix newline. So I did the easiest thing that would work reliably.

The simple head utility is a jewel. If you leave head off of the previous command, you’ll get a long listing scrolling down your screen. That’s rarely useful, especially when you’re building a chain of commands. head gives you the first few lines of output: 10 lines by default, but you can specify the number of lines you want. -2 says “just two lines,” which is enough for us to see that this script is doing what we want.

Next, we need to cut out the junk we don’t want. The easy way to do this is to use colrm (remove columns). That takes two arguments: the first and last column to remove. Column numbering starts with one, so in this case we can use colrm 1 72.

 $ grep '599 [A-Z][A-Z]' rudx-log.txt  | colrm 1 72 | head -2  VR  MO ... 

How did I know we wanted column 72? Just a little experimentation; command lines are cheap, especially with command history editing. I should actually use 73, but that additional space won’t hurt, nor will the additional whitespace at the end of each line. Yes, there are better ways to select columns; we’ll see them shortly. Next, we need to sort and find the unique abbreviations. I’m going to use two commands here: sort (which does what you’d expect), and uniq (to remove duplicates).

 $ grep '599 [A-Z][A-Z]' rudx-log.txt  | colrm 1 72 | sort |    uniq | head -2  AD  AL 

Sort has a –u option that suppresses duplicates, but for some reason I prefer to keep sort and uniq separate. sort can also be made case-insensitive (-f), can select particular fields (meaning we could eliminate the colrm command, too), can do numeric sorts in addition to lexical sorts, and lots of other things. Personally, I prefer building up long Unix pipes one command at a time to hunting for the right options.

Finally, I said I wanted to count the number of oblasts. One of the most useful Unix utilities is a little program called wc: “word count.” That’s what it does. Its output is three numbers: the number of lines, the number of words, and the number of characters it has seen. For many small data projects, that’s really all you need.

 $ grep '599 [A-Z][A-Z]' rudx-log.txt  | colrm 1 72 | sort | uniq | wc       38      38     342 

So, 38 unique oblasts. You can say wcl if you only want to count the lines; sometimes that’s useful. Notice that we no longer need to end the pipeline with head; we want wc to see all the data.

But I said I also wanted to know the number of oblasts on each ham band. That’s the first number (like 21000) in each log entry. So we’re throwing out too much data. We could fix that by adjusting colrm, but I promised a better way to pull out individual columns of data. We’ll use awk in a very simple way:

 $ grep '599 [A-Z][A-Z]' rudx-log.txt  | awk '{print $2 " " $11}' |      sort | uniq 14000 AD 14000 AL 14000 AN ... 

awk is a very powerful tool; it’s a complete programming language that can do almost any kind of text manipulation. We could do everything we’ve seen so far as an awk program. But rather than use it as a power tool, I’m just using it to pull out the second and eleventh fields from my input. The single quotes are needed around the awk program, to prevent the Unix shell from getting confused. Within awk’s print command, we need to explicitly include the space, otherwise it will run the fields together.

The cut utility is another alternative to colrm and awk. It’s designed for removing portions of a file. cut isn’t a full programming language, but it can make more complex transformations than simply deleting a range of columns. However, although it’s a simple tool at heart, it can get tricky; I usually find that, when colrm runs out of steam, it’s best jumping all the way to awk.

We’re still a little short of our goal: how do we count the number of oblasts on each band? At this point, I use a really cheesy solution: another grep, followed by wc:

 $ grep '599 [A-Z][A-Z]' rudx-log.txt  | awk '{print $2 " " $11}' |      sort | uniq | grep 21000 | wc       20      40     180 $ grep '599 [A-Z][A-Z]' rudx-log.txt  | awk '{print $2 " " $11}' |      sort | uniq | grep 14000 | wc       26      52     234 ... 

OK, 20 oblasts on the 21 MHz band, 26 on the 14 MHz band. And at this point, there are two questions you really should be asking. First, why not put grep 21000 first, and save the awk invocation? That’s just how the script developed. You could put the grep first, though you’d still need to strip extra gunk from the file. Second: What if there are gigabytes of data? You have to run this command for each band, and for some other project, you might need to run it dozens or hundreds of times. That’s a valid objection. To solve this problem, you need a more complex awk script (which has associative arrays in which you can save data), or you need a programming language such as perl, python, or ruby. At the same time, we’ve gotten fairly far with our data exploration, using only the simplest of tools.

Now let’s up the ante. Let’s say that there are a number of directories with lots of files in them, including these rudx-log.txt files. Let’s say that these directories are organized by year (2001, 2002, etc.). And let’s say we want to count oblasts across all the years for which we have records. How do we do that?

Here’s where we need find. My first approach is to take the filename (rudx-log.txt) out of the grep command, and replace it with a find command that looks for every file named rudx-log.txt in subdirectories of the current directory:

 $ grep '599 [A-Z][A-Z]' `find . -name rudx-log.txt -print`  |    awk '{print $2 " " $11}' | sort | uniq | grep 14000 | wc       48      96     432 

OK, so 48 directories on the 14 MHz band, lifetime. I thought I had done better than that. What’s happening, though? That find command is simply saying “look at the current directory and its subdirectories, find files with the given name, and print the output.” The backquotes tell the Unix shell to use the output of find as arguments to grep. So we’re just giving grep a long list of files, instead of just one. Note the -print option: if it’s not there, find happily does nothing.

We’re almost done, but there are a couple of bits of hair you should worry about. First, if you invoke grep with more than one file on the command line, each line of output begins with the name of the file in which it found a match:

 ... ./2008/rudx-log.txt:QSO: 14000 CW 2008-03-15 1526 W1JQ      599 0054  UA6YW         599 AD ./2009/rudx-log.txt:QSO: 14000 CW 2009-03-21 1225 W1JQ      599 0015  RG3K          599 VR ... 

We’re lucky. grep just sticks the filename at the beginning of the line without adding spaces, and we’re using awk to print selected whitespace-separated fields. So the number of any field didn’t change. If we were using colrm, we’d have to fiddle with things to find the right columns. If the filenames had different lengths (reasonably likely, though not possible here), we couldn’t use colrm at all. Fortunately, you can suppress the filename by using greph.

The second piece of hair is less common, but potentially more troublesome. If you look at the last command, what we’re doing is giving the find command a really long list of filenames. How long is long? Can that list get too long? The answers are “we don’t know,” and “maybe.” In the nasty old days, things broke when the command line got longer than a few thousand characters. These days, who knows what’s too long … But we’re doing “big data,” so it’s easy to imagine the find command expanding to hundreds of thousands, even millions of characters. More than that, our single Unix pipeline doesn’t parallelize very well; and if we really have big data, we want to parallelize it.

The answer to this problem is another old Unix utility, xargs. Xargs dates back to the time when it was fairly easy to come up with file lists that were too long. Its job is to break up command line arguments into groups and spawn as many separate commands as needed, running in parallel if possible (-P). We’d use it like this:

 $ find . -name rudx-log.txt -print | xargs grep '599 [A-Z][A-Z]'  |   awk '{print $2 " " $11}' | grep 14000 | sort | uniq | wc       48      96     432 

This command is actually a nice little map-reduce implementation: the xargs command maps grep all the cores on your machine, and the output is reduced (combined) by the awk/sort/uniq chain. xargs has lots of command line options, so if you want to be confused, read the man page.

Another approach is to use find‘s –exec option to invoke arbitrary commands. It’s somewhat more flexible than xargs, though in my opinion, findexec has the sort of overly flexible but confusing syntax that’s surprisingly likely to lead to disaster. (It’s worth noting that the examples for –exec almost always involve automating bulk file deletion. Excuse me, but that’s a recipe for heartache. Take this from the guy who once deleted the business plan, then found that the backups hadn’t been done for about 6 months.) There’s an excellent tutorial for both xargs and findexec at Softpanorama. I particularly like this tutorial because it emphasizes testing to make sure that your command won’t run amok and do bad things (like deleting the business plan).

That’s not all. Back in the dark ages, I wrote a shell script that did a recursive grep through all the subdirectories of the current directory. That’s a good shell programming exercise which I’ll leave to the reader. More to the point, I’ve noticed that there’s now a -R option to grep that makes it recursive. Clever little buggers …

Before closing, I’d like to touch on a couple of tools that are a bit more exotic, but which should be in your arsenal in case things go wrong. odc gives a raw dump of every character in your file. (-c says to dump characters, rather than octal or hexadecimal). It’s useful if you think your data is corrupted (it happens), or if it has something in it that you didn’t expect (it happens a LOT). od will show you what’s happening; once you know what the problem is, you can fix it. To fix it, you may want to use sed. sed is a cranky old thing: more than a hand tool, but not quite a power tool; sort of an antique treadle-operated drill press. It’s great for editing files on the fly, and doing batch edits. For example, you might use it if NUL characters were scattered through the data.

Finally, a tool I just learned about (thanks, @dataspora): the pipe viewer, pv. It isn’t a standard Unix utility. It comes with some versions of Linux, but the chances are that you’ll have to install it yourself. If you’re a Mac user, it’s in macports. pv tells you what’s happening inside the pipes as the command progresses. Just insert it into a pipe like this:

 $ find . -name rudx-log.txt -print | xargs grep '599 [A-Z][A-Z]'  |   awk '{print $2 " " $11}' | pv | grep 14000 | sort | uniq | wc 3.41kB 0:00:00 [  20kB/s] [       48      96     432 

The pipeline runs normally, but you’ll get some additional output that shows the command’s progress. If something’s getting malfunctioning or performing too slowly, you’ll find out. pv is particularly good when you have huge amounts of data, and you can’t tell whether something has ground to a halt, or you just need to go out for coffee while the command runs to completion.

Whenever you need to work with data, don’t overlook the Unix “hand tools.” Sure, everything I’ve done here could be done with Excel or some other fancy tool like R or Mathematica. Those tools are all great, but if your data is living in the cloud, using these tools is possible, but painful. Yes, we have remote desktops, but remote desktops across the Internet, even with modern high-speed networking, are far from comfortable. Your problem may be too large to use the hand tools for final analysis, but they’re great for initial explorations. Once you get used to working on the Unix command line, you’ll find that it’s often faster than the alternatives. And the more you use these tools, the more fluent you’ll become.

Oh yeah, that broken data file that would have made this exercise superfluous? Someone emailed it to me after I wrote these scripts. The scripting took less than 10 minutes, start to finish. And, frankly, it was more fun.

Related books:

Related coverage:

tags: , , ,