Strata Gems: The timeless utility of sed and awk

A little command line knowledge goes a long way

We’re publishing a new Strata Gem each day all the way through to December 24. Yesterday’s Gem: Where to find data.

Strata 2011Edison famously said that genius is 1% inspiration and 99% perspiration. Much the same can be said for data analysis. The business of obtaining, cleaning and loading the data often takes the lion’s share of the effort.

Now over 30 years old, the UNIX command line utilities sed and awk are useful tools for cleaning up and manipulating data. In their Taxonomy of Data Science, Hilary Mason and Chris Wiggins note that when cleaning data, “Sed, awk, grep are enough for most small tasks, and using either Perl or Python should be good enough for the rest.” A little aptitude with command line tools can go a long way.

sed is a stream editor: it operates on data in a serial fashion as it reads it. You can think of sed as a way to batch up a bunch of search and replace operations that you might perform in a text editor. For instance, this command will replace all instances of “foo” with “bar” within a file:

Anybody who has used regular expressions within a text editor or programming language will find sed easy to grasp. Awk takes a little more getting used to. A record-oriented tool, awk is the right tool to use when your data contains delimited fields that you want to manipulate.

Consider this list of names, which we’ll imagine lives in the file presidents.txt.

To extract just the first names, we can use the following command:

Or, to just find those records with “James” as the first name:

Awk can do a lot more, and features programming concepts such as variables, conditionals and loops. But just a basic grasp of how to match and extract fields will get you far.

For more information, attend the Strata Data Bootcamp, where Hilary Mason is an instructor, or read sed & awk.

tags: , , ,