Prepare distribution patches with gawk

Exploring the power and sophistication of awk.

I maintain GNU Awk. As part of making releases, I have to create a patch script to convert the file tree of the previous release into the current one. This means writing rm commands to remove any files that have been removed. This is fairly straightforward using tools like find, sort, and comm.

However, for the 4.1.2 release, I also changed the permissions (mode) on some files. I want to create chmod commands to update these files’ permission settings as well. This is a little harder, so I decided to write an awk script that will do this for me.

Let’s take a look at some of the sophistication and control you can achieve using awk, such as recursion, the use of arrays of arrays, and extension functions for using operating system facilities.

This script, comptrees.awk, uses the fts() extension function to do the heavy lifting. This function walks file trees, building up a representation of those trees using gawk‘s arrays of arrays.

The script then uses an awk function to compare the two trees’ arrays. We start with a #! header and some descriptive comments:

#! /usr/local/bin/gawk -f

# comptrees.awk --- compare two file trees and print commands to synchronize them
#
# Arnold Robbins
# arnold@skeeve.com
# April, 2015

The next statement loads the filefuncs extension, which includes the fts() function:

@load "filefuncs"

The program is run from a BEGIN rule. The first thing to do is check the number of arguments and print an error message if that count is incorrect:

BEGIN {
    # argument count checking
    if (ARGC != 3) {
        print "usage: comptrees dir1 dir2" > "/dev/stderr"
        exit 1
    }

The next step is to remove the program name from ARGV, leaving just the two file names in the array. This lets us pass ARGV directly to fts().

# remove program name
delete ARGV[0]

The fts() function walks the trees. The first argument is an array whose element values are the paths to walk. The second is one or more flag values ORed together; in this case symbolic links are not followed. The final argument holds the results as an array of arrays.

# walk the trees
fts(ARGV, FTS_PHYSICAL, results)

The top level indices in the results array are the final component of the full path. Thus, a simple basename() function strips out the leading path components to get at each subarray. We pass the full names and subarrays into the comparison function, which does the work, and then we’re done:

    # compare them
    compare(ARGV[1], results[basename(ARGV[1])],
        ARGV[2], results[basename(ARGV[2])])
}

The basename() function returns the final component of the input pathname, using gawk‘s gensub() function to do the work:

# basename --- strip out all but the last part of a filename

function basename(path)
{
    return gensub(".*/", "", "g", path)
}

The arrays created by fts() are a little bit complicated. See the filefuncs.3am man page in the gawk distribution and the documentation for the details. Basically, directories are represented by arrays where each file is a subarray. Files are arrays with a few special elements, including one named “stat” which is an array with file information such as owner and permissions. The compare() function has to carefully walk the two arrays representing the trees. The header lists the parameters and the single local variable:

# compare --- compare two trees

function compare(oldname, oldtree, newname, newtree,    i)
{

The function loops over all the elements in oldtree, skipping any of the special ones:

    # loop over all elements in the array
    for (i in oldtree) {
        # skip special elements filled in by fts()
        if (i == "." || i == "stat" || i == "type" ||
            i == "path" || i == "error")
            continue

If an element is itself a directory, compare the directories recursively:

        if ("." in oldtree[i]) { # directory
            # recurse
            compare(oldname "/" i, oldtree[i],
                newname "/" i, newtree[i])
        }

Next thing to check. If the element is not in the new tree, it was removed, so print an rm command:

        else if (! (i in newtree)) {
            # removed file
            printf("rm -fr %s/%s\n", oldname, i)
        }

Finally, if an element is a file and the permissions are different between the old and new trees, print a chmod command. The permission value is ANDed with 0777 to get just the file permissions, since the mode value also contains bits indicating the file type:

        else if (oldtree[i]["stat"]["type"] == "file") {
            if (oldtree[i]["stat"]["mode"] != newtree[i]["stat"]["mode"]) {
                # file permissions change
                printf("chmod %o %s/%s\n",
                    and(newtree[i]["stat"]["mode"], 0777),
                    newname, i)
            }
        }
    }
}

That’s it! 63 lines of awk that will save me a lot of time as I prepare future gawk releases. I think this script nicely demonstrates the power of the fts() extension function and gawk‘s arrays of arrays.


Editor’s note: If your work involves a significant amount of data extraction, reporting, and data-reformatting jobs, you’ll definitely want to check out Arnold Robbins’ Effective awk Programming, 4th Edition.

tags: , , , , ,