"data processing" entries
Exploring the power and sophistication of awk.
I maintain GNU Awk. As part of making releases, I have to create a patch script to convert the file tree of the previous release into the current one. This means writing
rm commands to remove any files that have been removed. This is fairly straightforward using tools like
However, for the 4.1.2 release, I also changed the permissions (mode) on some files. I want to create
chmod commands to update these files’ permission settings as well. This is a little harder, so I decided to write an
awk script that will do this for me.
Let’s take a look at some of the sophistication and control you can achieve using
awk, such as recursion, the use of arrays of arrays, and extension functions for using operating system facilities.
comptrees.awk, uses the
fts() extension function to do the heavy lifting. This function walks file trees, building up a representation of those trees using
gawk‘s arrays of arrays.
The Lambda Architecture has its merits, but alternatives are worth exploring.
Nathan Marz wrote a popular blog post describing an idea he called the Lambda Architecture (“How to beat the CAP theorem“). The Lambda Architecture is an approach to building stream processing applications on top of MapReduce and Storm or similar systems. This has proven to be a surprisingly popular idea, with a dedicated website and an upcoming book. Since I’ve been involved in building out the real-time data processing infrastructure at LinkedIn using Kafka and Samza, I often get asked about the Lambda Architecture. I thought I would describe my thoughts and experiences.
What is a Lambda Architecture and how do I become one?
The Lambda Architecture looks something like this:
Wearables can help bridge the gap between batch and real-time communications.
I drown in e-mail, which is a common affliction. With meetings during the day, I need to defer e-mail to breaks between meetings or until the evening, which prevents it from being a real-time communications medium.
Everybody builds a communication “bubble” around themselves, sometimes by design and sometimes by necessity. Robert Reich’s memoir Locked in the Cabinet describes the process of staffing his office and, ultimately, building that bubble. He resists, but eventually succumbs to the necessity of filtering communications when managing such a large organization.
One of the reasons I’m fascinated by wearable technology is that it is one way of bridging the gap between batch and real-time communications. Wearable technology has smaller screens, and many early products use low-power screen technology that lacks the ability to display vibrant colors. Some may view these qualities as drawbacks, but in return, it is possible to display critical information in an easily viewable — and immediate — way. Read more…
DataSift lands funding, popping the hood on Google Plus, data products for education
In the latest Strata Week: DataSift's access to the Twitter firehose proves compelling for investors, the inner workings of Google Plus are revealed, and contestants crank out apps for education.