Building pipelines to facilitate data analysis

A new operator from the magrittr package makes it easier to use R for data analysis.

Construction_of_Cedar_River_Pipeline_1900

In every data analysis, you have to string together many tools. You need tools for data wrangling, visualisation, and modelling to understand what’s going on in your data. To use these tools effectively, you need to be able to easily flow from one tool to the next, focusing on asking and answering questions of the data, not struggling to jam the output from one function into the format needed for the next. Wouldn’t it be nice if the world worked this way! I spend a lot of my time thinking about this problem, and how to make the process of data analysis as fast, effective, and expressive as possible. Today, I want to show you a new technique that I’m particularly excited about.

R, at its heart, is a functional programming language: you do data analysis in R by composing functions. However, the problem with function composition is that a lot of it makes for hard-to-read code. For example, here’s some R code that wrangles flight delay data from New York City in 2013. What does it do? Read more…

Comment: 1

Ten years of OpenStreetMap

OSM is moving out of its awkward adolescence and into its mature, young adult phase.

OSM_logo-2Next to GPS, the most significant development in the Open Geo Data movement is OpenStreetMap (OSM), a community-driven mapping project whose goal is to create the most detailed, correct, and current open map of the world. This week, OSM celebrates its 10th birthday, which provides a convenient excuse to highlight why its achievements to-date are amazing, unusual, and promising in equal parts.

When the project was begun by Steve Coast in 2004, map data sources were few, and largely controlled by a small collection of private and governmental players. The scarcity of map data ensured that it remained both expensive and highly restrictive, and no one but the largest navigation companies could use map data. Steve changed the rules by creating a wiki-like resource of the entire globe, which everyone could use without hinderance. Read more…

Comment: 1

Smarter buildings through data tracking

Buildings are ready to be smart — we just need to collect and monitor the data.

Buildings, like people, can benefit from lessons built up over time. Just as Amazon.com recommends books based on purchasing patterns or doctors recommend behavior change based on what they’ve learned by tracking thousands of people, a service such as Clockworks from KGS Buildings can figure out that a boiler is about to fail based on patterns built up through decades of data.

Screen from KGS Clockworks analytics tool

Screen shot from KGS Clockworks analytics tool

I had the chance to be enlightened about intelligent buildings through a conversation with Nicholas Gayeski, cofounder of KGS Buildings, and Mark Pacelle, an engineer with experience in building controls who has written for O’Reilly about the Internet of Things. Read more…

Comment: 1

Scaling up data frames

New frameworks for interactive business analysis and advanced analytics fuel the rise in tabular data objects.

The_Prison_House_of_Art

Long before the advent of “big data,” analysts were building models using tools like R (and its forerunners S/S-PLUS). Productivity hinged on tools that made data wrangling, data inspection, and data modeling convenient. Among R users, this meant proficiency with data frames — objects used to store data matrices that can hold both numeric and categorical data. A data.frame is the data structure consumed by most R analytic libraries.

But not all data scientists use R, nor is R suitable for all data problems. I’ve been watching with interest the growing number of alternative data structures for business analysis and advanced analytics. These new tools are designed to handle much larger data sets and are frequently optimized for specific problems. And they all use idioms that are familiar to data scientists — either SQL-like expressions, or syntax similar to those used for R data.frame or pandas.DataFrame.

Read more…

Comment

Health games platforms mature in preparation for mainstream adoption

Business models and sustainability will drive success in the health games space.

SPARX_screenshot

SPARX, a behavioral therapy game for youths,
combines a fantasy setting with skills for life.

For the past several years, researchers have strived to create compelling games that improve behavior, reduce stress, or teach healthy responses to difficult life situations. Such healthy games tend to arise in research settings because of the need to demonstrate clinically that the games are effective. I have covered such efforts in my postings from the Games for Health conference in 2012 and 2013.

These efforts have born fruit, and clinical trials have shown the value of many such games. Ben Sawyer, who founded the Games for Health conference more than 10 years ago, is watching all the pieces fall into place for the widespread adoption of games. Business plans, platforms, and the general environment for the acceptance of games (and other health-related apps) are coming together.

Read more…

Comment

Why local state is a fundamental primitive in stream processing

What do you get if you cross a distributed database with a stream processing system?

Ian_Sane_Texting_While_Farming

One of the concepts that has proven the hardest to explain to people when I talk about Samza is the idea of fault-tolerant local state for stream processing. I think people are so used to the idea of keeping all their data in remote databases that any departure from that seems unusual.

So, I wanted to give a little bit more motivation as to why we think local state is a fundamental primitive in stream processing.

What is state and why do you need it?

An easy way to understand state in stream processing is to think about the kinds of operations you might do in SQL. Imagine running SQL queries against a real-time stream of data. If your SQL query contains only filtering and single-row transformations (a simple select and where clause, say), then it is stateless. That is, you can process a single row at a time without needing to remember anything in between rows. However, if your query involves aggregating many rows (a group by) or joining together data from multiple streams, then it must maintain some state in between rows. If you are grouping data by some field and counting, then the state you maintain would be the counts that have accumulated so far in the window you are processing. If you are joining two streams, the state would be the rows in each stream waiting to find a match in the other stream.

Read more…

Comment: 1