A new operator from the magrittr package makes it easier to use R for data analysis.
In every data analysis, you have to string together many tools. You need tools for data wrangling, visualisation, and modelling to understand what’s going on in your data. To use these tools effectively, you need to be able to easily flow from one tool to the next, focusing on asking and answering questions of the data, not struggling to jam the output from one function into the format needed for the next. Wouldn’t it be nice if the world worked this way! I spend a lot of my time thinking about this problem, and how to make the process of data analysis as fast, effective, and expressive as possible. Today, I want to show you a new technique that I’m particularly excited about.
R, at its heart, is a functional programming language: you do data analysis in R by composing functions. However, the problem with function composition is that a lot of it makes for hard-to-read code. For example, here’s some R code that wrangles flight delay data from New York City in 2013. What does it do? Read more…
OSM is moving out of its awkward adolescence and into its mature, young adult phase.
Next to GPS, the most significant development in the Open Geo Data movement is OpenStreetMap (OSM), a community-driven mapping project whose goal is to create the most detailed, correct, and current open map of the world. This week, OSM celebrates its 10th birthday, which provides a convenient excuse to highlight why its achievements to-date are amazing, unusual, and promising in equal parts.
When the project was begun by Steve Coast in 2004, map data sources were few, and largely controlled by a small collection of private and governmental players. The scarcity of map data ensured that it remained both expensive and highly restrictive, and no one but the largest navigation companies could use map data. Steve changed the rules by creating a wiki-like resource of the entire globe, which everyone could use without hinderance. Read more…
Buildings are ready to be smart — we just need to collect and monitor the data.
Buildings, like people, can benefit from lessons built up over time. Just as Amazon.com recommends books based on purchasing patterns or doctors recommend behavior change based on what they’ve learned by tracking thousands of people, a service such as Clockworks from KGS Buildings can figure out that a boiler is about to fail based on patterns built up through decades of data.
I had the chance to be enlightened about intelligent buildings through a conversation with Nicholas Gayeski, cofounder of KGS Buildings, and Mark Pacelle, an engineer with experience in building controls who has written for O’Reilly about the Internet of Things. Read more…
New frameworks for interactive business analysis and advanced analytics fuel the rise in tabular data objects.
Long before the advent of “big data,” analysts were building models using tools like R (and its forerunners S/S-PLUS). Productivity hinged on tools that made data wrangling, data inspection, and data modeling convenient. Among R users, this meant proficiency with data frames — objects used to store data matrices that can hold both numeric and categorical data. A
data.frame is the data structure consumed by most R analytic libraries.
But not all data scientists use R, nor is R suitable for all data problems. I’ve been watching with interest the growing number of alternative data structures for business analysis and advanced analytics. These new tools are designed to handle much larger data sets and are frequently optimized for specific problems. And they all use idioms that are familiar to data scientists — either SQL-like expressions, or syntax similar to those used for R
Business models and sustainability will drive success in the health games space.
These efforts have born fruit, and clinical trials have shown the value of many such games. Ben Sawyer, who founded the Games for Health conference more than 10 years ago, is watching all the pieces fall into place for the widespread adoption of games. Business plans, platforms, and the general environment for the acceptance of games (and other health-related apps) are coming together.
What do you get if you cross a distributed database with a stream processing system?
One of the concepts that has proven the hardest to explain to people when I talk about Samza is the idea of fault-tolerant local state for stream processing. I think people are so used to the idea of keeping all their data in remote databases that any departure from that seems unusual.
So, I wanted to give a little bit more motivation as to why we think local state is a fundamental primitive in stream processing.
What is state and why do you need it?
An easy way to understand state in stream processing is to think about the kinds of operations you might do in SQL. Imagine running SQL queries against a real-time stream of data. If your SQL query contains only filtering and single-row transformations (a simple
where clause, say), then it is stateless. That is, you can process a single row at a time without needing to remember anything in between rows. However, if your query involves aggregating many rows (a
group by) or joining together data from multiple streams, then it must maintain some state in between rows. If you are grouping data by some field and counting, then the state you maintain would be the counts that have accumulated so far in the window you are processing. If you are joining two streams, the state would be the rows in each stream waiting to find a match in the other stream.