Entities move data from tables to knowledge

Cleaning and combining fields can turn messy data into actionable insight.

Red_Buckets_in_a_Row_Don_DeBold_Flickr

We often talk in business and computing about moving from “raw data” to “knowledge,” hoping to take useful actions based on the data our organization has collected over time. Before one can view trends in your data or do other analytics, you need tools for data cleaning and for combining multiple data sources into meaningful collections of information, known as entities. An entity may be a customer, a product, a point of sale, an incident being investigated by the police, or anything else around which you want to build meaningful context.

In this post, we’ll explore some of the complexities in real-life data that create headaches — and how analytical software can help users prepare data for sophisticated queries and visualizations. Read more…

Comment

Telling your data’s story

How storytelling can enhance the effectiveness of your visualizations.

Editor’s note: this post is part of our investigation into Big Data Design and Social Science. Michael Freeman covers the use of storytelling frameworks in visualizations in his new tutorial video “Using Storytelling to Effectively Communicate Data.”

Visualizing complex relationships in big data often requires involved graphical displays that can be intimidating to users. As the volume and complexity of data collection and storage scale exponentially, creating clear, communicative, and approachable visual representations of that data is an increasing challenge. As a data visualization specialist, I frightened one of my first sets of collaborators when I suggested using this display:

Data Visualization

What I had failed to communicate was that we would use a story structure to introduce audiences to the complex layout (you can see how I did it here).

This image captures three emerging limitations in big data visualization:

  1. Unclear visual encodings: People don’t know what each visual symbol represents
  2. Too much data: The volume of information displayed is overwhelming
  3. Too many variables: Simultaneous encodings of color, position, size, etc. precludes fully understanding each dimension

Read more…

Comments: 2

Redefining power distribution using big data

The O'Reilly Data Show Podcast: Erich Nachbar on testing and deploying open source, distributed computing components.

power_distribution_alex.ch_FlickrWhen I first hear of a new open source project that might help me solve a problem, the first thing I do is ask around to see if any of my friends have tested it. Sometimes, however, the early descriptions sound so promising that I just jump right in and try it myself — and in a few cases, I transition immediately (this was certainly the case for Spark).

I recently had a conversation with Erich Nachbar, founder and CTO of Virtual Power Systems, and one of the earliest adopters of Spark. In the early days of Spark, Nachbar was CTO of Quantifind, a startup often cited by the creators of Spark as one of the first “production deployments.” On the latest episode of the O’Reilly Data Show Podcast, we talk about the ease with which Nachbar integrates new open source components into existing infrastructure, his contributions to Mesos, and his new “software-defined power distribution” startup.

Ecosystem of open source big data technologies

When evaluating a new software component, nothing beats testing it against workloads that mimic your own. Nachbar has had the luxury of working in organizations where introducing new components isn’t subject to multiple levels of decision-making. But, as he notes, everything starts with testing things for yourself:

“I have sort of my mini test suite…If it’s a data store, I would just essentially hook it up to something that’s readily available, some feed like a Twitter fire hose, and then just let it be bombarded with data, and by now, it’s my simple benchmark to know what is acceptable and what isn’t for the machine…I think if more people, instead of reading papers and paying people to tell them how good or bad things are, would actually set aside a day and try it, I think they would learn a lot more about the system than just reading about it and theorizing about the system. Read more…

Comment

Data APIs, design, and visual storytelling

One example of how using a data API can lead to better visualizations.

data.path_Ryoji.Ikeda_R2hox_Flickr

Over the past five years, international agencies such as the World Bank, OECD, and UNESCO have created portals to make their data available for everyone to explore. Many non-profits are also visualizing masses of data in the hope that it will give policymakers, funders, and the general public a better understanding of the issues they are trying to solve.

Data visualization plays a key role in telling the stories behind the data. For most audiences, data sets are hard to use and interpret — the average user will need a technical guide just to navigate through the complicated hierarchies of categories let alone interpret the information. But data visualizations trigger interest and insight because they are immediate, clear, and tangible.

At FFunction, we visualize a lot of data. Most of the time our clients send us Excel spreadsheets or CSV files, so we were happily surprised when we started to work with UNESCO Institute for Statistics on two fascinating education-related projects — Out-of-School Children and Left Behind — and realized that they had been working on a data API. As we began to work through the data ourselves, we uncovered several reasons why using an API helps immeasurably with data visualization. Read more…

Comment

What the IoT can learn from the health care industry

Federated authentication and authorization could provide security solutions for the Internet of Things.

Adrian Gropper co-authored this post.

Nyckel_erik_forsberg_FlickrAfter a short period of excitement and rosy prospects in the movement we’ve come to call the Internet of Things (IoT), designers are coming to realize that it will survive or implode around the twin issues of security and user control: a few electrical failures could scare people away for decades, while a nagging sense that someone is exploiting our data without our consent could sour our enthusiasm. Early indicators already point to a heightened level of scrutiny — Senator Ed Markey’s office, for example, recently put the automobile industry under the microscope for computer and network security.

In this context, what can the IoT draw from well-established technologies in federated trust? Federated trust in technologies as diverse as the Kerberos and SAML has allowed large groups of users to collaborate securely, never having to share passwords with people they don’t trust. OpenID was probably the first truly mass-market application of federated trust.

OpenID and OAuth, which have proven their value on the Web, have an equally vital role in the exchange of data in health care. This task — often cast as the interoperability of electronic health records — can reasonably be described as the primary challenge facing the health care industry today, at least in the IT space. Reformers across the health care industry (and even Congress) have pressured the federal government to make data exchange the top priority, and the Office of the National Coordinator for Health Information Technology has declared it the centerpiece of upcoming regulations. Read more…

Comment

Let’s build open source tensor libraries for data science

Tensor methods for machine learning are fast, accurate, and scalable, but we'll need well-developed libraries.

Rubik's_cube_collection_Gerwin_Sturm_Flickr

Data scientists frequently find themselves dealing with high-dimensional feature spaces. As an example, text mining usually involves vocabularies comprised of 10,000+ different words. Many analytic problems involve linear algebra, particularly 2D matrix factorization techniques, for which several open source implementations are available. Anyone working on implementing machine learning algorithms ends up needing a good library for matrix analysis and operations.

But why stop at 2D representations? In a recent Strata + Hadoop World San Jose presentation, UC Irvine professor Anima Anandkumar described how techniques developed for higher-dimensional arrays can be applied to machine learning. Tensors are generalizations of matrices that let you look beyond pairwise relationships to higher-dimensional models (a matrix is a second-order tensor). For instance, one can examine patterns between any three (or more) dimensions in data sets. In a text mining application, this leads to models that incorporate the co-occurrence of three or more words, and in social networks, you can use tensors to encode arbitrary degrees of influence (e.g., “friend of friend of friend” of a user).

Being able to capture higher-order relationships proves to be quite useful. In her talk, Anandkumar described applications to latent variable models — including text mining (topic models), information science (social network analysis), recommender systems, and deep neural networks. A natural entry point for applications is to look at generalizations of matrix (2D) techniques to higher-dimensional arrays. Read more…

Comments: 15