FEATURED STORY

Why data preparation frameworks rely on human-in-the-loop systems

The O'Reilly Data Show Podcast: Ihab Ilyas on building data wrangling and data enrichment tools in academia and industry.

Celtic_Design_022_Paul_K_Flickr

As I’ve written in previous posts, data preparation and data enrichment are exciting areas for entrepreneurs, investors, and researchers. Startups like Trifacta, Tamr, Paxata, Alteryx, and CrowdFlower continue to innovate and attract enterprise customers. I’ve also noticed that companies — that don’t specialize in these areas — are increasingly eager to highlight data preparation capabilities in their products and services.

During a recent episode of the O’Reilly Data Show Podcast, I spoke with Ihab Ilyas, professor at the University of Waterloo and co-founder of Tamr. We discussed how he started working on data cleaning tools, academic database research, and training computer science students for positions in industry.

Academic database research in data preparation

Given the importance of data integrity, it’s no surprise that the database research community has long been interested in data preparation and data wrangling. Ilyas explained how his work in probabilistic databases led to research projects in data cleaning:

In the database theory community, these problems of handling, dealing with data inconsistency, and consistent query answering have been a celebrated area of research. However, it has been also difficult to communicate these results to industry. And database practitioners, if you like, they were more into the well-structured data and assuming a lot of good properties around this data, [and they were also] more interested in indexing this data, storing it, moving it from one place to another. And now, dealing with this large amount of diverse heterogeneous data with tons of errors, sidled across all business units in the same enterprise became a necessity. You cannot really avoid that anymore. And that triggered a new line of research for pragmatic ways of doing data cleaning and integration. … The acquisition layer in that stack has to deal with large sets of formats and sources. And you will hear about things like adapters and source adapters. And it became a market on its own, how to get access and tap into these sources, because these are kind of the long tail of data.

The way I came into this subject was also funny because we were talking about the subject called probabilistic databases and how to deal with data uncertainty. And that morphed into trying to find data sets that have uncertainty. And then we were shocked by how dirty the data is and how data cleaning is a task that’s worth looking at.

Read more…

Comment: 1
Four short links: 2 July 2015

Four short links: 2 July 2015

Mathematical Thinking, Turing on Imitation Game, Retro Gaming in Javascript, and Effective Retros

  1. How Not to be Wrong: The Power of Mathematical Thinking (Amazon) — Ellenberg chases mathematical threads through a vast range of time and space, from the everyday to the cosmic, encountering, among other things, baseball, Reaganomics, daring lottery schemes, Voltaire, the replicability crisis in psychology, Italian Renaissance painting, artificial languages, the development of non-Euclidean geometry, the coming obesity apocalypse, Antonin Scalia’s views on crime and punishment, the psychology of slime molds, what Facebook can and can’t figure out about you, and the existence of God. (via Pam Fox)
  2. What Turing Himself Said About the Imitation Game (IEEE) — fascinating history. The second myth is that Turing predicted a machine would pass his test around the beginning of this century. What he actually said on the radio in 1952 was that it would be “at least 100 years” before a machine would stand any chance with (as Newman put it) “no questions barred.”
  3. Impossible Mission in Javascript — an homage to the original, and beautiful to see. I appear to have lost all my skills in playing it in the intervening 32 years.
  4. Running Effective RetrospectivesEach change to the team’s workflow is treated as a scientific experiment, whereby a hypothesis is formed, data collected, and expectations compared with actual results.
Comment: 1

BioBuilder: Rethinking the biological sciences as engineering disciplines

Moving biology out of the lab will enable new startups, new business models, and entirely new economies.

Laboratory_public_domain_image_British_Library_Flickr

Buy “BioBuilder: Synthetic Biology in the Lab,” by Natalie Kuldell PhD., Rachel Bernstein, Karen Ingram, and Kathryn M. Hart.

What needs to happen for the revolution in biology and the life sciences to succeed? What are the preconditions?

I’ve compared the biorevolution to the computing revolution several times. One of the most important changes was that computers moved out of the lab, out of the machine room, out of that sacred space with raised floors, special air conditioning, and exotic fire extinguishers, into the home. Computers stopped being things that were cared for by an army of priests in white lab coats (and that broke several times a day), and started being things that people used. Somewhere along the line, software developers stopped being people with special training and advanced degrees; children, students, non-professionals — all sorts of people — started writing code. And enjoying it.

Biology is now in a similar place. But to take the next step, we have to look more carefully at what’s needed for biology to come out of the lab. Read more…

Comment: 1

“Internet of Things” is a temporary term

The O'Reilly Radar Podcast: Pilgrim Beart on the scale, challenges, and opportunities of the IoT.

Hills_album_public_domain_Internet_Archive_Flickr

Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.

In this week’s Radar Podcast, O’Reilly’s Mary Treseler chatted with Pilgrim Beart about co-founding his company, AlertMe, and about why the scale of the Internet of Things creates as many challenges as it does opportunities. He also talked about the “gnarly problems” emerging from consumer wants and behaviors.

Read more…

Comment: 1

Signals from the O’Reilly 2015 Solid Conference

Insight and analysis on the Internet of Things and the new hardware movement.

Practitioners, entrepreneurs, academics, and analysts came together in San Francisco this week to discuss the Internet of Things and the new hardware movement at the O’Reilly 2015 Solid Conference. Below we’ve assembled notable keynotes and interviews from the event.

Lock in, lock out: DRM in the real world

Author and activist Cory Doctorow uses his Solid keynote to passionately explain how computers are already entwined in our lives and our bodies, which means laws that support lock-in are much more than inconveniences. Doctorow also discusses Apollo 1201, a project from the Electronic Frontier Foundation that aims to eradicate digital rights management (DRM).

Read more…

Comment: 1

The future of car making: Small teams using fewer materials

How we make cars is a bigger environmental issue than how we fuel them.

automobile_public_domain_image_Internet_Archive

Around two billion cars have been built over the last 115 years; twice that number will be built over the next 35-40 years. The environmental and health impacts will be enormous. Some think the solution is electric cars or other low- or zero-emission vehicles. The truth is, if you look at the emissions of a car over its total life, you quickly discover that tailpipe emissions are just the tip of the iceberg.

An 85 kWh electric SUV may not have a tailpipe, but it has an enormous impact on our environment and health. A far greater percentage of a car’s total emissions come from the materials and energy required for manufacturing a car (mining, processing, manufacturing, and disposal of the car ), not the car’s operation. As leading environmental economist and vice chair of the National Academy of Sciences Maureen Cropper notes, “Whether we are talking about a conventional gasoline-powered automobile, an electric vehicle, or a hybrid, most of the damages are actually coming from stages other than just the driving of the vehicle.” If business continues as usual, we could triple the total global pollution generated by automobiles, as we go from two billion to six billion vehicles manufactured.

The conclusion from this is straightforward: how we make our cars is actually a bigger environmental issue than how we fuel our cars. We need to dematerialize — dramatically reduce the material and energy required to build cars — and we need to do it now. Read more…

Comment: 1