Lessons from next-generation data wrangling tools

Drawing inspiration from recent advances in data preparation.

DSC_6826_4754_Flickr

One of the trends we’re following is the rise of applications that combine big data, algorithms, and efficient user interfaces. As I noted in an earlier post, our interest stems from both consumer apps as well as tools that democratize data analysis. It’s no surprise that one of the areas where “cognitive augmentation” is playing out is in data preparation and curation. Data scientists continue to spend a lot of their time on data wrangling, and the increasing number of (public and internal) data sources paves the way for tools that can increase productivity in this critical area.

At Strata + Hadoop World New York, NY, two presentations from academic spinoff start-ups — Mike Stonebraker of Tamr and Joe Hellerstein and Sean Kandel of Trifacta — focused on data preparation and curation. While data wrangling is just one component of a data science pipeline, and granted we’re still in the early days of productivity tools in data science, some of the lessons these companies have learned extend beyond data preparation.

Scalability ~ data variety and size

Not only are enterprises faced with many data stores and spreadsheets, data scientists have many more (public and internal) data sources they want to incorporate. The absence of a global data model means integrating data silos, and data sources requires tools for consolidating schemas.

Random samples are great for working through the initial phases, particularly while you’re still familiarizing yourself with a new data set. Trifacta lets users work with samples while they’re developing data wrangling “scripts” that can be used on full data sets.
Read more…

Comments: 2

Security principles of bitcoin

The core principle in bitcoin is decentralization, and it has important implications for security.

Editor’s note: this is an excerpt from Chapter 10 of our recently released book Mastering Bitcoin, by Andreas Antonopoulos. You can read the full chapter here. Antonopoulos will be speaking at our upcoming event Bitcoin & the Blockchain, January 27, 2015, in San Francisco. Find out more about the event and reserve your spot here.

Securing bitcoin is challenging because bitcoin is not an abstract reference to value, like a balance in a bank account. Bitcoin is very much like digital cash or gold. You’ve probably heard the expression “Possession is nine tenths of the law.” Well, in bitcoin, possession is ten tenths of the law. Possession of the keys to unlock the bitcoin, is equivalent to possession of cash or a chunk of precious metal. You can lose it, misplace it, have it stolen, or accidentally give the wrong amount to someone. In every one of those cases, end users would have no recourse, just as if they dropped cash on a public sidewalk.

However, bitcoin has capabilities that cash, gold, and bank accounts do not. A bitcoin wallet, containing your keys, can be backed up like any file. It can be stored in multiple copies, even printed on paper for hardcopy backup. You can’t “backup” cash, gold, or bank accounts. Bitcoin is different enough from anything that has come before that we need to think about bitcoin security in a novel way too.

Security principles

The core principle in bitcoin is decentralization and it has important implications for security. A centralized model, such as a traditional bank or payment network, depends on access control and vetting to keep bad actors out of the system. By comparison, a decentralized system like bitcoin pushes the responsibility and control to the end users. Because security of the network is based on Proof-Of-Work, not access control, the network can be open and no encryption is required for bitcoin traffic. Read more…

Comments: 3

The promise and problems of big data

A look at the social and moral implications of living in a deeply connected, analyzed, and informed world.

Editor’s note: this is an excerpt from our new report Data: Emerging Trends and Technologies, by Alistair Croll. You can download the free report here.

We’ll now look at both the light and the shadows of this new dawn, the social and moral implications of living in a deeply connected, analyzed, and informed world. This is both the promise and the peril of big data in an age of widespread sensors, fast networks, and distributed computing.

Solving the big problems

The planet’s systems are under strain from a burgeoning population. Scientists warn of rising tides, droughts, ocean acidity, and accelerating extinction. Medication-resistant diseases, outbreaks fueled by globalization, and myriad other semi-apocalyptic Horsemen ride across the horizon.

Can data fix these problems? Can we extend agriculture with data? Find new cures? Track the spread of disease? Understand weather and marine patterns? General Electric’s Bill Ruh says that while the company will continue to innovate in materials sciences, the place where it will see real gains is in analytics.

It’s often been said that there’s nothing new about big data. The “iron triangle” of Volume, Velocity, and Variety that Doug Laney coined in 2001 has been a constraint on all data since the first database. Basically, you could have any two you want fairly affordably. Consider:

  • A coin-sorting machine sorts a large volume of coins rapidly, but assumes a small variety of coins. It wouldn’t work well if there were hundreds of coin types.
  • A public library, organized by the Dewey Decimal System, has a wide variety of books and topics, and a large volume of those books — but stacking and retrieving the books happens at a slow velocity.

What’s new about big data is that the cost of getting all three Vs has become so cheap it’s almost not worth billing for. A Google search happens with great alacrity, combs the sum of online knowledge, and retrieves a huge variety of content types. Read more…

Comment

The computing of distrust

A look at what lies ahead in the disenchanted age of postmodern computing.

Ominous_II_James_Loesch_Flickr

Sometime last summer, I ran into the phrase “postmodern computing.” I don’t remember where, but it struck me as a powerful way to understand an important shift in the industry. What is different in the industry? How are 2014 and 2015 different from 2004 and 2005?

If we’re going to understand what “postmodern computing” means, we first have to understand “modern” computing. And to do that, we also have to understand modernism and postmodernism. After all, “modern” and “postmodern” only have meaning relative to each other; they’re both about a particular historical arc, not a single moment in time.

Some years back, I was given a history of St. Barbara’s Greek Orthodox Church in New Haven, carefully annotated wherever a member of my family had played a part. One story that stood out from early in the 20th century was AHEPA: the American-Hellenic Progressive Association. The mere existence of that organization in the 1920s says more about modernism than any number of literary analyses. In AHEPA, and in many other similar societies crossing many churches and many ethnic groups, people were betting on the future. The future is going to be better than the present. We were poor dirt farmers in the Old Country; now we’re here, and we’re going to build a better future for ourselves and our children. Read more…

Comments: 4

Introduction to the blockchain

The blockchain is like layers in a geological formation — the deeper you go, the more stability you gain.

Editor’s note: this is an excerpt from Chapter 7 of our recently released book Mastering Bitcoin, by Andreas Antonopoulos. You can read the full chapter here. Antonopoulos will be speaking at our upcoming event Bitcoin & the Blockchain, January 27, 2015, in San Francisco. Find out more about the event and reserve your spot here.

The blockchain data structure is an ordered back-linked list of blocks of transactions. The blockchain can be stored as a flat file, or in a simple database. The bitcoin core client stores the blockchain metadata using Google’s LevelDB database. Blocks are linked “back,” each referring to the previous block in the chain. The blockchain is often visualized as a vertical stack, with blocks layered on top of each other and the first block serving as the foundation of the stack. The visualization of blocks stacked on top of each other results in the use of terms like “height” to refer to the distance from the first block, and “top” or “tip” to refer to the most recently added block.

Each block within the blockchain is identified by a hash, generated using the SHA256 cryptographic hash algorithm on the header of the block. Each block also references a previous block, known as the parent block, through the “previous block hash” field in the block header. In other words, each block contains the hash of its parent inside its own header. The sequence of hashes linking each block to its parent creates a chain going back all the way to the first block ever created, known as the genesis block. Read more…

Comment: 1

Apache Spark’s journey from academia to industry

In this O'Reilly Data Show Podcast: Ion Stoica talks about the rise of Apache Spark and Apache Mesos.

Three projects from UC Berkeley’s AMPLab have been keenly adopted by industry: Apache Mesos, Apache Spark, and Tachyon. As an early user, it’s been fun to watch Spark go from an academic lab to the most active open source project in big data. In my recent travels, I’ve met Spark users from companies of all sizes and and from many industries. I’ve also spoken with companies that came of age before Spark was available or mature enough, and many are replacing homegrown tools with Spark (Full disclosure: I’m an advisor to Databricks, a start-up commercializing Apache Spark..)

Subscribe to the O’Reilly Data Show Podcast

iTunes, SoundCloud, RSS

A few months ago, I spoke with UC Berkeley Professor and Databricks CEO Ion Stoica about the early days of Spark and the Berkeley Data Analytics Stack. Ion noted that by the time his students began work on Spark and Mesos, his experience at his other start-up Conviva had already informed some of the design choices:

“Actually, this story started back in 2009, and it started with a different project, Mesos. So, this was a class project in a class I taught in the spring of 2009. And that was to build a cluster management system, to be able to support multiple cluster computing frameworks like Hadoop, at that time, MPI and others. To share the same cluster as the data in the cluster. Pretty soon after that, we thought about what to build on top of Mesos, and that was Spark. Initially, we wanted to demonstrate that it was actually easier to build a new framework from scratch on top of Mesos, and of course we wanted it to be also special. So, we targeted workloads for which Hadoop at that time was not good enough. Hadoop was targeting batch computation. So, we targeted interactive queries and iterative computation, like machine learning. Read more…

Comment