"Big Data" entries

Lessons from next-generation data wrangling tools

Drawing inspiration from recent advances in data preparation.

DSC_6826_4754_Flickr

One of the trends we’re following is the rise of applications that combine big data, algorithms, and efficient user interfaces. As I noted in an earlier post, our interest stems from both consumer apps as well as tools that democratize data analysis. It’s no surprise that one of the areas where “cognitive augmentation” is playing out is in data preparation and curation. Data scientists continue to spend a lot of their time on data wrangling, and the increasing number of (public and internal) data sources paves the way for tools that can increase productivity in this critical area.

At Strata + Hadoop World New York, NY, two presentations from academic spinoff start-ups — Mike Stonebraker of Tamr and Joe Hellerstein and Sean Kandel of Trifacta — focused on data preparation and curation. While data wrangling is just one component of a data science pipeline, and granted we’re still in the early days of productivity tools in data science, some of the lessons these companies have learned extend beyond data preparation.

Scalability ~ data variety and size

Not only are enterprises faced with many data stores and spreadsheets, data scientists have many more (public and internal) data sources they want to incorporate. The absence of a global data model means integrating data silos, and data sources requires tools for consolidating schemas.

Random samples are great for working through the initial phases, particularly while you’re still familiarizing yourself with a new data set. Trifacta lets users work with samples while they’re developing data wrangling “scripts” that can be used on full data sets.
Read more…

Comments: 2

Top keynotes at Strata Conference and Strata + Hadoop World 2014

From data privacy to real-world problem solving, O’Reilly’s data editors highlight the best of the best talks from 2014.

2014 was a year of tremendous growth in the field of data, as it was as well for Strata and Strata + Hadoop World, O’Reilly’s and Cloudera’s series of data conferences. At Strata, keynotes, individual sessions, and tracks like Hardcore Data Science, Hadoop and Beyond, Data-Driven Business Day, and Design & Interfaces, among others, explore the cutting-edge aspects of how to gather, store, wrangle, analyze, visualize, and make decisions with the vast amounts of data on our hands today. Looking back on the past year of Strata, the O’Reilly data editors chose our top keynotes from Strata Santa Clara, Strata Barcelona, and Strata + Hadoop World NYC.

It was tough to winnow the list down from an exceptional set of keynotes. Visit the O’Reilly YouTube channel for a larger set of 2014 keynotes, or Safari for videos of the keynotes and many of the conference sessions.

Best of the best

  • Julia Angwin reframes the issue of data privacy as justice, due process, and human rights (and her account of trying to buy better privacy goods and services is both instructive and funny).

  • Read more…

Comment: 1

The promise and problems of big data

A look at the social and moral implications of living in a deeply connected, analyzed, and informed world.

Editor’s note: this is an excerpt from our new report Data: Emerging Trends and Technologies, by Alistair Croll. You can download the free report here.

We’ll now look at both the light and the shadows of this new dawn, the social and moral implications of living in a deeply connected, analyzed, and informed world. This is both the promise and the peril of big data in an age of widespread sensors, fast networks, and distributed computing.

Solving the big problems

The planet’s systems are under strain from a burgeoning population. Scientists warn of rising tides, droughts, ocean acidity, and accelerating extinction. Medication-resistant diseases, outbreaks fueled by globalization, and myriad other semi-apocalyptic Horsemen ride across the horizon.

Can data fix these problems? Can we extend agriculture with data? Find new cures? Track the spread of disease? Understand weather and marine patterns? General Electric’s Bill Ruh says that while the company will continue to innovate in materials sciences, the place where it will see real gains is in analytics.

It’s often been said that there’s nothing new about big data. The “iron triangle” of Volume, Velocity, and Variety that Doug Laney coined in 2001 has been a constraint on all data since the first database. Basically, you could have any two you want fairly affordably. Consider:

  • A coin-sorting machine sorts a large volume of coins rapidly, but assumes a small variety of coins. It wouldn’t work well if there were hundreds of coin types.
  • A public library, organized by the Dewey Decimal System, has a wide variety of books and topics, and a large volume of those books — but stacking and retrieving the books happens at a slow velocity.

What’s new about big data is that the cost of getting all three Vs has become so cheap it’s almost not worth billing for. A Google search happens with great alacrity, combs the sum of online knowledge, and retrieves a huge variety of content types. Read more…

Comment
Four short links: 6 January 2015

Four short links: 6 January 2015

IoT Protocols, Predictive Limits, Machine Learning and Security, and 3D-Printing Electronics

  1. Exploring the Protocols of the Internet of Things (Sparkfun) — Arduino and Arduino-like IoT “things” especially, with their limited flash and SRAM, can benefit from specially crafted IoT protocols.
  2. Complexity Salon: Ebola (willowbl00) — These notes were taken at the 2014.Dec.18 New England Complex Systems Institute Salon focused on Ebola. […] Why don’t we engage in risks in a more serious way? Everyone thinks their prior experience indicates what will happen in the future. Look at past Ebola! It died down before going far, surely it won’t be bad in the future.
  3. Machine Learning Methods for Computer Security (PDF) — papers on topics such as adversarial machine learning, attacking pattern recognition systems, data privacy and machine learning, machine learning in forensics, and deceiving authorship detection.
  4. voxel8Using Voxel8’s 3D printer, you can co-print matrix materials such as thermoplastics and highly conductive silver inks enabling customized electronic devices like quadcopters, electromagnets and fully functional 3D electromechanical assemblies.
Comment
Four short links: 1 January 2015

Four short links: 1 January 2015

Wearables Killer App, Open Government Data, Gender From Name, and DVCS for Geodata

  1. Killer App for Wearables (Fortune) — While many corporations are still waiting to see what the “killer app” for wearables is, Disney invented one. The company launched the RFID-enabled MagicBands just over a year ago. Since then, they’ve given out more than 9 million of them. Disney says 75% of MagicBand users engage with the “experience”—a website called MyMagic+—before their visit to the park. Online, they can connect their wristband to a credit card, book fast passes (which let you reserve up to three rides without having to wait in line), and even order food ahead of time. […] Already, Disney says, MagicBands have led to increased spending at the park.
  2. USA Govt Depts Progress on Open Data Policy (labs.data.gov) — nice dashboard, but who will be watching it and what squeeze will they apply?
  3. globalnamedataWe have collected birth record data from the United States and the United Kingdom across a number of years for all births in the two countries and are releasing the collected and cleaned up data here. We have also generated a simple gender classifier based on incidence of gender by name.
  4. geogigan open source tool that draws inspiration from Git, but adapts its core concepts to handle distributed versioning of geospatial data.
Comment: 1

Apache Spark’s journey from academia to industry

In this O'Reilly Data Show Podcast: Ion Stoica talks about the rise of Apache Spark and Apache Mesos.

Three projects from UC Berkeley’s AMPLab have been keenly adopted by industry: Apache Mesos, Apache Spark, and Tachyon. As an early user, it’s been fun to watch Spark go from an academic lab to the most active open source project in big data. In my recent travels, I’ve met Spark users from companies of all sizes and and from many industries. I’ve also spoken with companies that came of age before Spark was available or mature enough, and many are replacing homegrown tools with Spark (Full disclosure: I’m an advisor to Databricks, a start-up commercializing Apache Spark..)

Subscribe to the O’Reilly Data Show Podcast

iTunes, SoundCloud, RSS

A few months ago, I spoke with UC Berkeley Professor and Databricks CEO Ion Stoica about the early days of Spark and the Berkeley Data Analytics Stack. Ion noted that by the time his students began work on Spark and Mesos, his experience at his other start-up Conviva had already informed some of the design choices:

“Actually, this story started back in 2009, and it started with a different project, Mesos. So, this was a class project in a class I taught in the spring of 2009. And that was to build a cluster management system, to be able to support multiple cluster computing frameworks like Hadoop, at that time, MPI and others. To share the same cluster as the data in the cluster. Pretty soon after that, we thought about what to build on top of Mesos, and that was Spark. Initially, we wanted to demonstrate that it was actually easier to build a new framework from scratch on top of Mesos, and of course we wanted it to be also special. So, we targeted workloads for which Hadoop at that time was not good enough. Hadoop was targeting batch computation. So, we targeted interactive queries and iterative computation, like machine learning. Read more…

Comment

New opportunities in the maturing marketplace of big data components

The evolving marketplace is making new data applications and interactions possible.

Editor’s note: this is an excerpt from our new report Data: Emerging Trends and Technologies, by Alistair Croll. Download the free report here.

Here’s a look at some options in the evolving, maturing marketplace of big data components that are making the new applications and interactions we’ve been looking at possible.

Graph theory

First used in social network analysis, graph theory is finding more and more homes in research and business. Machine learning systems can scale up fast with tools like Parameter Server, and the TitanDB project means developers have a robust set of tools to use.

Are graphs poised to take their place alongside relational database management systems (RDBMS), object storage, and other fundamental data building blocks? What are the new applications for such tools?

Inside the black box of algorithms: whither regulation?

It’s possible for a machine to create an algorithm no human can understand. Evolutionary approaches to algorithmic optimization can result in inscrutable, yet demonstrably better, computational solutions.

If you’re a regulated bank, you need to share your algorithms with regulators. But if you’re a private trader, you’re under no such constraints. And having to explain your algorithms limits how you can generate them.

As more and more of our lives are governed by code that decides what’s best for us, replacing laws, actuarial tables, personal trainers, and personal shoppers, oversight means opening up the black box of algorithms so they can be regulated.

Years ago, Orbitz was shown to be charging web visitors who owned Apple devices more money than those visiting via other platforms, such as the PC. Only that’s not the whole story: Orbitz’s machine learning algorithms, which optimized revenue per customer, learned that the visitor’s browser was a predictor of their willingness to pay more. Read more…

Comment

Computational power and cognitive augmentation

A look at a few ways humans mesh with the rest of our data systems.

Editor’s note: this is an excerpt from our new report Data: Emerging Trends and Technologies, by Alistair Croll. Download the free report here.

Here’s a look at a few of the ways that humans — still the ultimate data processors — mesh with the rest of our data systems: how computational power can best produce true cognitive augmentation.

Deciding better

Over the past decade, we fitted roughly a quarter of our species with sensors. We instrumented our businesses, from the smallest market to the biggest factory. We began to consume that data, slowly at first. Then, as we were able to connect data sets to one another, the applications snowballed. Now that both the front office and the back office are plugged into everything, business cares. A lot.

While early adopters focused on sales, marketing, and online activity, today, data gathering and analysis is ubiquitous. Governments, activists, mining giants, local businesses, transportation, and virtually every other industry lives by data. If an organization isn’t harnessing the data exhaust it produces, it’ll soon be eclipsed by more analytical, introspective competitors that learn and adapt faster.

Whether we’re talking about a single human made more productive by a smartphone-turned-prosthetic-brain, or a global organization gaining the ability to make more informed decisions more quickly, ultimately, Strata + Hadoop World has become about deciding better.

What does it take to make better decisions? How will we balance machine optimization with human inspiration, sometimes making the best of the current game and other times changing the rules? Will machines that make recommendations about the future based on the past reduce risk, raise barriers to innovation, or make us vulnerable to improbable Black Swans because they mistakenly conclude that tomorrow is like yesterday, only more so? Read more…

Comments: 2
Four short links: 19 December 2014

Four short links: 19 December 2014

Statistical Causality, Clustering Bitcoin, Hardware Security, and A Language for Scripts

  1. Distinguishing Cause and Effect using Observational Data — research paper evaluating effectiveness of the “additive noise” test, a nifty statistical trick to identify causal relationships from observational data. (via Slashdot)
  2. Clustering Bitcoin Accounts Using Heuristics (O’Reilly Radar) — In theory, a user can go by many different pseudonyms. If that user is careful and keeps the activity of those different pseudonyms separate, completely distinct from one another, then they can really maintain a level of, maybe not anonymity, but again, cryptographically it’s called pseudo-anonymity. […] It turns out in reality, though, the way most users and services are using bitcoin, was really not following any of the guidelines that you would need to follow in order to achieve this notion of pseudo-anonymity. So, basically, what we were able to do is develop certain heuristics for clustering together different public keys, or different pseudonyms.
  3. A Primer on Hardware Security: Models, Methods, and Metrics (PDF) — Camouflaging: This is a layout-level technique to hamper image-processing-based extraction of gate-level netlist. In one embodiment of camouflaging, the layouts of standard cells are designed to look alike, resulting in incorrect extraction of the netlist. The layout of nand cell and the layout of nor cell look different and hence their functionality can be extracted. However, the layout of a camouflaged nand cell and the layout of camouflaged nor cell can be made to look identical and hence an attacker cannot unambiguously extract their functionality.
  4. Prompter: A Domain-Specific Language for Versu (PDF) — literally a scripting language (you write theatrical-style scripts, characters, dialogues, and events) for an inference engine that lets you talk to characters and have a different story play out each time.
Comment

Clustering bitcoin accounts using heuristics

In this O'Reilly Data Show Podcast: Sarah Meiklejohn on analytic applications for blockchain and cryptocurrency technology.

Editor’s note: we’ll explore present and future applications of cryptocurrency and blockchain technologies at our upcoming Radar Summit: Bitcoin & the Blockchain on Jan. 27, 2015, in San Francisco.

A few data scientists are starting to play around with cryptocurrency data, and as bitcoin and related technologies start gaining traction, I expect more to wade in. As the space matures, there will be many interesting applications based on analytics over the transaction data produced by these technologies. The blockchain — the distributed ledger that contains all bitcoin transactions — is publicly available, and the underlying data set is of modest size. Data scientists can work with this data once it’s loaded into familiar data structures, but producing insights requires some domain knowledge and expertise.

Subscribe to the O’Reilly Data Show Podcast

iTunes, SoundCloud, RSS

I recently spoke with Sarah Meiklejohn, a lecturer at UCL, and an expert on computer security and cryptocurrencies. She was part of an academic research team that studied pseudo-anonymity (“pseudonymity”) in bitcoin. In particular, they used transaction data to compare “potential” anonymity to the “actual” anonymity achieved by users. A bitcoin user can use many different public keys, but careful research led to a few heuristics that allowed them to cluster addresses belonging to the same user:

“In theory, a user can go by many different pseudonyms. If that user is careful and keeps the activity of those different pseudonyms separate, completely distinct from one another, then they can really maintain a level of, maybe not anonymity, but again, cryptographically it’s called pseudo-anonymity. So, if they are a legitimate businessman on the one hand, they can use a certain set of pseudonyms for that activity, and then if they are dealing drugs on Silk Road, they might use a completely different set of pseudonyms for that, and you wouldn’t be able to tell that that’s the same user.

Read more…

Comment: 1