# "Big Data" entries

## The original big data industry

### Oil and gas exploration have long been at the forefront of data collection and analysis.

Download our new free report, “Oil, Gas, and Data: High-Performance Data Tools in the Production of Industrial Power,” looking at the role of data, machine learning, and predictive analytics in oil and gas exploration.

Petroleum extraction is an industry marked by price volatility and high capital exposure in new ventures. Big data is reducing risk, not just to capital, but to workers and the environment as well, as Dan Cowles explores in the new free report Oil, Gas, and Data.

At the Global Petroleum Show in Calgary, exhibiting alongside massive drill heads, chemical analysts, and the latest in valves and pipes are companies with a decidedly more virtual product: data. IBM’s Aspera, Abacus Datagraphics, Fujitsu, and Oracle’s Front Porch Digital are pitching data intake, analysis, and storage services to the oil industry, and industry stalwarts such as Halliburton, Lockheed Martin, and BP have been developing these capacities in-house.

The primary benefits of big data occur at the upstream end of petroleum production: exploration, discovery, and drilling. Better analysis of seismic and other geological data allows for drilling in more productive locations, and continual monitoring of equipment results in more uptime and better safety for both workers and environment. These marginal gains can be enough to keep an entire region competitive: the trio of cheap sensors, fast networks, and distributed computation that we’ve so often seen in other industries is the difference-maker keeping the North Sea oilfields productive in sub-$100/barrel market. Read more… ## Four short links: 3 July 2015 ### Storage Interference, Open Source SSL, Pub-Sub Reverse-Proxy, and Web Components Checklist 1. The Storage Tipping Pointthe performance optimization technologies of the last decade – log structured file systems, coalesced writes, out-of-place updates and, soon, byte-addressable NVRAM – are conflicting with similar-but-different techniques used in SSDs and arrays. The software we use is written for dumb storage; we’re getting smart storage; but smart+smart = fragmentation, write amplification, and over-consumption. 2. s2n — Amazon’s open source ssl implementation. 3. pushpina reverse proxy server that makes it easy to implement WebSocket, HTTP streaming, and HTTP long-polling services. It communicates with backend web applications using regular, short-lived HTTP requests (GRIP protocol). This allows backend applications to be written in any language and use any webserver. 4. The Gold Standard Checklist for Web ComponentsThis is a working draft of a checklist to define a “gold standard” for web components that aspire to be as predictable, flexible, reliable, and useful as the standard HTML elements. ## Why data preparation frameworks rely on human-in-the-loop systems ### The O'Reilly Data Show Podcast: Ihab Ilyas on building data wrangling and data enrichment tools in academia and industry. As I’ve written in previous posts, data preparation and data enrichment are exciting areas for entrepreneurs, investors, and researchers. Startups like Trifacta, Tamr, Paxata, Alteryx, and CrowdFlower continue to innovate and attract enterprise customers. I’ve also noticed that companies — that don’t specialize in these areas — are increasingly eager to highlight data preparation capabilities in their products and services. During a recent episode of the O’Reilly Data Show Podcast, I spoke with Ihab Ilyas, professor at the University of Waterloo and co-founder of Tamr. We discussed how he started working on data cleaning tools, academic database research, and training computer science students for positions in industry. ## Academic database research in data preparation Given the importance of data integrity, it’s no surprise that the database research community has long been interested in data preparation and data wrangling. Ilyas explained how his work in probabilistic databases led to research projects in data cleaning: In the database theory community, these problems of handling, dealing with data inconsistency, and consistent query answering have been a celebrated area of research. However, it has been also difficult to communicate these results to industry. And database practitioners, if you like, they were more into the well-structured data and assuming a lot of good properties around this data, [and they were also] more interested in indexing this data, storing it, moving it from one place to another. And now, dealing with this large amount of diverse heterogeneous data with tons of errors, sidled across all business units in the same enterprise became a necessity. You cannot really avoid that anymore. And that triggered a new line of research for pragmatic ways of doing data cleaning and integration. … The acquisition layer in that stack has to deal with large sets of formats and sources. And you will hear about things like adapters and source adapters. And it became a market on its own, how to get access and tap into these sources, because these are kind of the long tail of data. The way I came into this subject was also funny because we were talking about the subject called probabilistic databases and how to deal with data uncertainty. And that morphed into trying to find data sets that have uncertainty. And then we were shocked by how dirty the data is and how data cleaning is a task that’s worth looking at. ## Four short links: 1 July 2015 ### Recovering from Debacle, Open IRS Data, Time Series Requirements, and Error Messages 1. Google Dev Apologies After Photos App Tags Black People as Gorillas (Ars Technica) — this is how you recover from a unequivocally horrendous mistake. 2. IRS Finally Agrees to Release Non-Profit Records (BoingBoing) — Today, the IRS released a statement saying they’re going to do what we’ve been hoping for, saying they are going to release e-file data and this is a “priority for the IRS.” Only took$217,000 in billable lawyer hours (pro bono, thank goodness) to get there.
3. Time Series Database Requirements — classic paper, laying out why time-series databases are so damn weird. Their access patterns are so unique because of the way data is over-gathered and pushed ASAP to the store. It’s mostly recent, mostly never useful, and mostly needed in order. (via Thoughts on Time-Series Databases)
4. Compiler Errors for Humans — it’s so important, and generally underbaked in languages. A decade or more ago, I was appalled by Python’s errors after Perl’s very useful messages. Today, appreciating Go’s generally handy errors. How a system handles the operational failures that will inevitably occur is part and parcel of its UX.

## Graphs in the world: Modeling systems as networks

### See, extract, and create value with networks.

Get notified when our free report, “Mapping Big Data: A Data Driven Market Report” is available for download.

Networks of all kinds drive the modern world. You can build a network from nearly any kind of data set, which is probably why network structures characterize some aspects of most phenomenon. And yet, many people can’t see the networks underlying different systems. In this post, we’re going to survey a series of networks that model different systems in order to understand different ways networks help us understand the world around us.

We’ll explore how to see, extract, and create value with networks. We’ll look at four examples where I used networks to model different phenomenon, starting with startup ecosystems and ending in network-driven marketing.

## Networks and markets

Commerce is one person or company selling to another, which is inherently a network phenomenon. Analyzing networks in markets can help us understand how market economies operate.

Strength of weak ties

Mark Granovetter famously researched job hunting and discovered the Strength of Weak Ties. Read more…

## Four short links: 24 June 2015

### Big Data Architecture, Leaving the UK, GPU-powered Queries, and Gongkai in the West

1. 100 Big Data Architecture Papers (Anil Madan) — you’ll either find them fascinating essential reading … or a stellar cure for insomnia.
2. Software Companies Leaving UK Because of Government’s Surveillance Plans (Ars Technica) — to Amsterdam, to NYC, and to TBD.
3. MapD: Massive Throughput Database Queries with LLVM and GPUs (nvidia) — The most powerful GPU currently available is the NVIDIA Tesla K80 Accelerator, with up to 8.74 teraflops of compute performance and nearly 500 GB/sec of memory bandwidth. By supporting up to eight of these cards per server, we see orders-of-magnitude better performance on standard data analytics tasks, enabling a user to visually filter and aggregate billions of rows in tens of milliseconds, all without indexing.
4. Why It’s Often Easier to Innovate in China than the US (Bunnie Huang) — We did some research into the legal frameworks and challenges around absorbing gongkai IP into the Western ecosystem, and we believe we’ve found a path to repatriate some of the IP from gongkai into proper open source.

## Four short links: 22 June 2015

### Power Analysis, Data at Scale, Open Source Fail, and Closing the Virtuous Loop

1. Power Analysis of a Typical Psychology Experiment (Tom Stafford) — What this means is that if you don’t have a large effect, studies with between groups analysis and an n of less than 60 aren’t worth running. Even if you are studying a real phenomenon you aren’t using a statistical lens with enough sensitivity to be able to tell. You’ll get to the end and won’t know if the phenomenon you are looking for isn’t real or if you just got unlucky with who you tested.
2. The Future of Data at ScaleData curation, on the other hand, is “the 800-pound gorilla in the corner,” says Stonebraker. “You can solve your volume problem with money. You can solve your velocity problem with money. Curation is just plain hard.” The traditional solution of extract, transform, and load (ETL) works for 10, 20, or 30 data sources, he says, but it doesn’t work for 500. To curate data at scale, you need automation and a human domain expert.
3. Why Are We Still Explaining? (Stephen Walli) — Within 24 hours we received our first righteous patch. A simple 15-line change that provided a 10% boost in Just-in-Time compiler performance. And we politely thanked the contributor and explained we weren’t accepting changes yet. Another 24 hours and we received the first solid bug fix. It was golden. It included additional tests for the test suite to prove it was fixed. And we politely thanked the contributor and explained we weren’t accepting changes yet. And that was the last thing that was ever contributed.
4. Blood Donors in Sweden Get a Text Message When Their Blood Helps Someone (Independent) — great idea to close the feedback loop. If you want to get more virtuous behaviour, make it a relationship and not a transaction. And if a warm feeling is all you have to offer in return, then offer it!

## The future of data at scale

### The O'Reilly Radar Podcast: Turing Award winner Michael Stonebraker on the future of data science.

Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.

In March 2015, database pioneer Michael Stonebraker was awarded the 2014 ACM Turing Award “for fundamental contributions to the concepts and practices underlying modern database systems.” In this week’s Radar Podcast, O’Reilly’s Mike Hendrickson sits down with Stonebraker to talk about winning the award, the future of data science, and the importance — and difficulty — of data curation.

## One size does not fit all

Stonebraker notes that since about 2000, everyone has realized they need a database system, across markets and across industries. “Now, it’s everybody who’s got a big data problem,” he says. “The business data processing solution simply doesn’t fit all of these other marketplaces.” Stonebraker talks about the future of data science — and data scientists — and the tools and skill sets that are going to be required:

It’s all going to move to data science as soon as enough data scientists get trained by our universities to do this stuff. It’s fairly clear to me that you’re probably not going to retread a business analyst to be a data scientist because you’ve got to know statistics, you’ve got to know machine learning. You’ve got to know what regression means, what Naïve Bayes means, what k-Nearest Neighbors means. It’s all statistics.

All of that stuff turns out to be defined on arrays. It’s not defined on tables. The tools of future data scientists are going to be array-based tools. Those may live on top of relational database systems. They may live on top of an array database system, or perhaps something else. It’s completely open.

## Real-time, not batch-time, analytics with Hadoop

### How big data, fast data, and real-time analytics work together in the real world.

Attend the VoltDB webcast on June 24, 2015 with John Hugg to learn more on how to build a fast data front-end to Hadoop.

Today, we often hear the phrase “The 3 Vs” in relation to big data: Volume, Variety and Velocity. With the interest and popularity of big data frameworks such as Hadoop, the focus has mostly centered on volume and data at rest. Common requirements here would be data ingestion, batch processing, and distributed queries. These are well understood. Increasingly, however, there is a need to manage and process data as it arrives, in real time. There may be great value in the immediacy of that data and the ability to act upon it very quickly. This is velocity and data in motion, also known as “fast data.” Fast data has become increasingly important within the past few years due to the growth in endpoints that now stream data in real time.

Big data + fast data is a powerful combination. However, adding real-time analytics to this mix provides the business value. Let’s look at a real example, originally described by Scott Jarr of VoltDB.

Consider a company that builds systems to manage physical assets in precious metal mines. Inside a mine, there are sensors on miners as well as shovels and other assets. For a lost shovel, minutes or hours of reporting latency may be acceptable. However, a sensor on a miner indicating a stopped heart should require immediate attention. The system should, therefore, be able to receive very fast data. Read more…

## Building self-service tools to monitor high-volume time-series data

### The O'Reilly Data Show Podcast: Phil Liu on the evolution of metric monitoring tools and cloud computing.

One of the main sources of real-time data processing tools is IT operations. In fact, a previous post I wrote on the re-emergence of real-time, was to a large extent prompted by my discussions with engineers and entrepreneurs building monitoring tools for IT operations. In many ways, data centers are perfect laboratories in that they are controlled environments managed by teams willing to instrument devices and software, and monitor fine-grain metrics.

During a recent episode of the O’Reilly Data Show Podcast, I caught up with Phil Liu, co-founder and CTO of SignalFx, a SF Bay Area startup focused on building self-service monitoring tools for time series. We discussed hiring and building teams in the age of cloud computing, building tools for monitoring large numbers of time series, and lessons he’s learned from managing teams at leading technology companies.

## Evolution of monitoring tools

Having worked at LoudCloud, Opsware, and Facebook, Liu has seen first hand the evolution of real-time monitoring tools and platforms. Liu described how he has watched the number of metrics grow, to volumes that require large compute clusters:

One of the first services I worked on at LoudCloud was a service called MyLoudCloud. Essentially that was a monitoring portal for all LoudCloud customers. At the time, [the way] we thought about monitoring was still in a per-instance-oriented monitoring system. [Later], I was one of the first engineers on the operational side of Facebook and eventually became part of the infrastructure team at Facebook. When I joined, Facebook basically was using a collection of open source software for monitoring and configuration, so these are things that everybody knows — Nagios, Ganglia. It started out basically using just per-instance instant monitoring techniques, basically the same techniques that we used back at LoudCloud, but interestingly and very quickly as Facebook grew, this per-instance-oriented monitoring no longer worked because we went from tens or thousands of servers to hundreds of thousands of servers, from tens of services to hundreds and thousands of services internally.