In this episode of the O'Reilly Data Show Podcast, Jay Kreps talks about data integration, event data, and the Internet of Things.
At the heart of big data platforms are robust data flows that connect diverse data sources. Over the past few years, a new set of (mostly open source) software components have become critical to tackling data integration problems at scale. By now, many people have heard of tools like Hadoop, Spark, and NoSQL databases, but there are a number of lesser-known components that are “hidden” beneath the surface.
In my conversations with data engineers tasked with building data platforms, one tool stands out: Apache Kafka, a distributed messaging system that originated from LinkedIn. It’s used to synchronize data between systems and has emerged as an important component in real-time analytics.
In my travels over the past year, I’ve met engineers across many industries who use Apache Kafka in production. A few months ago, I sat down with O’Reilly author and Radar contributor Jay Kreps, a highly regarded data engineer and former technical lead for Online Data Infrastructure at LinkedIn, and most recently CEO/co-founder of Confluent. Read more…
Introducing Bitcoin & the Blockchain: An O’Reilly Radar Summit
When the creators of bitcoin solved the “double spend” problem in a decentralized manner, they introduced techniques that have implications far beyond digital currency. Our newly announced one-day event — Bitcoin & the Blockchain: An O’Reilly Radar Summit — is in line with our tradition of highlighting applications of developments in computer science. Financial services have long relied on centralized solutions, so in many ways, products from this sector have become canonical examples of the developments we plan to cover over the next few months. But many problems that require an intermediary are being reexamined with techniques developed for bitcoin.
How do you get multiple parties in a transaction to trust each other without an intermediary? In the case of a digital currency like bitcoin, decentralization means reaching consensus over an insecure network. As Mastering Bitcoin author Andreas Antonopoulos noted in an earlier post, several innovations lie at the heart of what makes bitcoin disruptive:
“Bitcoin is a combination of several innovations, arranged in a novel way: a peer-to-peer network, a proof-of-work algorithm, a distributed timestamped accounting ledger, and an elliptic-curve cryptography and key infrastructure. Each of these parts is novel on its own, but the combination and specific arrangement was revolutionary for its time and is beginning to show up in more innovations outside bitcoin itself.”
Rajiv Maheswaran talks about the tools and techniques required to analyze new kinds of sports data.
Many data scientists are comfortable working with structured operational data and unstructured text. Newer techniques like deep learning have opened up data types like images, video, and audio.
Other common data sources are garnering attention. With the rise of mobile phones equipped with GPS, I’m meeting many more data scientists at start-ups and large companies who specialize in spatio-temporal pattern recognition. Analyzing “moving dots” requires specialized tools and techniques.
A few months ago, I sat down with Rajiv Maheswaran founder and CEO of Second Spectrum, a company that applies analytics to sports tracking data. Maheswaran talked about this new kind of data and the challenge of finding patterns:
“It’s interesting because it’s a new type of data problem. Everybody knows that big data machine learning has done a lot of stuff in structured data, in photos, in translation for language, but moving dots is a very new kind of data where you haven’t figured out the right feature set to be able to find patterns from. There’s no language of moving dots, at least not that computers understand. People understand it very well, but there’s no computational language of moving dots that are interacting. We wanted to build that up, mostly because data about moving dots is very, very new. It’s only in the last five years, between phones and GPS and new tracking technologies, that moving data has actually emerged.”
A new partnership between O’Reilly and Databricks offers certification and training in Apache Spark.
Editor’s note: full disclosure — Ben is an advisor to Databricks.
I am pleased to announce a joint program between O’Reilly and Databricks to certify Spark developers. O’Reilly has long been interested in certification, and with this inaugural program, we believe we have the right combination — an ascendant framework and a partnership with the team behind the technology. The founding team of Databricks comprises members of the UC Berkeley AMPLab team that created Spark.
The certification exam will be offered at Strata events, through Databricks’ Spark Summits, and at training workshops run by Databricks and its partner companies. A variety of O’Reilly resources will accompany the certification program, including books, training days, and videos targeted at developers and companies interested in the Apache Spark ecosystem. Read more…
New frameworks for interactive business analysis and advanced analytics fuel the rise in tabular data objects.
Long before the advent of “big data,” analysts were building models using tools like R (and its forerunners S/S-PLUS). Productivity hinged on tools that made data wrangling, data inspection, and data modeling convenient. Among R users, this meant proficiency with data frames — objects used to store data matrices that can hold both numeric and categorical data. A
data.frame is the data structure consumed by most R analytic libraries.
But not all data scientists use R, nor is R suitable for all data problems. I’ve been watching with interest the growing number of alternative data structures for business analysis and advanced analytics. These new tools are designed to handle much larger data sets and are frequently optimized for specific problems. And they all use idioms that are familiar to data scientists — either SQL-like expressions, or syntax similar to those used for R
Business users are becoming more comfortable with graph analytics.
The rise of sensors and connected devices will lead to applications that draw from network/graph data management and analytics. As the number of devices surpasses the number of people — Cisco estimates 50 billion connected devices by 2020 — one can imagine applications that depend on data stored in graphs with many more nodes and edges than the ones currently maintained by social media companies.
This means that researchers and companies will need to produce real-time tools and techniques that scale to much larger graphs (measured in terms of nodes & edges). I previously listed tools for tapping into graph data, and I continue to track improvements in accessibility, scalability, and performance. For example, at the just-concluded Spark Summit, it was apparent that GraphX remains a high-priority project within the Spark1 ecosystem.
Researchers and startups are building tools that enable feature discovery.
Why do data scientists spend so much time on data wrangling and data preparation? In many cases it’s because they want access to the best variables with which to build their models. These variables are known as features in machine-learning parlance. For many0 data applications, feature engineering and feature selection are just as (if not more important) than choice of algorithm:
Good features allow a simple model to beat a complex model.
(to paraphrase Alon Halevy, Peter Norvig, and Fernando Pereira)
The terminology can be a bit confusing, but to put things in context one can simplify the data science pipeline to highlight the importance of features:
Feature Engineering or the Creation of New Features
A simple example to keep in mind is text mining. One starts with raw text (documents) and extracted features could be individual words or phrases. In this setting, a feature could indicate the frequency of a specific word or phrase. Features1 are then used to classify and cluster documents, or extract topics associated with the raw text. The process usually involves the creation2 of new features (feature engineering) and identifying the most essential ones (feature selection).
Many more companies want to highlight how they're using Apache Spark in production.
One of the trends we’re following closely at Strata is the emergence of vertical applications. As components for creating large-scale data infrastructures enter their early stages of maturation, companies are focusing on solving data problems in specific industries rather than building tools from scratch. Virtually all of these components are open source and have contributors across many companies. Organizations are also sharing best practices for building big data applications, through blog posts, white papers, and presentations at conferences like Strata.
These trends are particularly apparent in a set of technologies that originated from UC Berkeley’s AMPLab: the number of companies that are using (or plan to use) Spark in production1 has exploded over the last year. The surge in popularity of the Apache Spark ecosystem stems from the maturation of its individual open source components and the growing community of users. The tight integration of high-performance tools that address different problems and workloads, coupled with a simple programming interface (in Python, Java, Scala), make Spark one of the most popular projects in big data. The charts below show the amount of active development in Spark:
For the second year in a row, I’ve had the privilege of serving on the program committee for the Spark Summit. I’d like to highlight a few areas where Apache Spark is making inroads. I’ll focus on proposals2 from companies building applications on top of Spark.
Casting a critical eye on the exciting developments in the world of AI.
True AI has been “just around the corner” for 60 years, so why should O’Reilly start covering AI in a big way now? As computing power catches up to scientific and engineering ambitions, and as our ability to learn directly from sensory signals — i.e., big data — increases, intelligent systems are having a real and widespread impact. Every Internet user benefits from these systems today — they sort our email, plan our journeys, answer our questions, and protect us from fraudsters. And, with the Internet of Things, these system have already started to keep our houses and offices comfortable and well-lit, our data centers running more efficiently, our industrial processes humming, and even are driving our cars. Read more…