The O’Reilly Data Show podcast: Todd Lipcon on hybrid and specialized tools in distributed systems.
Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.
In recent months, I’ve been hearing about hybrid systems designed to handle different data management needs. At Strata + Hadoop World NYC last week, Cloudera’s Todd Lipcon unveiled an open source storage layer — Kudu — that’s good at both table scans (analytics) and random access (updates and inserts).
While specialized systems will continue to serve companies, there will be situations where the complexity of maintaining multiple systems — to eke out extra performance — will be harder to justify.
During the latest episode of the O’Reilly Data Show Podcast, I sat down with Lipcon to discuss his new project a few weeks before it was released. Here are a few snippets from our conversation:
HDFS and Hbase
[Hadoop is] more like a file store. It allows you to upload files onto an arbitrarily sized cluster with 20-plus petabytes, in single clusters. The thing is, you can upload the files but you can’t edit them in place. To make any change, you have to basically put in a new file. What HBase does in distinction is that it has more of a tabular data model, where you can update and insert individual row-by- row data, and then randomly access that data [in] milliseconds. The distinction here is that HDFS is pretty good for large scans where you’re putting in a large data set, maybe doing a full parse over the data set to train a machine learning model or compute an aggregate. If any of that data changes on a frequent basis or if you want to stream the data in or randomly access individual customer records, you’re kind of out of luck on HDFS. Read more…
A new crop of interesting solutions for the complexity of operating multiple systems in a distributed computing setting.
The 2004 holiday shopping season marked the start of Amazon’s investigation into alternative database technologies that led to the creation of DynamoDB — a key-value storage system that went onto inspire several NoSQL projects.
A new group of startups began shifting away from the general-purpose systems favored by companies just a few years earlier. In recent years, we’ve seen a diverse set of DBMS technologies that specialize in handling particular workloads and data models such as OLTP, OLAP, search, RDF, XML, scientific applications, etc. The success and popularity of such systems reinforced the belief that in order to scale and “go fast,” specialized systems are preferable.
In distributed computing, the complexity of maintaining and operating multiple specialized systems has recently led to systems that bridge multiple workloads and data models. Aside from multi-model databases, there are an emerging number of storage and compute engines adept at handling different workloads and problems. At this week’s Strata + Hadoop World conference in NYC, I had a chance to interact with the creators of some of these new solutions.
OLTP (transactions) and OLAP (analytics)
One of the key announcements at Strata + Hadoop World this week was Project Kudu — an open source storage engine that’s good at both table scans (analytics) and random access (updates and inserts). Its creators are quick to point out that they aren’t out to beat specialized OLTP and OLAP systems. Rather, they’re shooting to build a system that’s “70-80% of the way there on both axes.” The project is very young and lacks enterprise features, but judging from the reaction at the conference, it’s something the big data community will be watching. Leading technology research firms have created a category for systems with related capabilities: HTAP (Gartner) and Trans-analytics (Forrester).
Integration of the data supply chain is key to a reliable and fast big data analytics deployment.
Watch our free webcast “Accelerating Advanced Analytics with Spark” to learn about the architecture, applications, and best practices of Apache Spark.
Apache Hadoop is a mature development framework, which coupled with its large ecosystem, and support and contributions from key players such as Cloudera, Hortonworks, and Yahoo, provides organizations with many tools to manage data of varying sizes.
In the past, Hadoop’s batch-oriented nature using MapReduce was sufficient to meet the processing needs of many organizations. However, increasing demands for faster processing of data have emerged. These demands have been driven by recent developments in streaming technologies, the Internet of Things (IoT) and real-time analytics, to name just a few. These new demands have required new processing models. One significant new technology today that is being used to meet these demands and is gaining considerable interest and widespread support is Apache Spark. Spark’s speed and versatility make it a key part of today’s big-data processing stack in industries from energy to finance. Read more…
Dinner conversation turns into a career retrospective. Food for thought for leaders and leaders-to-be.
Toss Bhudvanbhen co-authored this post.
Over a recent dinner, my conversation with Toss Bhudvanbhen meandered into discussion of how much our jobs had changed since we entered the workforce. We started during the Dot-Com era. Technology was a relatively young field then (frankly, it still is) so there wasn’t a well-trodden career path. We just went with the flow.
Over time our titles changed from “software developer,” to “senior developer,” to “application architect,” and so on, until one day we realized that we were writing less code but sending more e-mails. Attending fewer code reviews but more meetings. Less worried about how to implement a solution, but more concerned with defining the problem and why it needed to be solved. We had somehow taken on leadership roles.
We’ve stuck with it. Toss now works as a principal consultant at Pariveda Solutions and my consulting work focuses on strategic matters around data and technology.
The thing is, we were never formally trained as management. We just learned along the way. What helped was that we’d worked with some amazing leaders, people who set great examples for us and recognized our ability to understand the bigger picture.
Best practices for data preparation — what you need to know before data analysis can begin.
Download “Data Preparation in the Big Data Era,” a new free report to help you manage the challenges of data cleaning and preparation.
Data is growing at an exponential rate worldwide, with huge business opportunities and challenges for every industry. In 2016, global Internet traffic will reach 90 exabytes per month, according to a recent Cisco report. The ability to manage and analyze an unprecedented amount of data will be the key to success for every industry.
To exploit the benefits of a big data strategy, a key question is how to translate all of that data into useful knowledge. To meet this challenge, a company first needs to have a clear picture of their strategic knowledge assets, such as their area of expertise, core competencies, and intellectual property.
Having a clear picture of the business model and the relationships with distributors, suppliers, and customers is extremely useful in order to design a tactical and strategic decision-making process. The true potential value of big data is only gained when placed in a business context, where data analysis drives better decisions — otherwise, it’s just data.
In a new O’Reilly report Data Preparation in the Big Data Era, we provide a step-by-step guide to manage the challenges of data cleaning and preparation — critical steps before effective data analysis can begin. We explore the common problems of data preparation and the different steps involved, including data cleaning, combination, and transformation. You’ll also learn about new products that deal with problem of data variety at scale, including Tamr’s solution, which curates data at scale using a combination of machine learning and expert feedback. Read more…
The O'Reilly Radar Podcast: Paco Nathan and Jesse Anderson on the evolution of the data training landscape.
Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.
Their discussion focuses on the training landscape in the big data ecosystem, their teaching techniques and particular content they choose, and a look at some expected future trends.
Here are a few snippets from their chat:
Training vs PowerPoint slides
Anderson: “Often, when you have a startup and somebody says, ‘Well, we need some training,’ what will usually happen is one of the software developers will say, ‘OK, I’ve done some training in the past and I’ll put together some PowerPoints.’ The differences between a training thing and doing some PowerPoints, like at a meetup, is that a training actually has to have hands-on exercises. It has to have artifacts that you use right there in class. You actually need to think through, these are concepts, these are things that the person will need to be successful in that project. It really takes a lot of time and it takes some serious expertise and some experience in how to do that.”
Nathan: “Early on, you would get some committer to go out and do a meetup, maybe talk about an extension to an API or whatever they were working on directly. If there was a client firm that came up and needed training, then they’d peel off somebody. As it evolved, that really didn’t work. That kind of model doesn’t scale. The other thing too is, you really do need people who understand instructional design, who really understand how to manage a classroom. Especially when it gets to any size, it’s not just a afterthought for an engineer to handle.” Read more…