Data analysis and commentary: O'Reilly Radar

Graph databases are powering mission-critical applications

The O’Reilly Data Show Podcast: Emil Eifrem on popular applications of graph technologies, cloud computing, and company culture.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

While most people associate graphs with social media analysis, there are a wide range of applications — including recommendations, fraud detection, I.T. operations, and security — that are routinely framed using graphs. This wide variety of use cases has led to rise to many interesting tools for storing, managing, visualizing, and analyzing massive graphs. The important thing to note is that graph databases are not limited to reporting and analytics, but are also being used to power mission critical applications.

In this episode of the O’Reilly Data Show, I sat down with Emil Eifrem, CEO and co-founder of Neo Technology. We talked about the early days of NoSQL, applications of graph databases, cloud computing, and company culture in the U.S. and Sweden.

Graph and NoSQL databases

The relational database had been an accelerator, and here it’s really slowing us down. What we ended up concluding was that the problem was this mismatch between the shape of the data and the abstractions that were exposed by our infrastructure. At that point, we said, okay, what if we had a database that just exposed these amazing network-oriented data structures or graph-oriented data structures, but other than that, had all the properties of a relational database. Wouldn’t that be great? … Ultimately, we said the famous last words: ‘Hey, let’s just build it ourselves. How hard can it be?’ It turns out it’s 15 years later!

…

2007 is when both the Dynamo paper had been published and the BigTable paper had been published out of Amazon and Google, respectively. That’s when, in early adopter circuits, the discourse started to change … maybe the era of the one-size-fits-all database is over. Maybe our job isn’t to take all of our data and shove it through a relational database. Maybe there are some other tools and technologies and abstractions out there that make better sense for some data. That was in ’07. I really think it was as if lightning struck in the community. … . [Dynamo and BigTable were announced] and the next day, 12 open source projects, implementing it, and then the next day, 24 new ones. It was just crazy back then.

Read more…

Jai Ranganathan on architecting big data applications in the cloud

The O’Reilly Data Show podcast: The Hadoop ecosystem, the recent surge in interest in all things real time, and developments in hardware.

by Ben Lorica | @bigdata | +Ben Lorica | November 19, 2015

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

Given the quick pace of innovation in the data ecosystem, we like to take a step back from the details of individual components, architecture, and applications, in order to take a wider view of the landscape of big data. This allows us to evaluate the progress of technology and infrastructure along the way, shifting our attention from the details of individual components like Spark and Kafka, to larger trends.

Some of the larger trends we’ve been exploring include the capabilities of distributed machine learning and the tradeoffs and design decisions involved in cloud architecture and stream processing.

In this episode of the O’Reilly Data Show, I sat down with Jai Ranganathan, senior director of product management at Cloudera. We talked about the trends in the Hadoop ecosystem, cloud computing, the recent surge in interest in all things real time, and hardware trends:

Large-scale machine learning

This sounds a bit like this should already exist in really good form right now, but one of the things that I’m really interested in is expanding the set of capabilities for distributed machine learning. While there are systems out there today that do do this, I think relative to what you can experience from a singular environment learning scikit-learn or R, the set of things you can do in a distributed fashion is limited. … It’s not easy to distribute various algorithms and model-building techniques. I think there is still a lot of work for us to do to improve that experience. … And I do want to have good open source options like MLlib. MLlib may be the right answer. I would be perfectly happy if that’s the final answer, but we do need systems just to provide the kind of depth that you typically are used to in the singular environment. That’s just a matter of time and investment because these are non-trivial problems, but they are things that people are working on.

Read more…

Building systems for massive scale data applications

The O’Reilly Data Show podcast: Tyler Akidau on the evolution of systems for bounded and unbounded data processing.

by Ben Lorica | @bigdata | +Ben Lorica | November 5, 2015

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

Many of the open source systems and projects we’ve come to love — including Hadoop and HBase — were inspired by systems used internally within Google. These systems were described in papers and implemented by people who needed frameworks that could comfortably scale to massive data sets.

Google engineers and scientists continue to publish interesting papers, and these days some of the big data systems they describe in publications are available on their cloud platform.

In this episode of the O’Reilly Data Show, I sat down with Tyler Akidau one of the lead engineers in Google’s streaming and Dataflow technologies. He recently wrote an extremely popular article that provided a framework for how to think about bounded and unbounded data processing (a follow-up article is due out soon). We talked about the evolution of stream processing, the challenges of building systems that scale to massive data sets, and the recent surge in interest in all things real time:

On the need for MillWheel: A new stream processing engine

At the time [that MillWheel was built], there was, as far as I know, literally nothing externally that could handle the scale that we needed to handle. A lot of the existing streaming systems didn’t focus on out-of-order processing, which was a big deal for us internally. Also we really wanted to hit a strong focus on consistency — being able to get absolutely correct answers. … All three of these things were lacking in at least some area in [the systems we examined].

Read more…

Turning big data into actionable insights

The O’Reilly Data Show podcast: Evangelos Simoudis on data mining, investing in data startups, and corporate innovation.

by Ben Lorica | @bigdata | +Ben Lorica | October 22, 2015

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

Can developments in data science and big data infrastructure drive corporate innovation? To be fair, many companies are still in the early stages of incorporating these ideas and tools into their organizations.

Evangelos Simoudis has spent many years interacting with entrepreneurs and executives at major global corporations. Most recently, he’s been advising companies interested in developing long-term strategies pertaining to big data, data science, cloud computing, and innovation. He began his career as a data mining researcher and practitioner, and is counted among the pioneers who helped data mining technologies get adopted in industry.

In this episode of the O’Reilly Data Show, I sat down with Simoudis and we talked about his thoughts on investing, data applications and products, and corporate innovation:

Open source software companies

I very much appreciate open source. I encourage my portfolio companies to use open source components as appropriate, but I’ve never seen the business model as being one that is particularly easy to really build the companies around them. Everybody points to Red Hat, and that may be the exception, but I have not seen companies that have, on the one hand, remained true to the open source principles and become big and successful companies that do not require constant investment. … The revenue streams never prove to be sufficient for building big companies. I think the companies that get started from open source in order to become big and successful … [are] ones that, at some point, decided to become far more proprietary in their model and in the services that they deliver. Or they become pure professional services companies as opposed to support services companies. Then they reach the necessary levels of success.

Read more…

Get started with cloud-based data science

Learn how to deploy machine learning solutions using Azure ML.

by Stephen Elston | October 22, 2015

Download the free, updated report “Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update.“

Cloud-based machine learning platforms, like Microsoft’s Azure Machine Learning (Azure ML), provide a simplified path to create and deploy analytic solutions. Azure ML is a fully managed and secure machine learning platform that resides within the Microsoft Cortana Analytics Suite.

Azure ML workflows (known as “experiments”) are constructed using a combination of drag-and-drop modules, SQL, R, and Python scripts. The wide range of built modules support the typical steps in a machine learning workflow, from data ingestion and data munging to model construction and cross validation.

Once your Azure ML experiment is ready, there are several options to deploy it. Azure ML experiments can access large-scale data stored in Azure Blob storage, Azure SQL and Hive, to name a few options. Similarly, your experiment can write results back to multiple scalable Azure storage options.

Read more…

We need open and vendor-neutral metadata services

Comprehensive metadata collection and analysis can pave the way for many interesting applications.

by Ben Lorica | @bigdata | +Ben Lorica | October 12, 2015

As I spoke with friends leading up to Strata + Hadoop World NYC 2015, one topic continued to come up: metadata. It’s a topic that data engineers and data management researchers have long thought about because it has significant effects on the systems they maintain and the services they offer. I’ve also been having more and more conversations about applications made possible by metadata collection and analysis.

At the recent Strata + Hadoop World, U.C. Berkeley professor and Trifacta co-founder Joe Hellerstein outlined the reasons why the broader data industry should rally to develop open and vendor-neutral metadata services. He made the case that improvements in metadata collection and sharing can lead to interesting applications and capabilities within the industry.

Below are some of the reasons why Hellerstein believes the data industry should start focusing more on metadata:

Improved data analysis: metadata-on-use

You will never know your data better than when you are wrangling and analyzing it. — Joe Hellerstein

A few years ago, I observed that context-switching — due to using multiple frameworks — created a lag in productivity. Today’s tools have improved to the point that someone using a single framework like Apache Spark can get many of their data tasks done without having to employ other programming environments. But outside of tracking in detail the actions and choices analysts make, as well as the rationales behind them, today’s tools still do a poor job of capturing how people interact and work with data.

Read more…

Graph databases are powering mission-critical applications

The O’Reilly Data Show Podcast: Emil Eifrem on popular applications of graph technologies, cloud computing, and company culture.

Graph and NoSQL databases

Jai Ranganathan on architecting big data applications in the cloud

The O’Reilly Data Show podcast: The Hadoop ecosystem, the recent surge in interest in all things real time, and developments in hardware.

Large-scale machine learning

Building systems for massive scale data applications

The O’Reilly Data Show podcast: Tyler Akidau on the evolution of systems for bounded and unbounded data processing.

On the need for MillWheel: A new stream processing engine

Turning big data into actionable insights

The O’Reilly Data Show podcast: Evangelos Simoudis on data mining, investing in data startups, and corporate innovation.

Open source software companies

Get started with cloud-based data science

Learn how to deploy machine learning solutions using Azure ML.

We need open and vendor-neutral metadata services

Comprehensive metadata collection and analysis can pave the way for many interesting applications.

Improved data analysis: metadata-on-use

Get the O’Reilly Data Newsletter