"data analysis" entries

Kristian Hammond on truly democratizing data and the value of AI in the enterprise

The O'Reilly Radar Podcast: Narrative Science's foray into proprietary business data and bridging the data gap.

Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.


In this week’s episode, O’Reilly’s Mac Slocum chats with Kristian Hammond, Narrative Science’s chief scientist. Hammond talks about Natural Language Generation, Narrative Science’s shift into the world of business data, and evolving beyond the dashboard.

Here are a few highlights:

We’re not telling people what the data are; we’re telling people what has happened in the world through a view of that data. I don’t care what the numbers are; I care about who are my best salespeople, where are my logistical bottlenecks. Quill can do that analysis and then tell you — not make you fight with it, but just tell you — and tell you in a way that is understandable and includes an explanation about why it believes this to be the case. Our focus is entirely, a little bit in media, but almost entirely in proprietary business data, and in particular we really focus on financial services right now.

You can’t make good on that promise [of what big data was supposed to do] unless you communicate it in the right way. People don’t understand charts; they don’t understand graphs; they don’t understand lines on a page. They just don’t. We can’t be angry at them for being human. Instead we should actually have the machine do what it needs to do in order to fill that gap between what it knows and what people need to know.

Read more…

Four short links: 29 October 2015

Four short links: 29 October 2015

Cloud Passports, Better Python Notebooks, Slippery Telcos, and Python Data Journalism

  1. Australia Floating the Idea of Cloud PassportsUnder a cloud passport, a traveller’s identity and biometrics data would be stored in a cloud, so passengers would no longer need to carry their passports and risk having them lost or stolen. That sound you hear is Taylor Swift on Security, quoting “Wildest Dreams” into her vodka and Tang: “I can see the end as it begins.” This article is also notable for The idea of cloud passports is the result of a hipster-style-hackathon.
  2. Jupyter — Python Notebooks that allows you to create and share documents that contain live code, equations, visualizations, and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning, and much more.
  3. Telcos $24B Business In Your DataUnder the radar, Verizon, Sprint, Telefonica, and other carriers have partnered with firms including SAP, IBM, HP, and AirSage to manage, package, and sell various levels of data to marketers and other clients. It’s all part of a push by the world’s largest phone operators to counteract diminishing subscriber growth through new business ventures that tap into the data that showers from consumers’ mobile Web surfing, text messaging, and phone calls. Even if you do pay for it, you’re still the product.
  4. Introducing Agate — a Python data analysis library designed to be useable by non-data-scientists, so leads to readable and predictable code. Target market: data journalists.

Translating data into knowledge

Best practices for data preparation — what you need to know before data analysis can begin.

Download “Data Preparation in the Big Data Era,” a new free report to help you manage the challenges of data cleaning and preparation.

Joseph_Wright_of_Derby_The_AlchemistData is growing at an exponential rate worldwide, with huge business opportunities and challenges for every industry. In 2016, global Internet traffic will reach 90 exabytes per month, according to a recent Cisco report. The ability to manage and analyze an unprecedented amount of data will be the key to success for every industry.

To exploit the benefits of a big data strategy, a key question is how to translate all of that data into useful knowledge. To meet this challenge, a company first needs to have a clear picture of their strategic knowledge assets, such as their area of expertise, core competencies, and intellectual property.

Having a clear picture of the business model and the relationships with distributors, suppliers, and customers is extremely useful in order to design a tactical and strategic decision-making process. The true potential value of big data is only gained when placed in a business context, where data analysis drives better decisions — otherwise, it’s just data.

In a new O’Reilly report Data Preparation in the Big Data Era, we provide a step-by-step guide to manage the challenges of data cleaning and preparation — critical steps before effective data analysis can begin. We explore the common problems of data preparation and the different steps involved, including data cleaning, combination, and transformation. You’ll also learn about new products that deal with problem of data variety at scale, including Tamr’s solution, which curates data at scale using a combination of machine learning and expert feedback. Read more…


A “bottom-up” approach to data unification

How machine learning plus expert sourcing can unify customer data at scale.


Watch the free webcast Integrating Customer Data at Scale to learn how Toyota Motor Europe was able to unify its customer data at scale.

Enterprises that are capable of gaining a unified view of their customer data can achieve added business enhancements and user opportunities. Capturing customer data, however, can be a difficult task, as most systems rely on traditional “top-down” approaches to standardizing data. In a recent O’Reilly webcast, Integrating Customer Data at Scale, Tamr field engineer Alan Wagner hosts a Q&A session with Matt Stevens, the general manager at Toyota Motor Europe, to demonstrate how a leading enterprise uses a third-generation system like Tamr to simplify the process of unifying customer data.

In the webcast, Stevens explains how Toyota Motor Europe has gained a 360-degree view of their customers through the Tamr Data Unification Platform, which takes a machine learning and expert-sourcing “human guided workflow” approach to data unification. Wagner provides a demo of the Tamr platform, applied within a Salesforce application, to demonstrate the ability to capture and unify customer data. Read more…


Signals from Strata + Hadoop World 2015 in London

Key insights from Strata + Hadoop World 2015 in London.

People from across the data world came together this week for Strata + Hadoop World 2015 in London. Below we’ve assembled notable keynotes, interviews, and insights from the event.

Shazam already knows the next big hit

“With relative accuracy, we can predict 33 days out what song will go to No. 1 on the Billboard charts in the U.S.,” says Cait O’Riordan, VP of product for music and platforms at Shazam. O’Riordan walks through the data points and trendlines — including the “shape of a pop song” — that give Shazam hints about hits.

Read more…

Comment: 1
Four short links: 1 April 2015

Four short links: 1 April 2015

Tuning Fanout, Moore's Law, 3D Everything, and Social Graph Analysis

  1. Facebook’s Mystery MachineThe goal of this paper is very similar to that of Google Dapper[…]. Both work [to] try to figure out bottlenecks in performance in high fanout large-scale Internet services. Both work us[ing] similar methods, however this work (the mystery machine) tries to accomplish the task relying on less instrumentation than Google Dapper. The novelty of the mystery machine work is that it tries to infer the component call graph implicitly via mining the logs, where as Google Dapper instrumented each call in a meticulous manner and explicitly obtained the entire call graph.
  2. The Multiple Lives of Moore’s LawA shrinking transistor not only allowed more components to be crammed onto an integrated circuit but also made those transistors faster and less power hungry. This single factor has been responsible for much of the staying power of Moore’s Law, and it’s lasted through two very different incarnations. In the early days, a phase I call Moore’s Law 1.0, progress came by “scaling up”—adding more components to a chip. At first, the goal was simply to gobble up the discrete components of existing applications and put them in one reliable and inexpensive package. As a result, chips got bigger and more complex. The microprocessor, which emerged in the early 1970s, exemplifies this phase. But over the last few decades, progress in the semiconductor industry became dominated by Moore’s Law 2.0. This era is all about “scaling down,” driving down the size and cost of transistors even if the number of transistors per chip does not go up.
  3. BoXZY Rapid-Change FabLab: Mill, Laser Engraver, 3D Printer (Kickstarter) — project that promises you the ability to swap out heads to get different behaviour from the “move something in 3 dimensions” infrastructure in the box.
  4. SociaLite (Github) — a distributed query language for graph analysis and data mining. (via Ben Lorica)
Comment: 1

Bridging the gap in big data silos

The O'Reilly Radar Podcast: John Carnahan on holistic data analysis, engagement channels, and data science as an art form.


In this Radar Podcast episode, I sit down with John Carnahan, executive vice president of data science at Ticketmaster. At our recent Strata + Hadoop World Conference in San Jose, CA, Carnahan presented a session on using data science and machine learning to improve ticket sales and marketing at Ticketmaster.

I took the opportunity to chat with Carnahan about Ticketmaster’s evolving approach to data analysis, the avenues of user engagement they’re investigating, and how his genetics background is informing his work in the big data space.

When Carnahan took the job at Ticketmaster about three years ago, his strategy focused on small, concrete tasks aimed at solving distinct nagging problems: how do you address large numbers of tickets not sold at an event, how do you engage and market those undersold events to fans, and how do you stem abuse of ticket sales. This strategy has evolved, Carnahan explained, to a more holistic approach aimed at bridging the data silos within the company:

“We still want those concrete things, but we want to build a bed of data science assets that’s built on top of a company that’s been around almost 40 years and has a lot of data assets. How do we build the platform that will leverage those things into the future, beyond just those small niche products that we really want to build. We’re trying to bridge the gap between a lot of those products, too. Rather than think of each of those things as a vertical or a silo that’s trying to accomplish something, it’s how do you use something that you’ve built over here, over there to make that better?”

Read more…

Comment: 1

Building pipelines to facilitate data analysis

A new operator from the magrittr package makes it easier to use R for data analysis.


In every data analysis, you have to string together many tools. You need tools for data wrangling, visualisation, and modelling to understand what’s going on in your data. To use these tools effectively, you need to be able to easily flow from one tool to the next, focusing on asking and answering questions of the data, not struggling to jam the output from one function into the format needed for the next. Wouldn’t it be nice if the world worked this way! I spend a lot of my time thinking about this problem, and how to make the process of data analysis as fast, effective, and expressive as possible. Today, I want to show you a new technique that I’m particularly excited about.

R, at its heart, is a functional programming language: you do data analysis in R by composing functions. However, the problem with function composition is that a lot of it makes for hard-to-read code. For example, here’s some R code that wrangles flight delay data from New York City in 2013. What does it do? Read more…

Comment: 1

New scalable solutions for data analysis with R

Addressing in-memory limitations and scalability issues of R.

The R programming language is the most popular statistical software in use today by data scientists, according to the 2013 Rexer Analytics Data Miner survey. One of the main drawbacks of vanilla R is the inability to scale and handle extremely large datasets because by default, R programs are executed in a single thread, and the data being used must be stored completely in RAM. These barriers present a problem for data analysis on massive datasets. For example, the R installation and administration manual suggests using data structures no larger than 10-20% of a computer’s available RAM. Moreover, high-level languages such as R or Matlab incur significant memory overhead because they use temporary copies instead of referencing existing objects.

One potential forthcoming solution to this issue could come from Teradata’s upcoming product, Teradata Aster R, which runs on the Teradata Aster Discovery Platform. It aims to facilitate the distribution of data analysis over a cluster of machines and to overcome one-node memory limitations in R applications. Read more…

Comments: 2
Four short links: 24 March 2014

Four short links: 24 March 2014

Google Flu, Embeddable JS, Data Analysis, and Belief in the Browser

  1. The Parable of Google Flu (PDF) — We explore two
    issues that contributed to [Google Flu Trends]’s mistakes—big data hubris and algorithm dynamics—and offer lessons for moving forward in the big data age.
    Overtrained and underfed?
  2. Duktape — a lightweight embeddable Javascript engine. Because an app without an API is like a lightbulb without an IP address: retro but not cool.
  3. Principles of Good Data Analysis (Greg Reda) — Once you’ve settled on your approach and data sources, you need to make sure you understand how the data was generated or captured, especially if you are using your own company’s data. Treble so if you are using data you snaffled off the net, riddled with collection bias and untold omissions. (via Stijn Debrouwere)
  4. Deep Belief Networks in Javascript — just object recognition in the browser. The code relies on GPU shaders to perform calculations on over 60 million neural connections in real time. From the ever-more-awesome Pete Warden.