- Australia Floating the Idea of Cloud Passports — Under a cloud passport, a traveller’s identity and biometrics data would be stored in a cloud, so passengers would no longer need to carry their passports and risk having them lost or stolen. That sound you hear is Taylor Swift on Security, quoting “Wildest Dreams” into her vodka and Tang: “I can see the end as it begins.” This article is also notable for The idea of cloud passports is the result of a hipster-style-hackathon.
- Jupyter — Python Notebooks that allows you to create and share documents that contain live code, equations, visualizations, and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning, and much more.
- Telcos $24B Business In Your Data — Under the radar, Verizon, Sprint, Telefonica, and other carriers have partnered with firms including SAP, IBM, HP, and AirSage to manage, package, and sell various levels of data to marketers and other clients. It’s all part of a push by the world’s largest phone operators to counteract diminishing subscriber growth through new business ventures that tap into the data that showers from consumers’ mobile Web surfing, text messaging, and phone calls. Even if you do pay for it, you’re still the product.
- Introducing Agate — a Python data analysis library designed to be useable by non-data-scientists, so leads to readable and predictable code. Target market: data journalists.
"data analysis" entries
The O'Reilly Radar Podcast: Narrative Science's foray into proprietary business data and bridging the data gap.
Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.
In this week’s episode, O’Reilly’s Mac Slocum chats with Kristian Hammond, Narrative Science’s chief scientist. Hammond talks about Natural Language Generation, Narrative Science’s shift into the world of business data, and evolving beyond the dashboard.
Here are a few highlights:
We’re not telling people what the data are; we’re telling people what has happened in the world through a view of that data. I don’t care what the numbers are; I care about who are my best salespeople, where are my logistical bottlenecks. Quill can do that analysis and then tell you — not make you fight with it, but just tell you — and tell you in a way that is understandable and includes an explanation about why it believes this to be the case. Our focus is entirely, a little bit in media, but almost entirely in proprietary business data, and in particular we really focus on financial services right now.
You can’t make good on that promise [of what big data was supposed to do] unless you communicate it in the right way. People don’t understand charts; they don’t understand graphs; they don’t understand lines on a page. They just don’t. We can’t be angry at them for being human. Instead we should actually have the machine do what it needs to do in order to fill that gap between what it knows and what people need to know.
Best practices for data preparation — what you need to know before data analysis can begin.
Download “Data Preparation in the Big Data Era,” a new free report to help you manage the challenges of data cleaning and preparation.
Data is growing at an exponential rate worldwide, with huge business opportunities and challenges for every industry. In 2016, global Internet traffic will reach 90 exabytes per month, according to a recent Cisco report. The ability to manage and analyze an unprecedented amount of data will be the key to success for every industry.
To exploit the benefits of a big data strategy, a key question is how to translate all of that data into useful knowledge. To meet this challenge, a company first needs to have a clear picture of their strategic knowledge assets, such as their area of expertise, core competencies, and intellectual property.
Having a clear picture of the business model and the relationships with distributors, suppliers, and customers is extremely useful in order to design a tactical and strategic decision-making process. The true potential value of big data is only gained when placed in a business context, where data analysis drives better decisions — otherwise, it’s just data.
In a new O’Reilly report Data Preparation in the Big Data Era, we provide a step-by-step guide to manage the challenges of data cleaning and preparation — critical steps before effective data analysis can begin. We explore the common problems of data preparation and the different steps involved, including data cleaning, combination, and transformation. You’ll also learn about new products that deal with problem of data variety at scale, including Tamr’s solution, which curates data at scale using a combination of machine learning and expert feedback. Read more…
How machine learning plus expert sourcing can unify customer data at scale.
Watch the free webcast Integrating Customer Data at Scale to learn how Toyota Motor Europe was able to unify its customer data at scale.
Enterprises that are capable of gaining a unified view of their customer data can achieve added business enhancements and user opportunities. Capturing customer data, however, can be a difficult task, as most systems rely on traditional “top-down” approaches to standardizing data. In a recent O’Reilly webcast, Integrating Customer Data at Scale, Tamr field engineer Alan Wagner hosts a Q&A session with Matt Stevens, the general manager at Toyota Motor Europe, to demonstrate how a leading enterprise uses a third-generation system like Tamr to simplify the process of unifying customer data.
In the webcast, Stevens explains how Toyota Motor Europe has gained a 360-degree view of their customers through the Tamr Data Unification Platform, which takes a machine learning and expert-sourcing “human guided workflow” approach to data unification. Wagner provides a demo of the Tamr platform, applied within a Salesforce application, to demonstrate the ability to capture and unify customer data. Read more…
Key insights from Strata + Hadoop World 2015 in London.
People from across the data world came together this week for Strata + Hadoop World 2015 in London. Below we’ve assembled notable keynotes, interviews, and insights from the event.
Shazam already knows the next big hit
“With relative accuracy, we can predict 33 days out what song will go to No. 1 on the Billboard charts in the U.S.,” says Cait O’Riordan, VP of product for music and platforms at Shazam. O’Riordan walks through the data points and trendlines — including the “shape of a pop song” — that give Shazam hints about hits.
The O'Reilly Radar Podcast: John Carnahan on holistic data analysis, engagement channels, and data science as an art form.
In this Radar Podcast episode, I sit down with John Carnahan, executive vice president of data science at Ticketmaster. At our recent Strata + Hadoop World Conference in San Jose, CA, Carnahan presented a session on using data science and machine learning to improve ticket sales and marketing at Ticketmaster.
I took the opportunity to chat with Carnahan about Ticketmaster’s evolving approach to data analysis, the avenues of user engagement they’re investigating, and how his genetics background is informing his work in the big data space.
When Carnahan took the job at Ticketmaster about three years ago, his strategy focused on small, concrete tasks aimed at solving distinct nagging problems: how do you address large numbers of tickets not sold at an event, how do you engage and market those undersold events to fans, and how do you stem abuse of ticket sales. This strategy has evolved, Carnahan explained, to a more holistic approach aimed at bridging the data silos within the company:
“We still want those concrete things, but we want to build a bed of data science assets that’s built on top of a company that’s been around almost 40 years and has a lot of data assets. How do we build the platform that will leverage those things into the future, beyond just those small niche products that we really want to build. We’re trying to bridge the gap between a lot of those products, too. Rather than think of each of those things as a vertical or a silo that’s trying to accomplish something, it’s how do you use something that you’ve built over here, over there to make that better?”
A new operator from the magrittr package makes it easier to use R for data analysis.
In every data analysis, you have to string together many tools. You need tools for data wrangling, visualisation, and modelling to understand what’s going on in your data. To use these tools effectively, you need to be able to easily flow from one tool to the next, focusing on asking and answering questions of the data, not struggling to jam the output from one function into the format needed for the next. Wouldn’t it be nice if the world worked this way! I spend a lot of my time thinking about this problem, and how to make the process of data analysis as fast, effective, and expressive as possible. Today, I want to show you a new technique that I’m particularly excited about.
R, at its heart, is a functional programming language: you do data analysis in R by composing functions. However, the problem with function composition is that a lot of it makes for hard-to-read code. For example, here’s some R code that wrangles flight delay data from New York City in 2013. What does it do? Read more…
Addressing in-memory limitations and scalability issues of R.
The R programming language is the most popular statistical software in use today by data scientists, according to the 2013 Rexer Analytics Data Miner survey. One of the main drawbacks of vanilla R is the inability to scale and handle extremely large datasets because by default, R programs are executed in a single thread, and the data being used must be stored completely in RAM. These barriers present a problem for data analysis on massive datasets. For example, the R installation and administration manual suggests using data structures no larger than 10-20% of a computer’s available RAM. Moreover, high-level languages such as R or Matlab incur significant memory overhead because they use temporary copies instead of referencing existing objects.
One potential forthcoming solution to this issue could come from Teradata’s upcoming product, Teradata Aster R, which runs on the Teradata Aster Discovery Platform. It aims to facilitate the distribution of data analysis over a cluster of machines and to overcome one-node memory limitations in R applications. Read more…