- Facebook’s Mystery Machine — The goal of this paper is very similar to that of Google Dapper[…]. Both work [to] try to figure out bottlenecks in performance in high fanout large-scale Internet services. Both work us[ing] similar methods, however this work (the mystery machine) tries to accomplish the task relying on less instrumentation than Google Dapper. The novelty of the mystery machine work is that it tries to infer the component call graph implicitly via mining the logs, where as Google Dapper instrumented each call in a meticulous manner and explicitly obtained the entire call graph.
- The Multiple Lives of Moore’s Law — A shrinking transistor not only allowed more components to be crammed onto an integrated circuit but also made those transistors faster and less power hungry. This single factor has been responsible for much of the staying power of Moore’s Law, and it’s lasted through two very different incarnations. In the early days, a phase I call Moore’s Law 1.0, progress came by “scaling up”—adding more components to a chip. At first, the goal was simply to gobble up the discrete components of existing applications and put them in one reliable and inexpensive package. As a result, chips got bigger and more complex. The microprocessor, which emerged in the early 1970s, exemplifies this phase. But over the last few decades, progress in the semiconductor industry became dominated by Moore’s Law 2.0. This era is all about “scaling down,” driving down the size and cost of transistors even if the number of transistors per chip does not go up.
- BoXZY Rapid-Change FabLab: Mill, Laser Engraver, 3D Printer (Kickstarter) — project that promises you the ability to swap out heads to get different behaviour from the “move something in 3 dimensions” infrastructure in the box.
- SociaLite (Github) — a distributed query language for graph analysis and data mining. (via Ben Lorica)
"data analysis" entries
Best practices for data preparation — what you need to know before data analysis can begin.
Download “Data Preparation in the Big Data Era,” a new free report to help you manage the challenges of data cleaning and preparation.
Data is growing at an exponential rate worldwide, with huge business opportunities and challenges for every industry. In 2016, global Internet traffic will reach 90 exabytes per month, according to a recent Cisco report. The ability to manage and analyze an unprecedented amount of data will be the key to success for every industry.
To exploit the benefits of a big data strategy, a key question is how to translate all of that data into useful knowledge. To meet this challenge, a company first needs to have a clear picture of their strategic knowledge assets, such as their area of expertise, core competencies, and intellectual property.
Having a clear picture of the business model and the relationships with distributors, suppliers, and customers is extremely useful in order to design a tactical and strategic decision-making process. The true potential value of big data is only gained when placed in a business context, where data analysis drives better decisions — otherwise, it’s just data.
In a new O’Reilly report Data Preparation in the Big Data Era, we provide a step-by-step guide to manage the challenges of data cleaning and preparation — critical steps before effective data analysis can begin. We explore the common problems of data preparation and the different steps involved, including data cleaning, combination, and transformation. You’ll also learn about new products that deal with problem of data variety at scale, including Tamr’s solution, which curates data at scale using a combination of machine learning and expert feedback. Read more…
How machine learning plus expert sourcing can unify customer data at scale.
Watch the free webcast Integrating Customer Data at Scale to learn how Toyota Motor Europe was able to unify its customer data at scale.
Enterprises that are capable of gaining a unified view of their customer data can achieve added business enhancements and user opportunities. Capturing customer data, however, can be a difficult task, as most systems rely on traditional “top-down” approaches to standardizing data. In a recent O’Reilly webcast, Integrating Customer Data at Scale, Tamr field engineer Alan Wagner hosts a Q&A session with Matt Stevens, the general manager at Toyota Motor Europe, to demonstrate how a leading enterprise uses a third-generation system like Tamr to simplify the process of unifying customer data.
In the webcast, Stevens explains how Toyota Motor Europe has gained a 360-degree view of their customers through the Tamr Data Unification Platform, which takes a machine learning and expert-sourcing “human guided workflow” approach to data unification. Wagner provides a demo of the Tamr platform, applied within a Salesforce application, to demonstrate the ability to capture and unify customer data. Read more…
Key insights from Strata + Hadoop World 2015 in London.
People from across the data world came together this week for Strata + Hadoop World 2015 in London. Below we’ve assembled notable keynotes, interviews, and insights from the event.
Shazam already knows the next big hit
“With relative accuracy, we can predict 33 days out what song will go to No. 1 on the Billboard charts in the U.S.,” says Cait O’Riordan, VP of product for music and platforms at Shazam. O’Riordan walks through the data points and trendlines — including the “shape of a pop song” — that give Shazam hints about hits.
The O'Reilly Radar Podcast: John Carnahan on holistic data analysis, engagement channels, and data science as an art form.
In this Radar Podcast episode, I sit down with John Carnahan, executive vice president of data science at Ticketmaster. At our recent Strata + Hadoop World Conference in San Jose, CA, Carnahan presented a session on using data science and machine learning to improve ticket sales and marketing at Ticketmaster.
I took the opportunity to chat with Carnahan about Ticketmaster’s evolving approach to data analysis, the avenues of user engagement they’re investigating, and how his genetics background is informing his work in the big data space.
When Carnahan took the job at Ticketmaster about three years ago, his strategy focused on small, concrete tasks aimed at solving distinct nagging problems: how do you address large numbers of tickets not sold at an event, how do you engage and market those undersold events to fans, and how do you stem abuse of ticket sales. This strategy has evolved, Carnahan explained, to a more holistic approach aimed at bridging the data silos within the company:
“We still want those concrete things, but we want to build a bed of data science assets that’s built on top of a company that’s been around almost 40 years and has a lot of data assets. How do we build the platform that will leverage those things into the future, beyond just those small niche products that we really want to build. We’re trying to bridge the gap between a lot of those products, too. Rather than think of each of those things as a vertical or a silo that’s trying to accomplish something, it’s how do you use something that you’ve built over here, over there to make that better?”
A new operator from the magrittr package makes it easier to use R for data analysis.
In every data analysis, you have to string together many tools. You need tools for data wrangling, visualisation, and modelling to understand what’s going on in your data. To use these tools effectively, you need to be able to easily flow from one tool to the next, focusing on asking and answering questions of the data, not struggling to jam the output from one function into the format needed for the next. Wouldn’t it be nice if the world worked this way! I spend a lot of my time thinking about this problem, and how to make the process of data analysis as fast, effective, and expressive as possible. Today, I want to show you a new technique that I’m particularly excited about.
R, at its heart, is a functional programming language: you do data analysis in R by composing functions. However, the problem with function composition is that a lot of it makes for hard-to-read code. For example, here’s some R code that wrangles flight delay data from New York City in 2013. What does it do? Read more…
Addressing in-memory limitations and scalability issues of R.
The R programming language is the most popular statistical software in use today by data scientists, according to the 2013 Rexer Analytics Data Miner survey. One of the main drawbacks of vanilla R is the inability to scale and handle extremely large datasets because by default, R programs are executed in a single thread, and the data being used must be stored completely in RAM. These barriers present a problem for data analysis on massive datasets. For example, the R installation and administration manual suggests using data structures no larger than 10-20% of a computer’s available RAM. Moreover, high-level languages such as R or Matlab incur significant memory overhead because they use temporary copies instead of referencing existing objects.
One potential forthcoming solution to this issue could come from Teradata’s upcoming product, Teradata Aster R, which runs on the Teradata Aster Discovery Platform. It aims to facilitate the distribution of data analysis over a cluster of machines and to overcome one-node memory limitations in R applications. Read more…
Edge contributors say it's time to retire the search for one-size-fits-all answers.
The 2014 Edge Annual Question (EAQ) is out. This year, the question posed to the contributors is: What scientific idea is ready for retirement?
As usual with the EAQ, it provokes thought and promotes discussion. I have only read through a fraction of the responses so far, but I think it is important to highlight a few Edge contributors who answered with a common, and in my opinion a very important and timely, theme. The responses that initially caught my attention came from Laurence Smith (UCLA), Gavin Schmidt (NASA), Guilio Boccaletti (The Nature Conservancy) and Danny Hillis (Applied Minds). If I were to have been asked this question, my contribution for idea retirement would likely align most closely with these four responses: Smith and Boccaletti want to see same idea disappear — stationarity; Schmidt’s response focused on the abolition of simple answers; and Hillis wants to do away with cause-and-effect.