- Facebook’s Mystery Machine — The goal of this paper is very similar to that of Google Dapper[…]. Both work [to] try to figure out bottlenecks in performance in high fanout large-scale Internet services. Both work us[ing] similar methods, however this work (the mystery machine) tries to accomplish the task relying on less instrumentation than Google Dapper. The novelty of the mystery machine work is that it tries to infer the component call graph implicitly via mining the logs, where as Google Dapper instrumented each call in a meticulous manner and explicitly obtained the entire call graph.
- The Multiple Lives of Moore’s Law — A shrinking transistor not only allowed more components to be crammed onto an integrated circuit but also made those transistors faster and less power hungry. This single factor has been responsible for much of the staying power of Moore’s Law, and it’s lasted through two very different incarnations. In the early days, a phase I call Moore’s Law 1.0, progress came by “scaling up”—adding more components to a chip. At first, the goal was simply to gobble up the discrete components of existing applications and put them in one reliable and inexpensive package. As a result, chips got bigger and more complex. The microprocessor, which emerged in the early 1970s, exemplifies this phase. But over the last few decades, progress in the semiconductor industry became dominated by Moore’s Law 2.0. This era is all about “scaling down,” driving down the size and cost of transistors even if the number of transistors per chip does not go up.
- BoXZY Rapid-Change FabLab: Mill, Laser Engraver, 3D Printer (Kickstarter) — project that promises you the ability to swap out heads to get different behaviour from the “move something in 3 dimensions” infrastructure in the box.
- SociaLite (Github) — a distributed query language for graph analysis and data mining. (via Ben Lorica)
"data analysis" entries
Key insights from Strata + Hadoop World 2015 in London.
People from across the data world came together this week for Strata + Hadoop World 2015 in London. Below we’ve assembled notable keynotes, interviews, and insights from the event.
Shazam already knows the next big hit
“With relative accuracy, we can predict 33 days out what song will go to No. 1 on the Billboard charts in the U.S.,” says Cait O’Riordan, VP of product for music and platforms at Shazam. O’Riordan walks through the data points and trendlines — including the “shape of a pop song” — that give Shazam hints about hits.
The O'Reilly Radar Podcast: John Carnahan on holistic data analysis, engagement channels, and data science as an art form.
In this Radar Podcast episode, I sit down with John Carnahan, executive vice president of data science at Ticketmaster. At our recent Strata + Hadoop World Conference in San Jose, CA, Carnahan presented a session on using data science and machine learning to improve ticket sales and marketing at Ticketmaster.
I took the opportunity to chat with Carnahan about Ticketmaster’s evolving approach to data analysis, the avenues of user engagement they’re investigating, and how his genetics background is informing his work in the big data space.
When Carnahan took the job at Ticketmaster about three years ago, his strategy focused on small, concrete tasks aimed at solving distinct nagging problems: how do you address large numbers of tickets not sold at an event, how do you engage and market those undersold events to fans, and how do you stem abuse of ticket sales. This strategy has evolved, Carnahan explained, to a more holistic approach aimed at bridging the data silos within the company:
“We still want those concrete things, but we want to build a bed of data science assets that’s built on top of a company that’s been around almost 40 years and has a lot of data assets. How do we build the platform that will leverage those things into the future, beyond just those small niche products that we really want to build. We’re trying to bridge the gap between a lot of those products, too. Rather than think of each of those things as a vertical or a silo that’s trying to accomplish something, it’s how do you use something that you’ve built over here, over there to make that better?”
A new operator from the magrittr package makes it easier to use R for data analysis.
In every data analysis, you have to string together many tools. You need tools for data wrangling, visualisation, and modelling to understand what’s going on in your data. To use these tools effectively, you need to be able to easily flow from one tool to the next, focusing on asking and answering questions of the data, not struggling to jam the output from one function into the format needed for the next. Wouldn’t it be nice if the world worked this way! I spend a lot of my time thinking about this problem, and how to make the process of data analysis as fast, effective, and expressive as possible. Today, I want to show you a new technique that I’m particularly excited about.
R, at its heart, is a functional programming language: you do data analysis in R by composing functions. However, the problem with function composition is that a lot of it makes for hard-to-read code. For example, here’s some R code that wrangles flight delay data from New York City in 2013. What does it do? Read more…
Addressing in-memory limitations and scalability issues of R.
The R programming language is the most popular statistical software in use today by data scientists, according to the 2013 Rexer Analytics Data Miner survey. One of the main drawbacks of vanilla R is the inability to scale and handle extremely large datasets because by default, R programs are executed in a single thread, and the data being used must be stored completely in RAM. These barriers present a problem for data analysis on massive datasets. For example, the R installation and administration manual suggests using data structures no larger than 10-20% of a computer’s available RAM. Moreover, high-level languages such as R or Matlab incur significant memory overhead because they use temporary copies instead of referencing existing objects.
One potential forthcoming solution to this issue could come from Teradata’s upcoming product, Teradata Aster R, which runs on the Teradata Aster Discovery Platform. It aims to facilitate the distribution of data analysis over a cluster of machines and to overcome one-node memory limitations in R applications. Read more…
Edge contributors say it's time to retire the search for one-size-fits-all answers.
The 2014 Edge Annual Question (EAQ) is out. This year, the question posed to the contributors is: What scientific idea is ready for retirement?
As usual with the EAQ, it provokes thought and promotes discussion. I have only read through a fraction of the responses so far, but I think it is important to highlight a few Edge contributors who answered with a common, and in my opinion a very important and timely, theme. The responses that initially caught my attention came from Laurence Smith (UCLA), Gavin Schmidt (NASA), Guilio Boccaletti (The Nature Conservancy) and Danny Hillis (Applied Minds). If I were to have been asked this question, my contribution for idea retirement would likely align most closely with these four responses: Smith and Boccaletti want to see same idea disappear — stationarity; Schmidt’s response focused on the abolition of simple answers; and Hillis wants to do away with cause-and-effect.
Big Data and analytics are the foundation of personalized medicine
Despite considerable progress in prevention and treatment, cancer remains the second leading cause of death in the United States. Even with the $50 billion pharmaceutical companies spend on research and development every year, any given cancer drug is ineffective in 75% of the patients receiving it. Typically, oncologists start patients on the cheapest likely chemotherapy (or the one their formulary suggests first) and in the 75% likelihood of non-response, iterate with increasingly expensive drugs until they find one that works, or until the patient dies. This process is inefficient and expensive, and subjects patients to unnecessary side effects, as well as causing them to lose precious time in their fight against a progressive disease. The vision is to enable oncologists to prescribe the right chemical the first time–one that will kill the target cancer cells with the least collateral damage to the patient.
How data can improve cancer treatment
Big data is enabling a new understanding of the molecular biology of cancer. The focus has changed over the last 20 years from the location of the tumor in the body (e.g., breast, colon or blood), to the effect of the individual’s genetics, especially the genetics of that individual’s cancer cells, on her response to treatment and sensitivity to side effects. For example, researchers have to date identified four distinct cell genotypes of breast cancer; identifying the cancer genotype allows the oncologist to prescribe the most effective available drug first.
Herceptin, the first drug developed to target a particular cancer genotype (HER2), rapidly demonstrated both the promise and the limitations of this approach. (Among the limitations, HER2 is only one of four known and many unknown breast cancer genotypes, and treatment selects for populations of resistant cancer cells, so the cancer can return in a more virulent form.)
Increasingly available data spurs organizations to make analysis easier
Genomics is making headlines in both academia and the celebrity world. With intense media coverage of Angelina Jolie’s recent double mastectomy after genetic tests revealed that she was predisposed to breast cancer, genetic testing and genomics have been propelled to the front of many more minds.
In this new data field, companies are approaching the collection, analysis, and turning of data into usable information from a variety of angles.