Key insights from Strata + Hadoop World 2015 in London.
People from across the data world came together this week for Strata + Hadoop World 2015 in London. Below we’ve assembled notable keynotes, interviews, and insights from the event.
Shazam already knows the next big hit
“With relative accuracy, we can predict 33 days out what song will go to No. 1 on the Billboard charts in the U.S.,” says Cait O’Riordan, VP of product for music and platforms at Shazam. O’Riordan walks through the data points and trendlines — including the “shape of a pop song” — that give Shazam hints about hits.
Mark Burgess chats about Promise Theory, and Geoffrey Moore discusses a modern approach to his Crossing the Chasm theory.
As systems become increasingly distributed and complex, it’s more important than ever to find ways to accurately describe and analyze those systems, and to formalize intent behind processes, workflows, and collaboration.
A new operator from the magrittr package makes it easier to use R for data analysis.
In every data analysis, you have to string together many tools. You need tools for data wrangling, visualisation, and modelling to understand what’s going on in your data. To use these tools effectively, you need to be able to easily flow from one tool to the next, focusing on asking and answering questions of the data, not struggling to jam the output from one function into the format needed for the next. Wouldn’t it be nice if the world worked this way! I spend a lot of my time thinking about this problem, and how to make the process of data analysis as fast, effective, and expressive as possible. Today, I want to show you a new technique that I’m particularly excited about.
R, at its heart, is a functional programming language: you do data analysis in R by composing functions. However, the problem with function composition is that a lot of it makes for hard-to-read code. For example, here’s some R code that wrangles flight delay data from New York City in 2013. What does it do? Read more…
Data tools are less important than the way you frame your questions.
Max Shron and Jake Porway spoke with me at Strata a few weeks ago about frameworks for making reasoned arguments with data. Max’s recent O’Reilly book, Thinking with Data, outlines the crucial process of developing good questions and creating a plan to answer them. Jake’s nonprofit, DataKind, connects data scientists with worthy causes where they can apply their skills.
A few of the things we talked about:
- The importance of publishing negative scientific results
- Give Directly, an organization that facilitates donations directly to households in Kenya and Uganda. Give Directly was able to model income using satellite data to distinguish thatched roofs from metal roofs.
- Moritz Stefaner calling for a “macroscope”
- Project Cybersyn, Salvador Allende’s plan for encompassing the entire Chilean economy in a single real-time computer system
- Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed by James C. Scott
After we recorded this podcast episode at Strata Santa Clara, Max presided over a webcast on his book that’s archived here.
Jim Stogdill, Jon Bruner and Jenn Webb discuss James Burke, ninja homes, IoT standards and robots.
What happens if emerging technology and automation result in a world of abundance, where anyone at anytime can produce anything they need and there’s no need for jobs? In his recent Strata keynote, James Burke warned that society is not prepared for scarcity (and the value it brings) to be a thing of the past — an eventuality Burke predicts will occur in the next 40 years or so. This topic kicks off a discussion between Jim Stogdill, Jon Bruner and myself that we recorded while at Strata.
Link fodder from our chat includes:
- James Burke.
- Roundtable with James Burke, which Alistair Croll aptly described as the “best coffee break ever.”
- Kurt Vonnegut’s Player Piano.
- Norbert Wiener’s The Human Use Of Human Beings: Cybernetics And Society
- Ninja homes.
- Stewart Brand’s How Buildings Learn: What Happens After They’re Built.
- Why Solid, why now?
- Solid: Local in Boston — and upcoming in San Francisco.
- MarkForged carbon fiber 3D printer.
- Rethink Robotics.
- Ryan Cunningham’s Strata session
If you liked this article, you might be interested in a new report, “Building a Solid World,” that explores the key trends and developments that are accelerating the growth of a software-enhanced, networked physical world. (Download the free report.)
Today, it’s shocking (and honestly exciting) how much of my daily experience is determined by a recommender system. These systems drive amazing experiences everywhere, telling me where to eat, what to listen to, what to watch, what to read, and even who I should be friends with. Furthermore, information overload is making recommender systems indispensable, since I can’t find what I want on the web simply using keyword search tools. Recommenders are behind the success of industry leaders like Netflix, Google, Pandora, eHarmony, Facebook, and Amazon. It’s no surprise companies want to integrate recommender systems with their own online experiences. However, as I talk to team after team of smart industry engineers, it has become clear that building and managing these systems is usually a bit out of reach, especially given all the other demands on the team’s time.
Submit your suggestions for videos that make us think about how data, visualizations, and technology are changing us
Each year at Strata, we warm up the crowd in the main keynote sessions with short videos that will make people think. These videos demonstrate the ways that data, technology, and visualization are changing us. Some are funny; some are clever; some are downright disturbing.
For Strata New York + Hadoop World in October, we’re hoping you’ll join in and suggest some videos for us. If you’ve got something you feel captures the zeitgeist of technology at the fringes, then complete this form, and we’ll check it out. We’ll choose some of them as we kick off the event this fall.
An interview with Kristina Chodorow, author of MongoDB: The Definitive Guide, Second Edition
We launched the second edition of Kristina Chodorow’s book, MongoDB: The Definitive Guide at a recent MongoDB conference in San Francisco. Everyone worked hard to make this happen. I filmed a little behind the scenes video with my phone in order to share it with everyone that worked on the book. After I filmed it, I decided to post the video as well as an interview with Kristina. Both the video and interview provide snippets of what it is like to work on the second edition of the MongoDB: The Definitive Guide.
What inspired you to become a software engineer?
Kristina Chodorow: In college, I took a computer science class because it would count towards my math major. I was programming a tic-tac-toe game and thought, “Why can’t I just program it to try to win?” and then I realized I could figure out the actual logic of “trying to win.” I thought that was the coolest thing ever. I took a couple more programming classes, joined the programming team, and started doing CS research. By the time I graduated, I knew I was going to be a programmer.
How did you land at 10gen?
I was thrilled to receive an invitation to a new meetup: the NYC Data Skeptics Meetup. If you’re in the New York area, and you’re interested in seeing data used honestly, stop by!
That announcement pushed me to write another post about data skepticism. The past few days, I’ve seen a resurgence of the slogan that correlation is as good as causation, if you have enough data. And I’m worried. (And I’m not vain enough to think it’s a response to my first post about skepticism; it’s more likely an effect of Cukier’s book.) There’s a fundamental difference between correlation and causation. Correlation is a two-headed arrow: you can’t tell in which direction it flows. Causation is a single-headed arrow: A causes B, not vice versa, at least in a universe that’s subject to entropy.
Let’s do some thought experiments–unfortunately, totally devoid of data. But I don’t think we need data to get to the core of the problem. Think of the classic false correlation (when teaching logic, also used as an example of a false syllogism): there’s a strong correlation between people who eat pickles and people who die. Well, yeah. We laugh. But let’s take this a step further: correlation is a double headed arrow. So not only does this poor logic imply that we can reduce the death rate by preventing people from eating pickles, it also implies that we can harm the chemical companies that produce vinegar by preventing people from dying. Read more…
In a conversation with Q Ethan McCallum (who should be credited as co-author), we wondered how to evaluate data science groups. If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it? It’s not a simple problem under the best of conditions: you’re not an insider, so you don’t know the full story of how many projects it has tried, whether they have succeeded or failed, relations between the data group, management, and other departments, and all the other stuff you’d like to know but will never be told.
Our starting point was remote: Q told me about Tyler Brulé’s travel writing for Financial Times (behind a paywall, unfortunately), in which he says that a club sandwich is a good proxy for hotel quality: you go into the restaurant and order a club sandwich. A club sandwich isn’t hard to make: there’s no secret recipe or technique that’s going to make Hotel A’s sandwich significantly better than B’s. But it’s easy to cut corners on ingredients and preparation. And if a hotel is cutting corners on their club sandwiches, they’re probably cutting corners in other places.
This reminded me of when my daughter was in first grade, and we looked (briefly) at private schools. All the schools talked the same talk. But if you looked at classes, it was pretty clear that the quality of the music program was a proxy for the quality of the school. After all, it’s easy to shortchange music, and both hard and expensive to do it right. Oddly enough, using the music program as a proxy for evaluating school quality has continued to work through middle school and (public) high school. It’s the first thing to cut when the budget gets tight; and if a school has a good music program with excellent teachers, they’re probably not shortchanging the kids elsewhere.
How does this connect to data science? What are the proxies that allow you to evaluate a data science program from the “outside,” on the information that you might be able to cull from company blogs, a job interview, or even a job posting? We came up with a few ideas:
- Are the data scientists simply human search engines, or do they have real projects that allow them to explore and be curious? If they have management support for learning what can be learned from the organization’s data, and if management listens to what they discover, they’re accomplishing something significant. If they’re just playing Q&A with the company data, finding answers to specific questions without providing any insight, they’re not really a data science group.
- Do the data scientists live in a silo, or are they connected with the rest of the company? In Building Data Science Teams, DJ Patil wrote about the value of seating data scientists with designers, marketers, with the entire product group so that they don’t do their work in isolation, and can bring their insights to bear on all aspects of the company.
- When the data scientists do a study, is the outcome predetermined by management? Is it OK to say “we don’t have an answer” or to come up with a solution that management doesn’t like? Granted, you aren’t likely to be able to answer this question without insider information.
- What do job postings look like? Does the company have a mission and know what it’s looking for, or are they asking for someone with a huge collection of skills, hoping that they will come in useful? That’s a sign of data science cargo culting.
- Does management know what their tools are for, or have they just installed Hadoop because it’s what the management magazines tell them to do? Can managers talk intelligently to data scientists?
- What sort of documentation does the group produce for its projects? Like a club sandwich, it’s easy to shortchange documentation.
- Is the business built around the data? Or is the data science team an add-on to an existing company? A data science group can be integrated into an older company, but you have to ask a lot more questions; you have to worry a lot more about silos and management relations than you do in a company that is built around data from the start.
Coming up with these questions was an interesting thought experiment; we don’t know whether it holds water, but we suspect it does. Any ideas and opinions?