ENTRIES TAGGED "strata"

Podcast: thinking with data

Data tools are less important than the way you frame your questions.

Max Shron and Jake Porway spoke with me at Strata a few weeks ago about frameworks for making reasoned arguments with data. Max’s recent O’Reilly book, Thinking with Data, outlines the crucial process of developing good questions and creating a plan to answer them. Jake’s nonprofit, DataKind, connects data scientists with worthy causes where they can apply their skills.

A few of the things we talked about:

  • The importance of publishing negative scientific results
  • Give Directly, an organization that facilitates donations directly to households in Kenya and Uganda. Give Directly was able to model income using satellite data to distinguish thatched roofs from metal roofs.
  • Moritz Stefaner calling for a “macroscope”
  • Project Cybersyn, Salvador Allende’s plan for encompassing the entire Chilean economy in a single real-time computer system
  • Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed by James C. Scott

After we recorded this podcast episode at Strata Santa Clara, Max presided over a webcast on his book that’s archived here.

Comment

Podcast: automation and an abundance-oriented economy

Jim Stogdill, Jon Bruner and Jenn Webb discuss James Burke, ninja homes, IoT standards and robots.

What happens if emerging technology and automation result in a world of abundance, where anyone at anytime can produce anything they need and there’s no need for jobs? In his recent Strata keynote, James Burke warned that society is not prepared for scarcity (and the value it brings) to be a thing of the past — an eventuality Burke predicts will occur in the next 40 years or so. This topic kicks off a discussion between Jim Stogdill, Jon Bruner and myself that we recorded while at Strata.

Link fodder from our chat includes:

Subscribe to the O’Reilly Radar Podcast through iTunesSoundCloud, or directly through our podcast’s RSS feed.


Solid-report-cover-smallIf you liked this article, you might be interested in a new report, “Building a Solid World,” that explores the key trends and developments that are accelerating the growth of a software-enhanced, networked physical world. (Download the free report.)

Comment

Why is building custom recommender systems hard? Does it have to be?

guenstrin

Photo Courtesy of Carlos Guestrin

By Carlos Guestrin

Today, it’s shocking (and honestly exciting) how much of my daily experience is determined by a recommender system.  These systems drive amazing experiences everywhere, telling me where to eat, what to listen to, what to watch, what to read, and even who I should be friends with.  Furthermore, information overload is making recommender systems indispensable, since I can’t find what I want on the web simply using keyword search tools.  Recommenders are behind the success of industry leaders like Netflix, Google, Pandora, eHarmony, Facebook, and Amazon.  It’s no surprise companies want to integrate recommender systems with their own online experiences.  However, as I talk to team after team of smart industry engineers, it has become clear that building and managing these systems is usually a bit out of reach, especially given all the other demands on the team’s time.

Read more…

Comment

Make us think: a call for Strata keynote videos

Submit your suggestions for videos that make us think about how data, visualizations, and technology are changing us

Each year at Strata, we warm up the crowd in the main keynote sessions with short videos that will make people think. These videos demonstrate the ways that data, technology, and visualization are changing us. Some are funny; some are clever; some are downright disturbing.

For Strata New York + Hadoop World in October, we’re hoping you’ll join in and suggest some videos for us. If you’ve got something you feel captures the zeitgeist of technology at the fringes, then complete this form, and we’ll check it out. We’ll choose some of them as we kick off the event this fall.

Read more…

Comment

Making things happen: from being a software engineer to writing a book

An interview with Kristina Chodorow, author of MongoDB: The Definitive Guide, Second Edition

We launched the second edition of Kristina Chodorow’s book, MongoDB: The Definitive Guide at a recent MongoDB conference in San Francisco. Everyone worked hard to make this happen. I filmed a little behind the scenes video with my phone in order to share it with everyone that worked on the book. After I filmed it, I decided to post the video as well as an interview with Kristina. Both the video and interview provide snippets of what it is like to work on the second edition of the MongoDB: The Definitive Guide.

What inspired you to become a software engineer?

Kristina Chodorow: In college, I took a computer science class because it would count towards my math major. I was programming a tic-tac-toe game and thought, “Why can’t I just program it to try to win?” and then I realized I could figure out the actual logic of “trying to win.”  I thought that was the coolest thing ever. I took a couple more programming classes, joined the programming team, and started doing CS research. By the time I graduated, I knew I was going to be a programmer.

How did you land at 10gen?

Kristina Chodorow

Kristina Chodorow

Kristina Chodorow: After college I started a Ph.D. at Columbia and, although it was a great program, I really didn’t want to go to graduate school and left after a semester.  I moved to Seattle to be with a guy and unsurprisingly that didn’t work out. After a plane ride of shame back to the East Coast, I put my resume up on Dice.com.  A really excellent recruiter, Craig Collins, set me up with a bunch of interviews and I accepted an offer from 10gen. When I joined, 10gen was working on a full cloud stack (similar to Google App Engine).  I worked on a JavaScript compiler for about a year before we decided to focus on the scalable storage layer: MongoDB.

Read more…

Comment

Another Serving of Data Skepticism

I was thrilled to receive an invitation to a new meetup: the NYC Data Skeptics Meetup. If you’re in the New York area, and you’re interested in seeing data used honestly, stop by!

That announcement pushed me to write another post about data skepticism. The past few days, I’ve seen a resurgence of the slogan that correlation is as good as causation, if you have enough data. And I’m worried. (And I’m not vain enough to think it’s a response to my first post about skepticism; it’s more likely an effect of Cukier’s book.) There’s a fundamental difference between correlation and causation. Correlation is a two-headed arrow: you can’t tell in which direction it flows. Causation is a single-headed arrow: A causes B, not vice versa, at least in a universe that’s subject to entropy.

Let’s do some thought experiments–unfortunately, totally devoid of data. But I don’t think we need data to get to the core of the problem. Think of the classic false correlation (when teaching logic, also used as an example of a false syllogism): there’s a strong correlation between people who eat pickles and people who die. Well, yeah. We laugh. But let’s take this a step further: correlation is a double headed arrow. So not only does this poor logic imply that we can reduce the death rate by preventing people from eating pickles, it also implies that we can harm the chemical companies that produce vinegar by preventing people from dying. Read more…

Comments: 3

Leading Indicators

In a conversation with Q Ethan McCallum (who should be credited as co-author), we wondered how to evaluate data science groups. If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it? It’s not a simple problem under the best of conditions: you’re not an insider, so you don’t know the full story of how many projects it has tried, whether they have succeeded or failed, relations between the data group, management, and other departments, and all the other stuff you’d like to know but will never be told.

Our starting point was remote: Q told me about Tyler Brulé’s travel writing for Financial Times (behind a paywall, unfortunately), in which he says that a club sandwich is a good proxy for hotel quality: you go into the restaurant and order a club sandwich. A club sandwich isn’t hard to make: there’s no secret recipe or technique that’s going to make Hotel A’s sandwich significantly better than B’s. But it’s easy to cut corners on ingredients and preparation. And if a hotel is cutting corners on their club sandwiches, they’re probably cutting corners in other places.

This reminded me of when my daughter was in first grade, and we looked (briefly) at private schools. All the schools talked the same talk. But if you looked at classes, it was pretty clear that the quality of the music program was a proxy for the quality of the school. After all, it’s easy to shortchange music, and both hard and expensive to do it right. Oddly enough, using the music program as a proxy for evaluating school quality has continued to work through middle school and (public) high school. It’s the first thing to cut when the budget gets tight; and if a school has a good music program with excellent teachers, they’re probably not shortchanging the kids elsewhere.

How does this connect to data science? What are the proxies that allow you to evaluate a data science program from the “outside,” on the information that you might be able to cull from company blogs, a job interview, or even a job posting? We came up with a few ideas:

  • Are the data scientists simply human search engines, or do they have real projects that allow them to explore and be curious? If they have management support for learning what can be learned from the organization’s data, and if management listens to what they discover, they’re accomplishing something significant. If they’re just playing Q&A with the company data, finding answers to specific questions without providing any insight, they’re not really a data science group.
  • Do the data scientists live in a silo, or are they connected with the rest of the company? In Building Data Science Teams, DJ Patil wrote about the value of seating data scientists with designers, marketers, with the entire product group so that they don’t do their work in isolation, and can bring their insights to bear on all aspects of the company.
  • When the data scientists do a study, is the outcome predetermined by management? Is it OK to say “we don’t have an answer” or to come up with a solution that management doesn’t like? Granted, you aren’t likely to be able to answer this question without insider information.
  • What do job postings look like? Does the company have a mission and know what it’s looking for, or are they asking for someone with a huge collection of skills, hoping that they will come in useful? That’s a sign of data science cargo culting.
  • Does management know what their tools are for, or have they just installed Hadoop because it’s what the management magazines tell them to do? Can managers talk intelligently to data scientists?
  • What sort of documentation does the group produce for its projects? Like a club sandwich, it’s easy to shortchange documentation.
  • Is the business built around the data? Or is the data science team an add-on to an existing company? A data science group can be integrated into an older company, but you have to ask a lot more questions; you have to worry a lot more about silos and management relations than you do in a company that is built around data from the start.

Coming up with these questions was an interesting thought experiment; we don’t know whether it holds water, but we suspect it does. Any ideas and opinions?

Comment: 1

On the importance of imagination in data science

Strata Community Profile on Amy Heineike, Director of Mathematics

QuidAmyH_Bio

Amy Heineike

According to Amy Heineike, the Director of Mathematics at Quid, there’s nothing like having a fresh dataset in R and knowing how to use it. “You can add a few lines of code and discover all kinds of interesting information,” Heineike says. “One question leads to another, you get into a flow, and you can have an amazing exploration.”

Heineike started working with data several years ago at a consultancy in London, where “playing around” with data shed light on the impact of social networks on government policies. Part of her job was figuring out what types of data to use in order to find solutions to crucial problems, from public transportation to obesity. Her day-to-day work at Quid entails working with new data sets, prototyping analytics, and collaborating with an engineering team to improve data analysis and bring products into production.

Read more…

Comment

Pursuing data science as a second profession

Featured Strata Community Profile on Yogi Saxena

YogiSaxenaYogi Saxena is not one to back down from a challenge. The distance runner ran in his first marathon just two years ago in order to win a bet. Next month, he competes in another grueling marathon, his third. And if that were not enough, a friend’s Facebook post inspired him to train for a sprint triathalon. “I taught myself to swim when I was young,” Saxena says, revealing that his drive to learn new skills started early. “And if it wasn’t for the swim part, I’d have done an Olympic-distance triathlon instead.”

Saxena’s love of mastering new challenges is likely responsible for his decision to pursue data science as a second profession, after having a successful career as an electrical engineer. Currently at Boeing, he is responsible for developing a tool that would help visualize feeds from various classified and non-classified sources.

He is profiled here as part of the Strata community profiles.

Read more…

Comment

Tips and Tricks for Debugging Distributed Systems

Preview of upcoming session at the Strata Conference

By Philip Zeyliger

I’m talking on Wednesday at Strata about Tips and Tricks for Debugging Distributed Systems. You should come check it out.

As a preview, let’s talk about two pretty pictures.

Network Visualization

2network

I’m running some typical distributed systems (HDFS, MapReduce, Impala, HBase, Zookeeper) on a small, seven-node cluster. The diagram above has individual processes and the TCP connections they’ve established to each other. Some processes are “masters” and they end up talking to many other processes.

Read more…

Comment