In a conversation with Q Ethan McCallum (who should be credited as co-author), we wondered how to evaluate data science groups. If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it? It’s not a simple problem under the best of conditions: you’re not an insider, so you don’t know the full story of how many projects it has tried, whether they have succeeded or failed, relations between the data group, management, and other departments, and all the other stuff you’d like to know but will never be told.
Our starting point was remote: Q told me about Tyler Brulé’s travel writing for Financial Times (behind a paywall, unfortunately), in which he says that a club sandwich is a good proxy for hotel quality: you go into the restaurant and order a club sandwich. A club sandwich isn’t hard to make: there’s no secret recipe or technique that’s going to make Hotel A’s sandwich significantly better than B’s. But it’s easy to cut corners on ingredients and preparation. And if a hotel is cutting corners on their club sandwiches, they’re probably cutting corners in other places.
This reminded me of when my daughter was in first grade, and we looked (briefly) at private schools. All the schools talked the same talk. But if you looked at classes, it was pretty clear that the quality of the music program was a proxy for the quality of the school. After all, it’s easy to shortchange music, and both hard and expensive to do it right. Oddly enough, using the music program as a proxy for evaluating school quality has continued to work through middle school and (public) high school. It’s the first thing to cut when the budget gets tight; and if a school has a good music program with excellent teachers, they’re probably not shortchanging the kids elsewhere.
How does this connect to data science? What are the proxies that allow you to evaluate a data science program from the “outside,” on the information that you might be able to cull from company blogs, a job interview, or even a job posting? We came up with a few ideas:
- Are the data scientists simply human search engines, or do they have real projects that allow them to explore and be curious? If they have management support for learning what can be learned from the organization’s data, and if management listens to what they discover, they’re accomplishing something significant. If they’re just playing Q&A with the company data, finding answers to specific questions without providing any insight, they’re not really a data science group.
- Do the data scientists live in a silo, or are they connected with the rest of the company? In Building Data Science Teams, DJ Patil wrote about the value of seating data scientists with designers, marketers, with the entire product group so that they don’t do their work in isolation, and can bring their insights to bear on all aspects of the company.
- When the data scientists do a study, is the outcome predetermined by management? Is it OK to say “we don’t have an answer” or to come up with a solution that management doesn’t like? Granted, you aren’t likely to be able to answer this question without insider information.
- What do job postings look like? Does the company have a mission and know what it’s looking for, or are they asking for someone with a huge collection of skills, hoping that they will come in useful? That’s a sign of data science cargo culting.
- Does management know what their tools are for, or have they just installed Hadoop because it’s what the management magazines tell them to do? Can managers talk intelligently to data scientists?
- What sort of documentation does the group produce for its projects? Like a club sandwich, it’s easy to shortchange documentation.
- Is the business built around the data? Or is the data science team an add-on to an existing company? A data science group can be integrated into an older company, but you have to ask a lot more questions; you have to worry a lot more about silos and management relations than you do in a company that is built around data from the start.
Coming up with these questions was an interesting thought experiment; we don’t know whether it holds water, but we suspect it does. Any ideas and opinions?
Harvard Medical School conference lays out uses for a health data platform
This week has been teaming with health care conferences, particularly in Boston, and was declared by President Obama to be National Health IT Week as well. I chose to spend my time at the second ITdotHealth conference, where I enjoyed many intense conversations with some of the leaders in the health care field, along with news about the SMART Platform at the center of the conference, the excitement of a Clayton Christiansen talk, and the general panache of hanging out at the Harvard Medical School.
SMART, funded by the Office of the National Coordinator in Health and Human Services, is an attempt to slice through the Babel of EHR formats that prevent useful applications from being developed for patient data. Imagine if something like the wealth of mash-ups built on Google Maps (crime sites, disaster markers, restaurant locations) existed for your own health data. This is what SMART hopes to do. They can already showcase some working apps, such as overviews of patient data for doctors, and a real-life implementation of the heart disease user interface proposed by David McCandless in WIRED magazine.
Why we all need to understand and use big data.
Where does all the data in “big data” come from? And why isn’t big data just a concern for companies such as Facebook and Google? The answer is that the web companies are the forerunners. Driven by social, mobile, and cloud technology, there is an important transition taking place, leading us all to the data-enabled world that those companies inhabit today.
From exoskeleton to nervous system
Until a few years ago, the main function of computer systems in society, and business in particular, was as a digital support system. Applications digitized existing real-world processes, such as word-processing, payroll and inventory. These systems had interfaces back out to the real world through stores, people, telephone, shipping and so on. The now-quaint phrase “paperless office” alludes to this transfer of pre-existing paper processes into the computer. These computer systems formed a digital exoskeleton, supporting a business in the real world.
The arrival of the Internet and web has added a new dimension, bringing in an era of entirely digital business. Customer interaction, payments and often product delivery can exist entirely within computer systems. Data doesn’t just stay inside the exoskeleton any more, but is a key element in the operation. We’re in an era where business and society are acquiring a digital nervous system.
An interview with Shahid Shah
Health care costs rise as doctors try batches of treatments that don’t work in search of one that does. Meanwhile, drug companies spend billions on developing each drug and increasingly end up with nothing to show for their pains. This is the alarming state of medical science today. Shahid Shah, device developer and system integrator, sees a different paradigm emerging. In this interview at the Open Source convention, Shah talks about how technologies and new ways of working can open up medical research.
Castlight Health presents their vision of health care consumerism at Strata Rx
The stress of falling seriously ill often drags along the frustration of having no idea what the treatment will cost. We’ve all experienced the maddening stream of seemingly endless hospital bills, and testimony by E-patient Dave DeBronkart and others show just how absurd U.S. payment systems are.
Castlight casts its work in the framework of a service to employers and consumers. But make no mistake about it: they are a data-rich research operation, and their consumers become empowered patients (e-patients) who can make better choices.
As Arjun Kulothungun, John Zedlewski, and Eugenia Bisignani wrote to me, “Patients become empowered when actionable information is made available to them. In health care, like any other industry, people want high quality services at competitive prices. But in health care, quality and cost are often impossible for an average consumer to determine. We are proud to do the heavy lifting to bring this information to our users.”
Following are more questions and answers from the speakers: Read more…
Further reading and discussion on the civil rights implications of big data.
A few weeks ago, I wrote a post about big data and civil rights, which seems to have hit a nerve. It was posted on Solve for Interesting and here on Radar, and then folks like Boing Boing picked it up.
I haven’t had this kind of response to a post before (well, I’ve had responses, such as the comments to this piece for GigaOm five years ago, but they haven’t been nearly as thoughtful).
Some of the best posts have really added to the conversation. Here’s a list of those I suggest for further reading and discussion:
Nobody notices offers they don’t get
On Oxford’s Practical Ethics blog, Anders Sandberg argues that transparency and reciprocal knowledge about how data is being used will be essential. Anders captured the core of my concerns in a single paragraph, saying what I wanted to far better than I could:
… nobody notices offers they do not get. And if these absent opportunities start following certain social patterns (for example not offering them to certain races, genders or sexual preferences) they can have a deep civil rights effect
To me, this is a key issue, and it responds eloquently to some of the comments on the original post. Harry Chamberlain commented:
However, what would you say to the criticism that you are seeing lions in the darkness? In other words, the risk of abuse certainly exists, but until we see a clear case of big data enabling and fueling discrimination, how do we know there is a real threat worth fighting?