Why we all need to understand and use big data.
Where does all the data in “big data” come from? And why isn’t big data just a concern for companies such as Facebook and Google? The answer is that the web companies are the forerunners. Driven by social, mobile, and cloud technology, there is an important transition taking place, leading us all to the data-enabled world that those companies inhabit today.
From exoskeleton to nervous system
Until a few years ago, the main function of computer systems in society, and business in particular, was as a digital support system. Applications digitized existing real-world processes, such as word-processing, payroll and inventory. These systems had interfaces back out to the real world through stores, people, telephone, shipping and so on. The now-quaint phrase “paperless office” alludes to this transfer of pre-existing paper processes into the computer. These computer systems formed a digital exoskeleton, supporting a business in the real world.
The arrival of the Internet and web has added a new dimension, bringing in an era of entirely digital business. Customer interaction, payments and often product delivery can exist entirely within computer systems. Data doesn’t just stay inside the exoskeleton any more, but is a key element in the operation. We’re in an era where business and society are acquiring a digital nervous system.
An interview with Shahid Shah
Health care costs rise as doctors try batches of treatments that don’t work in search of one that does. Meanwhile, drug companies spend billions on developing each drug and increasingly end up with nothing to show for their pains. This is the alarming state of medical science today. Shahid Shah, device developer and system integrator, sees a different paradigm emerging. In this interview at the Open Source convention, Shah talks about how technologies and new ways of working can open up medical research.
Castlight Health presents their vision of health care consumerism at Strata Rx
The stress of falling seriously ill often drags along the frustration of having no idea what the treatment will cost. We’ve all experienced the maddening stream of seemingly endless hospital bills, and testimony by E-patient Dave DeBronkart and others show just how absurd U.S. payment systems are.
Castlight casts its work in the framework of a service to employers and consumers. But make no mistake about it: they are a data-rich research operation, and their consumers become empowered patients (e-patients) who can make better choices.
As Arjun Kulothungun, John Zedlewski, and Eugenia Bisignani wrote to me, “Patients become empowered when actionable information is made available to them. In health care, like any other industry, people want high quality services at competitive prices. But in health care, quality and cost are often impossible for an average consumer to determine. We are proud to do the heavy lifting to bring this information to our users.”
Following are more questions and answers from the speakers: Read more…
Further reading and discussion on the civil rights implications of big data.
A few weeks ago, I wrote a post about big data and civil rights, which seems to have hit a nerve. It was posted on Solve for Interesting and here on Radar, and then folks like Boing Boing picked it up.
I haven’t had this kind of response to a post before (well, I’ve had responses, such as the comments to this piece for GigaOm five years ago, but they haven’t been nearly as thoughtful).
Some of the best posts have really added to the conversation. Here’s a list of those I suggest for further reading and discussion:
Nobody notices offers they don’t get
On Oxford’s Practical Ethics blog, Anders Sandberg argues that transparency and reciprocal knowledge about how data is being used will be essential. Anders captured the core of my concerns in a single paragraph, saying what I wanted to far better than I could:
… nobody notices offers they do not get. And if these absent opportunities start following certain social patterns (for example not offering them to certain races, genders or sexual preferences) they can have a deep civil rights effect
To me, this is a key issue, and it responds eloquently to some of the comments on the original post. Harry Chamberlain commented:
However, what would you say to the criticism that you are seeing lions in the darkness? In other words, the risk of abuse certainly exists, but until we see a clear case of big data enabling and fueling discrimination, how do we know there is a real threat worth fighting?
In the age of big data, Deven McGraw emphasizes trust, education and transparency in assuring health privacy.
Society is now faced with how to balance the privacy of the individual patient with the immense social good that could come through great health data sharing. Making health data more open and fluid holds both the potential to be hugely beneficial for patients and enormously harmful. As my colleague Alistair Croll put it this summer, big data may well be a civil rights issue that much of the world doesn’t know about yet.
This will likely be a tension that persists throughout my lifetime as technology spreads around the world. While big data breaches are likely to make headlines, more subtle uses of health data have the potential to enable employers, insurers or governments to discriminate — or worse. Figuring out shopping habits can also allow a company to determine a teenager was pregnant before her father did. People simply don’t realize how much about their lives can be intuited through analysis of their data exhaust.
To unlock the potential of health data for the public good, informed consent must mean something. Patients must be given the information and context for how and why their health data will be used in clear, transparent ways. To do otherwise is to duck the responsibility that comes with the immense power of big data.
In search of an informed opinion on all of these issues, I called up Deven McGraw (@HealthPrivacy), the director of the Health Privacy Project at the Center for Democracy and Technology (CDT). Our interview, lightly edited for content and clarity, follows. Read more…
Spark is becoming a key part of a big data toolkit.
A large portion of this week’s Amp Camp at UC Berkeley, is devoted to an introduction to Spark – an open source, in-memory, cluster computing framework. After playing with Spark over the last month, I’ve come to consider it a key part of my big data toolkit. Here’s why:
Hadoop integration: Spark can work with files stored in HDFS, an important feature given the amount of investment in the Hadoop Ecosystem. Getting Spark to work with MapR is straightforward.
The Spark interactive Shell: Spark is written in Scala, and has it’s own version of the Scala interpreter. I find this extremely convenient for testing short snippets of code.
The Spark Analytic Suite:
(Figure courtesy of Matei Zaharia)
Spark comes with tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming). Rather than having to mix and match a set of tools (e.g., Hive, Hadoop, Mahout, S4/Storm), you only have to learn one programming paradigm. For SQL enthusiasts, the added bonus is that Shark tends to run faster than Hive. If you want to run Spark in the cloud, there are a set of EC2 scripts available.