"data" entries

What’s Up With Big Data Ethics?

Insights from a business executive and law professor

by Jonathan H. King & Neil M. Richards

Photo provided courtesy of Jonathan H. King.

Photo provided courtesy of Jonathan King

If you develop software or manage databases, you’re probably at the point now where the phrase “Big Data” makes you roll your eyes. Yes, it’s hyped quite a lot these days. But, overexposed or not, the Big Data revolution raises a bunch of ethical issues related to privacy, confidentiality, transparency and identity. Who owns all that data that you’re analyzing? Are there limits to what kinds of inferences you can make, or what decisions can be made about people based on those inferences? Perhaps you’ve wondered about this yourself.

We’re obsessed by these questions. We’re a business executive and a law professor who’ve written about this question a lot, but our audience is usually lawyers. But because engineers are the ones who confront these questions on a daily basis, we think it’s essential to talk about these issues in the context of software development.

Photo provided courtesy of Neil M. Richards.

Photo provided courtesy of Neil M. Richards.

While there’s nothing particularly new about the analytics conducted in big data, the scale and ease with which it can all be done today changes the ethical framework of data analysis. Developers today can tap into remarkably varied and far-flung data sources. Just a few years ago, this kind of access would have been hard to imagine. The problem is that our ability to reveal patterns and new knowledge from previously unexamined troves of data is moving faster than our current legal and ethical guidelines can manage. We can now do things that were impossible a few years ago, and we’ve driven off the existing ethical and legal maps. If we fail to preserve the values we care about in our new digital society, then our big data capabilities risk abandoning these values for the sake of innovation and expediency.

Read more…

Comment

Can data provide the trust we need in health care?

Collecting actionable data is a challenge for today's data tools

One of the problems dragging down the US health care system is that nobody trusts one another. Most of us, as individuals, place faith in our personal health care providers, which may or may not be warranted. But on a larger scale we’re all suspicious of each other:

  • Doctors don’t trust patients, who aren’t forthcoming with all the bad habits they indulge in and often fail to follow the most basic instructions, such as to take their medications.
  • The payers–which include insurers, many government agencies, and increasingly the whole patient population as our deductibles and other out-of-pocket expenses ascend–don’t trust the doctors, who waste an estimated 20% or more of all health expenditures, including some thirty or more billion dollars of fraud each year.
  • The public distrusts the pharmaceutical companies (although we still follow their advice on advertisements and ask our doctors for the latest pill) and is starting to distrust clinical researchers as we hear about conflicts of interest and difficulties replicating results.
  • Nobody trusts the federal government, which pursues two (contradictory) goals of lowering health care costs and stimulating employment.

Yet everyone has beneficent goals and good ideas for improving health care. Doctors want to feel effective, patients want to stay well (even if that desire doesn’t always translate into action), the Department of Health and Human Services champions very lofty goals for data exchange and quality improvement, clinical researchers put their work above family and comfort, and even private insurance companies are trying moving to “fee for value” programs that ensure coordinated patient care.

Read more…

Comments: 2

Podcast: thinking with data

Data tools are less important than the way you frame your questions.

Max Shron and Jake Porway spoke with me at Strata a few weeks ago about frameworks for making reasoned arguments with data. Max’s recent O’Reilly book, Thinking with Data, outlines the crucial process of developing good questions and creating a plan to answer them. Jake’s nonprofit, DataKind, connects data scientists with worthy causes where they can apply their skills.

A few of the things we talked about:

  • The importance of publishing negative scientific results
  • Give Directly, an organization that facilitates donations directly to households in Kenya and Uganda. Give Directly was able to model income using satellite data to distinguish thatched roofs from metal roofs.
  • Moritz Stefaner calling for a “macroscope”
  • Project Cybersyn, Salvador Allende’s plan for encompassing the entire Chilean economy in a single real-time computer system
  • Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed by James C. Scott

After we recorded this podcast episode at Strata Santa Clara, Max presided over a webcast on his book that’s archived here.

Comment

An Invitation to Practical Machine Learning

PracticalMachineLearning_covDoes it make sense for me to have a car? If so, which one is the best choice for my needs: a gasoline, hybrid, or electric?  And should I buy or lease?

In order to make an effective decision, I need to understand key issues about the design, performance, and cost of cars, regardless of whether or not I actually know how to build one myself. The same is true for people deciding if machine learning is a good choice for their business goals or project.  Will the payoff be worth the effort?  What machine learning approach is most likely to produce valuable results for your particular situation? What size team with what expertise is necessary to be able to develop, deploy, and maintain your machine learning system?

Given the complex and previously esoteric nature of machine learning as a field – the sometimes daunting array of learning algorithms and the math needed to understand and employ them – many people feel the topic is one best left only to the few.

Read more…

Comments: 2
Four short links: 3 March 2014

Four short links: 3 March 2014

Vanishing Money, Car Hackery, Data Literacy Course, and Cheaper CI

  1. The Programming Error That Cost Mt Gox 2609 Bitcoins — in the unforgiving world of crypto-currency, it’s easy to miscode and vanish your money.
  2. Ford Invites Open-Source Community to Tinker AwayOne example: Nelson has re-tasked the motor from a Microsoft Xbox 360 game controller to create an OpenXC shift knob that vibrates to signal gear shifts in a standard-transmission Mustang. The 3D-printed prototype shift knob uses Ford’s OpenXC research platform to link devices to the car via Bluetooth, and shares vehicle data from the on-board diagnostics port. Nelson has tested his prototype in a Ford Mustang Shelby GT500 that vibrates at the optimal time to shift.
  3. Making Sense of Data — Google online course on data literacy.
  4. Cost-Efficient Continuous Integration at Mozilla — CI on a big project can imply hundreds if not thousands of VMs on Amazon spinning up to handle compiles and tests. This blog post talks about Mozilla’s efforts to reduce its CI-induced spend without reducing the effectiveness of its CI practices.
Comment

Healthcare Lessons from the Data Sages at Strata

Other industries can show health care the way

This article was written with Ellen M. Martin.

Most healthcare clinicians don’t often think about donating or sharing data. Yet, after hearing Stephen Friend of Sage Bionetworks talk about involving citizens and patients in the field of genetic research at StrataRx 2012, I was curious to learn more.

McKinsey points out the 300 billion dollars in potential savings from using open data in healthcare, while a recent IBM Institute of Business Value study showed the need for corporate data collaboration.

Also, during my own research for Big Data in Healthcare: Hype and Hope, the resounding request from all the participants I interviewed was to “find more data streams to analyze.”

Read more…

Comment
Four short links: 10 February 2014

Four short links: 10 February 2014

Sterling Zings, Android Swings, Data Blings, and Visualized Things.

  1. Bruce Sterling at transmediale 2014 (YouTube) — “if it works, it’s already obsolete.” Sterling does a great job of capturing the current time: spies in your Internet, lost trust with the BigCos, the impermanence of status quo, the need to create. (via BoingBoing)
  2. No-one Should Fork Android (Ars Technica) — this article is bang on. Google Mobile Services (the Play functionality) is closed-source, what makes Android more than a bare-metal OS, and is where G is focusing its development. Google’s Android team treats openness like a bug and routes around it.
  3. Data Pipelines (Hakkalabs) — interesting overview of the data pipelines of Stripe, Tapad, Etsy, and Square.
  4. Visualising Salesforce Data in Minecraft — would almost make me look forward to using Salesforce. Almost.
Comment: 1

Finding the Meaning in “Meaningful Use”

The 30,000-foot view and the nitty gritty details of working with electronic health data

Ever wonder what the heck “meaningful use” really means? By now, you’ve probably heard it come up in discussions of healthcare data. You might even know that it specifically pertains to electronic health records (EHRs). But what is it really about, and why should you care?

If you’ve ever had to carry a large folder of paper between specialists, or fill out the same medical history form in different offices over and over—with whatever details you happen to remember off the top of your head that day—then you already have some idea of why EHRs are a desirable thing. The idea is that EHRs will lead to better care—and better research data—through more complete and accurate record-keeping, and will eventually become part of health information exchanges (HIEs) with features like trend analysis and push-notifications. However, the mere installation of EHR software isn’t enough; we need not just cursory use but meaningful use of EHRs, and we need to ensure that the software being used meets certain standards of efficiency and security.

Read more…

Comment

Why is building custom recommender systems hard? Does it have to be?

guenstrin

Photo Courtesy of Carlos Guestrin

By Carlos Guestrin

Today, it’s shocking (and honestly exciting) how much of my daily experience is determined by a recommender system.  These systems drive amazing experiences everywhere, telling me where to eat, what to listen to, what to watch, what to read, and even who I should be friends with.  Furthermore, information overload is making recommender systems indispensable, since I can’t find what I want on the web simply using keyword search tools.  Recommenders are behind the success of industry leaders like Netflix, Google, Pandora, eHarmony, Facebook, and Amazon.  It’s no surprise companies want to integrate recommender systems with their own online experiences.  However, as I talk to team after team of smart industry engineers, it has become clear that building and managing these systems is usually a bit out of reach, especially given all the other demands on the team’s time.

Read more…

Comment

Learning Apache Mesos

In the summer of 2012, Accel Partners hosted an invitation-only Big Data conference at Stanford. Ping Li stood near the exit with a checkbook, ready to invest $1MM in pitches for real-time analytics on clusters. However, real-time means many different things. For MetaScale working on the Sears turnaround, real-time means shrinking a 6 hour window on a mainframe to 6 minutes on Hadoop. For a hedge fund, real-time means compiling Python to run on GPUs where milliseconds matter, or running on FPGA hardware for microsecond response.

With much emphasis on Hadoop circa 2012, one might think that no other clusters existed. Nothing could be further from the truth: Memcached, Ruby on Rails, Cassandra, Anaconda, Redis, Node.js, etc. – all in large-scale production use for mission critical apps, much closer to revenue than the batch jobs. Google emphasizes a related point in their Omega paper: scheduling batch jobs is not difficult, while scheduling services on a cluster is a hard problem, and that translates to lots of money.

Read more…

Comment