"strata" entries

Approaching ethics and big data

What to do when facing the stoic expressions that pop up during ethics discussions.

The other day I clicked on a message posted to the O’Reilly editors’ email list and the message text filled up almost the entire monitor screen. I must admit that I thought “Am I going to require another caffeine hit to read through this?”

I decided to take a chance, not take another break just then, and read the lengthy note. I didn’t need that caffeine hit after all. Apparently, neither did half a dozen other editors.

The note was about ethics.

In a previous life, I worked in the competitive intelligence field. I remember participating in a friendly confab at an industry event and then someone mentioned the word “e-t-h-i-c-s”. It was rather fascinating to see how that word elicited stoic faces.  No one wanted to be the first person to say anything on that topic. Now when working at ORM, mention the word “ethics!” and folks are not shy about saying exactly what they think. Not. At. All.

During the discussion, Ethics of Big Data by Kord Davis, came up.  While I was not the editor on this book, I did read it when I was in New York. It made my list of recommended books for people looking to jump into the world of big data. Why? Because I remembered the stoic poker faces from my previous life in competitive intelligence. Read more…

Four data themes to watch from Strata + Hadoop World 2012

In-memory data storage, SQL, data preparation and asking the right questions all emerged as key trends at Strata + Hadoop World.

At our successful Strata + Hadoop World conference (including successfully avoiding Sandy), a few themes emerged that resonated with my interests and experience as a hands-on data analyst and as a researcher who tracks technology adoption trends. Keep in mind that these themes reflect my personal biases. Others will have a different take on their own key takeaways from the conference.

1. In-memory data storage for faster queries and visualization

Interactive or real-time query for large datasets is seen as a key to analyst productivity (real-time as in query times fast enough to keep the user in the flow of analysis, from sub-second to less than a few minutes). The existing large-scale data management schemes aren’t fast enough and reduce analytical effectiveness when users can’t explore the data by quickly iterating through various query schemes. We see companies with large data stores building out their own in-memory tools, e.g., Dremel at Google, Druid at Metamarkets, and Sting at Netflix, and new tools, like Cloudera’s Impala announcement at the conference, UC Berkeley’s AMPLab’s Spark, SAP Hana, and Platfora.

We saw this coming a few years ago when analysts we pay attention to started building their own in-memory data store sandboxes, often in key/value data management tools like Redis, when trying to make sense of new, large-scale data stores. I know from my own work that there’s no better way to explore a new or unstructured data set than to be able to quickly run off a series of iterative queries, each informed by the last. Read more…

Discovering genetic associations using large data

David Heckerman's research uses big datasets to tackle essential health questions.

David Heckerman from Microsoft Research presents a summary of his work in the session “Discovering Genetic Associations on Large Data.” This was part of the Strata Rx Online Conference: Personalized Medicine, a preview of O’Reilly’s conference Strata Rx, highlighting the use of data in medical research and delivery.

Heckerman’s research attempts to answer essential questions such as “What is your propensity for getting a particular disease?” and “How are you likely to react to a particular drug?”

Key points from Heckerman’s presentation include: Read more…

Solving the Wanamaker problem for health care

Data science and technology give us the tools to revolutionize health care. Now we have to put them to use.

By Tim O’Reilly, Julie Steele, Mike Loukides and Colin Hill

“The best minds of my generation are thinking about how to make people click ads.” — Jeff Hammerbacher, early Facebook employee

“Work on stuff that matters.” — Tim O’Reilly

Doctors in operating room with data

In the early days of the 20th century, department store magnate John Wanamaker famously said, “I know that half of my advertising doesn’t work. The problem is that I don’t know which half.”

The consumer Internet revolution was fueled by a search for the answer to Wanamaker’s question. Google AdWords and the pay-per-click model began the transformation of a business in which advertisers paid for ad impressions into one in which they pay for results. “Cost per thousand impressions” (CPM) was outperformed by “cost per click” (CPC), and a new industry was born. It’s important to understand why CPC outperformed CPM, though. Superficially, it’s because Google was able to track when a user clicked on a link, and was therefore able to bill based on success. But billing based on success doesn’t fundamentally change anything unless you can also change the success rate, and that’s what Google was able to do. By using data to understand each user’s behavior, Google was able to place advertisements that an individual was likely to click. They knew “which half” of their advertising was more likely to be effective, and didn’t bother with the rest.

Since then, data and predictive analytics have driven ever deeper insight into user behavior such that companies like Google, Facebook, Twitter,  and LinkedIn are fundamentally data companies. And data isn’t just transforming the consumer Internet. It is transforming finance, design, and manufacturing — and perhaps most importantly, health care.

How is data science transforming health care? There are many ways in which health care is changing, and needs to change. We’re focusing on one particular issue: the problem Wanamaker described when talking about his advertising. How do you make sure you’re spending money effectively? Is it possible to know what will work in advance?

Read more…

Four short links: 4 October 2011

Four short links: 4 October 2011

Singaporean Incubator, Oracle NoSQL, Should Facebook have a Browser?, and GitHub has Competition

  1. jfdi.asia — Singaporean version of TechStars, with 100-day program (“the bootcamp”) Jan-Apr 2012. Startups from anywhere in the world can apply, and will want to because Singapore is the gateway to Asia. They’ll also have mentors from around the world.
  2. Oracle NoSQLdb — Oracle want to sell you a distributed key-value store. It’s called “Oracle NoSQL” (as opposed to PostgreSQL, which is SQL No-Oracle). (via Edd Dumbill)
  3. Facebook Browser — interesting thoughts about why the browser might be a good play for Facebook. I’m not so sure: browsers don’t lend themselves to small teams, and search advertising doesn’t feel like a good fit with Facebook’s existing work. Still, making me grumpy again to see browsers become weapons again.
  4. Bitbucket — a competitor to Github, from the folks behind the widely-respected Jira and Confluence tools. I’m a little puzzled, to be honest: Github doesn’t seem to have weak spots (the way, for example, that Sourceforge did).

There's no such thing as big data

Even if you have petabyes of data, you still need to know how to ask the right questions to apply it.

Today's big companies are losing to small upstarts simply because those firms ask better questions. To compete, large enterprises need to learn how to harvest the data they have on customers, markets, competitors, and products.

Four short links: 7 February 2011

Four short links: 7 February 2011

Printed Toys, Magazines in JS, git push web, Clean Beats More

  1. UK Internet Entrepreneurs (Guardian) — two things stood out for me. (1) A startup focused on 3d printing better dolls for boys and girls. (2) it seems easier to the government to start something new and impose its own vision than it is to understand and integrate with what already exists.
  2. TreeSaver.js — MIT/GPLv2-licensed JavaScript framework for creating magazine-style layouts using standards-compliant HTML and CSS.
  3. Using git to Manage a Web SiteThis page describes how I set things up so that I can make changes live by running just “git push web”.
  4. Strata Data Conference RecapClean data > More Data > Fancy Math — this is the order which makes data easier and better to work with. Clean data will be easier to work with and provide best results. If your data isn’t clean, it is better to have more data than having to resort to fancy math. Using higher order statistical processing, while workable as a last resort, will require longer to develop, difficult algorithms and harder to maintain. So best place to focus is to start with clean data.