ENTRIES TAGGED "strata"

Public health case study: Tracking zombies and vampires using social media

Preview of Strata Santa Clara 2013 Session

Towards the end of 2012, a battle that the pitted state versus state, father versus son, wife versus Bunco group, dog versus cat, finally reached a truce spawned by the treaty we all sign every fours years known as the presidential election. While the death match between red versus blue states has finally faded from our televisions and twitter feeds, we can now focus on the real issues of the day.

Longer then Romney’s candidacy bid for the white house, there has been a war going on in America, an undeath match of sorts between Zombies and Vampires. Like a flu pandemic sweeping the nation, the undead have been infiltrating our lives in every aspect. What traditionally was only a mild outbreak in October has turned into a year round epidemic that our society cannot seem to shake.

Read more…

Comment

Privacy in the Online Ecosystem: Obligations and Best Practices Are Evolving

Preview of upcoming session at Strata Santa Clara

At the end of 2012, the Federal Trade Commission (“FTC”) hosted the public workshop, “The Big Picture – Comprehensive Online Data Collection,” which focused on privacy concerns relating to the comprehensive collection of consumer online data by Internet service providers (“ISPs”), operating systems, browsers, search engines, and social media. During the workshop, panelists debated the impact of service providers’ ability to collect data about computer and device users across unaffiliated websites, including when some entities have no direct relationship with such users.

As one example of the issues raised by the panelists, Professor Neil Richards, from the Washington University in St. Louis School of Law, stated that, despite its benefits, comprehensive data collection infringes on the concept of “intellectual privacy,” which is predicated on consumers’ ability to freely search, interact, and express themselves online. Professor Richards also stated that comprehensive data collection is creating a transformational power shift in which businesses can effectively persuade consumers based on their knowledge of consumer preferences. Yet, according to Professor Richards, few consumers actually understand “the basis of the bargain,” or the extent to which their information is being collected.

Read more…

Comment

Building recommendation platforms with Hadoop

Preview of upcoming session at the Strata Conference

Recommendations are making their way into more and more products. Using larger datasets are significantly improving the recommendations. Hadoop is being increasingly used for building out the recommendation platforms. Some of the examples of Recommendations include product recommendations, merchant recommendations, content recommendations, social recommendations, query recommendation, display and search ads.

With the number of options available to the users ever increasing, the attention span of customers is getting lower and lower at the very fast pace. At any given moment, the customers are getting used to seeing their best choices right in front of them. In such a scenario, we see recommendations powering more and more features of the products and driving user interaction. Hence companies are looking for more ways to minutely target customers at the right time. This brings in big data into the picture. Succeeding with data and building new markets, or changing the existing markets is the game being played in many high stake scenarios. Some companies have found the way to build their big data recommendation/machine learning platform giving them the edge in bringing better and better products ever faster to the market. Hence, there is a strong case for looking at recommendations/machine learning on big data as a platform in a company, rather than something of a black box that magically produces the right results. The platform allows us to build various other features like fraud detection, spam detection, content enrichment and serving etc. making it viable in the long run. It is not just about recommendations.

Read more…

Comment

The 0th Law of Data Mining

Preview of The Laws of Data Mining Session at Strata Santa Clara 2013

Many years ago I was taught about the three laws of thermodynamics. When that didn’t stick, I was taught a quick way to remember originally identified by C.P. Snow:

  • 1st Law: you can’t win
  • 2nd Law: you can’t draw
  • 3rd Law: you can’t get out of the game

These laws (well the real ones) were firmly established by the mid 19th century. Yet, it wasn’t until the 1930s that the value of the 0th law was identified.

At Strata I’m going to be talking about the 9 Laws of Data Mining – a set of principles identified by Tom Khabaza and very closely related to the CRISP-DM data mining methodology.

They may possibly, just possibly, not be as important as the laws of thermodynamics, but at Strata they will be supported by an equally important 0th Law.

Read more…

Comment

Just the basics: refreshingly void of any semblance of big data

Strata Santa Clara session preview on core data science skills

The McKinsey Global Institute forecasts a shortage of over 140,000 data scientists in the U.S. by 2018. I forecast a shortage of 140,000 people to explain to their respective hiring managers that make it Hadoop is not an appropriate articulation of what these people can or should do. If big data is the new bubble, then here’s to the prolonged correct data recession that hopefully follows.

Correct data? Such skills used to be called unsexy names like statistics or scientific experiments, but we now prefer to spice up the job titles (and salaries!) a bit and brand ourselves as data scientists, data storytellers, data prophets, or—if my next promotion comes through—Lord High Chancellor of Data, appointed by the Sovereign on the advice of the Prime Minister to oversee Her Majesty’s Terabytes. Modesty, it sometimes feels, is low on the burgeoning list of big data skills.

Read more…

Comment

Design matters more than math

Design compels. Math is proof. Both sides will defend their domains at Strata's next Great Debate.

At Strata Santa Clara later this month, we’re reprising what has become a tradition: Great Debates. These Oxford-style debates pit two teams against one another to argue a hot topic in the fields of big data, ubiquitous computing, and emerging interfaces.

What matters more? Our teams for the Great Debate.Part of the fun is the scoring: attendees vote on whether they agree with the proposal before the debaters; and after both sides have said their piece, the audience votes again. Whoever moves the needle wins.

This year’s proposition — that design matters more than math — is sure to inspire some vigorous discussion. The argument for math is pretty strong. Math is proof. Given enough data — and today, we have plenty — we can know. “The right information in the right place just changes your life,” said Stewart Brand. Properly harnessed, the power of data analysis and modeling can fix cities, predict epidemics, and revitalize education. Abused, it can invade our lives, undermine economies, and steal elections. Surely the algorithms of big data matter!

But your life won’t change by itself. Bruce Mau defines design as “the human capacity to plan and produce desired outcomes.” Math informs; design compels. Without design, math can’t do its thing. Poorly designed experiments collect the wrong data. And if the data can’t be understood and acted upon, it may as well not have been crunched in the first place.

This is the question we’ll be putting to our debaters: Which matters more? A well-designed collection of flawed information — or an opaque, hard-to-parse, but unerringly accurate model? From mobile handsets to social policy, we need both good math and good design. Which is more critical? Read more…

Comment

Approaching ethics and big data

What to do when facing the stoic expressions that pop up during ethics discussions.

The other day I clicked on a message posted to the O’Reilly editors’ email list and the message text filled up almost the entire monitor screen. I must admit that I thought “Am I going to require another caffeine hit to read through this?”

I decided to take a chance, not take another break just then, and read the lengthy note. I didn’t need that caffeine hit after all. Apparently, neither did half a dozen other editors.

The note was about ethics.

In a previous life, I worked in the competitive intelligence field. I remember participating in a friendly confab at an industry event and then someone mentioned the word “e-t-h-i-c-s”. It was rather fascinating to see how that word elicited stoic faces.  No one wanted to be the first person to say anything on that topic. Now when working at ORM, mention the word “ethics!” and folks are not shy about saying exactly what they think. Not. At. All.

During the discussion, Ethics of Big Data by Kord Davis, came up.  While I was not the editor on this book, I did read it when I was in New York. It made my list of recommended books for people looking to jump into the world of big data. Why? Because I remembered the stoic poker faces from my previous life in competitive intelligence. Read more…

Comment

Four data themes to watch from Strata + Hadoop World 2012

In-memory data storage, SQL, data preparation and asking the right questions all emerged as key trends at Strata + Hadoop World.

At our successful Strata + Hadoop World conference (including successfully avoiding Sandy), a few themes emerged that resonated with my interests and experience as a hands-on data analyst and as a researcher who tracks technology adoption trends. Keep in mind that these themes reflect my personal biases. Others will have a different take on their own key takeaways from the conference.

1. In-memory data storage for faster queries and visualization

Interactive or real-time query for large datasets is seen as a key to analyst productivity (real-time as in query times fast enough to keep the user in the flow of analysis, from sub-second to less than a few minutes). The existing large-scale data management schemes aren’t fast enough and reduce analytical effectiveness when users can’t explore the data by quickly iterating through various query schemes. We see companies with large data stores building out their own in-memory tools, e.g., Dremel at Google, Druid at Metamarkets, and Sting at Netflix, and new tools, like Cloudera’s Impala announcement at the conference, UC Berkeley’s AMPLab’s Spark, SAP Hana, and Platfora.

We saw this coming a few years ago when analysts we pay attention to started building their own in-memory data store sandboxes, often in key/value data management tools like Redis, when trying to make sense of new, large-scale data stores. I know from my own work that there’s no better way to explore a new or unstructured data set than to be able to quickly run off a series of iterative queries, each informed by the last. Read more…

Comment

Discovering genetic associations using large data

David Heckerman's research uses big datasets to tackle essential health questions.

David Heckerman from Microsoft Research presents a summary of his work in the session “Discovering Genetic Associations on Large Data.” This was part of the Strata Rx Online Conference: Personalized Medicine, a preview of O’Reilly’s conference Strata Rx, highlighting the use of data in medical research and delivery.

Heckerman’s research attempts to answer essential questions such as “What is your propensity for getting a particular disease?” and “How are you likely to react to a particular drug?”

Key points from Heckerman’s presentation include: Read more…

Comment

Solving the Wanamaker problem for health care

Data science and technology give us the tools to revolutionize health care. Now we have to put them to use.

By Tim O’Reilly, Julie Steele, Mike Loukides and Colin Hill

“The best minds of my generation are thinking about how to make people click ads.” — Jeff Hammerbacher, early Facebook employee

“Work on stuff that matters.” — Tim O’Reilly

Doctors in operating room with data

In the early days of the 20th century, department store magnate John Wanamaker famously said, “I know that half of my advertising doesn’t work. The problem is that I don’t know which half.”

The consumer Internet revolution was fueled by a search for the answer to Wanamaker’s question. Google AdWords and the pay-per-click model began the transformation of a business in which advertisers paid for ad impressions into one in which they pay for results. “Cost per thousand impressions” (CPM) was outperformed by “cost per click” (CPC), and a new industry was born. It’s important to understand why CPC outperformed CPM, though. Superficially, it’s because Google was able to track when a user clicked on a link, and was therefore able to bill based on success. But billing based on success doesn’t fundamentally change anything unless you can also change the success rate, and that’s what Google was able to do. By using data to understand each user’s behavior, Google was able to place advertisements that an individual was likely to click. They knew “which half” of their advertising was more likely to be effective, and didn’t bother with the rest.

Since then, data and predictive analytics have driven ever deeper insight into user behavior such that companies like Google, Facebook, Twitter,  and LinkedIn are fundamentally data companies. And data isn’t just transforming the consumer Internet. It is transforming finance, design, and manufacturing — and perhaps most importantly, health care.

How is data science transforming health care? There are many ways in which health care is changing, and needs to change. We’re focusing on one particular issue: the problem Wanamaker described when talking about his advertising. How do you make sure you’re spending money effectively? Is it possible to know what will work in advance?

Read more…

Comments: 12