We are simply not good at playing with others when it comes to data
Russia’s railway gauge is different from Western Europe’s. At the border of the former Soviet states, the Russian gauge of 1.524m meets the European & American ‘Standard’ gauge of 1.435m. The reasons for this literal disconnect arise from discussions between the Tsar and his War Minister. When asked the most effective way to prevent Russia’s own rail lines being used against them in times of invasion, the Minister suggested a different gauge to prevent supply trains rolling through the border. The artifact of this decision remains visible today at all rail crossings between Poland and Belarus or Slovakia and Ukraine. The rail cars are jacked up at the border, new wheels inserted underneath, and the car lowered again. It is about a 2-4 hour time burn for each crossing.
Per head, per crossing, over 170 years, is a heck of a lot of resource wasted. But to change it would entail changing the rail stock of the entire country and realigning about 225,000 km (140,000 mi) of track.
Talk about technical debt.
Data suffers from a similar disconnect. It really wasn’t until the advent of XML 15 years ago that we had an agreed (but not entirely satisfactory) mechanism for storing arbitrary data structures outside the application layer. This is as much a commentary on our technical priorities as it is a social indictment. We are simply not good at playing with others when it comes to data.
A Call for Industry-Standard Benchmarks for Big Data Platforms at Strata SC 2013
Big data systems are characterized by their flexibility in processing diverse data genres, such as transaction logs, connection graphs, and natural language text, with algorithms characterized by multiple communication patterns, e.g. scatter-gather, broadcast, multicast, pipelines, and bulk-synchronous. A single benchmark that characterizes a single workload could not be representative of such a multitude of use-cases. However, our systematic study of several use-cases of current big data platforms indicates that most workloads are composed of a common set of stages, which capture the variety of data genres and algorithms commonly used to implement most data-intensive end-to-end workloads. Our upcoming session at Strata SC discusses the BigData Top 100 List, a new community-based initiative for benchmarking big data systems.
Tips for interacting with analytics colleagues
To quote Pride and Prejudice, businesses have for many years “labored under the misapprehension” that their analytics talent was made up of misanthropes with neither the will nor the ability to communicate or work with others on strategic or creative business problems. These employees were meant to be kept in the basement out of sight, fed bad pizza, and pumped for spreadsheets to be interpreted in the sunny offices aboveground.
This perception is changing in industry as the big data phenomenon has elevated data science to a C-level priority. Suddenly folks once stereotyped by characters like Milton in Office Space are now “sexy.” The truth is there have always been well-rounded, articulate, friendly analytics professionals (they may just like Battlestar more than you), and now that analytics is an essential business function, personalities of all types are being attracted to practice the discipline.
Preview of Strata Santa Clara 2013 Session
The 2013 Strata Conference in Santa Clara, CA will be my fifth Strata conference. As always, I’m excited to join so many leaders in the data and data viz communities, and I’m honored that I’ll be speaking there.
I will be presenting my tutorial “Communicating Data Clearly” at 9AM on Tuesday, February 26. This talk will cover methods and principles of creating effective graphs, to ensure they are clear, accurate, and make it easier to understand the data. It will also emphasize how to avoid common graphical mistakes. To give you a preview of a few of the topics I will be covering as well as to provide some information to those who cannot attend, I will now link to some of the blog posts I‘ve written for Forbes. I was invited to blog for Forbes at a New York Strata Conference in 2011 so that my relationships with Forbes and Strata are intertwined.
Preview of upcoming session at Strata Santa Clara
At the end of 2012, the Federal Trade Commission (“FTC”) hosted the public workshop, “The Big Picture – Comprehensive Online Data Collection,” which focused on privacy concerns relating to the comprehensive collection of consumer online data by Internet service providers (“ISPs”), operating systems, browsers, search engines, and social media. During the workshop, panelists debated the impact of service providers’ ability to collect data about computer and device users across unaffiliated websites, including when some entities have no direct relationship with such users.
As one example of the issues raised by the panelists, Professor Neil Richards, from the Washington University in St. Louis School of Law, stated that, despite its benefits, comprehensive data collection infringes on the concept of “intellectual privacy,” which is predicated on consumers’ ability to freely search, interact, and express themselves online. Professor Richards also stated that comprehensive data collection is creating a transformational power shift in which businesses can effectively persuade consumers based on their knowledge of consumer preferences. Yet, according to Professor Richards, few consumers actually understand “the basis of the bargain,” or the extent to which their information is being collected.
Preview of upcoming session at the Strata Conference
Recommendations are making their way into more and more products. Using larger datasets are significantly improving the recommendations. Hadoop is being increasingly used for building out the recommendation platforms. Some of the examples of Recommendations include product recommendations, merchant recommendations, content recommendations, social recommendations, query recommendation, display and search ads.
With the number of options available to the users ever increasing, the attention span of customers is getting lower and lower at the very fast pace. At any given moment, the customers are getting used to seeing their best choices right in front of them. In such a scenario, we see recommendations powering more and more features of the products and driving user interaction. Hence companies are looking for more ways to minutely target customers at the right time. This brings in big data into the picture. Succeeding with data and building new markets, or changing the existing markets is the game being played in many high stake scenarios. Some companies have found the way to build their big data recommendation/machine learning platform giving them the edge in bringing better and better products ever faster to the market. Hence, there is a strong case for looking at recommendations/machine learning on big data as a platform in a company, rather than something of a black box that magically produces the right results. The platform allows us to build various other features like fraud detection, spam detection, content enrichment and serving etc. making it viable in the long run. It is not just about recommendations.
Preview of The Laws of Data Mining Session at Strata Santa Clara 2013
Many years ago I was taught about the three laws of thermodynamics. When that didn’t stick, I was taught a quick way to remember originally identified by C.P. Snow:
- 1st Law: you can’t win
- 2nd Law: you can’t draw
- 3rd Law: you can’t get out of the game
These laws (well the real ones) were firmly established by the mid 19th century. Yet, it wasn’t until the 1930s that the value of the 0th law was identified.
They may possibly, just possibly, not be as important as the laws of thermodynamics, but at Strata they will be supported by an equally important 0th Law.
Strata Santa Clara session preview on core data science skills
The McKinsey Global Institute forecasts a shortage of over 140,000 data scientists in the U.S. by 2018. I forecast a shortage of 140,000 people to explain to their respective hiring managers that make it Hadoop is not an appropriate articulation of what these people can or should do. If big data is the new bubble, then here’s to the prolonged correct data recession that hopefully follows.
Correct data? Such skills used to be called unsexy names like statistics or scientific experiments, but we now prefer to spice up the job titles (and salaries!) a bit and brand ourselves as data scientists, data storytellers, data prophets, or—if my next promotion comes through—Lord High Chancellor of Data, appointed by the Sovereign on the advice of the Prime Minister to oversee Her Majesty’s Terabytes. Modesty, it sometimes feels, is low on the burgeoning list of big data skills.
Design compels. Math is proof. Both sides will defend their domains at Strata's next Great Debate.
At Strata Santa Clara later this month, we’re reprising what has become a tradition: Great Debates. These Oxford-style debates pit two teams against one another to argue a hot topic in the fields of big data, ubiquitous computing, and emerging interfaces.
Part of the fun is the scoring: attendees vote on whether they agree with the proposal before the debaters; and after both sides have said their piece, the audience votes again. Whoever moves the needle wins.
This year’s proposition — that design matters more than math — is sure to inspire some vigorous discussion. The argument for math is pretty strong. Math is proof. Given enough data — and today, we have plenty — we can know. “The right information in the right place just changes your life,” said Stewart Brand. Properly harnessed, the power of data analysis and modeling can fix cities, predict epidemics, and revitalize education. Abused, it can invade our lives, undermine economies, and steal elections. Surely the algorithms of big data matter!
But your life won’t change by itself. Bruce Mau defines design as “the human capacity to plan and produce desired outcomes.” Math informs; design compels. Without design, math can’t do its thing. Poorly designed experiments collect the wrong data. And if the data can’t be understood and acted upon, it may as well not have been crunched in the first place.
This is the question we’ll be putting to our debaters: Which matters more? A well-designed collection of flawed information — or an opaque, hard-to-parse, but unerringly accurate model? From mobile handsets to social policy, we need both good math and good design. Which is more critical? Read more…