Making sense of the hype-cycle scuffle.
The big data world is a confusing place. We’re no longer in a market dominated mostly by relational databases, and the alternatives have multiplied in a baby boom of diversity.
These child prodigies of the data scene show great promise but spend a lot of time knocking each other around in the schoolyard. Their egos can sometimes be too big to accept that everybody has their place, and eyeball-seeking media certainly doesn’t help.
POPULAR KID: Look at me! Big data is the hotness!
HADOOP: My data’s bigger than yours!
SCIPY: Size isn’t everything, Hadoop! The bigger they come, the harder they fall. And aren’t you named after a toy elephant?
R: Backward sentences mine be, but great power contains large brain.
SQL: Oh, so you all want to be friends again now, eh?!
POPULAR KID: Yeah, what SQL said! Nobody really needs big data; it’s all about small data, dummy.
The fact is that we’re fumbling toward the adolescence of big data tools, and we’re at an early stage of understanding how data can be used to create value and increase the quality of service people receive from government, business and health care. Big data is trumpeted in mainstream media, but many businesses are better advised to take baby steps with small data.
Being both liberal and safe in programming is hard
Recent discoveries of security vulnerabilities in Rails and MongoDB led me to thinking about how people get to write software.
In engineering, you don’t get to build a structure people can walk into without years of study. In software, we often write what the heck we want and go back to clean up the mess later. It works, but the consequences start to get pretty monumental when you consider the network effects of open source.
You might think it’s a consequence of the tools we use—running fast and loose with scripting languages. I’m not convinced. Unusually among computer science courses, my alma mater taught us programming 101 with Ada. Ada is a language that more or less requires a retinal scan before you can use the compiler. It was a royal pain to get Ada to do anything you wanted: the philosophical inverse of Perl or Ruby. We certainly came up the “hard way.”
I’m not sure that the hard way was any better: a language that protects you from yourself doesn’t teach you much about the problems you can create.
But perhaps we are in need of an inversion of philosophy. Where Internet programming is concerned, everyone is quick to quote Postel’s law: “Be conservative in what you do, be liberal in what you accept from others.”
The fact of it is that being liberal in what you accept is really hard. You basically have two options: look carefully for only the information you need, which I think is the spirit of Postel’s law, or implement something powerful that will take care of many use cases. This latter strategy, though seemingly quicker and more future-proof, is what often leads to bugs and security holes, as unintended applications of powerful parsers manifest themselves.
My conclusion is this: use whatever language makes sense, but be systematically paranoid. Be liberal in what you accept, but conservative about what you believe.
It's not about IT buying, but about making data work for you. Learn more in the Big Data in Enterprise IT program at Strata California.
In a world where technology and business are evermore intertwined, IT leaders aspire to key roles in their organizations. Sadly, industry conferences can lag behind, assuming IT is all about making the right buying decisions.
Not so at Strata.
Our approach is to take a view of data for business that centers around the problems you need to solve. The excitement around big data isn’t really about large volumes of data, it’s about smart use of data. It’s about using data to make your products better, help you be significantly more efficient, and create new products and businesses.
Getting the most from big data and data science is a lot more than a software choice. The business aims come first, and a good understanding of the problems you want to solve. Then you need to understand the capabilities of the technology and where data science can be best applied. Finally, you need to know how to run successful data projects, and how to hire and manage data teams.
Working with analytics and BI expert Mark Madsen, I’ve compiled a day-long program at Strata called Big Data in Enterprise IT that will take you through big data strategy, the issues of managing data, and how data science can be used effectively in your organization. Read more…
We can change the future, and we must.
I sat last night at Aaron Swartz’s memorial in San Francisco, among the very people who built the Internet, the web, the culture of young entrepreneurialism and Web 2.0 startups. Among the pioneers of Creative Commons, Electronic Frontier Foundation, open source software and those fighting to keep the public domain public.
Aaron was one of them.
It was a family reunion, under dreadful circumstances nobody would have wished for.
In his life Aaron had worked and learned among the thoughtful leaders who built the web we now benefit from today. He worked with the W3C, when the web was still “1.0,” and then in the social web and the hotbed of innovation and startup culture at Y Combinator.
Aaron’s passion for providing access to knowledge drove the most recent years of his life, from the campaign against SOPA to the liberation of public court records from PACER. And of course the downloading of journal articles, leading to the events that has brought his death so much into the public eye. Yet as Carl Malamud passionately insisted last night, Aaron was not a lone actor, but part of a peaceful army of reformers. Read more…
Diversity and manageability are big data watchwords for the next 12 months.
Here are some of the key big data themes I expect to dominate 2013, and of course will be covering in Strata.
Emergence of a big data architecture
The coming year will mark the graduation for many big data pilot projects, as they are put into production. With that comes an understanding of the practical architectures that work. These architectures will identify:
- best of breed tools for different purposes, for instance, Storm for streaming data acquisition
- appropriate roles for relational databases, Hadoop, NoSQL stores and in-memory databases
- how to combine existing data warehouses and analytical databases with Hadoop
Of course, these architectures will be in constant evolution as big data tooling matures and experience is gained.
In parallel, I expect to see increasing understanding of where big data responsibility sits within a company’s org chart. Big data is fundamentally a business problem, and some of the biggest challenges in taking advantage of it lie in the changes required to cross organizational silos and reform decision making.
One to watch: it’s hard to move data, so look for a starring architectural role for HDFS for the foreseeable future. Read more…
Unraveling what programming will need for the next 10 years.
Programming is changing. The PC era is coming to an end, and software developers now work with an explosion of devices, job functions, and problems that need different approaches from the single machine era. In our age of exploding data, the ability to do some kind of programming is increasingly important to every job, and programming is no longer the sole preserve of an engineering priesthood.Over the course of the next few months, I’m looking to chart the ways in which programming is evolving, and the factors that are affecting it. This article captures a few of those forces, and I welcome comment and collaboration on how you think things are changing.
Where am I headed with this line of inquiry? The goal is to be able to describe the essential skills that programmers need for the coming decade, the places they should focus their learning, and differentiating between short term trends and long term shifts. Read more…
Web services combine to give us our data, and help us use it.
The web service IFTTT (If this, then that) accesses popular web applications via their APIs, and lets users create new actions based on changes. For instance, actions such as “upload photos to Flickr when I add them to my Dropbox folder”, or “send me email when frost is forecast”.
I had been tempted to classify IFTTT as a merely an interesting toy for playing with social media. Granted, it’s nice that I can archive all my tweets into an Evernote note, but so what? However, IFTTT’s growth in features is showing it to be more than a bauble. The service is becoming an empowering tool that gives users more control over their own data, previously often accessible by programmers alone.
Why we all need to understand and use big data.
Where does all the data in “big data” come from? And why isn’t big data just a concern for companies such as Facebook and Google? The answer is that the web companies are the forerunners. Driven by social, mobile, and cloud technology, there is an important transition taking place, leading us all to the data-enabled world that those companies inhabit today.
From exoskeleton to nervous system
Until a few years ago, the main function of computer systems in society, and business in particular, was as a digital support system. Applications digitized existing real-world processes, such as word-processing, payroll and inventory. These systems had interfaces back out to the real world through stores, people, telephone, shipping and so on. The now-quaint phrase “paperless office” alludes to this transfer of pre-existing paper processes into the computer. These computer systems formed a digital exoskeleton, supporting a business in the real world.
The arrival of the Internet and web has added a new dimension, bringing in an era of entirely digital business. Customer interaction, payments and often product delivery can exist entirely within computer systems. Data doesn’t just stay inside the exoskeleton any more, but is a key element in the operation. We’re in an era where business and society are acquiring a digital nervous system.
The essential principles of conference development.
I’ve chaired computer industry conferences for ten years now. First for IDEAlliance (XML Europe, XTech), and recently with O’Reilly Media (OSCON, Strata). Over the years I have tried to balance three factors as I select talks: proposal quality, important new work, and practical value of the knowledge to the attendees.
As the competition for speaking slots at both Strata and OSCON reach intense levels, I wanted to articulate these factors, and the principles I use when compiling conference programs.
How the program is made
My guiding principle in putting a program together is value to the attendees. They’re why we do this. By putting out quality content and speakers, we attract thinking, interested attendees. In turn, our sponsors get a much better quality of conversation and customer contact through their presence at the event.
Here’s the process in a nutshell: proposals are invited through a public call for participation, and then reviewers, drawn from the industry community of experts, will grade and comment on each proposal. I and my co-chairs use this feedback, along with editorial judgement, to compile the final schedule. For keynotes, and a small number of breakout sessions, we will augment the review process by inviting talks we think are important for the program.