In a recent blog post, Edd Dumbill, VP of strategy at Silicon Valley Data Science, wrote about the phrase “data lake.” Likening it to a dream, he described a data lake as “a place with data-centered architecture, where silos are minimized, and processing happens with little friction in a scalable, distributed environment…Data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise.” He explained that he called it a “dream” because “we’ve a way to go to make the vision come true” — but noted he’s optimistic the dream can be realized.
In this Radar Podcast epidsode, O’Reilly’s Mac Slocum sits down with Dumbill to talk about the data lake, the opportunities the model presents, and the driving forces behind the concept. Edd argues that the primary opportunity lies in organization agility:
“The data lake is a model that’s part of a larger platform, really. Where you have the data stored in a repository where it can easily be accessed and create new applications from it. We’re also sitting atop a foundation of the cloud infrastructure of devops. … The higher level, we’re looking at developing in an agile sense. We’re not looking at three year projects anymore. … It’s not just the technology, it’s this combination working in an agile sense and also this idea that having data isn’t just the end of the story; it’s not just a big hole where you put things into.
“The old way was you build an application A to do thing A, then application B to do thing B, and they all had assumptions about the data and threw different things away. Now if you can espouse the model that, famously, Amazon has done — they don’t build any functionality without exposing it as a service. Well, think about that in a data sense as well. Don’t create any data without making it available in a controlled and useful way back to an organization. I think, done right, this is a power house for agility, a power house for more invention, and really creating value from data rather than just seeing it as a cost incentive.”
Edd also talks about Hadoop, which he says has become “a distributed operating system for data processing,” and likens the evolution of Linux to Hadoop’s trajectory:
“Hadoop came along, people using it as cheaper storage, cheaper data warehousing. Linux came along, people are saying, “Wow that’s a free operating system; I don’t need to pay.” But it made new things possible. First when something is that free, you can create it and use it so many more times and you free yourself of a scaling capacity. Google was built on commodity Linux hardware. But then also it can go places. Right now, we’re owning cell phones with Linux and so on. So we’re starting to find innovation, with processing going all kinds of places for all different kinds of applications.”
Also in this podcast…
In the second segment, O’Reilly’s big data guru Ben Lorica chats with Rajiv Maheswaran, CEO of SecondSpectrum. They tackle the problem of spatial-temporal pattern recognition — what Maheswaran calls “the science of moving dots.” This segment is a crossover feature from our new O’Reilly Data Show Podcast, which is available on iTunes, SoundCloud, and through our RSS feed.