Yahoo!'s bet on Hadoop

hadoop logo

One of the most important announcements at Oscon last week was Yahoo!’s commitment to support Hadoop. We’ve been writing about Hadoop on radar for a while, so it’s probably not news to you that we think Hadoop is important.

Yahoo’s involvement wasn’t actually news either, because Yahoo! had hired Doug Cutting, the creator of hadoop, back in January. But Doug’s talk at Oscon was kind of a coming out party for Hadoop, and Yahoo! wanted to make clear just how important they think the project is. In fact, I even had a call from David Filo to make sure I knew that the support is coming from the top.

Jeremy Zawodny’s post about hadoop on the Yahoo! developer network does a great job of explaining why Yahoo! considers hadoop important:

For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the “divide-and-conquer using lots of cheap hardware” approach to breaking down large problems is the only way to scale, doing so is not easy.

The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else’s commodity hardware, you still have to develop the software that’ll do the divide-and-conquer work to keep them all busy.

It’s hard work. And it needs to be commoditized, just like the hardware has been…

To build the necessary software infrastructure, we could have gone off to develop our own technology, treating it as a competitive advantage, and charged ahead. But we’ve taken a slightly different approach. Realizing that a growing number of companies and organizations are likely to need similar capabilities, we got behind the work of Doug Cutting (creator of the open source Nutch and Lucene projects) and asked him to join Yahoo to help deploy and continue working on the [then new] open source Hadoop project.

Let me unpack the two parts of this news: hadoop as an important open source project, and Yahoo!’s involvement. On the first front, I’ve been arguing for some time that free and open source developers need to pay more attention to Web 2.0. Web 2.0 software-as-a-service applications built on top of the LAMP stack now generate several orders of magnitude more revenue than any companies seeking to directly monetize open source. And most of the software used by those Web 2.0 companies above the commodity platform layer is proprietary. Not only that, Web 2.0 is siphoning developers and buzz away from open source.

But there are open source projects that are tackling important Web 2.0 problems “up the stack.” Brad Fitzpatrick’s LiveJournal scaling tools memcached, perlbal, and mogileFS come to mind, as well as OpenID. Hadoop is another critical piece of Web 2.0 infrastructure now being duplicated in open source. (I’m sure there are others, and we’d love to hear from you about them in the comments.)

OK — but why is Yahoo!’s involvement so important? First, it indicates a kind of competitive tipping point in Web 2.0, where a large company that is a strong #2 in a space (search) realizes that open source is a great competitive weapon against their dominant competitor. It’s very much the same reason why IBM got behind Eclipse, as a way of getting competitive advantage against Sun in the Java market. (If you thought they were doing it out of the goodness of their hearts rather than clear-sighted business logic, think again.) If Yahoo! is realizing that open source is an important part of their competitive strategy, you can be sure that other big Web 2.0 companies will follow. In particular, expect support of open source projects that implement software that Google treats as proprietary. (See the long discussion thread on my post about Microsoft’s submission of their shared source licenses to OSI for my arguments as to why “being on the right side of history” will ultimately drive Microsoft to open source.)

Supporting Hadoop and other Apache projects not only gets Yahoo! deeply involved in open source software projects they can use, it helps give them renewed “geek cred.” And of course, attracting great people is a huge part of success in the computer industry (and for that matter, any other.)

Second, and perhaps equally important, Yahoo! gives hadoop an opportunity to be tested out at scale. Some years ago, I was on the board of Doug’s open source search engine effort, Nutch. Where the project foundered was in not having a large enough data set to really prove out the algorithms. Having more than a couple of hundred million pages in the index was too expensive for a non-profit open source project to manage. One of the important truths of Web 2.0 is that it ain’t the personal computer era any more, Eben Moglen’s arguments to the contrary notwithstanding. A lot of really important software can’t even be exercised properly without very large networks of machines, very large data sets, and heavy performance demands. Yahoo! provides all of these. This means that Hadoop will work for the big boys, and not just for toy projects. And as Jeremy pointed out in his post (linked and quoted above), today’s big boy may be everyday folks a few years from now, as the size and scale of Web 2.0 applications continue to increase.

BTW, in followup conversations with Doug, he pointed out that web search is not actually the killer app for hadoop, despite the fact that it is in part an implementation of the MapReduce technique made famous by Google. After all, Yahoo! has been doing web search for years without this kind of general purpose scaling platform. “Where Hadoop really shines,” says Doug, “is in data exploration.” Many problems, including tuning ad systems, personalization, learning what users need — and for that matter, corporate or government data mining — involve finding signal in a lot of noise. Doug pointed me to an interesting article on Amazon Web Services Developer Connection: Running Hadoop MapReduce on Amazon EC2 and Amazon S3. Doug said in email:

It provides an example of using Hadoop to mine one’s [logfile] data.

Another trivial application for log data that’s very valuable is reconstructing and analyzing user sessions. If you’ve got logs for months or years from hundreds of servers and you want to look at individual user sessions, e.g., how often do users visit, how long are their sessions, how do they move around the site, do often do they re-visit the same places, etc. This is a single MapReduce operation over all the logs, blasthing through, sorting and collating all your logs at the transfer rate of all the drives in your cluster. You don’t have to re-structure your database to measure something new. It’s really as easy as ‘grep | sort | uniq’.

Also, here are <a href=http://wiki.apache.org/lucene-hadoop/HadoopPresentationsthe slides from my talk.

Update: In response to a comment, I updated this article to clarify what I was talking about regarding open source and the competitive landscape.

tags: , ,
  • Bret

    It really amazes me that I got to the end of the article before reading the name “Google.” Hadoop started and is almost entirely an implementation of technologies described in papers from Google on MapReduce, GFS, etc. Did someone from Yahoo pay you to not mention that or something? ;)

    Seriously, I expect more from this blog. Not mentioning important (obvious? ironic?) points like that is something I expect from other blogs, but not this one.

  • genium

    linux 2.0 + web 2.0 = linternux
    http://media-tech.blogspot.com/2006/03/linternux-linux-20-avec-le-web-20.html

    Anyway, i don’t like the word “service”; it’s not really RESTful. We should talk about “agent-resource”, or better, “software-robot”…

  • http://tim.oreilly.com Tim O'Reilly

    Bret –

    Given that we’ve written about hadoop before, and that all the prominently linked hadoop pages make mention of the Google connection, it just didn’t occur to me that it needed to be front and center. I guess I’m just too close to the subject.

    Meanwhile, did you miss one of the main points of the article, namely that I thought it was significant that Yahoo! was getting behind Hadoop as a way of competing with Google?

    I thought that was obvious, but clearly it could be made more so. I’ll add a parenthetical statement to make it REALLY explicit.

  • http://datastrategy.wordpress.com/ Chuck Lam

    I just had a post about Hadoop on my blog “Data Strategy” a few days ago where I mentioned Hadoop’s increasing momentum as extensions are being built on top of it. Of note is Pig from Yahoo Research that builds a relational algebra framework on top of Hadoop so it behaves more like a SQL engine. I also mentioned a paper from Stanford researchers that discuss how to decompose popular machine learning algorithms to run on the MapReduce framework.

  • http://www.mattcutts.com/blog/ Matt Cutts

    In addition to Hadoop, don’t forget to mention that Overture chipped in on Nutch. From an article about four years ago at http://news.zdnet.com/2100-3513_22-5064913.html

    “Nutch itself has been operating secretly for roughly a year, gathering support from developers and funding from one of the biggest commercial players in search: Overture Services.”

    So this isn’t the first example of a company pursuing this approach.

  • http://www.sriramkrishnan.com Sriram Krishnan

    Tim – this (and Dare’s post about Web 2.0 lock-in of data) made me think about open source in a web 2.0 world (where more and more interesting stuff happens on the cloud rather than on your computer).

    See http://www.sriramkrishnan.com/blog/2007/08/open-source-and-scratching-itches-in.html

  • http://code.google.com/edu/content/parallel.html Chris DiBona

    Since we’re talking about Hadoop and such, I’d love to point out that we have created some pretty bitchin courseware for people to learn and use hadoop. See the url linked to my name, or go to http://code.google.com/edu/content/parallel.html

    Hadoop is pretty awesome.

  • http://schestowitz.com Roy Schestowitz

    > Free and open source developers
    > need to pay more attention to Web 2.0

    If only I knew what Web 2.0 truly is and what makes it possible. In my humble opinion, too many startups out there build upon the free work of others and give absolutely nothing in return. These return could in fact bring benefits to **them**, but they fail to understand that collaboration goes both ways. Giving away is not giving away when you choose the right licences.

  • http://michaelbernstein.com Michael R. Bernstein

    It seems that what we’re talking about here is the recapitulation of the mainframe era (and it’s eventual transition to minicomputers and the PC era). So, at some point it make sense to start talking about a ‘personal cluster’ instead of a ‘personal computer’.

    Virtual Machines will undoubtedly play into this, as they eventually provide the individual developer the capability to simulate an entire cluster of low-cost servers on a single reasonably powerful multi-core desktop box.

    Tim said “A lot of really important software can’t even be exercised properly without very large networks of machines, very large data sets, and heavy performance demands.” This is true, but I wonder what the lower bound for the size and complexity of a system that is useful for development and testing purposes. 10 virtual machines is probably too low. What about 100? 1000?

    Whatever the lower bound of utility in this context is, I wonder when we can expect a commodity desktop workstation that is capable of bootstrapping and running a virtual cluster at that lower bound of complexity to cost around the same as a 1990′s PC. Because it is only when the basic tools became financially accessible to a broader audience that things really started taking off.

  • http://tim.oreilly.com Tim O'Reilly

    Chris –

    Good point about Google also supporting Hadoop. One of the big differences in competition in the Web 2.0 era is that software alone is no longer necessarily the key differentiator for a company. Google isn’t a software-only company in the way that Microsoft was. That makes a company much more resilient in the face of “open source as a competitive strategy.” Google can and does release lots of its own software as OSS — and while Google didn’t release MapReduce source, it certainly published the very detailed paper on which Doug built.

    It would be really interesting to me to figure out what kinds of things will be kept proprietary and which not.

    I argue that a lot of network effects databases tend to have a kind of natural lock-in (the ebay effect) that is independent of a pure software advantage, but there are still areas where software confers advantage. What are they?

  • http://michaelbernstein.com Michael R. Bernstein

    there are still areas where software confers advantage. What are they?

    I think a lot of areas are various kinds of efficiency/scalability that are ‘boring’ in some sense, yet can be critical:

    Write software that produces results that are just as good with less data (get the network effect to kick in sooner).

    Write software that produces better results with the same data (a qualitative advantage, but access to the same data is key. Note that this is what Google Pagerank did).

    Write software that crunches the same data using less CPU cycles / Memory / Disk accesses (lower costs, become cash-flow-positive sooner, eventually have higher margins).

    All of these can be sources of competitive advantage, especially in the early days of a startup if it makes serving smaller markets more viable and attractive.

  • http://www.adscriptor.com Jean-Marie Le Ray

    Link broken in your post : http://wiki.apache.org/lucene-hadoop/HadoopPresentations
    Jean-Marie

  • Patrick Mansted

    Why is O’Reilly refusing to release Moglen’s presentation to the media or to the public? Is this the only keynote that is being withheld by O’Reilly? You appear to provide most (all?) other keynotes:
    http://conferences.oreillynet.com/pub/w/58/presentations.html

    Moglen’s presentation isn’t even mentioned on that list. What gives?

  • Tomasz Gorski

    I read the San Francisco Chronicle which outlined your efforts in finding a solution to Yahoo’s declining market share and fortunes. The article detailed how you are leading the effort at discovering a new product or idea that will counter Googles ever increasing domination. From reading past articles about Yahoo, it appears that CEO Jerry Wang, is also in pursuit of new “killer apps.” to take back the lead from Google. You are both looking in the wrong direction.

    The salvation of Yahoo lies not in looking forward – but in looking backwards. This is a time for “backwards thinking”. Yahoo already has the solution to it’s problems. In fact it has everything it needs, to not only successfully battle Google, but to once again dominate. A killer App. won’t save you – you need to beat Google at it’s own game – and Yahoo and only Yahoo can. The solution lies in Yahoo’s boring, neglected, and nearly abandoned, directory.

  • Patrick Mansted

    I suppose I should have given more context:

    The Register contacted the OSCON audio staff to obtain a recording of the session. “No problem,” they said, “It will just take a couple of minutes, but you need to get O’Reilly’s permission first.” O’Reilly corporate refused to release the audio, saying it would cause a slippery slope. (We’re still trying to understand that one.) They, however, did add that Moglen appeared to be “off his meds.”

    http://www.theregister.co.uk/2007/07/26/oreilly_moglen_oscon/

  • http://www.hackszine.com Brian Jepson

    Hi Patrick,

    Although Eben’s talk was listed as a keynote, it was part of the Executive Briefing. We haven’t posted any of the video from the Executive Briefing, just the keynotes from the main conference. I’ll look into whether we have video of the Executive Briefings and see if it’s possible to get them imported, edited, and posted online, although it may take a while because it’s a full day of video.

    - Brian

  • http://tim.oreilly.com Tim O'Reilly

    Patrick –

    Eben’s conversation with me wasn’t a keynote at Oscon. It was part of the Radar Executive Briefing, a separate event on the same day as the Oscon tutorials. The videos that are up are from the main show keynotes. The other presentations that are on that list from which you claim Eben’s presentation is suspiciously missing are just that: presentation files. Eben didn’t make a presentation, ergo no presentation files on the list. Even for the session that he did separately for Oscon, he didn’t show any slides, just spoke from his notes.

    As to why we didn’t give the video from the Radar session to the Register — why would we give video to a competitor in the online news space? Especially one that is traditionally hostile to O’Reilly and never misses an opportunity to spin news to put us in a bad light? If it goes up, it will go up on the O’Reilly site. If you take the Register as a legitimate news source, rather than link-baiters, you probably also watch the “other” O’Reilly’s “no-spin zone” without realizing just how much spin there is there too…

    I have no objection to putting up the video if we have it. (See comment from Brian Jepson, who’s been putting up the keynote videos.) However, I was really disappointed that Eben was unwilling to talk about the issue that he had agreed to come discuss, and instead used the stage for name-calling. I believe that there are real issues here that I’d much rather have aired, rather than the FSF’s long-standing grudges against O’Reilly.

    More to the point, if you watch the video, you will see I’m doing my best not to respond to fairly direct insults (for which Eben has since apologized). Various people in the audience got quite angry on my behalf, even thought I tried to stay cool. I didn’t see the reason to rush out a video that would likely inflame things further. Unlike the Register, which likes to “bite the hand” that feeds it, I prefer constructive engagement. That’s also why I didn’t write a response to Eben’s attacks. What good does escalation do when the comments are ad hominem rather than substantial?

    I’ve reached out to Eben, and I yet hope to have the conversation with him that I hoped to have on stage. If we have that conversation and are able to record it, I’ll be sure to release that video!

    Meanwhile, I’ve checked with the folks working on the video, and will see where this is in the pipeline.

  • Patrick Mansted

    Eben’s conversation with me wasn’t a keynote at Oscon.

    As Brian Jepson mentioned, I was misled by
    http://conferences.oreillynet.com/pub/w/58/speakers.html
    where it is listed as a keynote.

    As to why we didn’t give the video from the Radar session to the Register

    They only asked for audio, which presumably they were going to comment on and not release (or you could have specified those terms).

    — why would we give video to a competitor in the online news space?
    Especially one that is traditionally hostile to O’Reilly and never misses an opportunity to spin news to put us in a bad light?

    By that reasoning, why would you let them attend and report OSCON at all? By their account they merely (foolishly) chose not to attend that particular presentation, so they would have been able to give a first hand report on it without an audio recording. But this is an irrelevant digression.

    If it goes up, it will go up on the O’Reilly site.

    I don’t think anyone was objecting to this.

    If you take the Register as a legitimate news source, rather than link-baiters, you probably also watch the “other” O’Reilly’s “no-spin zone” without realizing just how much spin there is there too…

    Oh the irony. No matter how I feel about the Register, they currently
    have the best account of the exchange. Of course if there was video I
    wouldn’t need to trust the Register or O’Reilly corp for that
    matter, provided it was unedited; deflating any ‘spin’. You can
    imagine that some would believe your depiction of Moglen making
    completely substanceless accusations — that he did nothing but
    call you names — to be somewhat suspect.

    If nothing else, the Register’s article has diminished the chance of
    the event being buried, ‘move along nothing to see here’, or under
    some guise of paternalism. That wouldn’t be transparent, wouldn’t be open and
    wouldn’t facilitate social involvement.

    Meanwhile, I’ve checked with the folks working on the video, and will
    see where this is in the pipeline.

    To be clear, you do intend to release the video? Because some of your
    response seemed to be an apology^ for not releasing the video in the
    future.

    ^ in the Socratic sense.

    P.S. I think it would be polite if you had a footer (or other
    annotation) that mentioned when you have edited your own posts and why.

  • http://tim.oreilly.com Tim O'Reilly

    Patrick –

    Let me start with your Ps. I do make notes if I ever make a change to a post. But I assume you’re referring to the change I made to the comment above. When I first posted it, I thought that Brian Jepson did indeed have video from the Executive Briefing, and just hadn’t gotten to it, because I saw the Phil Torrone session listed on the video page. But he corrected me immediately after I posted, that that was a separate keynote session in the main conference. It seemed easier to correct that mistake by editing the comment rather than to create a set of cascading corrections.

    As to the Register having the best account of the event, surely you jest. There’s a good, factual blow-by-blow on Linux.com: http://www.linux.com/feature/118201 There are a few things I’d correct there (e.g. I countered that “If Google and Web 2.0 are thermal noise, then Linux is also thermal noise.” Linux.com left out the conditional.

    To be clear, what Brian said. He doesn’t know if video is available. If it is, we’ll put it up. If not, we’ll put up the audio, which I know is available.

    And you exaggerate when you claim that I said that Eben made “completely substanceless” accusations. I said that he made ad hominem attacks (e.g. that open source and web 2.0 were really just O’Reilly self-promotional money-grubbing terms that wasted the opportunities that the FSF had provided to deal with freedom) rather than being willing to talk about whether software-as-a-service by any name you like to call it raised issues for the GPL, and why he’d resolved those issues in the way that he did.

    He did eventually talk about that in response to a questioner from the audience, but he refused to talk with me about it.

  • Patrick Mansted

    Let me start with your Ps. I do make notes if I ever make a change to
    a post. But I assume you’re referring to the change I made to the
    comment above.

    To be clear, I wasn’t suggesting that you did anything underhanded in
    your change to the comment. Nonetheless I think it is bad form for
    various reasons, most of which are probably the same reasons why you
    annotate changes to posts. For instance, editing a comment would be
    unfair for succeeding commenters as it misrepresents the conversation
    – even if the changes are deemed insignificant as such judgements are
    subjective. Perhaps surprisingly, this can happen even if the
    respondent’s comment hasn’t yet appeared: The commenter may be in the
    process of responding to the original version that they previously
    read which, without obvious indications to the contrary, they believe
    to be the same as the ‘current’ version. To a less important extent
    unannotated edits becomes perplexing to those who notice the change
    and wonder why they occurred. Finally it keeps people’s sanity in
    check: “Am I going crazy? I could have sworn I just read ‘foo’ and now
    it says ‘bar’”. Many web forums and commenting systems thus mark when
    an entry has been edited.

    To be clear, what Brian said. He doesn’t know if video is available. If it is, we’ll put it up. If not, we’ll put up the audio, which I know is available.

    Thank you. I hope the video is indeed available as body language can
    convey a lot, particularly in an interaction between two people. If
    you also have Moglen’s other talk (even just audio) that would be great as
    well.

    And you exaggerate when you claim that I said that Eben made “completely substanceless” accusations. I said that he made ad hominem attacks.

    I have difficulty interpreting the following any other way:

    That’s also why I didn’t write a response to Eben’s attacks. What good
    does escalation do when the comments are ad hominem rather than
    substantial?

    There is also this statement:

    I believe that there are real issues here that I’d much rather have aired, rather than the FSF’s long-standing grudges against O’Reilly.

    where it is strongly implied that Moglen didn’t raise any ‘real issues’.

    In any event, is it your contention now that there was indeed
    substance to Moglen’s remarks, only that they were tarnished by ‘name
    calling’?

  • Patrick Mansted

    Argh! Sorry about the formatting. Your preview doesn’t render the
    same way as the actual post does. In particular ‘preview’ seems to ignores new lines while the ‘real deal’ converts them to ‘br’ tags.

  • Li

    The link to “Hadoop” in the first sentence of your post is broken.

  • Miles

    I think yahoo realized realizes that open source is a great competitive weapon long ago. The creator of PHP works there. They’ve had some mysql heavyweights. Their servers have been freebsd and apache based forever. There aren’t many projects that *really* get a company geek cred regardless of the amount of involvement. The biggest/only exception I can think of is Google gets lots of credit for working on Firefox.

  • http://www.urbanmvp.com J.O. Urban

    I don’t think hadoop.org is the actual domain for the project. The correct link is http://lucene.apache.org/hadoop/ for the Hadoop project.

    Anyways for large corporations in the technology field supporting or being closely tied with an open source project is essential even if it is nothing more then a public relations exercise. No one wants to get the Microsoft “Evil Greedy” coporation label.

  • http://curvesetter.com C.O. Der

    There is a vmware appliance for anyone who wants to fool around with hadoop:

    http://code.google.com/edu/tools/hadoopvm/index.html