Martin Kleppmann

Martin is a software engineer and entrepreneur, specializing in the data infrastructure of Internet companies. His last start-up, Rapportive, was acquired by LinkedIn in 2012. He is a committer on Apache Samza and author of the O'Reilly book Designing Data-Intensive Applications.

Wouldn’t it be fun to build your own Google?

Exploring open web crawl data — what if you had your own copy of the entire web, and you could do with it whatever you want?

Web_2_0_gualtiero_Flickr

For the last few millennia, libraries have been the custodians of human knowledge. By collecting books, and making them findable and accessible, they have done an incredible service to humanity. Our modern society, culture, science, and technology are all founded upon ideas that were transmitted through books and libraries.

Then the web came along, and allowed us to also publish all the stuff that wasn’t good enough to put in books, and do it all much faster and cheaper. Although the average quality of material you find on the web is quite poor, there are some pockets of excellence, and in aggregate, the sum of all web content is probably even more amazing than all libraries put together.

Google (and a few brave contenders like Bing, Baidu, DuckDuckGo and Blekko) have kindly indexed it all for us, acting as the web’s librarians. Without search engines, it would be terribly difficult to actually find anything, so hats off to them. However, what comes next, after search engines? It seems unlikely that search engines are the last thing we’re going to do with the web. Read more…