## Exploring open web crawl data — what if you had your own copy of the entire web, and you could do with it whatever you want?

For the last few millennia, libraries have been the custodians of human knowledge. By collecting books, and making them findable and accessible, they have done an incredible service to humanity. Our modern society, culture, science, and technology are all founded upon ideas that were transmitted through books and libraries.

Then the web came along, and allowed us to also publish all the stuff that wasn’t good enough to put in books, and do it all much faster and cheaper. Although the average quality of material you find on the web is quite poor, there are some pockets of excellence, and in aggregate, the sum of all web content is probably even more amazing than all libraries put together.

Google (and a few brave contenders like Bing, Baidu, DuckDuckGo and Blekko) have kindly indexed it all for us, acting as the web’s librarians. Without search engines, it would be terribly difficult to actually find anything, so hats off to them. However, what comes next, after search engines? It seems unlikely that search engines are the last thing we’re going to do with the web.

A small number of organizations, including Google, have crawled the web, processed, and indexed it, and generated a huge amount of value from it. However, there’s a problem: those indexes, the result of crawling the web, are hidden away inside Google’s data centers. We’re allowed to make individual search queries, but we don’t have bulk access to the data.

Imagine you had your own copy of the entire web, and you could do with it whatever you want. (Yes, it would be very expensive, but we’ll get to that later.) You could do automated analyses and surface the results to users. For example, you could collate the “best” articles (by some definition) written on many different subjects, no matter where on the web they are published. You could then create a tool which, whenever a user is reading something about one of those subjects, suggests further reading: perhaps deeper background information, or a contrasting viewpoint, or an argument on why the thing you’re reading is full of shit.

(I’ll gloss over the problem of actually implementing those analyses. The signal-to-noise ratio on the web is terrible, so it’s difficult to determine algorithmically whether a particular piece of content is any good. Nevertheless, search engines are able to give us useful search results because they spend a huge amount of effort on spam filtering and other measures to improve the quality of results. Any product that uses web crawl data will have to decide what is noise, and get rid of it. However, you can’t even start solving the signal-to-noise problem until you have the raw data. Having crawled the web is step one.)

Unfortunately, at the moment, only Google and a small number of other companies that have crawled the web have the resources to perform such analyses and build such products. Much as I believe Google try their best to be neutral, a pluralistic society requires a diversity of voices, not a filter bubble controlled by one organization. Surely there are people outside of Google who want to work on this kind of thing. Many a start-up could be founded on the basis of doing useful things with data extracted from a web crawl.

The idea of collating several related, useful pieces of content on one subject was recently suggested by Justin Wohlstadter (indeed it was a discussion with Justin that inspired me to write this article). His start-up, Wayfinder, aims to create such cross-references between URLs, by way of human curation. However, it relies on users actively submitting links to Wayfinder’s service.

I argued to Justin that I shouldn’t need to submit anything to a centralized database. By writing a blog post (such as this one) that references some things on the web, I am implicitly creating a connection between the URLs that appear in this blog post. By linking to those URLs, I am implicitly suggesting that they might be worth reading. (Of course, this is an old idea.) The web is already, in a sense, a huge distributed database.

By analogy, citations are very important in scientific publishing. Every scientific paper uses references to acknowledge its sources, to cite prior work, and to make the reader aware of related work. In the opposite direction, the number of times a paper is cited by other authors is a metric of how important the work is, and citation counts have even become a metric for researchers’ careers.

Google Scholar and bibliography databases maintain an index of citations, so you can find later work that builds upon (or invalidates!) a particular paper. As a researcher, following those forward and backward citations is an important way of learning about the state of the art.

Similarly, if you want to analyze the web, you need to need to be able to traverse the link graph and see which pages link to each other. For a given URL, you need to be able to see which pages it links to (outgoing links) and which pages link to it (incoming links).

You can easily write a program that fetches the HTML for a given URL, and parses out all the links — in other words, you can easily find all the outgoing links for an URL. Unfortunately, finding all the incoming links is very difficult. You need to download every page on the entire web, extract all of their links, and then collate all the web pages that reference the same URL. You need a copy of the entire web.

Incoming and outgoing links for a URL. Finding the outgoing ones is easy (just parse the page); finding the incoming ones is hard.

## Publicly available crawl data

An interesting move in this direction is CommonCrawl, a nonprofit. Every couple of months they send a crawler out into the web, download a whole bunch of web pages (about 2.8 billion pages, at latest count), and store the result as a publicly available data set in S3. The data is in WARC format, and you can do whatever you want with it. (If you want to play with the data, I wrote an implementation of WARC for use in Hadoop.)

As an experiment, I wrote a simple MapReduce job that processed the entire CommonCrawl data set. It cost me about $100 in EC2 instance time to process all 2.8 billion pages (a bit of optimization would probably bring that down). Crunching through such quantities of data isn’t free, but it’s surprisingly affordable. The CommonCrawl data set is about 35 TB in size (unparsed, compressed HTML). That’s a lot, but Google says they crawl 60 trillion distinct pages, and the index is reported as being over 100 PB, so it’s safe to assume that the CommonCrawl data set represents only a small fraction of the web. Relative sizes of CommonCrawl, Internet Archive, and Google (sources: 1, 2, 3). CommonCrawl is a good start. But what would it take to create a publicly available crawl of the entire web? Is it just a matter of getting some generous donations to finance CommonCrawl? But if all the data is on S3, you have to either use EC2 to process it or pay Amazon for the bandwidth to download it. A long-term solution would have to be less AWS-centric. I don’t know for sure, but my gut instinct is that a full web crawl would best be undertaken as a decentralized effort, with many organizations donating some bandwidth, storage, and computing resources toward a shared goal. (Perhaps this is what Faroo and YaCy are doing, but I’m not familiar with the details of their systems.) ## An architectural sketch Here are some rough ideas on how a decentralized web crawl project could look. The participants in the crawl can communicate peer-to-peer, using something like BitTorrent. A distributed hash table can be used to assign a portion of the URL space to a participant. That means each URL is assigned to one or more participants, and that participant is in charge of fetching the URL, storing the response, and parsing any links that appear in the page. Every URL that is found in a link is sent to the crawl participant to whom the URL is assigned. The recipient can ignore that message if it has already fetched that URL recently. The system will need to ensure it is well-behaved as a whole (obey robots.txt, stay within rate limits, de-duplicate URLs that return the same content, etc.). This will require some coordination between crawl participants. However, even if the crawl was done by a single organization, it would have to be distributed across multiple nodes, probably using asynchronous message passing for loose coordination. The same principles apply if the crawl nodes are distributed across several participants — it just means the message-passing is across the Internet rather than within one organization’s data center. A hypothetical architecture for distributed web crawling. Each participant crawls and stores a portion of the web. There remain many questions. What if your crawler downloads some content that is illegal in your country? How do you keep crawlers honest (ensuring they don’t manipulate the crawl results to their own advantage)? How is the load balanced across participants with different amounts of resources to offer? Is it necessary to enforce some kind of reciprocity (you can only use crawl data if you also contribute data), or have a payment model (bitcoin?) to create an incentive for people to run crawlers? How can index creation be distributed across participants? (As an aside, I think Samza‘s model of stream computation would be a good starting point for implementing a scalable distributed crawler. I’d love to see someone implement a proof of concept.) ## Motivations for contributing Why would different organizations — many of them probably competitors — potentially collaborate on creating a public domain crawl data set? Well, there is precedence for this, namely in open source software. Simplifying for the sake of brevity, there are a few reasons why this model works well: • Cost: Creating and maintaining a large software project (eg. Hadoop, Linux kernel, database system) is very expensive, and only a small number of very large companies can afford to run a project of that size by themselves. As a mid-size company, you have to either buy an off-the-shelf product from a vendor or collaborate with other organizations in creating an open solution. • Competitive advantage: With infrastructure software (databases, operating systems) there is little competitive advantage in keeping a project proprietary because competitive differentiation happens at the higher levels (closer to the user interface). On the other hand, by making it open, everybody benefits from better software infrastructure. This makes open source a very attractive option for infrastructure-level software. • Public relations: Companies want to be seen as doing good, and contributing to open source is seen as such. Many engineers also want to work on open source, perhaps for idealistic reasons, or because it makes their skills and accomplishments publicly visible and recognized, including to prospective future employers. I would argue that all the same arguments apply to the creation of an open data set, not only to the creation of open source software. If we believe that there is enough value in having publicly accessible crawl data, it looks like it could be done. ## Perhaps we can make it happen What I’ve described is a pie in the sky right now (although CommonCrawl is totally real). Collaboratively created data sets such as Wikipedia and OpenStreetMap are an amazing resource and accomplishment. At first, people thought the creators of these projects were crazy, but they turned out to work very well. We can safely say they have made a positive impact on the world, by summarizing a certain subset of human knowledge and making it freely accessible to all. I don’t know if freely available web crawl data would be similarly valuable because it’s hard to imagine all the possible applications, which only arise when you actually have the data and start exploring it. However, there must be interesting things you can do if you have access to the collective outpourings of humanity. How about augmenting human intellect, which we’ve talked about for so long? Can we use this data to create fairer societies, better mutual understanding, better education, and such good things? Cropped image by gualtiero on Flickr, used under a Creative Commons license. This post is part of our ongoing exploration into Big Data Components and Big Data Tools and Pipelines. Kleppmann’s new book Designing Data-Intensive Applications is available in early release here. ### Get the O’Reilly Data Newsletter Stay informed. Receive weekly insight from industry insiders. • Charlie Hull This would be a great resource, I agree – but it’s not quite as simple as that of course. Back in 1999 m,y Flax co-founder and I worked on Webtop, which was a half-billion-page web search engine with probabilistic capabilities. Crawling is *hard* – there are many, many nasty places for your crawler to end up, some created specifically to trap crawlers in endless painful loops of synthetic domains, spam sites and content you wouldn’t want your kids to see (in fact you might want to un-see it yourself). There’s thus a lot of human-powered crawler care and feeding to be done, and I’m not sure how much of this would fit the open-source model. CommonCrawl is pretty fantastic though! • Lindsay What about adding this at the server level? Each website gets an inbound.txt showing sites that a) link to it and b) send traffic? • Govind Or Google could just give all developers unlimited API access to their index. Anything useful developed could be sold on a “search marketplace” • Joe Have you tried Zillabyte? They provide API access to their web crawls and the options to make your code public. • Hey Martin, Great article — we’re excited there’s so much interest in the web as a dataset! I’m part of the team at Common Crawl and thought I’d clarify some points in the article. The most important is that you can download all the data that Common Crawl provides completely for free, without the need to pay S3 transfer fees or process it only in an EC2 cluster. You don’t even need to have an Amazon account! Our crawl archive blog posts give full details for downloading[1]. The main challenge then is storing it, as the full dataset is really quite large, but a number of universities have pulled down a significant portion onto their local clusters. Also, we’re performing the crawl once a month now. The monthly crawl archives are between 35-70 terabytes compressed. As such, we’ve actually crawled and stored over a quarter petabyte compressed, or 1.3 petabytes uncompressed, so far in 2014. (The archives go back to 2008.) Comparing directly against the Internet Archive datasets is a bit like comparing apples to oranges. They store images and other types of binary content as well, whilst Common Crawl aims primarily for HTML, which compresses better. Also, the numbers used for Internet Archive were for all of the crawls they’ve done, and in our case the numbers were for a single month’s crawl. We’re excited to see you use a crawl archive for experiments! I can confirm that optimizations will help you lower that EC2 figure. We can process a fairly intensive MR job over a standard crawl archive in afternoon for about$30. Big data on a small budget is a top priority for us!

• The problem with algorithmic/scraper search methods, is that they only work with existing data. For example, most Google searches gives a list of websites on one side, and some data scraped from Wikipedia on the other. There is not much meaning there. That’s because Google’s algorithm cannot combine the results into something original, because that would require human creativity. As such, I see the rise of different kinds of search based on what humans create, rather than what computers can scrape. I wrote a (longish) blog post on this problem: http://newslines.org/blog/googles-black-hole/

• >> We’re allowed to make individual search queries, but we don’t have bulk access to the data.

Good point. From my experience most search engine API queries are used for some kind of data mining or analytics, not only simple search. Unfortunately index queries are inherently limited, because the index is pre-calculated and optimized for a specific use case (not only because it is hidden in data centers). There is demand to remove this limitation, but this comes at the price of huge time, computing and storing capacity requirements:

raw crawl data (unlimited analysis options, but slow with huge computing and
storing capacity requirement) vs. index queries (limited set of query
parameters, but very fast without computing and storing capacity requirements for API users)

>> However, you can’t even start solving the signal-to-noise problem until you
have the raw data. Having crawled the web is step one.

You may decide which links to follow already during crawling, based on reputation of already crawled pages/sites. This allows you to sort out spam early and dramatically save crawling time and resources (think of a signal noise ratio of 1%)

>> Google says they crawl 60 trillion distinct pages

There is probably a huge difference between the number of crawled and indexed pages due to the signal noise ratio. There has been research based on Zipf’s Law that suggests an index size below 100 billion pages. This might be important for a realistic resource planning.

>> A distributed hash table can be used to assign a portion of the URL space to a participant.

It might be better to make the partitioning based on the hash of the domain part
of the URL. This allows less communication (the entire domain is crawled
autonomously by a single peer) and easier implementation of the politeness
policy of the crawler with less communication.

• Mark Watson

Thanks for writing the article, an interesting read! I would like to see more developers and companies using the Common Crawl data (most of which is also on Microsoft Azure, and I think copied on some academic sites). Your cost estimate of \$100 for scanning all of the data on AWS might be a little high, BTW.

Personally, I would like to see more resources and effort put behind Common Crawl to crawl more frequently than every month and obviously collect as much of the web as is possible.

• Sumeet Singh

“The system will need to ensure it is well-behaved as a whole (obey robots.txt, stay within rate limits, de-duplicate URLs that return the same content, etc.).”

“How do you keep crawlers honest (ensuring they don’t manipulate the crawl results to their own advantage)?”

Those are some really big issues and you seem to just gloss over them. Ensuring polite behavior and quality would be nearly impossible with the system you propose. A centralized system like CommonCrawl makes much more sense. If your only objection to the CommonCrawl approach is that it is AWS-centric, why not just work to get CommonCrawl on other clouds?