• Print

Stuffing Six Million Pages Down Google's Throat

I got two fascinating emails from Jason Hunter over the weekend, both concerning MarkMail, the open source mailing list search engine created by Ryan Grimm and Jason over at MarkLogic. I thought I’d share them, with Jason’s permission.

The first was a fairly prosaic announcement:

In the last few weeks we loaded the PHP and PEAR mailing lists, a sum total of about 700,000 new messages. Contained within the new load is the php-general list, now statistically our largest list at 266,000 messages, passing by the old king tomcat-users with its 225,000 messages. Third place now goes to the main MySQL list.

It’s great to know that you can now search these lists with the amazing MarkMail tools, which I wrote about recently on Radar. But what really caught my attention was Jason’s next message, a bit of backchannel conversation that illuminates just how poorly the big search engines index small sites with large collections of data:

I thought you might find this interesting. One of the challenges we face with MarkMail is how to get Google to crawl all 6m (and eventually more) pages we have on our site. You often hear people talking about ways to increase their small site’s PageRank, but you don’t find many people talking about the challenge of stuffing Google full of (good) content.

Observations so far:

  • Google is way ahead of the other search engines. Based on doing site:markmail.org queries: Google has indexed 760k pages, Yahoo 19k, and MSN 4k. It’s not just the search algorithm that matters, it’s also the crawl algorithm, and we have a clear winner here.
  • Google started crawling right from the get-go. We peaked at 960k pages indexed about a week ago. The number goes up and down but the 1m line seems a tough one to crack.
  • It is definitely possible to crack 1m. My old email archive system on Servlets.com stands at 1.3m pages indexed. Gmane has 2.5m indexed.
  • This may be an impossible challenge. GMane has 2.5m messages indexed by Google but more than 50m messages archived.
  • We theorize that Google judges a site on various factors and crawls accordingly. Based on the fast crawl speed in our earliest days, we think “momentum” is a factor besides just raw PageRank.
  • We also theorize that faster page response times might help keep the crawl rate up. There’s only 86,400 seconds in a day so to have 250k pages crawled means the Googlebot needs to hit your site three times a second. We expect the bot slows down if a site seems sluggish.

Attached is a picture of the pages crawled each day for the last 50 days (we don’t have stats for the first few weeks). It’s widely variable. Notice the happy increase this last week. That might be due to your blog entry giving us some extra whuffie. Or maybe it’s due to the 30x speed optimization we made in response to your post to better handle the traffic. Or maybe it’s just random. I wish I knew. [links added by me]

google-spidering.png

Jason’s last point seems particularly insightful. There are only so many seconds in a day, and the larger the number of pages on a site, the harder a crawler would have to hit the site to index them all while keeping reasonably up to date copies. Small sites with lots of pages thus provide an impedance mismatch for crawling. Obviously, with Google showing 58.4 million pages in response to “site:myspace.com”, and 73.3 million in response to “site:flickr.com”, a high performance site justifies a high performance crawl, so the question is how Google makes the decision how deep to go. (Interestingly, “site:facebook.com” shows only 906K pages, suggesting that a huge proportion of Facebook’s pages are still private.)

It does seem to me, though, that since most of the pages in mail archives are old pages, and not changing, some smarter algorithms that recognize the nature of these sites versus, say dynamic collections of potentially updated pages like flickr or myspace, might figure out how to do a purely incremental crawl, rather than re-crawling the pages over again. After all, archive sites like gmane and markmail have realized that they don’t need to keep re-crawling archived mail messages. Why can’t the big guys figure this out too, especially when someone else has done the work of collecting all the data?

Still, I’m reminded of a comment by Ben Bernanke reported in today’s New York Times profile, The Education of Ben Bernanke, that he is “a believer in the laws of mathematics.” I’ve become increasingly fascinated by the underlying math of Web 2.0 since reading Jeremy Liew‘s post about the economics of online advertising last year. Limits are, of course, made to be broken. But it’s worth thinking about absolute (and temporary) limits to the growth of Web 2.0. What constraints do we take for granted? What constraints are invisible to us? Your thoughts welcome.

tags: , ,
  • http://jasoncartwright.com Jason

    He should create some sitemaps (http://www.sitemaps.org), then submit using Google Webmaster Tools (http://www.google.com/webmasters/).

  • http://simoncast.blogspot.com Simon

    There is something about creating a index file that the large players can suck in rather than re-crawling what has already been indexed.

    There is also some sort of bundling that could be done so that the site preps the most recent update so that the crawler only needs to grab one file.

  • http://www.oshineye.com ade

    There is actually an established means for doing what Jason wants. Create a sitemap and submit it: https://www.google.com/webmasters/tools/docs/en/protocol.html

    This Creative Commons licensed technology is supported by Google, Yahoo and Microsoft. See: http://www.sitemaps.org/ for more information.

  • http://inevitablecorp.com Steve Mallett

    Don’t confuse crawled with indexed.

    Google may have crawled a given site with +1M pages, but simply choosen to only ‘index’ a handful as worthy of including in the search index.

  • http://muellerware.org Patrick Mueller

    “It does seem to me, though, that since most of the pages in mail archives are old pages, and not changing, some smarter algorithms that recognize the nature of these sites versus, say dynamic collections of potentially updated pages like flickr or myspace, might figure out how to do a purely incremental crawl, rather than re-crawling the pages over again.”

    Good news! The problem is already solved! It’s called cache validation in HTTP (ETag &| Last-Modified).

    There is one thing that could be added to HTTP that might, in theory, help even more. And that is to be able to mark a resource as immutable. Which you could do for each and every mailing list, newsgroup, etc post. But that’s just an additional optimization, assuming that sites serving up static content are already doing the right thing, which might not be the case.

  • http://inevitablecorp.com Steve Mallett

    Also, site:mail-archive.com shows me 21M+ indexed pages. There’s a very good chance that anything mail related is going to be given priority, with Google, there.

  • http://blog.persistent.info Mihai Parparita

    This problem (lots of high quality pages, not changing very often) is what sitemaps are support to help with:

    https://www.google.com/webmasters/tools/docs/en/about.html

  • http://markmail.org Jason Hunter

    To give search engines an assist in finding all our messages, one thing we do at MarkMail is use sitemap files. The sitemap file format is pretty rudimentary but does include a lastmod element. If a search engine wants to trust our timestamp it won’t even have to bother with the conditional get requests.

    We timestamp each message with the date of its authorship, after which it won’t change (modulo our output formatting improvements). For example:

    <url>
    <loc>http://markmail.org/message/wn76bs257navowou</loc>
    <lastmod>1995-02-27</lastmod>
    </url>
    
  • http://dasht-exp-1a.com Thomas Lord

    it’s worth thinking about absolute (and temporary) limits to the growth of Web 2.0. What constraints do we take for granted? What constraints are invisible to us? Your thoughts welcome.

    1. It would be a mistake to conflate “Web 2.0″ with “advertising supported media business”.

    2. I hypothesize the existence of some numbers and trends about users and ads:

    2a. Maximum audience sizes (number of people and length of exposures) will peek and contract in not very many years. The age of mass spectacle is drawing to a close. This includes Google search.

    2b. The number of ad exposures to which each user is subjected (from any source) will continue to grow until well past the point of diminishing returns for advertisers. This will diminish the value of all ads as users become more jaded and resentful.

    2c. Consumer assistance (e.g., like Geico’s “show the prices of our competitors” and also like “Consumer Reports”) will trend towards greater importance.

    3. We aren’t really going back to the days of cloistered developers around locked-up main-frames. The novelty of centralized Web 2.0 properties will wear off and, at the same time, needing to find markets for the new commidity clusters, users will be led in the direction of (and find much to like about) “personal cloud computing”. (Disclaimer / bias: http://basiscraft.com/notes/personal.html )

    4. Backlash against treating users as free labor in a system of immaterial production (of social maps for advertisers and others) will open the door to “next big thing” applications that actually may cost users some money: but that will pay for themselves in household and career efficiencies without sacrificing things like privacy.

    5. The core idea of Web 2.0, that distributed effort can produce databases of surprising value, will become more formalized and studied. It will leave the field of entrepreneurial possibility and become part of the field of systematic, repeatable technique. However, very large examples like wikipedia or GNU/Linux will be un-reproducable, for the most part because of increased competition for audient/participants.

    -t

  • Hashim Warren

    I have used Google Base to get my pages indexed

  • http://paulbeard.org/wordpress paul

    This has been solved, of course. InfoSeek used a sitelist file, listing the updated pages, sizes, and modtimes, and Google has it’s sitemap. I assume this has in use?

  • http://paulbeard.org/wordpress paul

    s/This has/This is/g

  • http://www.darkcoding.net Graham King

    MarkMail haven’t created any content – what they have is a different interface on content that is already indexed by Google. Could it be that search engines notice the duplication?

  • Yuval R

    What about rotating a ‘featured category’ or ‘featured timeframe’ element on the homepage, allowing crawlers to access different parts of the DB?

  • http://asbjorn.ulsberg.no/ Asbj√∏rn Ulsberg

    I see Last-Modified and ETag is already mentioned. Good. They help a lot. Another thing that might help is setting the “Expires” header. If you’re sure the resource won’t change within a year, set it to one year from “now” (whenever that may be).

  • http://markmail.blogspot.com Ryan Grimm

    I’ll answer a few questions that have been posed in some recent comments:

    @paul “Google has it’s sitemap. I assume this is in use?”

    Yep, we are providing sitemap files for the spiders and have pointed them to it via the robots.txt file.

    @Graham “Could it be that search engines notice the duplication?”

    That’s an interesting thought, but a quick test clearly shows that Google doesn’t remove duplicate content.

    @Yuval “…allowing crawlers to access different parts of the DB?”

    We actually have a link at the bottom of every page called “Browse” that is intended to give crawlers access to every message in the database and based off of some log data the Googlebot has found it.

    @Asbj√∏rn “Another thing that might help is setting the “Expires” header.”

    We actually set the Expires header for every html page on the site. Our initial hope was that the search engines would use the lastmod date in our sitemap file to determine what needs to be crawled again. But setting the Expires header further into the future for some pages is a good idea as well. We’ll roll this change out shortly.

    Jason went into a little more depth on these topics as well on the MarkMail Blog.

  • Neal

    For those searching for it, the specific Bernanke quote from the New York Times article was:

    “believes in the laws of arithmetic”
    (i.e. s/mathematics/arithmetic/)

  • http://stevesouders.com Steve Souders

    Another example where fast loading pages is important. It helps us to remember to always keep performance in mind.

  • http://inevitablecorp.com Steve Mallett

    “Another example where fast loading pages is important. It helps us to remember to always keep performance in mind.”

    Squid (squid-cache.org) is the poor man’s private rack full of servers.

  • http://www.openxtra.co.uk/blog/ Jack @ Tech Teapot

    I would take the google commands like link:, site: and the like with a pinch of salt. They don’t always return a true picture.

    Google Webmaster Console does tend to show a more accurate picture, though it is available only to the domain owner not anybody from outside.

  • http://tauschen.blogg.de Jacob

    Interesting, I’m inclined to agree with Graham about it possibly being a duplicate content issue.

  • http://www.archicentral.com Mark

    site:markmail.org returns 885,000 results for me. They should give it some more time.

  • http://www.ccil.org/~cowan John Cowan

    As is well known, the counts returned by Google are a *very* rough approximation, and should not be used to assume anything about a site’s presence in the index. In addition, it’s true that dupes are discarded, though not perfectly, or Google searches would be drowning in splog pages. Disclaimer: I work for Google, but not on search, and I only know ordinary things about it. If I knew more, I couldn’t tell you.