Strata Week: How much of the web is archived?

Cataloging the web's attic, improving healthcare data collection, Twitter buys BackType

Here are a few of the data stories that caught my attention this week.

How much of the web is archived?

Researchers at Old Dominion University are trying to ascertain how much of the web has actually been archived and preserved in various databases. Scott Ainsworth, Ahmed Alsum, Hany SalahEldeen, Michele Weigle, and Michael Nelson have published a paper (PDF) with their analysis of the current state of archiving.

The researchers have studied sample URIs from DMOZ, Delicious, Bit.ly, and search engine indexes in order to measure the number of archive copies available in various public web archives. According to their findings, between 35% to 90% of URIs have at least one archived copy. That’s a huge range, and when you look at DMOZ, for example, you’ll find a far higher rate of archiving than you will for Bit.ly links. That’s hardly surprising, of course, as DMOZ is a primary source for the Internet Archive’s efforts.

More troubling, perhaps, is the poor representation of Bit.ly links in archiving. The researchers say that the reason for this isn’t entirely clear. Nonetheless, we should consider: what are the implications of this as Bit.ly and other URL shorteners become increasingly utilized?

In an article in The Chronicle, Alexis Rossi from the Internet Archive points out how this project helps to shed light on the web archiving process. “People are coming to the realization that if nobody saves the Internet, their work will just be gone.”

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD

Better healthcare data collection

The Department of Health and Human Services announced last week that it would work to improve its collection of healthcare data, specifically around the collection and reporting of race, ethnicity, sex, primary language and disability status. The department also announced that it also plans to collect health data about lesbian, gay, bisexual, and transgender populations.

Feministing says this is a big deal — and indeed, it does mark an important move to help uncover some of the disparities in healthcare that many in the LGBT community know exist:

There is a lack of data on LGBT folks, who we do know face disparities in health and access to health services. Without federal health data, it’s practically impossible to direct federal government resources to focus on health inequalities. Including sexual orientation in data collection will go a long way towards showing what LGB folks face. This data will make it possible to name and quantify real world problems, and to then direct government resources towards addressing them.

The Department of Health and Human Services positions the move for better data collection as part of its efforts to help address healthcare inequality. According to HHS Secretary Kathleen Sebelius: “The first step is to make sure we are asking the right questions. Sound data collection takes careful planning to ensure that accurate and actionable data is being recorded.”

Twitter acquires Backtype for better data analytics

Analytics company Backtype announced this week that it had been acquired by Twitter. Backtype offers its customers the ability to track their social media impact across the web. Using BackType, you could get an RSS feed of all the comments posted across the blogosphere that were signed with a certain website’s URL, providing a powerful tool by which you could track people’s commentary and participation online.

With the acquisition, Backtype says that it will bring its analytics platform to help develop “tools for Twitter’s publisher partners.” No doubt, these sorts of analytics are a key piece of Twitter’s value proposition, and it’s becoming increasingly clear that the company is opting to bring these sorts of tools in-house, rather than relying on third-party vendors to supply the analytics tools through which people can gauge participation and interest in various pieces of content.

Got data news?

Feel free to email me.

Related:

tags: , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Ian Crew

    The question I’d have is how much of the web is WORTH archiving. For example, I just had a discussion with my boss about a service we ran for a few years and recently shut down. My argument was that it’s old, no longer functioning, and potentially confusing to folks if they stumble across it in the future, so we should just remove from the web. He felt that keeping it around as a record of past work was worth it.

    (The website is still up, so you can see how that conversation went!)

    I guess that this really comes down to the age-old curation question–what’s really worth keeping. It’s tough!

  • Justin

    What if you are a web designer by trade. You are contracted to design a page for a company. You may want to go through the archives to determine which designs they’ve had in the past so you know which designs they liked and decided to change.

    Further, with the archives (particularly collections like the Hurricane Katrina and September 11th collections), we can capture and replay culturally important events.