Strata Week: The mortality rate of URLs

Parsing link rot, visualizing Wikipedia edits, and deconstructing autocorrect

Here are some of the data stories that caught my eye this week.

Pinboard examines link rot

Pinboard’s founder Maciej Ceglowski has analyzed URLs bookmarked on the site in order to examine “link rot — the depressing phenomenon in which perfectly healthy URLs stop working just a few years after appearing online.” Ceglowski took a random sample of 300 URLs from every year between 1997 and 2011 in order to ascertain if the decay of URLs was linear or if, like plutonium, they tend to have a half life.

Almost half of the links from 1997 are dead, Ceglowski found. Roundly a quarter of the links from 2002 to 2006 are dead. And even 6% of links bookmarked in 2011 no longer resolve. The full results of his analysis are here.

Pinboard proportion of working links chart
Pinboard proportion of working links chart. Click here for full analysis.

Ceglowski does note that there are some problems assessing the mortality of links: some dead links actually redirect and dead domains often end up full of ads. He asks some interesting questions about his methodology — Is there a simple programmatic way to detect parked domains? What is the attrition rate for shortened links?

He’s posted the raw data for others to analyze.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

Where in the world is Wikipedia edited?

Wikipedia visualizationAccording to Wikipedia, there have been some 463 million edits to the site — roughly 19 edits per page. Wikimedia’s data analyst Erik Zachte has unveiled a new visualization that shows exactly where in the world these edits are occurring on any given day for the various language editions of Wikipedia.

The visualization is interactive and using various keyboard shortcuts, you can navigate between different views and event markers. You can zoom into a particular area (with the + key), for example, or filter the edits by language (with the space bar). There are three types of visualizations available with this new tool: an animation of edits, a bubble map, and a heat map — all highlighting the 400,000+ edits that occur in a given day.

The tool reveals some interesting trends, not surprisingly showing different language versions more active depending on the time zones. It also demonstrates that most edits to the Chinese-language Wikipedia come from outside mainland China.

Zachte has written a blog post explaining how he created the visualization tool using HTML5 and JavaScript. He also addresses some of the measures he took to guard the privacy of Wikipedia authors, including adjusting the timestamps and rounding the latitude and longitude to a half degree.

Analyzing iPhone autocorrect errors

Although meant to be a helpful feature, the iPhone autocorrect has generated plenty of laughs with its spelling and word suggestions.

Following this tweet by Andrew Parker:

.bbpBox71655638937243648 {background:url( #9ae4e8;padding:20px;} p.bbpTweet{background:#fff;padding:10px 12px 10px 12px;margin:0;min-height:48px;color:#000;font-size:18px !important;line-height:22px;-moz-border-radius:5px;-webkit-border-radius:5px} p.bbpTweet span.metadata{display:block;width:100%;clear:both;margin-top:8px;padding-top:12px;height:40px;border-top:1px solid #fff;border-top:1px solid #e6e6e6} p.bbpTweet span.metadata{line-height:19px} p.bbpTweet span.metadata img{float:left;margin:0 7px 0 0px;width:38px;height:38px} p.bbpTweet a:hover{text-decoration:underline}p.bbpTweet span.timestamp{font-size:12px;display:block}

My iPhone auto-corrected “Harvard” to “Garbage”. Well played Apple engineers.less than a minute ago via Proxlet Favorite Retweet Reply

Brendan O’Connor decided to take a closer look at these autocorrection errors:

I was wondering how this would happen, and then noticed that each character pair has 0 to 2 distance on the QWERTY keyboard. Perhaps their model is eager to allow QWERTY-local character substitutions.
>>> zip(‘harvard’,’garbage’)
[(‘h’, ‘g’), (‘a’, ‘a’), (‘r’, ‘r’), (‘v’, ‘b’), (‘a’, ‘a’), (‘r’, ‘g’), (‘d’, ‘e’)]

O’Connor wonders if it’s a problem with the corpus of the iOS language model or if that language model is under-penalizing the edit distance. Commenters on the post contend the problem is that the iOS language model is generic and not personalized. Moreover, the model doesn’t actually account for the last word typed, so it tends to make non-grammatical suggestions.

Got data news?

Feel free to email me.


tags: , , ,