Personal data stores and pub/sub networks

Social streams may eclipse RSS, but the blogosphere's roots run deeper.

The elmcity project joins five streams of data about public calendar events. Four of them are well-known services: Facebook, EventBrite, Upcoming, and Eventful. They all work the same way. You sign up for a service, you post your events there, other people can go there to find out about your events. What they find, when they go there, are copies of your event data. If you want to promote an event in more than one place, you have to push a copy to each place. If you change the time or day of an event, you have to revisit all those places and push new copies to each.

The fifth stream works differently. It’s a loosely-coupled network of publishers and subscribers. To join it you post events once to your own website, blog, or online calendar, in a way that yields two complementary outputs. For people, you offer HTML files that can be read and printed. For mechanized web services like elmcity, you offer iCalendar feeds that can be aggregated and syndicated. If you want to promote an event in more than one place, you ask other services to subscribe to your feed. If you change the time or day of the event, every subscriber sees the change.

The first and best example of a decentralized pub/sub network is the blogosphere. My original blogging tool, Radio UserLand, embodied the pub/sub pattern. It made everything you wrote automatically available in two ways: as HTML for people to read, and as RSS for machines to process. What’s more, Radio UserLand didn’t just produce RSS feeds that other services could read and aggregate. It was itself an aggregator that pointed the way toward what became a vibrant ecosystem of applications — and services — that knew how to merge RSS streams. In that network the feeds we published flowed freely, and appeared in many contexts. But they always remained tethered to original sources that we stamped with our identities, hosted wherever we liked, and controlled ourselves. Every RSS feed that was published, no matter where it was published, contributed to a global pool of RSS feeds. Any aggregator could create a view of the blogosphere by merging a set of feeds, chosen from the global pool, based on subject, author, place, time, or combinations of these selectors.

Now social streams have largely eclipsed RSS readers, and the feed reading service I’ve used for years — Bloglines — will soon go dark. Dave Winer thinks the RSS ecosystem could be rebooted, and argues for centralized subscription handling on the next turn of the crank. Of course definitions tend to blur when we talk about centralized versus decentralized services. Consider FriendFeed. It’s centralized in the sense that a single provider offers the service. But it can be used to create many RSS hubs that merge many streams for many purposes. In The power of informal contracts I showed how an instance of FriendFeed merges a particular set of RSS feeds to create a news service just for elmcity curators. The elmcity service itself has the same kind of dual nature. A single provider offers the service. But many curators can use it to spin up many event hubs, each tuned to a location or topic.

The early blogosphere proved that we could create and share many views drawn from the same pool of feeds. That’s one of the bedrock principles that I hope we’ll remember and carry forward to other pub/sub networks. Another principle is that we ought to control and syndicate our data. Radio UserLand, for example, was happy to host your blog, just as Twitter and Facebook are now happy to host your online social presence. But unlike Twitter and Facebook, Radio UserLand was just as happy to let you push your data to another host. To play in the syndication network your feed just had to exist — it didn’t matter where — and be known to one or more hubs.

This notion of a cloud-based personal data store is only now starting to come into focus. When I was groping for a term to describe it back in 2007 I came up with hosted lifebits. More recently the Internet Identity Workshop gang have settled on personal data store, as recently described by Kaliya Hamlin and Phil Windley. The acronym is variously PDS or PDX, where X, as Kaliya says, stands for “store, service, locker, bank, broker, vault, etc.” Phil elaborates:

The term itself is a problem. When you say “store” or “locker” people assume that this is a place to put things (not surprisingly). While there will certainly be data stored in the PDS, that really misses its primary purposes: acting as a broker for all the data you’ve got stored all over the place, and managing the metadata about that data. That is, it is a single place, but a place of indirection not storage. The PDS is the place where services that need access to your data will come for permission, metadata, and location.

The elmcity service aligns with that vision. If we require the calendar data for a city, town, or neighborhood to live in a single place of storage, we’ll never agree to use the same place. Thus the elmcity service merges streams from Facebook, EventBrite, Upcoming, and Eventful. But those streams are fed by people who put copies of their events into them, one event at at time, once per stream. What if we managed our public calendar data canonically, in personal (or organizational) data stores fed from our own preferred calendar applications? These data stores would in turn feed downstream hubs like Facebook, EventBrite, Upcoming, and Eventful, all of which could — although they currently don’t — receive and transmit such feeds. Other hubs, based on instances of the elmcity service or a similar system, would enable curators to create particular geographic or topical views.

I’ve identified a handful of common calendar applications that can publish calendar data at URLs accessible to any such hub, in a format (iCalendar) that enables automated processing. The short list includes Google Calendar, Outlook, Apple iCal, and Windows Live Calendar. But there are many others. Here’s the full list of producers as captured so far by the elmcity service:

feed producer # of feeds
-//Google Inc//Google Calendar 70.9054//EN 151
-//Meetup Inc//RemoteApi//EN 14
unknown 14
iCalendar-Ruby 6
e-vanced event management system 6
-//DDay.iCal//NONSGML 5
-// Limited Event Feeds//NONSGML//EN 4
-// 3
-//CollegeNET Inc//NONSGML R25//EN 3
-//Drupal iCal API//EN 3
-//Microsoft Corporation//Windows Live Calendar//EN 3
-//Trumba Corporation//Trumba Calendar Services 0.11.6830//EN 2
-//herald-dispatch/calendar//NONSGML v1.0//EN 1
-//WebCalendar-v1.1.2 1
Zvents Ical 1
Coldfusion8 1
-//Intand Corporation//Tandem for Schools//EN 1
-//strange bird labs//Drupal iCal API//EN 1
-//SchoolCenter/NONSGML Calendar v9.0//EN 1
-//blogTO//NONSGML Toronto Events V1.0//EN 1
-//Events at Stanford//iCal4j 1.0//EN 1
-//University of California\, Berkeley//UCB Events Calendar//EN 1
-//EVDB// 1
-//mySportSite Inc.//mySportSite//EN 1
Mobile Geographics Tides 3988 2010 1

Google Calendar dominates overwhelmingly, but the long tail hints at the variety of event sources that could feed into a calendar-oriented pub/sub network. How much of the total event flow comes by way of this assortment of iCalendar sources, as compared to centralized sources? Here’s the breakdown:

(Click to enlarge)

It’s roughly half Eventful, a third Upcoming, a fifth iCalendar. There’s negligible flow from EventBrite, which focuses on big events. Likewise FaceBook where the focus, though it’s evolving, remains on group versus world visibility.

In a companion piece at O’Reilly Answers I show how I made this visualization. It’s a nice example of another kind of pub/sub network, in this case one that’s enabled by the OData protocol. For our purposes here, I just want to draw attention to the varying contributions made by the five streams to each of the hubs. The Eventful stream is strong almost everywhere. The Upcoming and iCalendar tributaries are only strong in some places. But where the iCalendar stream does flow powerfully, there’s a curator who has mined one or more rich veins of data from a school system, or a city government, or a newspaper. Today the vast majority of these organizations think of the calendar information they push as text for people to read. Few realize it is also data for networks to syndicate. When that mindset changes, a river of data will be unleashed.


tags: , , ,