How will the elmcity service scale? Like the web!

During a recent talk at Harvard’s Berkman Center, Scott MacLeod asked (via the IRC backchannel): “How does the elmcity service scale?” He wondered, in particular, whether the service could support an online university like the World University and School that might produce an unlimited number of class schedules.

My short answer was that the elmcity service scales like the web. But what does that really mean? I promised Scott that I’d spell it out here. We’ll start with an analogy. As I mentioned in The power of informal contracts, the elmcity project envisions a web of calendar feeds that’s analogous to the blogosphere’s web of RSS and Atom feeds. We take for granted that the blogosphere scales like the web. A blog feed is just a special kind of web page. Anybody can create a blog and publish its feed at some URL. Why not calendars too? We haven’t thought about them in the same way, but the ICS (iCalendar) files that our calendar programs export are the moral equivalents of the RSS and Atom feeds that our blog publishing tools export. Anybody can create a calendar and publish its feed at some URL.

These webs — of HTML pages, of blog feeds, of calendar feeds — are notionally webs of peers. We can all publish, and we can all read, without relying on a central authority or privileged hub. There are, to be sure, powerful centralized services. My blog, for example, is one of millions hosted at wordpress.com, aggregated by Bloglines and Google Reader, and indexed by Google and Bing. But these services, while convenient, are optional. So long as we can publish our blogs somewhere online, advertise their URLs, and get the DNS to resolve their domain names, we can have a working blogosphere. The necessary and sufficient condition is that we can all publish resources (e.g., pages and feeds), and that we can all access those resources.

For the calendarsphere that I envision, a service like elmcity is likewise optional. Let’s suppose that the World University and School succeeds wildly. At any given moment there are tens of thousands of courses on offer, each with its own course page and also with its own calendar. Instructors publish course pages using any web publishing tool, and also publish calendars using any calendar publishing tool — Google Calendar, or Outlook, or Apple iCal, or another calendar program. Students pick schedules of courses, bookmark the course pages, and load the course calendars into any of these same calendar programs. The calendar software merges the separate course calendars and combines them with the students’ personal calendars. These calendar programs are thus aggregators of calendar feeds in the same way that feedreaders like NetNewsWire or Google Reader are aggregators of blog feeds.

Given a baseline web of peers, it’s useful to be able to merge our individual views of them into pooled spaces. NetNewsWire is a personal feedreader, but Google Reader is social. In the pool created by Google Reader, data finds data and people find people. The elmcity service aims to create that same kind of effect in the realm of public calendar events. When we pool our separate calendars, we publicize the events that we are promoting, we discover events that others are promoting, and we see all our public events on common timelines.

What constrains our ability to scale out pools of calendars? Let’s continue the analogy to the blogosphere. Google Reader constitutes one pooled space for blog feeds, Bloglines another. Because the data aggregated by these services conforms to open standards (i.e., RSS and Atom), other services can create blog pools too. Likewise in the calendarsphere, Google Calendar is one way to pool calendars, the elmcity service is another, Calagator is a third. Others can play too.

How can we scale these providers of calendar pools? Along one axis, each provider needs to be able to grow its computing power. Google Calendar scales on this axis by using Google’s cloud platform. The elmcity service uses Azure, the Microsoft cloud platform. Note that elmcity, unlike Google Calendar, is an open source service. That means you could run your own instance of it, using your own Azure account, but you’d still be relying on the Azure compute fabric.

Calagator, based on Ruby on Rails, could be deployed either to a conventional hosting environment or to a cloud platform. It would thus scale, along the compute axis, as either environment allows. The elmcity service could be used in this way too. The service is written for Azure, but the core aggregation engine is independent of Azure and could be deployed to a conventional hosting environment.

For feed aggregators, another axis of scale is the number of feeds that can be processed. When that number grows, the time required to connect to many feeds and ingest their contents becomes a constraint. The elmcity service currently supports 50 calendar hubs. Thrice daily, each hub pulls data from Eventful, Upcoming, Eventbrite, Facebook, and a list of iCalendar feeds. So far a single Azure worker role can easily do all this work. I’ll dial up the number of workers if needed, but first I want to squeeze as much parallelism as I can out of each worker. To that end, I recently upgraded to the 4.0 version of the .NET Framework in order to exploit its dramatically simplified parallel processing. In this week’s companion article I show how the elmcity service uses that new capability to optimize the time required to gather feeds from many sources.

Pub/sub networks can also scale by coalescing feeds. Consider a calendar hub operated, for some city, by the online arm of that city’s newspaper. One model is flat. The newspaper runs a hub whose registry lists all the calendar feeds in town. But another model is hierarchical. In that model, there’s a hub for arts and culture, a hub for sports and recreation, a hub for city government, and so on. Each hub gathers events from many feeds, and publishes the merged result on its own website for its own constituency. If the newspaper wants to include all those feeds, it can list them individually in its own registry. But why aggregate arts, sports, or recreation feeds more than once? The newspaper’s uber-hub can, instead, reuse the arts, sports, and recreation feeds curated by those respective hubs, adding their merged outputs to its own set of curated feeds. Such reuse can cut down the computational time and effort required to propagate feeds throughout the network.

None of these mechanisms will matter, though, until a vibrant ecosystem of calendar feeds requires them. That’s the ultimate constraint. Scaling the calendarsphere isn’t a problem yet, but it would be a good problem to have. First, though, we’ve got to light up a whole bunch of feeds.

Related:

How will the elmcity service scale? Like the web!

The calendarsphere will be another collection of small pieces loosely joined.