|
|
||||||||||||||||||||||||||||||||||||||||||
Developing intuitions about dataWhy we must consider the different properties and purposes of computer files.In The laws of information chemistry I mentioned that my local high school uses a PDF file to publish the school's calendar of events. Let's look at some different ways to represent the calendar entries for Oct 6, 2010. First I'll divide these representations into two major categories: "What People See," and "What Computers See." Then I'll discuss how the various formats serve various purposes. Category 1: What People SeeHere's a piece of the PDF file for the week of Oct 4, 2010. And here's how the same entries might look in Google Calendar (or in any other calendar program). Category 2: What Computers SeeThe PDF file describes fonts and layout in a highly structured way. But the calendar's data -- dates, times, descriptions -- only lives in free-form text. Computers use it to enable people to read or print that text. 10/6
-Junior class NECAP testing info. Meeting block 4 (aud.) Fig. 2a: How the data in the PDF file looks to a computer When your browser renders the calendar, it sees a mixture of HTML and JavaScript. Computers use that mixture to enable people to read, print, and also interact with the text. <TR class="lv-row lv-newdate lv-firstevent lv-alt"> <TH class=lv-datecell rowSpan=5><A class=lv-datelink href="javascript:void(Vaa('20101006'))">Wed Oct 6</A></TH> <TD class="lv-eventcell lv-status"> </TD> <TD class="lv-eventcell lv-time"><SPAN class=lv-event-time onmousedown="Waa(event,'listview','YzFmYT...b2tAZw','20101006');return false;">All day</SPAN></TD> <TD class="lv-eventcell lv-titlecell"> <DIV id=listviewzYzFmYT...b2tAZw20101006 class=lv-zippy onmousedown="Waa(event,'listview','YzFmYT...b2tAZw','20101006');return false;"></DIV> <DIV class=lv-event-title-line><A style="COLOR: #1f753c" class=lv-event-title onmousedown="Waa(event,'listview','YzFmYT...b2tAZw','20101006');return false;" href="javascript:void(0)">-Junior class NECAP testing info. Meeting block 4 <SPAN dir=ltr>(aud.)</SPAN></A> </DIV> Fig. 2b: How the HTML looks to a computer A calendar application or service that knows how use a standard format called iCalendar will receive a structured representation of the data. It relies on that structure to identify, recombine, and exchange the dates, times, and descriptions.
BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN VERSION:2.0 BEGIN:VEVENT DTSTART:20101006T113000Z DTEND:20101006T190000Z DTSTAMP:20101005T172506Z UID:bccvmn5aooodokincjbgl8crc0@google.com CREATED:20101005T161914Z DESCRIPTION: LOCATION: SUMMARY:-Rain date for AP Env. Sci. trip to Monadnock 7:30 am-3 pm (Davenso n/Sintros) END:VEVENT Fig. 2c: How the iCalendar feed looks If a proposed format called xCalendar is approved as a standard, and is widely adopted by calendar applications and services, then calendar applications or services might also use that format to identify, recombine, and exchange dates, times, and descriptions. <icalendar xmlns="urn:ietf:params:xml:ns:icalendar-2.0"> <vcalendar> <properties> <prodid> <text>-//Google Inc//Google Calendar 70.9054//EN</text> </prodid> <version> <text>2.0</text> </version> </properties> <components> <vevent> <properties> <dtstamp>20101005T172506Z</dtstamp> <dtstart>20101006T113000Z</dtstart> <dtend>20101006T190000Z</dtend> <uid> <text>bccvmn5aooodokincjbgl8crc0@google.com</text> </uid> <summary> <text>-Rain date for AP Env. Sci. trip to Monadnock 7:30 am-3 pm (Davenson/SintrosEvent #2</text> </summary> </properties> </vevent> </components> </vcalendar> </icalendar> Fig. 2d: How an xCalendar feed might look Note that Fig. 2c (iCalendar) and Fig 2d (xCalendar) look very different. The iCalendar format uses lines of plain text to represent name:value pairs. The xCalendar format use a package of nested XML entities to represent the same data. Technical experts can, and do, endlessly debate the pros and cons of these different approaches. But for our purposes here, the key observations are:
It's also worth noting that this simple name:value technique, which has been the Internet calendar standard for over a decade, is broadly useful. Curators of elmcity calendar hubs, for example, follow a convention for representing name:value pairs as tags, attached to Delicious bookmarks, that have the form name=value. A similar convention enables any calendar event, made by any calendar program, to specify the URL for the event and the categories that it belongs to. In this week's companion article on answers.oreilly.com I show how to extract these name:value pairs from free text. A taxonomy of representations and purposesLet's chart these representations and arrange them according to purpose.
To most people, all four items in the What Computers See column are roughly equivalent. They're understood to be computer files of one sort or another. But when computers use these files on our behalf, they use them in very different ways. The first two uses enable people to read, print, and interact online. The latter two enable computers to exchange data without loss of fidelity, so that other people can read, print, and interact online. The laws of information chemistry say that if we want to exchange data, we must provide it in a format that's useful for that purpose. In this example the PDF and HTML formats aren't; the iCalendar and xCalendar formats are. To most people it's not obvious why that's so. Our brains are such powerful pattern recognizers, and we know so much about the world in which the patterns occur, that we can look at Fig. 2a and see that the text clearly implies a structure involving dates, times, titles, and descriptions. Computers can't do that so easily or so well. Computers are, of course, getting smarter all the time. Google Calendar's Quick Add feature is a perfect example. I used it to create the example shown in Fig. 1b, and it did a great job of parsing out the times and titles of the events. But that was only possible because I inserted the events, one at time, into a container that Google Calendar understood to represent Wed Oct 6. It wouldn't be able to import the original free-form text that was the original source for the PDF file. No other calendar program could either. The surprising difficulty of structured informationIt's counter-intuitive that computers don't recognize structure easily or reliably. But so are many other things. For example:
You can live a long time without ever developing an intuition that the final amount is less. And you may be profoundly harmed because you lack that intuition. If you have it, you most likely didn't acquire it all by yourself. Either somebody taught it to you, or nobody did. Although our sample PDF file contains no structured representation of the events that it exists to convey, it does contain some other structured data:
From this we learn that that calendar originates in Microsoft Word. Why Word instead of a calendar program? Available cloud-based applications include Google Calendar and Hotmail Calendar. On the Mac desktop where the document originated, there's Apple iCal. If one of these alternatives were even considered, a number of valid concerns would arise:
And if alternatives were considered, we could discuss those concerns:
Since alternatives are almost never considered, though, the ensuing discussion almost never happens. Why not? Key intuitions are missing. Some kinds of computer files have different properties than others, and thus serve different purposes. Structured representation of data is one such property. If we are trying to put data onto the web, and if we want others to have the use of that data, and if we hope it will flow reliably through networks to all the places where it's needed, then we ought to consider how the files we choose to publish do, or don't, respect that property. Nobody is born knowing this stuff. We need to learn it. Schools aren't the only source of instruction. But they ought to teach core principles that govern the emerging web of people, data, and services. And they ought to cultivate intuitions about when, why, and how to apply those principles. Related:
|
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
Comments: 7
Elisabeth Robson [ 7 October 2010 02:18 PM]
Jon -
Terrific article - thank you! We just started an online course on this topic to teach some of these skills that you outline here. You did a great job describing the issues.
Elisabeth
Jamie Thomson [ 8 October 2010 05:27 AM]
Jon,
Your constant door-beating of this issue is admirable but I'm starting to wonder whether or not these efficiencies will ever permeate from the microcosm of techies that understand the benefits today. I have been waiting years for the goodness of iCalendar to hit the tipping point but somehow it never seems any closer. Do you think we'll ever get there (where "there" is iCalendar being used by your local high school, church, drama club etc...)?
-Jamie Thomson
Jon Udell [ 8 October 2010 05:45 AM]
Not yet my local high school, or the two colleges. But my local city government is on board and now things like the dates for hazardous waste recycling, which used to be buried in a PDF, are on the city's site in a Google Calendar widget, and also flowing through the hub for Keene.
I hope it is becoming clear, though, that this not just about iCalendar. It's about a set of principles that need to be part of everyone's basic mental toolkit. Jeannette Wing calls this computational thinking, I think we need to broaden/destigmatize that notion, this stuff can't remain locked up in the geek ghetto, it's too important.
I'm using iCalendar and the events domain as an extended case study, because everyone feels the pain. But from that case study I'm trying to tease out a set of principles that can guide an educational curriculum.
Jamie Thomson [ 8 October 2010 06:00 AM]
Hi Jon,
Yep, understood that a shift in people's thinking is what we're all after here and I most enjoyed your Interview with Innovators chat with Jeannette on that very subject.
I am canvassing my local authority to adopt similar principles in one of their current endeavours (http://jamiekt.wordpress.com/2010/09/08/an-open-letter-to-surrey-county-council-regarding-the-charlton-lane-eco-park/). No dice as yet.
Keep up the great work (and please bring back Interview with Innovators :))
-Jamie Thomson
Jeff [ 8 October 2010 07:45 AM]
Nice article, Jon.
This will likely be solved generationally, if you ask me. "Make a Word document!" isn't something most people under 25 even think of as a solution to documenting calendar-like information.
Jon Udell [ 8 October 2010 10:49 AM]
Thought experiment: You are an under-25, you play in a band, you want to tell the world about your schedule. Word? No, I agree, that won't be your choice. But it's instructive to consider what likely will be: MySpace.
If this problem were being solved generationally, it would not have been necessary to write an HTML screenscraper to turn MySpace band pages into iCalendar feeds:
http://github.com/judell/elmcity/blob/master/fusecal/ElmcityLib/myspace.py
Why doesn't MySpace offer iCalendar feeds? Because the generation we like to think of as being digitally native is not absorbing some core underlying principles.
Why not? Those principles have yet to be formulated and taught.
Frank [11 October 2010 06:17 AM]
Wow! Way too much over-thinking! This article is epitome of verbosity where none is required. You've taken a simple topic and shown that you now know basic computing. You might as well have told us that paper can be used for writing or drawing, and then over-analyzed it with enough extrapolation of nuances that you've contorted it into a self-made science, and thus unduly impressed some elementary kids. This is not information chemistry, this is not computational thinking, this is child's play. By the way, PDF's are just fine to write and store information in. Any dime-a-dozen developer can extract the data from them, search against its contents, store it in databases, and further employ the information in any one of a hundred other ways.