• Print

Questions (and Answers!) About the Federal Register

When the White House
retweets Cory Doctorow,
you know something unusual has happened. As many of
you saw,
the Office
of the Federal Register announced

that source code for
the Federal Register is
now available in bulk—for free—and has been converted to XML. Ed
Felten’s shop at Princeton created a site called
fedthread.org to see what you
can do with the data and Public.Resource.Org helped the Government Printing
Office in
testing early
stages of
the XML work.

All-in-all, a nice piece of public-private cooperation and an important
step towards open source America’s operating system, and I figured that
was the end of that. So, imagine my surprise
when I got a call from the White House saying they were making Raymond
Mosley, Director of the
Office of the Federal Register
(OFR) and Michael L. Wash, the Chief
Information Officer of the
Government Printing Office

(GPO) available just in case there
were any technical questions from the net.

I gathered questions from a variety
of sources, including on-line discussion
groups and twitter, and have been doing email back and forth with both
Ray and Mike. Hope this is useful (it certainly has been fun to do)!

Question: Ray, you announced XML for the Federal Register. Are you expecting to do the same for the other Official Journals? What other documents can we be expecting to show up in XML?

Ray Mosley: Yes we are expecting to do the same for our other Federal Register journals. Our priority is to complete the 1994–1999 Federal Registers which predate the SGML FR. Then, next, the Daily Compilation of Presidential Documents; then the Code of Federal Regulations, and so on until all of our publications are available in XML.

Question: Mike, you went from SGML to XML to convert the Federal Register.  That’s a rather hairy task. Can you tell us how you did that and are there any problems you encountered? Is it all XML now, or should application developers be aware of some domain-specific processing they’ll need to do?

Mike Wash: Currently the Federal Register is composed in SGML.  This has been the case since 2000.  Converting or transforming this SGML to XML is non-trivial, as you indicated, but having documents in SGML makes the transformation easier that those that are in locator codes.

The Federal Register composition is structured in a way to support printing the publication.  As such, there are a number of formatting functions in the composition code that, from a pure XML perspective, are not needed.  However, formatting is at times critical to properly interpret the publication.   Therefore in our transformation process, we could not eliminate all the formatting elements in the source SGML.   The result is an XML rendition that still contains some formatting elements.  Developers can access the documentation for our XML in the resources link on the bulk download page on GPO’s Federal Digital System (FDsys).

Question: Ray, how many people read the Federal Register on a regular basis?  What is the breakdown between print and on-line readers? And, do you have any idea what this will do to the total size of your readership–will more people be reading it now, do you think?

Ray Mosley: The printed Federal Register is distributed to about 6000 recipients every day. Most of those are Federal agencies and libraries. We do not have a head count as to how many readers use the printed FR each day. We suspect it to be a large number considering the likely usage in libraries. As for online, during FY 2009, just ended September 30, 2009, over 200 million Federal Register documents were downloaded by the public. We do not have a count as to how many individual users that equates to, but it is obviously a large number. For the months of July and August 2009, each month over 30 million documents were downloaded, suggesting that the projected number for 2010 will exceed 300 million documents downloaded. So, we expect the XML FR to expand this usage as developers create new and easy to use ways to keep current with the FR.

Question: Mike, does this XML release change at all how the data is distributed, or should developers still count on downloading a “ZIP” file every day at the same time?

Mike Wash: This release adds another data distribution method. Now you can download a single issue in XML, or a full month or a full year. Signed PDFs remain an option as well as text. We plan to add data feeds in the near future (e.g., RSS) which will add another nice feature to the bulk download capability.

Question: Ray, Does it change at all how agencies submit their information to you?

Ray Mosley: No, this does not change how agencies will submit material to us. That will remain unchanged for the foreseeable future.

Question: Mike, is it correct that authoring is still in SGML and then a conversion gets done?  Two questions: do you expect the authoring to be done in native XML in the future?  And, when do you think the full back archive will be converted?

Mike Wash: In the future, we intend to develop tools to allow the Federal Register and other publications, to be composed in XML and utilized style sheets for presentation and print formatting.  This updated composition process will eliminate the need to transform the Federal Register from another encoding scheme to XML, and allow us to offer a more pure XML file for the bulk data.  Our development process for the XML composition tool is just getting underway and therefore we do not have a delivery date yet.  This is referred to as the Composition System Replacement project.

Regarding the back archive of the Federal Register, there are 2 parts to this.  First, there are the versions of the Federal Register that are currently available on FDsys that date back to 1994.  The Federal Register issues dated January 3, 1994 through January 18, 2000 are not coded in SGML.  We are now in the process of transforming these to XML and should be available later this year.  Versions prior to January 3, 1994 are not in an electronic format at this time and need to be converted to digital.  This is a larger effort that is still in the planning phase.

Question: Ray, are there particular applications for the Federal Register that you’ve always dreamed of seeing? (E.g., do you have any hints for application developers out there looking for something cool to do with this new data)?

Ray Mosley: I would love to see developers create ways to advance/embrace plain language principles in FR documents, or ways that would help the public better understand the purpose and language of a particular document. Similarly, applications that make it easy for the public to participate in rulemakings and in the rulemaking activities of agencies would be most sought after.

Question: Ray, a lot of people talk about authenticity as something that happens at the final point of information dissemination, like the FDSyS system.   But, authenticity goes back to the root of the content.  Can you talk a little bit about what you folks do in the Office of the Federal Register to make sure you’re only publishing the real thing? What’s to prevent me from creating a fake office or submitting on behalf of somebody else or otherwise hacking the system?

Ray Mosley: We have a number of safeguards to ensure that impostors do not publish faked documents.  About 40 per cent of all documents submitted are all-electronic, digitally signed originals.  We require digital signatures to have medium level assurance, and be issued in compliance with the Federal Bridge Certificate Authority requirements.  For signed ink-on-paper original documents, we have other controls, which we won’t discuss in great detail for obvious security reasons.  One of the biggest factors is human.  We have experienced editors and a legal staff who could recognize fraudulent documents submitted by anyone foolish enough to risk a felony conviction.  We have a system of agency Liaison Officers who vouch for their agencies’ documents.  We have almost daily contact and personal relationships with those liaisons and many other agency program staff and general counsel. Major regulatory documents are often sent for pre-submission review, so we know what is in the pipeline.  “Start-up” agencies’ documents do not get past the front door until our legal staff has checked out their legal authority and bona fides.      

We also maintain a “chain of custody” throughout the editorial process.  When we edit documents, we maintain an electronic record of every change and annotate those changes with notes to record the authorization of the agency.  We share the GPO production network with our statutory partners, which largely eliminates errors in transferring files.  We feed files to GPO all day, and exchange production information all day.  GPO does not independently alter any Federal Register material.  Their production staff can and does call to consult with senior OFR staff at any time of day or night.  

Question: There’s a lot of concern about authenticity, particularly from groups like law librarians.  Mike, can you talk about digital signatures and other things you have in place to make sure you’re looking at the real deal when you see an official journal? What happens when copies of this stuff get made …. is there anyway to see that you’re not looking at a Bogus Register?

Mike Wash: The XML is not digitally signed. The Office of the Federal Register is working with Data.gov to enhance the language on Data.gov to clearly indicate that the XML is not signed.  New language is being added to the Federal Register pages on Data.gov that will read as follows:

The current XML data set is not yet an official format of the Federal Register.  Only the PDF and Text versions have legal status as parts of the official online format of the Federal Register.  The XML-structured files are derived from SGML-tagged data and printing codes, which may produce anomalies in display. In addition, the XML data does not yet include image files. Users who require a higher level of assurance may wish to consult the official version of the Federal Register on FDsys.gov  The FDsys data set includes digitally signed Federal Register PDF files, which may be relied upon as evidence in a court of law. [See: http://www.fdsys.gov/fdsys/browse/collection.action?collectionCode=FR ]  

Our XML user guide explains that we may digitally sign XML files in the future, but for now we are still concentrating on enhancing the display and content of XML files.  We require complete assurance that the XML product is a true rendition of the FR official legal record before proceeding with digital signatures.  As the official publisher, data integrity is paramount.  For us, the equation is: digital signature = authentic official edition.

Question: Ray, as publisher of one of the more complex daily journals in the world, I was wondering if your office has an editorial function?   Do you ever send text back and ask people to rewrite it? And, related to that, do you do all the SGML/XML markup on the text or is this something the agencies do and then submit?  If the latter, do you think it would be possible to get them to change the way they do markup, adding features like internal citations or indented lists to make the Register look better?

Ray Mosley: As publisher, we do indeed have an editorial function.  Ensuring authenticity and accuracy of FR publications is our number one concern, along with ensuring distribution to customers.  We have the authority, recognized in law, to return documents to agencies if they fail to meet our publications requirements.  We accept some changes while documents are in our possession, but we return documents that have major legal or formatting issues.  

We don’t re-draft regulations for style, per se.  We did adopt a regulation in 1978 requiring agencies to write the FR “Summary” in simple terms using understandable language.  It was one of the first plain language requirements in government.  We don’t have the resources to do more extensive plain language quality control, but if a passage in regulatory text violates codification standards, we will have the agency re-draft it.  We have a number of other substantive requirements.  For example, we won’t allow a document to mix elements of a proposed rule with final rule requirements. We require clarity and certainty as to the final agency action, including clear distinctions between interpretive, guidance documents and final rules with the force of law.   

On tagging, we do a limited amount of SGML tagging in our office.  GPO does most of the tagging on their end.  A few agencies have staff who can create SGML in their submissions, but the quality of that markup varies greatly.  We do not envision agencies tagging to create paragraph indentations or internal references.  In our view, the legal requirements of the Federal Register Act require format and content tagging and internal and external cross references to be imposed across the board, in a uniform manner.  

We have an evolving list of priorities, including some that will be enabled through XML.  Our Managing Editor has already requested that GPO developers look into creating indented text for the CFR and FR based on automated semantic analysis of the structured data.     

Thanks very much to Raymond Mosley and Michael Wash for taking the time to answer these questions! In addition to their web sites, you can find OFR on their Facebook page and you can follow @USGPO on Twitter.

tags: , ,
  • http://www.gottahavacuppamocha.com Michael H

    This is pretty interesting. Though I don’t think it is source code as much as putting out the information in another format (though I may have to read more about it).

    I did go and download the PDF of today’s version of the Federal Register (October 7, 2009) and noticed the FTC will be holding workshops and discussions on the effect of the internet on journalism. One of the footnotes seemed pertinent to the Federal Register itself: “The point here is that since individuals do not calculate the full benefit to society of their learning about politics, they will express less than optimal levels of interest in public affairs coverage and generate less than desirable demands for news about government.”

    People may only be vaguely aware of what the Federal Register does and the information it delivers, so they do not think to look for it. They may rely on specific outlets of information (bloggers, newspapers, networks, etc) to tell them what is important for them to know. I can see bloggers or other news outlets setting up bots to download and parse the Federal Register, looking for key terms or rules from specific agencies. Information that gets flagged by the bot can be sent to read by an editor for further information.

    A few examples might be NOTAMs (Notices to Airmen) from the FAA being parsed and put into an RSS feed on an aviation website, or the editor of a business journal gets SEC notices sent to her email she can figure out if they are useful for her readers.

    A lot of the parsing could be done with tools that are readily available for corpus linguistics.

  • http://www.xponentsoftware.com Bill Conniff

    Some of the xml files are quite large. I wonder if you and your readers might be interested in the XMLMax Virtual xml Editor because of its unlimited file size capability and speed. I just downloaded a 207MB xml EPA file and XMLMax loaded it in a treeview in just fifteen seconds.

  • http://vtd-xml.sf.net anon_anon

    You should also look into vtd-xml for splitting large XML files, (several hundreds to multi-gigabytes of documents

    http://vtd-xml.sf.net