We were part of a group of journalists and bloggers invited to hear presentations from 10 different research groups within various parts of Xerox, PARC, and Fuji-Xerox. The format was similar to a science fair or a poster session in an academic conference with small groups moving around to hear presentations from the different projects. While other research labs use a large auditorium and parade different researchers in, I thought the smaller, science fair format made for better interactions between the visitors and the researchers.
We saw early prototypes created by the researchers themselves, so the user interfaces were far from polished. Here are some of the highlights from our visit:
Seamless Document Viewer
A J2ME application designed to help solve the problem of viewing documents on small screens (cell phones and other mobile devices), this app automatically segments a document into blocks and displays the keyphrase for each block. The keyphrases are intended to help users navigate to sections of interest quickly. The cell phone demo we saw used a fairly intuitive touchscreen interface that included an interesting way to pan and zoom in and out of sections of a document. Because documents viewed through the application need to be processed and analyzed in advance, it is better suited for viewing PDF’s and static documents, not frequently updated web pages.
Categorizing documents automatically is an old topic in information science. Most tools rely only on the text portion of documents and use a combination of Natural Language Processing and Machine Learning. I was looking forward to this presentation because we use text-only automatic classifiers to help organize some of our data sources.
Hybrid categorization uses both the text and images contained in documents. It isn’t clear how scalable their hybrid categorizer is, the results we saw were based on small numbers of documents. Precision measures the accuracy of a categorizer and judging from the results of an academic competition, Xerox’ hybrid (text +images) approach may hold some promise.
“Reusable paper” refers to paper coated with special materials and a custom printer that shoots UV light onto it. The resulting printed document is designed to fade within 24 hours and the paper can be reused and fed into the printer multiple (10+) times. The printer can even erase the printing on the specially-coated papers, and print an entirely new document on the same sheets of paper. We raised the possibility that a sheet of paper that has nominally erased itself can be reverse engineered to reveal sensitive content: think security agencies or dumpster-diving identity thieves. Surprisingly, the researchers had not seriously investigated the possibility of “recovering erased documents”.
The cost of the specially-coated paper is projected to be only 2-3 time the cost of normal paper, while the accompanying printer will cost about the same as a laser printer. Since paper can be reused multiple (10+) times, the obvious environmental benefits also lead to savings. Further savings come from the design of the printer itself: since the printing is done with light (UV LED bar), the printer does not use ink or toner.
Redaction is the process of removing sensitive information from documents. Popular examples include government/intelligence documents released to the public and medical records. Text redaction is normally a tedious manual process that requires staff possessing significant domain expertise. As an example, privacy rules governing medical records in the U.S. requires redaction of terms associated with HIV/AIDS, mental health and drug/alcohol problems. In the demo we saw, the software tool examined a corpus of documents, automatically came up with terms/phrases associated with the listed illnesses, and redacted them from every document in the corpus.