Here are a few of the data stories that caught my attention this week:
Sarah Palin’s Inbox
Last Friday, in response to a years-old public records request, the state of Alaska finally released some 24,000 pages of emails sent by former governor Sarah Palin. And “pages” really is the operative word here. Palin’s emails were all printed out — about 250 pounds of paper all told — at a printing cost of $725 per set. At least initially, the documents were only available to those who picked them up in Juneau — or to those willing to pay the high cost of having the six boxes mailed elsewhere.
Various organizations worked quickly to digitize the documents, but the task was so daunting that there were calls from many news agencies, including The New York Times to crowdsource the review of the emails.
The project echoes a similar one undertaken by the Sunlight Foundation last year when the group made a searchable interface for then Supreme Court nominee Elena Kagan’s emails.
As the Sunlight Foundation notes:
Like Elena’s Inbox, Sarah’s Inbox faced staggering issues of data quality because government officials continue to release digital files as hideous printouts requiring a laborious and error-ridden optical character recognition (OCR) pass over. You will notice that many of the emails are garbled, incomplete or contain odd characters — please keep in mind that we did the best with what we had and are not responsible for the content. Due to the programmatic nature of the tools used to build this site, we recommend checking any research effort against the source files.
Legal limits on location data
Roughly two months after the iOS location story broke here on Radar, the U.S. legislature has taken steps to limit how both the government and private companies can use location data.
Two bills were introduced this week — one in the House and one in the Senate. The latter was proposed by Senators Al Franken and Richard Blumenthal and would require companies to obtain users’ consent before sharing information about the location of a mobile device. The other bill, proposed by Representative Jason Chaffetz and Senator Ron Wyden, would require law enforcement agencies to obtain a warrant in order to track someone’s location via their mobile phone.
The proposals are part of a larger effort to update digital privacy laws, as legislators seem to grow increasingly concerned about consumer protections and data security.
LexisNexis open sources its Hadoop alternative
Research company LexisNexis announced this week that it will open source its big data processing tools. LexisNexis is positioning its High Performance Computing Cluster (HPCC) Systems as an alternative to Hadoop, boasting that it can “process, analyze, and find links and associations in high volumes of complex data significantly faster and more accurately than current technology systems.”
LexisNexis has a long history of working with big datasets and it began developing HPCC Systems internally in its Risk Solutions unit a decade ago. Risk Solutions CEO James Peck says the company has opted to open source HPCC in order to leverage the “innovation of the open source community to further the development of the platform for the benefit of our customers and the community.”
HPCC Systems is comprised of a data-centric programming language and two processing platforms: the Thor Data Refinery Cluster and the Roxie Rapid Data Delivery Cluster.
We’ve been watching the Hadoop competition heat up over the last few months, and the entry by LexisNexis makes the development of big data technologies and the big data market even more interesting.
Got data news?
Feel free to email me.