Strata Week: Infochimps makes a platform play

Infochimps opens up a data platform, de-anonymization via writing style, data for the public good.

Here are a few of the data stories that caught my attention this week.

Infochimps makes its big data expertise available in a platform

The big data marketplace Infochimps announced this week that it will begin offering the platform that it’s built for itself to other companies — as both a platform-as-a-service and an on-premise solution. “The technical needs for Infochimps are pretty substantial,” says CEO Joe Kelly, and the company now plans to help others get up-to-speed with implementing a big data infrastructure.

Infochimps has offered datasets for download or via API for a number of years (see my May 2011 interview with the company here), but the startup is now making the transition to offer its infrastructure to others. Likening its big data marketplace to an “iTunes for data,” Infochimps says it’s clear that we still need a lot more “iPods” in production before most companies are able to handle the big data deluge.

Infochimps will now offer its in-house expertise to others. That includes a number of tools that one might expect: AWS, Hadoop, and Pig. But it also includes Ironfan, Infochimps’ management tool built on top of Chef.

Infochimps isn’t abandoning the big data marketplace piece of its business. However, its move to support companies with their big data efforts is indication there’s still quite a bit of work to do before everyone’s quite ready to “do stuff” with the big data we’re accumulating.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

How do you anonymize online publications?

A fascinating piece of research is set to to appear at IEEE S&P on the subject of Internet-scale authorship identification based on “stylometry,” which is an analysis of writing style. The paper was co-authored by Arvind Narayanan, Hristo Paskov, Neil Gong, John Bethencourt, Emil Stefanov, Richard Shin and Dawn Song. They’ve been able to correctly identify writers 20% of the time based on looking at what they’ve published online before. It’s a finding with serious implications for online anonymity and free speech, the team notes.

“The good news for authors who would like to protect themselves against de-anonymization is it appears that manually changing one’s style is enough to throw off these attacks,” says Narayanan.

Open data for the public data

O’Reilly Media has just published a report on “Data for the Public Good.” In the report, Alex Howard makes the argument for a systemic approach to thinking about open data and the public sector, examining the case for a “public good” around public data as well as around governmental, journalistic, healthcare, and crisis situations (to name but a few scenarios and applications).

Howard notes that the success of recent open data initiatives “won’t depend on any single chief information officer, chief executive or brilliant developer. Data for the public good will be driven by a distributed community of media, nonprofits, academics and civic advocates focused on better outcomes, more informed communities and the new news, in whatever form it is delivered.” Although many municipalities have made the case for open data initiatives, there’s more to the puzzle, Howard argues, including recognizing the importance of personal data and making the case for a “hybridized public-private data.”

The “Data for the Public Good" report is available for free as a PDF, ePUB, or MOBI download.

Got data news?

Feel free to email me.

Related:

tags: , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Gwen Jenkins

    De-anonymization: I wouldn’t bet on protecting myself by changing my style. Writers have long used pseudonyms to write for audiences with differing expectations; “identify the author” is one of the games literary critics have always played.
    The most skilled critics use subtle grammar clues, not writing style, to make the match. Given the neglect of grammar studies in modern education, it’s unlikely that writers will be sufficiently conscious of their own quirks to fool a well-programmed computer.