Strata Week: Keeping it clean

Great data tools for R and Clojure, identifying shady Twitter memes, distributed data in Zambia, and cleaning mashed-up datasets

This edition of Strata Week is all about making things easy and tidy. If you’re eager to learn more tips and tricks for doing so, come to Santa Clara in February: check out the list of Strata conference speakers and register today.

Languages made easy: R and Clojure

Love “fruitful and fun” data mining with Orange? Wish you had an interface like that for R? Wish no more. Anup Parikh and Kyle Covington have created Red-R to extend the Orange interface.

The goal of this project is to provide access to the massive library of packages in R (and even non-R packages) without any programming expertise. The Red-R framework uses concepts of data-flow programming to make data the center of attention while hiding all the programming complexity.

Similar to Orange, Red-R uses a series of widgets to modify and display data. The beauty of Red-R is that it allows programming novices to leverage R’s power and to interact with their data in an analytical way. Such tools are no substitute for actual statistical modeling, of course, but they are a great first step in piquing interest and providing a visual conversation-starter.


Red-R is still in its infancy, but as with all such projects, testing and bug reports are welcome. Check out the forums to get involved.

If R is not your thing, perhaps you’ve jumped on the Clojure bandwagon (I wouldn’t blame you: Clojure is one exciting new language). If that’s the case, check out Webmine, a library for mining HTML written by Bradford Cross, Matt Revelle, and Aria Haghighi.

Registration for Strata 2011 is now open. Save 20% with the code “STR11RAD”

Facts are stubborn things

A team at the Indiana University Center for Complex Networks and Systems Research has built the Truthy system to examine and classify memes on Twitter in an attempt to identify instances of astroturfing, smear campaigns, and other “social pollution.”

Truthy looks at streaming Twitter data via the public Twitter API, filters it to extract politically-minded tweets, and then pulls out “memes” like #hashtags, @ replies, phrases, and URLs. Memes that constitute a high volume of tweets, as well as memes that have experienced a significant fluctuation in volume, are flagged and entered into a database for further investigation.

The Truthy system then visualizes a timeline, map, and diffusion network for each meme, and applies sentiment analysis in order to better study and understand “social epidemics.” It also relies on crowdsourcing to train its algorithms. Users can visit the project’s website and are asked to click the “Truthy” button on a meme’s detail page when they suspect a meme contains misinformation masquerading as fact.

Check out the gallery for some fascinating network visuals and the stories behind them.


A clean bill of health

Kudos to Dimagi and CIDRZ for a creative solution to a serious problem. In order to provide standard interventions to reduce maternal and infant mortality rates in rural Zambia for the BHOMA (Better Health Outcomes through Mentoring and Assessments) project, they needed a distributed system for capturing and relaying health data.

As in many other places in Africa, reliable internet is not easy to find in rural Zambian communities. But cell phones are nearly ubiquitous, and the best communication devices for relaying patient information from clinics and field workers and back again.

Enter Apache’s CouchDB, which saved the day with its continuous replication. A lightweight server in each clinic now replicates filtered data to a national CouchDB database via a modem connection, and two-way replication allows data collected on phones to propagate back to each clinic.

Read more details of the case study here.

Refinement rather than fashion

Strata RegistrationYou may recall that among a spate of Google acquisitions over the summer was Metaweb, the company responsible for Freebase. Now, a nifty open source tool formerly called Freebase Gridworks has been renamed Google Refine, and version 2.0 was released just last week.

Refine is a powerful tool for cleaning up data. It allows you to easily sort and transform inconsistent cells to correct typos and merge variants; filter, then remove or change certain rows; apply custom text transformations; examine numerical columns via histograms; and perform many more complex operations to make data more consistent and useful.

Refine really shines when it is used to combine or transform data from multiple sources, so it’s no surprise that it has been popular for open government and data journalism tasks.

Also notable is the fact that Refine is a downloadable desktop app, not a web service. This means you don’t have to upload your data anywhere in order to use it. Best of all, Google Refine keeps a running changelog that lets you review and revert changes to your data — so go ahead: play around. A great set of video tutorials on Google’s blog can help you do just that.

tags: , , , , ,