Strata Week: A step toward personal data control

Here are a few of the data stories that caught my attention this week.

Your data in your locker

Earlier this month, John Battelle wrote a post on his blog where he wished for a service to counter the ways in which all our personal data is scattered across so many applications and devices. He was looking for a tool that would pull together the data from these various places into something that “queries all my various social actions and curates them into one publicly addressable instance independent of any larger platform like AOL, Facebook, Apple, or Google … I’m pretty sure this is what Singly and the Locker Project will make theoretically possible.”

Battelle and Singly’s Jason Cavnar discussed the Locker Project in more detail in another post on Battelle’s blog this week.

As Cavnar argued:

Data doesn’t do us justice. This is about LIFE. Our lives. Or as our colleague Lindsay (@lschutte) says — ‘your story.’ Not data. Data is just a manifestation of the actual life we are leading. Our data (story) should be ours to own, remember, re-use, discover with and share.

If that sounds appealing then there’s good news ahead. Singly 1.0 begins its roll-out to developers this week, as ReadWriteWeb’s Marshall Kirkpatrick reports. Developers will be able to build apps that “search, sort and visualize contacts, links and photos that have been published by their own accounts on various social networks but also by all the accounts they are subscribed to there.” The apps will live on Github and will deploy on Github for now. There are also several restrictions as far as using other people’s apps — for example, you can only do so to visualize your own data.

Even with limitations, Singly is a first step in what will be a much-anticipated and a hugely important move for personal data control.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Bad graphics and good data journalism

A sample word cloud Last week, New York Times senior software architect Jacob Harris issued a challenge to the growing number of data journalists. Want to visualize your work? Avoid word clouds.

Word clouds are, he argued, much like tag clouds before them: “the mullets of the Internet.” That is, taking a particular dataset and merely visualizing the frequency of words therein via tools like Wordle and the like is simply “filler visualization.” (And Harris said it’s also personally painful to the NYT data science team.)

Harris pointed to the numerous problems with utilizing word clouds as the sole form of textual analysis. At the very least, they only take advantage of word frequency, which doesn’t necessarily tell you that much:

For starters, word clouds support only the crudest sorts of textual analysis, much like figuring out a protein by getting a count only of its amino acids. This can be wildly misleading; I created a word cloud of Tea Party feelings about Obama, and the two largest words were implausibly “like” and “policy,” mainly because the importuned word “don’t” was automatically excluded. (Fair enough: Such stopwords would otherwise dominate the word clouds.) A phrase or thematic analysis would reach more accurate conclusions. When looking at the word cloud of the War Logs, does the equal sizing of the words “car” and “blast” indicate a large number of reports about car bombs or just many reports about cars or explosions? How do I compare the relative frequency of lesser-used words? Also, doesn’t focusing on the occurrence of specific words instead of concepts or themes miss the fact that different reports about truck bombs might be use the words “truck,” “vehicle,” or even “bongo” (since the Kia Bongo is very popular in Iraq)?

The Guardian’s Simon Rogers responded to Harris. Rogers acknowledged there are plenty of poor visualizations out there, but he added an important point:

Calling for better graphics is also like calling for more sunshine and free chocolate — who’s going to disagree with that? What they do is ignore why people produce their own graphics. We often use free tools because they are quick and tell the story simply. But, when we have the time, nothing beats having a good designer create something beautiful — and the Guardian graphics team produces lovely visualisation for the Datablog all the time — such as this one. What is the alternative online for those who don’t have access to a team of trained designers?

That last question is crucial, particularly as not everyone has access to designers or software to be able to do much more with their data than create simple visualizations (i.e. word clouds). Rogers said that it’s probably fine to have a lot of less-than-useful graphics, because, if nothing else, it “shows that data analysis is part of all our lives now, not just the preserve of a few trained experts handing out pearls of wisdom.”

Mary Meeker examines the global growth of mobile data

Among the most-anticipated speakers at Web 2.0 Summit this week was Mary Meeker. The former Morgan Stanley analyst and now partner at Kleiner Perkins gave her annual “Internet Trends” presentation, which is always chock full of data.

Meeker’s full Web 2.0 Summit presentation is available in the following video:

Meeker noted that 81% of users of the top 10 global Internet properties come from outside the U.S. Furthermore, in the last three years alone, China has added more Internet users than there are in all of the United States (246 million new Chinese users online versus 244 million total U.S. users online). Although companies like Apple, Amazon, and Google continue to dominate, Meeker pointed out that some of the largest and fasted growing Internet companies are also based outside the U.S. — Chinese companies like Baidu and Tencent, for example, and Russian companies like Mail.ru. And beyond just market value, she pointed to global innovations, such as Sweden’s Spotify and Israel’s Waze.

The growth in Internet usage continues to be in mobile. Meeker highlighted the global scale and spread of mobile growth, noting that it’s in countries like Turkey, India, Brazil and China where we are seeing the largest year-over-year expansion in mobile subscribers.

Suggesting that it may be time to reevaluate Maslow’s hierarchy of needs, Meeker posited that Internet access is rapidly becoming a crucial need that sits at the top of a new hierarchy.

Apache Cassandra reaches 1.0

The Apache Software Foundation announced this week the release of Cassandra v1.0.

Cassandra, originally developed by Facebook to power its Inbox Search, was open sourced in 2008. Although it’s been a top-level Apache project for more than a year now, the 1.0 release marks Cassandra’s maturity and readiness for more widespread implementation. The technology has been adopted beyond Facebook by companies like Cisco, Cloudkick, Digg, Reddit, Twitter and Walmart Labs.

Of course, Cassandra is just one of many non-relational databases on the market, with the most recent addition coming from Oracle. But Jonathan Ellis, the vice president of the Apache Cassandra project, explained to PCWorld why Cassandra remains competitive:

[Its] architecture is suited for multi-data center environments, because it does not rely on a leader node to coordinate activities of the database. Data can be written to a local node, thereby eliminating the additional network communications needed to coordinate with a sometimes geographically distant master node. Also, because Cassandra is a column-based storage engine, it can store richer data sets than the typical key-value storage engine.

Got data news?

Feel free to email me.

Related: