Here are the data stories that caught my attention this week.
Face recognition and Facebook
Face recognition technology isn’t really a new Facebook feature, but until now it’s only been available for U.S. users. The switch was flipped this week and face recognition made available for international users, prompting an outcry about privacy and an EU probe into the matter. The concerns involve using face recognition technology to tag users in photos without their consent.
As TechCrunch’s Jason Kincaid points out, however, the fact that people can be tagged in photos without their consent happens with or without the face recognition technology:
To reiterate: the EU may conclude that Facebook users should be able to pre-approve their tags, and I don’t necessarily think that would be a bad thing (I’m sick of tag spam, for one). But conflating this with the spookiness of facial recognition seems like a mistake — we should save that outcry for when companies really do start doing creepy things with the technology.
Screenshot of Facebook’s “Suggest Tags” menu (user photos were edited out of this image).
Tim O’Reilly wrote here on Radar that, in fact, Facebook’s strategy for rolling out face recognition technology may be just the ticket:
Face recognition is here to stay. My question is whether to pretend that it doesn’t exist, and leave its use to government agencies, repressive regimes, marketing data mining firms, insurance companies, and other monolithic entities, or whether to come to grips with it as a society by making it commonplace and useful, figuring out the downsides, and regulating those downsides.
Analyzing hacked passwords
Much of the uproar around recent hacks and security breaches has focused on the weaknesses of corporate systems themselves, as well as the impact stolen data might have on customers. But software architect Troy Hunt has turned his attention to a different matter, analyzing the passwords that were stolen.
Hunt has examined the 37,000 some-odd passwords that were made available via BitTorrent, just a small section of the million or so that LulzSec claimed to have taken in its latest breach of Sony Pictures. Hunt looked at the passwords in terms of length, randomness, uniqueness, and character types — generally accepted as the standards for password entropy. In other words, the more of these variables that you have, the stronger your password.
And no surprise, he found that most passwords aren’t particularly strong.
Ninety-three percent of accounts were between six and 10 characters in length, and 50% were less than eight characters. Length is only one indicator of strength, and Hunt found that less than 4% of the passwords he analyzed had three or more character types (as in, capital letters, lower case letters, numbers, and so on). Half the passwords had only one character type, and of those, 90% were all lower case letters. Furthermore, less than 1% of passwords contained a non-alphanumeric character. There were a fair number of identical passwords, with “password” “123456” and “abc123” among the most common, and 20% of the passwords in this particular batch were repeats.
Just as problematic as these weak passwords, of course, is the repetition of passwords acros multiple databases. Although only 88 email addresses in this batch taken from Sony Pictures can be found in a similar data-dump from the stolen Gawker email addresses, two-thirds of those people used the same password to register on both sites.
“Based on the finding above,” writes Hunt, “there’s a statistically good chance that the majority of them will work with other websites. How many Gmail or eBay or Facebook accounts are we holding the keys to here? And of course ‘we’ is a bit misleading because anyone can grab these off the net right now. Scary stuff.”
While the recent exploits demonstrate some of the ongoing problems around system security, Hunt’s work highlights that there are a fair number of Internet users who are still not protecting themselves.
Archival data helps game developers recreate 1940s Los Angeles
The new video game L.A. Noire was released last month to great reviews, with many praising the accuracy of the game’s 1940s Los Angeles setting.
Nathan Masters explains how the game’s developers contacted archivists at a number of different collections in order to piece together the data about the city. Detailed WPA maps were found at the Huntington Library. U.S. Geological Survey data and photos were used from the UCLA Department of Geography and the Spence Air Photo Collection. From the Dick Whittington and Los Angeles Examiner photography collections at USC came images of cityscapes from the era. Numerous other libraries were consulted as well.
The Atlantic’s Alexis Madrigal makes the wonderful suggestion for the game makers Rockstar Games to release the model for others to study and remix.
Got data news?
Feel free to email me.