Strata Week: We give up more data than we realize, but CA residents soon may have access to all of it

Alessandro Acquisti's data research, the CA Right to Know Act of 2013, big data signal issues, and big data battles fraud and theft.

A look at personal data research and new government legislation

In a post at the New York Times this week, Somini Sengupta took an in-depth look at the work of Alessandro Acquisti, a behavioral economist at Carnegie Mellon University in Pittsburgh. Acquisti studies the choices we make when deciding what and how much data we’re willing to share and the things that cause us to often give up more data than we realize. Sengupta reports:

“Our browsing habits, search terms, e-mail communication — even our offering of our ZIP codes at the supermarket checkout — reveal bits of information that can be assembled by data companies, usually for the purpose of knowing what sorts of products we’re most likely to buy. The online advertising industry insists that the data is scrambled to make it impossible to identify individuals.

“Mr. Acquisti offers a sobering counterpoint. In 2011, he took snapshots with a webcam of nearly 100 students on campus. Within minutes, he had identified about one-third of them using facial recognition software. In addition, for about a fourth of the subjects whom he could identify, he found out enough about them on Facebook to guess at least a portion of their Social Security numbers.”

Sengupta also looks at Acquisti’s work that shows how distractions such as incoming email, Twitter notifications and text messages can interfere with and affect our decisions about sharing personal data — often making us less vigilant — and how his work is influencing government policy and regulation. You can read her full in-depth profile at The New York Times.

In related news, a new bill making its way through California legislation would be the first in the U.S. to give consumers the right to find out what companies have their data and to request a copy of that data. Cyrus Farivar reports at ArsTechnica that the bill — the “Right to Know Act of 2013” — is a result of lobbying by the Electronic Frontier Foundation and the American Civil Liberties Union of Northern California.

EFF activism director Rainey Reitman describes the bill in a post at the EFF:

“The new proposal brings California’s outdated transparency law into the digital age, making it possible for California consumers to request an accounting of all the ways their personal information is being trafficked—including with online advertisers, data brokers, and third-party apps. So while current law provides information about data exchanged for direct marketing, the Right to Know Act would update existing transparency law to ensure that users could track the flow of their data from online interactions. It also updates the definitions in the law in important ways, including adding location data—a sensitive data set not adequately protected by current law.”

Reitman points out that consumers would not only have a right to know what data companies are sharing, but also what personal data they’re storing.

Farivar reports that under the new bill, if a company fails to comply with a consumer’s request for their data, a civil suit can be filed to force compliance.

Big data signal issues may best be addressed with social science methodologies

Looking at the rise in big data hype in a post at Harvard Business Review this week, Kate Crawford warns against what she calls “data fundamentalism” — “the notion that correlation always indicates causation, and that massive data sets and predictive analytics always reflect objective truth.” Crawford emphasizes that the numbers can’t speak for themselves, that meaning from data is drawn through human interpretation. She says that “biases in both the collection and analysis stages present considerable risks” and need to carry as much weight in big data interpretation as the numbers themselves.

Crawford illustrates her point with an example of a study (PDF) of Twitter and Foursquare data gathered during Hurricane Sandy. She notes that while the data showed some expected findings such as an upsurge in grocery shopping and a few less expected ones such as a rise in nightlife the day after the hurricane hit, the data didn’t tell the entire story. “The greatest number of tweets about Sandy came from Manhattan,” she writes. “This makes sense given the city’s high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster.”

The solution to addressing big data’s signal issues and weaknesses in big data science may lie in the social sciences, Crawford suggests. “In the near term, data scientists should take a page from social scientists, who have a long history of asking where the data they’re working with comes from, what methods were used to gather and analyze it, and what cognitive biases they might bring to its interpretation,” she argues. “Social science methodologies may make the challenge of understanding big data more complex,” she says, “but they also bring context-awareness to our research to address serious signal problems.” You can read Crawford’s full piece at Harvard Business Review.

Big data vs fraud and theft

Companies are putting big data to work not only in efforts to secure more customers and to increase revenues, but also to crack down on fraud and theft. Frank Konkel reports at FCW this week on how the United States Postal Service (USPS) is making use of big data. He writes:

“The United States Postal Service is at the cutting edge of supercomputing technologies and the big data revolution, operating one of the most powerful non-classified supercomputing databases on the planet to process and detect fraud on over 528 million mail pieces every day.”

USPS program manager Scot Atkins told Konkel that 6,100 pieces of mail are processed each second, and within 50 to 100 milliseconds, the data from each piece of mail — such as weight, size, and carrier route data — is compared to a database of 400 billion records to detect problems, such as insufficient, duplicate or fraudulent postage. If an issue arrises, Atkins noted, the USPS is able to address it in near real-time.

Atkins said the USPS faces a significant number of fraud attempts each year and that the supercomputing and revenue protection program has been successful in cutting fraud. “We’re in the last mile by the time mail gets to the post office,” Atkins told Konkel, “and if we don’t intercept fraudulent packages at that point, chances are we won’t get the revenue.” Konkel notes that the USPS doesn’t publicly aggregate the amount of revenue saved through fraud detection efforts, but “[w]ith annual revenue of $65 billion,” Konkel writes, “the math says big data could save USPS millions per year.” You can read his full report at FCW.

On a more controversial level, retailers are putting big data databases to work to help detect and track employee thefts, Stephanie Clifford and Jessica Silver-Greenberg report at the New York Times. The problem, they say, is that the information repositories the retailers have helped amass, such as First Advantage Corporation’s Esteem database, “often contain scant details about suspected thefts and routinely do not involve criminal charges” but nevertheless, the information often is sufficient to hinder a job candidate’s success.

Though the retail theft databases are legal, Clifford and Silver-Greenberg report, they are beginning to come under scrutiny by labor lawyers and federal regulators as being so sweeping that innocent employees could be lumped in with the guilty. “The lawyers say workers are often coerced into confessing,” Clifford and Silver-Greenberg report, “sometimes when they have done nothing wrong, without understanding that they will be branded as thieves.”

Clifford and Silver-Greenberg cover several real-world examples of unjust devastation caused by the Esteem database, quoting one lawyer who called the database a “secret blacklist”: “The employees don’t know about it until they have already been hurt.” Clifford and Silver-Greenberg report that The Federal Trade Commission is investigating Esteem and other retail theft databases to determine whether they comply with the Fair Credit Reporting Act. You can read their full piece at the New York Times.

Tip us off

News tips and suggestions are always welcome, so please send them along.


O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata Rx Health Data Conference: September 25-27 | Boston, MA
Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England

tags: , , ,