- ‘Scrapers’ Dig Deep for Data on Web (WSJ) — our users’ data comprise a valuable resource to mine and sell, but so do their kidneys. The data world faces serious issues with informed consent, control, and exploitation–it’s not just a shiny new business model, it can also leave people feeling very violated. Again, if you’re not paying for it then you’re the product and not the customer. The majority of humanity is not conscious of the difference between “user” and “customer”. (via Mike Brown on Twitter)
- Journalism in the Age of Data (Video) — Stanford video, with annotations and links, on the challenge of using dataviz as a storytelling medium. (via Ben Goldacre on Twitter)
- webshell (Github) — open source (Apache-licensed) console utility, requiring node.js, for debugging and understanding HTTP connections. (via Chris Shiflett on Twitter, who prefers it to yesterday’s htty)
- Amazon to Launch Kindle Singles (press release) — shorter-form works (think: novellas) as a format to expand publishing market rather than shrink it. Damn near every business book ever written should have been this size instead of 300 pages of tedium.
IA Ventures success, MathJax display engine, statistical literacy, and making big data more human
IA Ventures raises a huge first-time fund; MathJax provides an open source mathematical display engine; Kevin Drum shares 10 statistics pitfalls; and Paul Bradshaw explains how to bring big data down to a human scale.
Data geekery, visualization and journalism
From deep-diving startup founders to national newspapers, there's a rich vein of wisdom and information in blogs about data. Here's five to get your reading list started.
Data Privacy, Journalism and Dataviz, Web Shell, and Kindle Singles
Teaching Design Thinking, Client-Side Graphics, Removing Logos, and Tweeting the Revolution
- Design Thinking in Schools — materials to help teach design thinking in schools and education. My favourite: Design MadLibs (though until they can include “fart” in the list of acceptable words, it won’t be as interesting to my kids as the original MadLibs). (via Justine Sanderson)
- Unlogo — a web service that eliminates logos and other corporate signage from videos. Very clever use of computer vision technology: “if we have all these demos of CV that put logos on blank sheets of paper and otherwise inject them into our lives, why not use the same technology to remove logos from the world around us?” There’s a nifty demo replacing logos with the head of the relevant corporation’s CEO. (via Phil Lindsay)
- Gibbets, Dismemberment, and Dickens (Julie Starr) — evocative and well-written Dickens account of witnessing a guillotining. If the next revolution is tweeted, it’ll be a sad day for journalism, literature, and history. Do read this, it’s not revolting.
Data viz for journalism, student career paths, multi-dimensional data, and the future.
Get cozy for fall by watching some videos about visualization. First, check out Geoffrey McGhee's documentary about data viz in journalism. Then get a sneek preview of LinkedIn's Career Explorer tool. Catch up on Julia Grace's Web2.0 Expo keynote, and finally, take a look at the future of user interfaces through touchable holograms.
The intersection -- and accompanying questions -- of data science and journalism.
There's nothing wrong with taking a strong position, assuming the underlying data and facts are accurate. But it's important for the audience to recognize it as advocacy, not as strict science, even when it comes wrapped in a really cool visualization.
- Transparency is Not Enough (danah boyd) — we need people to not just have access to the data, but have access to the context surrounding the data. A very thoughtful talk from Gov 2.0 Expo about meaningful data release.
- Feed6 — the latest from Rohit Khare is a sort of a “hot or not” for pictures posted to Twitter. Slightly addictive, while somewhat purposeless. Remarkable for how banal the “most popular” pictures are, it reminds me of the way Digg, Reddit, and other such sites trend towards the uninteresting and dissatisfying. Flickr’s interestingness still remains one of the high points of user-curated notability. (via rabble on Twitter)
- Potential Policy Recommendations to Support the Reinvention of Journalism (PDF) — FTC staff discussion document that floats a number of policy proposals around journalism: additional IP rights to defend against aggregators like Google News; protection of “hot news” facts; statutory limits to “fair use”; antitrust exemptions for cartel paywalls; and more. Jeff Jarvis hates it, but Alexander Howard found something to love in the proposal that the government “maximize the easy accessibility of government information” to help journalists find and investigate stories more easily. (via Jose Antonio Vargas)
European Economic Crisis, Scaling Guardian API, Cheerful Pessimism, and Science Mapping
- Lending Merry-Go-Round — these guys have been Australia’s sharpest satire for years, filling the role of the Daily Show. Here they ask some strong questions about the state of Europe’s economies … (via jdub on Twitter)
- What’s Powering the Guardian’s Content API — Scala and Solr/Lucene on EC2 is the short answer. The long answer reveals the details of their setup, including some of their indexing tricks that means Solr can index all their content in just an hour. (via Simon Willison)
- What I Learned About Engineering from the Panama Canal (Pete Warden) — I consider myself a cheerful pessimist. I’ve been through enough that I know how steep the odds of success are, but I’ve made a choice that even a hopeless fight in a good cause is worthwhile. What a lovely attitude!
- Mapping the Evolution of Scientific Fields (PLoSone) — clever use of data. We build an idea network consisting of American Physical Society Physics and Astronomy Classification Scheme (PACS) numbers as nodes representing scientific concepts. Two PACS numbers are linked if there exist publications that reference them simultaneously. We locate scientific fields using a community finding algorithm, and describe the time evolution of these fields over the course of 1985-2006. The communities we identify map to known scientific fields, and their age depends on their size and activity. We expect our approach to quantifying the evolution of ideas to be relevant for making predictions about the future of science and thus help to guide its development.
Personalised Healthcare, Academic Link Shorteners, Journalism Futures, Security
- Genome Scan Gives Man Insight Into Future Health Risks — the first completely mapped genome of a healthy person aimed at predicting future health risks. The scan was conducted by a team of Stanford researchers and cost about $50,000. The researchers say they can now predict [his] risk for dozens of diseases and how he might respond to a number of widely used medicines. Personalized medicine takes a step closer, and all powered by massive computational power.
- Long Handle on Shorted Digital Object — digital object identifiers, and their relationship to shortener services like bit.ly (in which O’Reilly is an investor). The Handle System is relatively inexpensive, but the costs are now higher than the large scale URL shorteners. According to public tax returns, the DOI Foundation pays CNRI about $500,000 per year to run the DOI resolution system. That works out to about 0.7 cents per thousand resolutions. Compare this to Bit.ly, which has attracted $3.5 million of investment and has resolved about 20 billion shortened links- for a cost of about 0.2 cents per thousand. It remains to be seen whether bit.ly will find a sustainable business model; competing directly with DOI is not an impossibility.
- We Are In The Information Business — A well-architected news website leads to content that will keep on providing value, rather than leaving stories to wither away when their immediate news value has faded. Structured content is the stuff that makes a website malleable, rather than cementing you into certain ways of doing things. Structured content is like a big undo button that allows you to reverse decisions and change how your website looks and behaves. Since none of us can predict the future, the freedom to change course as often as we please and not having to worry about escalating legacy costs, well, that’s pretty close to heaven.
- Sacramento Credit Union FAQ — The answers to your Security Questions are case sensitive and cannot contain special characters like an apostrophe, or the words “insert,” “delete,” “drop,” “update,” “null,” or “select.” (via Simon Willison)
Find the Pretty, Win the Prize, Manage the Data, and Model the Temple
- 0to255 — simple cute colour-generator. (via Hacker News)
- ProPublica Wins Pulitzer Prize (NYTimes) — important landmark in the rise of online journalism. The award is a landmark for ProPublica, founded in 2007, and the other digital news outlets that have sprouted around the country. Over the last few years, the Pulitzer Prize board has relaxed the eligibility rules, allowing news sites to submit work published only online; this year there were many such submissions.
- Big Data Workshop — unconference at the Computer History Museum in Mountain View. (via jchris on Twitter
- 3D Machu Picchu, Created With LIDAR — viewable in Google Earth, took over 1,200 hours of work. (via skry on Twitter)