ENTRIES TAGGED "data mining"
Data Privacy, Journalism and Dataviz, Web Shell, and Kindle Singles
- ‘Scrapers’ Dig Deep for Data on Web (WSJ) — our users’ data comprise a valuable resource to mine and sell, but so do their kidneys. The data world faces serious issues with informed consent, control, and exploitation–it’s not just a shiny new business model, it can also leave people feeling very violated. Again, if you’re not paying for it then you’re the product and not the customer. The majority of humanity is not conscious of the difference between “user” and “customer”. (via Mike Brown on Twitter)
- Journalism in the Age of Data (Video) — Stanford video, with annotations and links, on the challenge of using dataviz as a storytelling medium. (via Ben Goldacre on Twitter)
- webshell (Github) — open source (Apache-licensed) console utility, requiring node.js, for debugging and understanding HTTP connections. (via Chris Shiflett on Twitter, who prefers it to yesterday’s htty)
- Amazon to Launch Kindle Singles (press release) — shorter-form works (think: novellas) as a format to expand publishing market rather than shrink it. Damn near every business book ever written should have been this size instead of 300 pages of tedium.
European Economic Crisis, Scaling Guardian API, Cheerful Pessimism, and Science Mapping
- Lending Merry-Go-Round — these guys have been Australia’s sharpest satire for years, filling the role of the Daily Show. Here they ask some strong questions about the state of Europe’s economies … (via jdub on Twitter)
- What’s Powering the Guardian’s Content API — Scala and Solr/Lucene on EC2 is the short answer. The long answer reveals the details of their setup, including some of their indexing tricks that means Solr can index all their content in just an hour. (via Simon Willison)
- What I Learned About Engineering from the Panama Canal (Pete Warden) — I consider myself a cheerful pessimist. I’ve been through enough that I know how steep the odds of success are, but I’ve made a choice that even a hopeless fight in a good cause is worthwhile. What a lovely attitude!
- Mapping the Evolution of Scientific Fields (PLoSone) — clever use of data. We build an idea network consisting of American Physical Society Physics and Astronomy Classification Scheme (PACS) numbers as nodes representing scientific concepts. Two PACS numbers are linked if there exist publications that reference them simultaneously. We locate scientific fields using a community finding algorithm, and describe the time evolution of these fields over the course of 1985-2006. The communities we identify map to known scientific fields, and their age depends on their size and activity. We expect our approach to quantifying the evolution of ideas to be relevant for making predictions about the future of science and thus help to guide its development.
New Take on Ubicomp, Language Insight, Sexy Viz, and iPad Usability
- People are Walking Architecture — presentation by Matt Jones of BERG, taking a new lens to this AR/ubicomp/whatever-it-is-today world. “[Mobile phones are] a whole toy box full of playful, inventive strategies for exploring cities ….”
- Lexicalist — insight into geographic and age distribution of language use, based on Twitter data. (via Language Log)
- Advanced Visualization Techniques — nice overview of some non-standard visualization techniques. Short shameful confession: I love polar dendrograms with a passion. These techniques are to visualizers as algorithms and data structures to programmers: each is used in specific circumstances and compromises some things to gain in others. (via Flowing Data)
- iPad Usability Report (Nielsen-Norman Group) — 93-page report based on user studies. The iPad etched-screen aesthetic does look good. No visual distractions or nerdy buttons. The penalty for this beauty is the re-emergence of a usability problem we haven’t seen since the mid-1990s: Users don’t know where they can click. For the last 15 years of Web usability research, the main problems have been that users don’t know where to go or which option to choose — not that they don’t even know which options exist. With iPad UIs, we’re back to this square one. (via Andrew Savikas)
Open Facebook, Internet Stats, Handling Interviews, and Textual Relationships
- Don’t Simply Build a More Open Facebook, Build a Better One — Most people don’t care so much about whether technology is “open” or “closed” so long as it works. (Case in point: iPhone.) Rather than starting your plans by picking which “open” standards you’ll use, start by designing a better social networking service and then determine how “open” specs will help you build that service. (via David Recordon)
- Internet Stats from Google — very nice categorized factoids about internet use, technology, trends, etc. 64% of C-level executives conduct six or more searches per day to locate business information.
- Qualitative Methods for IS Research — summary of qualitative methods (interviews, documents, observation data) as applied to IS. Written for academics, so you have to choke back passive voice vomit (sorry, “passive voice vomit must be choked back”) but it’s got lots of useful information on approaches and tools. (via johnny723 on Twitter)
- Social Signaling and Language Use — turns out the stopwords like “to”, “be”, and “on” are the ones that indicate manager-subordinate relationships. In so many fields I see again and again that you keep data at each stage of transformation, because transforming for one use prevents others. (via terrycojones on Twitter)
GMail CRM, Django Best Practices, Stats-Think, and WoW Number Crunching
- Rapportive — a simple social CRM built into Gmail. They replace the ads in Gmail with photos, bio, and info from social media sites. (via ReadWrite Web)
- Best Practices in Web Development with Django and Python — great set of recommendations. (via Jon Udell‘s article on checklists)
- Think Like a Statistician Without The Math (Flowing Data) — Finally, and this is the most important thing I’ve learned, always ask why. When you see a blip in a graph, you should wonder why it’s there. If you find some correlation, you should think about whether or not it makes any sense. If it does make sense, then cool, but if not, dig deeper. Numbers are great, but you have to remember that when humans are involved, errors are always a possibility. This is basically how to be a scientist: know the big picture, study the details to find deviations, and always ask “why”.
- WoW Armory Data Mining — a blog devoted to data mining on the info from the Wow Amory, which has a lot of data taken from the servers. It’s baseball statistics for World of Warcraft. Fascinating! (via Chris Lewis)
Visualising Tweeted Data, Voting Licenses, Space-Time Mining, and Processing for the iPhone
- Visualising Time Series Data in Tweets — builds sparklines from Twitter Data tweets.
- GPL Inadequate for Open Source Voting Software — the GPL prohibits “additional restrictions”, but the US Government has requirements for its voting software that fall into that category. An interesting read. The solution will be a new open source license (sigh) but one that meets their specific and real needs. (via Glyn Moody)
- SatScan — free software that analyzes spatial, temporal and space-time data using the spatial, temporal, or space-time scan statistics. It is designed for any of the following interrelated purposes: Perform geographical surveillance of disease, to detect spatial or space-time disease clusters, and to see if they are statistically significant; Test whether a disease is randomly distributed over space, over time or over space and time; Evaluate the statistical significance of disease cluster alarms; Perform repeated time-periodic disease surveillance for early detection of disease outbreaks. (via ancodezambia on Delicious)
- iProcessing — a Processing.js port to iPhone plus application framework library that lets you write iPhone apps in Processing. (via cityofsound on Delicious)
Open Source Government Tools, Insider Journalism, Open Clip Art, Mining Facebook Profiles
- OSOR.eu — The OSOR is a platform where public administrations can exchange information and experiences and collaborate in developing free and open source software. The platform has managed to bring together more than 2000 such open source software applications in just sixteen months after its launch. (via EUPractice and vikram_nz on Twitter)
- Inside Glitch — writeup of behind-the-scenes during the development of the game Glitch, the new project from Stewart Butterfield, Cal Henderson, Eric Costello, and Serguei Mourachov. The historical details themselves are banal, but what’s interesting is how the reporter got access: “I’ll let you determine when the piece runs (but not editorial control over what goes in it), and in return I get to meet regularly with you and you tell me all.” It’s analogous to the Newsweek tell-alls that come out after the election. (via Waxy)
- Open Clip Art — archive of public domain-contributed clip art. (via Mark Osbourne)
- How To Split Up The US — clique analysis from 210 million public Facebook profiles. Some of these clusters are intuitive, like the old south, but there’s some surprises too, like Missouri, Louisiana and Arkansas having closer ties to Texas than Georgia. To make sense of the patterns I’m seeing, I’ve marked and labeled the clusters, and added some notes about the properties they have in common.