ENTRIES TAGGED "data mining"
Google Price Index, The High Cost of Freemium, Literate Programming, Results Clustering
- Google Creates New Inflation Measure (The Guardian) — The Google Price Index will be based on the cost of goods sold online and could use real-time search data to forecast official figures. Clever use of unique data, but can the GPI findings be reproduced by another agency? I do like the idea of moving national statistical measures into real-time.
- How To Break The Trust of Your Customers In Just One Day — some horrifying revelations about how freemium worked for Chargify and their customers: Over the past year, we discovered that the customer that never paid had the highest support load. [...] Everyone’s always talking about freemium, but very few people actually use it, and we discovered this in looking at our customers for the past year. The reality was that less than 0.4% of customers had any sizeable number of free customers on their accounts. (via Hacker News)
- Annotated Backbone.js — very readable literate programming. (via Simon Willison)
- Carrot2 — open source results clustering engine.
Data Privacy, Journalism and Dataviz, Web Shell, and Kindle Singles
- ‘Scrapers’ Dig Deep for Data on Web (WSJ) — our users’ data comprise a valuable resource to mine and sell, but so do their kidneys. The data world faces serious issues with informed consent, control, and exploitation–it’s not just a shiny new business model, it can also leave people feeling very violated. Again, if you’re not paying for it then you’re the product and not the customer. The majority of humanity is not conscious of the difference between “user” and “customer”. (via Mike Brown on Twitter)
- Journalism in the Age of Data (Video) — Stanford video, with annotations and links, on the challenge of using dataviz as a storytelling medium. (via Ben Goldacre on Twitter)
- webshell (Github) — open source (Apache-licensed) console utility, requiring node.js, for debugging and understanding HTTP connections. (via Chris Shiflett on Twitter, who prefers it to yesterday’s htty)
- Amazon to Launch Kindle Singles (press release) — shorter-form works (think: novellas) as a format to expand publishing market rather than shrink it. Damn near every business book ever written should have been this size instead of 300 pages of tedium.
European Economic Crisis, Scaling Guardian API, Cheerful Pessimism, and Science Mapping
- Lending Merry-Go-Round — these guys have been Australia’s sharpest satire for years, filling the role of the Daily Show. Here they ask some strong questions about the state of Europe’s economies … (via jdub on Twitter)
- What’s Powering the Guardian’s Content API — Scala and Solr/Lucene on EC2 is the short answer. The long answer reveals the details of their setup, including some of their indexing tricks that means Solr can index all their content in just an hour. (via Simon Willison)
- What I Learned About Engineering from the Panama Canal (Pete Warden) — I consider myself a cheerful pessimist. I’ve been through enough that I know how steep the odds of success are, but I’ve made a choice that even a hopeless fight in a good cause is worthwhile. What a lovely attitude!
- Mapping the Evolution of Scientific Fields (PLoSone) — clever use of data. We build an idea network consisting of American Physical Society Physics and Astronomy Classification Scheme (PACS) numbers as nodes representing scientific concepts. Two PACS numbers are linked if there exist publications that reference them simultaneously. We locate scientific fields using a community finding algorithm, and describe the time evolution of these fields over the course of 1985-2006. The communities we identify map to known scientific fields, and their age depends on their size and activity. We expect our approach to quantifying the evolution of ideas to be relevant for making predictions about the future of science and thus help to guide its development.
New Take on Ubicomp, Language Insight, Sexy Viz, and iPad Usability
- People are Walking Architecture — presentation by Matt Jones of BERG, taking a new lens to this AR/ubicomp/whatever-it-is-today world. “[Mobile phones are] a whole toy box full of playful, inventive strategies for exploring cities ….”
- Lexicalist — insight into geographic and age distribution of language use, based on Twitter data. (via Language Log)
- Advanced Visualization Techniques — nice overview of some non-standard visualization techniques. Short shameful confession: I love polar dendrograms with a passion. These techniques are to visualizers as algorithms and data structures to programmers: each is used in specific circumstances and compromises some things to gain in others. (via Flowing Data)
- iPad Usability Report (Nielsen-Norman Group) — 93-page report based on user studies. The iPad etched-screen aesthetic does look good. No visual distractions or nerdy buttons. The penalty for this beauty is the re-emergence of a usability problem we haven’t seen since the mid-1990s: Users don’t know where they can click. For the last 15 years of Web usability research, the main problems have been that users don’t know where to go or which option to choose — not that they don’t even know which options exist. With iPad UIs, we’re back to this square one. (via Andrew Savikas)
Open Facebook, Internet Stats, Handling Interviews, and Textual Relationships
- Don’t Simply Build a More Open Facebook, Build a Better One — Most people don’t care so much about whether technology is “open” or “closed” so long as it works. (Case in point: iPhone.) Rather than starting your plans by picking which “open” standards you’ll use, start by designing a better social networking service and then determine how “open” specs will help you build that service. (via David Recordon)
- Internet Stats from Google — very nice categorized factoids about internet use, technology, trends, etc. 64% of C-level executives conduct six or more searches per day to locate business information.
- Qualitative Methods for IS Research — summary of qualitative methods (interviews, documents, observation data) as applied to IS. Written for academics, so you have to choke back passive voice vomit (sorry, “passive voice vomit must be choked back”) but it’s got lots of useful information on approaches and tools. (via johnny723 on Twitter)
- Social Signaling and Language Use — turns out the stopwords like “to”, “be”, and “on” are the ones that indicate manager-subordinate relationships. In so many fields I see again and again that you keep data at each stage of transformation, because transforming for one use prevents others. (via terrycojones on Twitter)
GMail CRM, Django Best Practices, Stats-Think, and WoW Number Crunching
- Rapportive — a simple social CRM built into Gmail. They replace the ads in Gmail with photos, bio, and info from social media sites. (via ReadWrite Web)
- Best Practices in Web Development with Django and Python — great set of recommendations. (via Jon Udell‘s article on checklists)
- Think Like a Statistician Without The Math (Flowing Data) — Finally, and this is the most important thing I’ve learned, always ask why. When you see a blip in a graph, you should wonder why it’s there. If you find some correlation, you should think about whether or not it makes any sense. If it does make sense, then cool, but if not, dig deeper. Numbers are great, but you have to remember that when humans are involved, errors are always a possibility. This is basically how to be a scientist: know the big picture, study the details to find deviations, and always ask “why”.
- WoW Armory Data Mining — a blog devoted to data mining on the info from the Wow Amory, which has a lot of data taken from the servers. It’s baseball statistics for World of Warcraft. Fascinating! (via Chris Lewis)
Visualising Tweeted Data, Voting Licenses, Space-Time Mining, and Processing for the iPhone
- Visualising Time Series Data in Tweets — builds sparklines from Twitter Data tweets.
- GPL Inadequate for Open Source Voting Software — the GPL prohibits “additional restrictions”, but the US Government has requirements for its voting software that fall into that category. An interesting read. The solution will be a new open source license (sigh) but one that meets their specific and real needs. (via Glyn Moody)
- SatScan — free software that analyzes spatial, temporal and space-time data using the spatial, temporal, or space-time scan statistics. It is designed for any of the following interrelated purposes: Perform geographical surveillance of disease, to detect spatial or space-time disease clusters, and to see if they are statistically significant; Test whether a disease is randomly distributed over space, over time or over space and time; Evaluate the statistical significance of disease cluster alarms; Perform repeated time-periodic disease surveillance for early detection of disease outbreaks. (via ancodezambia on Delicious)
- iProcessing — a Processing.js port to iPhone plus application framework library that lets you write iPhone apps in Processing. (via cityofsound on Delicious)