- R Studio — AGPLv3-licensed IDE for R. It brings your R console, source code, plots, help, history, and workspace browser into one cohesive package. We’ve added some neat productivity features like a searchable endless command history, function/symbol completion, data import dialog with preview, one-click Sweave compile, and more. Source on github. Built as a web-app on Google AppEngine, from Joe Cheng who did Windows Live Writer at Microsoft. (via DeWitt Clinton)
- Adventures in Participatory Audience — Nina Simon helped thirteen students produce three projects to encourage participation in museum audiences: Xavier, Stringing Connections, and Dirty Laundry. My favourite was Dirty Laundry, where people shared secrets connected to works of art. Nina’s description of what she learned has some nuggets: friendly faces welcoming people in gets better response than a card with instructions, and I am still flummoxed as to what would make someone admit to an affair or bad parenting in a sterile art gallery, or the devastating one that read, “I avoid the important, difficult conversations with those I love the most.” Audience participation in the real world has lessons on what works for those who would build social software.
- Why Generic Machine Learning Fails — Returns for increasing data size come from two sources: (1) the importance of tails and (2) the cost of model innovation. When tails are important, or when model innovation is difficult relative to cost of data capture, then more data is the answer. [...] Machine learning is not undifferentiated heavy lifting, it’s not commoditizable like EC2, and closer to design than coding. The Netflix prize is a good example: the last 10% reduction in RMSE wasn’t due to more powerful generic algorithms, but rather due to some very clever thinking about the structure of the problem; observations like “people who rate a whole slew of movies at one time tend to be rating movies they saw a long time ago” from BellKor.
- Anatomy of a Crushing — Maciej Ceglowski describes how pinboard.in survived the flood of Delicious émigrées. It took several rounds of rewrites to get the simple tag cloud script right, and this made me very skittish about touching any other parts of the code over the next few days, even when the fixes were easy and obvious. The part of my brain that knew what to do no longer seemed to be connected directly to my hands.
ENTRIES TAGGED "scale"
Bot graders pass muster, Instagram's small team handles scale, assessing UK open data efforts.
In this week's data news, a look at the performance of automated essay-grading software, scaling Instagram, and an audit of the UK government's open data initiative.
A new look at Yahoo's traffic, the challenge of scaling Tumblr, and a host of visualization guidelines.
In this week's data news: Yahoo visualizes its front page traffic and demographics, why Tumblr is tougher to scale than Twitter, and a look at what you need to consider as you build visualizations.
Data Sets, Data-driven Policy, Task Queues, and 8-Bit Browser
- DSPL: DataSet Publishing Language (Google Code) — a representation language for the data and metadata of datasets. Datasets described in this format can be processed by Google and visualized in the Google Public Data Explorer. XML metadata on CSV, geo-enabled, with linkable data. (via Michal Migurski on Delicious)
- Why is Evidence So Hard for Politicians — Ben Goldacre nails how politicians go about “evidence-based policy making”: So the Minister has cherry picked only the good findings, from only one report, while ignoring the peer-reviewed literature. Most crucially, he cherry-picks findings he likes whilst explicitly claiming that he is fairly citing the totality of the evidence from a thorough analysis. I can produce good evidence that I have a magical two-headed coin, if I simply disregard all the throws where it comes out tails.
- Celery: Distributed Task Queue — asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well. MIT-style licensed, written in Python, RabbitMQ is the recommended message broker. (via Joshua Schachter on Delicious)
- pixelfari — Safari hacked to look like it’s running on an 8-bit computer. This sense of playfulness with the medium is something I love about the best coders. They think “ha, wouldn’t it be funny if …” and then can make it happen.
Amazon as Vendor, Distributed Tasks, Evolutionary Photofitting, and Basic Physics
- The Rise of Amazon Web Services — Stephen O’Grady points out that Amazon has become an enterprise sales company but we don’t treat it as such because we think of it as a retail company that’s dabbling in technology. I think of Amazon as an automation company: they automate and optimize everything, and a data center is just a warehouse for MIPS. (via Matt Asay)
- Celery Project — a distributed task queue. (via joshua on Delicious)
- Memory Upgrade (The Economist) — a photofit system that uses evolutionary algorithms to generate the suspects’ faces, and does clever things like animated distortions to call out features the witness might recall. Technology going beyond automated sketch artists.
- The Particle Adventure: The Fundamental of Matter and Force — basic physics in easy-to-understand language with illustrations, all in bite-size pieces (and 1998-era web design). I’m pondering what one of these would be like for computers, and whether “how do these actually work?” has the same romance as “how does the world really work?”.
Thumb Drives and the Cloud, FCC APIs, Mining on GFS, Check Your Prose with Scribe
- CloudUSB — a USB key containing your operating environment and your data + a protected folder so nobody can access you data, even if you lost the key + a backup program which keeps a copy of your data on an online disk, with double password protection. (via ferrouswheel on Twitter)
- FCC APIs — for spectrum licenses, consumer broadband tests, census block search, and more. (via rjweeks70 on Twitter)
- Sibyl: A system for large scale machine learning (PDF) — paper from Google researchers on how to build machine learning on top of a system designed for batch processing. (via Greg Linden)
- The Surprisingness of What We Say About Ourselves (BERG London) — I made a chart of word-by-word surprisingness: given the statement so far, could Scribe predict what would come next?
Non-Profits, UK Legislation, Mobile Web Variation, and Scaling
- How to Raise Funds for Non-Profits (Joi Ichi) — One organization sent a message to all of their donors during the Haiti crisis asking them to give to an NGO that they had vetted. They didn’t ask for any money for themselves. This had a hugely positive effect and the donors trust in the group increased. Wallets aren’t zero sum.
- legislation.gov.uk — very elegant legislation system for the UK. Check out the annual analysis, for example. (via rchards on Twitter)
- The Great WebKit Comparison Table — So far I’ve tested 14 different mobile WebKits, and they are all slightly different. You can find the details below. (via Andrew Savikas)
- Node and Scaling in the Small vs Scaling in the Large (al3x) — In a system of no significant scale, basically anything works. The power of today’s hardware is such that, for example, you can build a web application that supports thousands of users using one of the slowest available programming languages, brutally inefficient datastore access and storage patterns, zero caching, no sensible distribution of work, no attention to locality, etc. etc. Basically, you can apply every available anti-pattern and still come out the other end with a workable system, simply because the hardware can move faster than your bad decision-making.
- Tondo Interactive Table to Analyze Medical Errors (MedGadget) — use of a multitouch table to help clinical staff identify and track medical errors. (via IVLINE on Twitter)
- Steve Huffman Lessons Learned While at Reddit (SlideShare) — uptime and scale. It’s interesting that most everyone reinvents tuples as a way to scale databases, hence the popularity of NoSQL systems.
- Hernando de Soto: Shadow Economies — de Soto is an economist, and this ends up talking about the need for transparency and open data. As long as you don’t know who owns the greatest amount of your assets, there’s no info as to who owns what, who is related to what, you have a shadow economy. We live in one, and it has as a characteristic a permanent credit crunch. We know more about it than you do. Credit crunch is where you don’t know who you’d be lending to, so you don’t lend. It’s permanent, we live with it, and now you’re going to have to learn to live with it too, because until you know who is solvent how can you give anybody credit? You’re flying blind. (via Jon Udell)
Fair Use Economy, Deconstituted Appliances, 3D Vision, Redis for Fun and Profit
- Fair Use in the US Economy (PDF) — prepared by IT lobby in the US, it’s the counterpart to Big ©’s fictitious billions of dollars of losses due to file sharing. Take each with a grain of salt, but this is interesting because it talks about the industries and businesses that the fair use laws make possible.
- Disassembled Household Appliances — neat photos of the pieces in common equipment like waffle irons, sandwich makers, can openers, etc. (via evilmadscientist)
- GelSight — gel block on a sheet of glass, lit from below with lights and then scanned with cameras, lets you easily capture 3D qualities of the objects pressed into it. Very cool demo–you can see finger prints, pulse, and even make out designs on a $100 bill.
- Redis Tutorial (Simon Willison) — Redis is a very fast collection of useful behaviours wrapped around a distributed key-value store. You get locks, IDs, counters, sets, lists, queues, replication, and more.
Government Dashboard, Science Code Errors, Scaling Online Games, Information Theory
- Track DC — informative drill-down report from Washington DC government about the different departments. (via Sunlight Labs blog)
- Errors in Scientific Software — a 1994 study of scientific software that found inconsistent interfaces (1 in 7 for Fortran, 1 in 37 for C) and poor use of arithmetic such that significant figures declined from 6sf in the data to 1sf in the result. (via “If you’re going to do good science, release the computer code too” in the Guardian)
- How Farmville Scales — 75M players/month (28M/day), 1/4 of disk activity is writes, 50% higher load spikes, 3G/s traffic go between Farmville and Facebook at peak, LAMP stack, nagios+munin+puppet. (via Hacker News)
- Mathematical Philology — when two manuscripts of the same text differ, which is correct? This PLoSONE paper looked at all such discrepancies in Lucretius’s De Rerum Natura and found that the traditional principle of choosing the more difficult reading (on the grounds that errors are from humans unconsciously simplifying) has a strong information theory justification for it. Interesting to see this less than a week after an MIT Technology Review article on quantum teleportation remarked, There is a growing sense that the properties of the universe are best described not by the laws that govern matter but by the laws that govern information.