- Influx DB — open-source, distributed, time series, events, and metrics database with no external dependencies.
- Omega (PDF) — ﬂexible, scalable schedulers for large compute clusters. From Google Research.
- Amazon Mines Its Data Trove To Bet on TV’s Next Hit (WSJ) — Amazon produced about 20 pages of data detailing, among other things, how much a pilot was viewed, how many users gave it a 5-star rating and how many shared it with friends.
ENTRIES TAGGED "scale"
Time Series Database, Cluster Schedulers, Structural Search-and-Replace, and TV Data
Amen Break, MySQL Scale, Spooky Source, and Graph Analytics Engine
- The Amen Break (YouTube) — fascinating 20m history of the amen break, a handful of bars of drum solo from a forgotten 1969 song which became the origin of a huge amount of popular music from rap to jungle and commercials, and the contested materials at the heart of sample-based music. Remix it and weep. (via Beta Knowledge)
- The MySQL Ecosystem at Scale (PDF) — nice summary of how MySQL is used on massive users, and where the sweet spots have been found.
- Lab41 (Github) — open sourced code from a spook hacklab in Silicon Valley.
- Fanulus — open sourced Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster. A breadth-first version of the graph traversal language Gremlin operates on graphs stored in the distributed graph database Titan, in any Rexster-fronted graph database, or in HDFS via various text and binary formats.
Audio Visualization, 3D Printed Toys, Data Center Computing, and Downloding Not Yet Beaten
- github realtime activity — audio triggered by github activity, built with choir.io.
- Makies Hit Shelves at Selfridges — 3d printing business gaining mainstream distribution. Win!
- The Datacenter as Computer — we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. We hope it will be useful to architects and programmers of today’s WSCs, as well as those of future many-core platforms which may one day implement the equivalent of today’s WSCs on a single board. (via Mike Loukides)
- Illegal Downloads Not Erased By Simultaneous Release — Data gathered by TorrentFreak throughout the day reveals that most early downloaders, a massive 16.1%, come from Australia. Down Under the show aired on the pay TV network Foxtel, but it appears that many Aussies prefer to download a copy instead. The same is true for the United States and Canada, with 16% and 9.6% of the total downloads respectively, despite the legal offerings. Unclear whether this represents greater or less downloading than would have happened without simultaneous release.
DEFCON Doco, Global-Scale Networks, Media Goblin, and TCP/IP Legos
- DEFCON Documentary — free download, I’m looking forward to watching it on the flight back to NZ.
- Global-Scale Systems — botnets as example of the scale of networks and systems we’ll have to build but don’t have experience in.
- MediaGoblin — GNU project to build a decentralized alternative to Flickr, YouTube, SoundCloud, etc.
- Teaching TCP/IP Headers with Legos — genius. (via BoingBoing)
Bot graders pass muster, Instagram's small team handles scale, assessing UK open data efforts.
In this week's data news, a look at the performance of automated essay-grading software, scaling Instagram, and an audit of the UK government's open data initiative.
A new look at Yahoo's traffic, the challenge of scaling Tumblr, and a host of visualization guidelines.
In this week's data news: Yahoo visualizes its front page traffic and demographics, why Tumblr is tougher to scale than Twitter, and a look at what you need to consider as you build visualizations.
Data Sets, Data-driven Policy, Task Queues, and 8-Bit Browser
- DSPL: DataSet Publishing Language (Google Code) — a representation language for the data and metadata of datasets. Datasets described in this format can be processed by Google and visualized in the Google Public Data Explorer. XML metadata on CSV, geo-enabled, with linkable data. (via Michal Migurski on Delicious)
- Why is Evidence So Hard for Politicians — Ben Goldacre nails how politicians go about “evidence-based policy making”: So the Minister has cherry picked only the good findings, from only one report, while ignoring the peer-reviewed literature. Most crucially, he cherry-picks findings he likes whilst explicitly claiming that he is fairly citing the totality of the evidence from a thorough analysis. I can produce good evidence that I have a magical two-headed coin, if I simply disregard all the throws where it comes out tails.
- Celery: Distributed Task Queue — asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well. MIT-style licensed, written in Python, RabbitMQ is the recommended message broker. (via Joshua Schachter on Delicious)
- pixelfari — Safari hacked to look like it’s running on an 8-bit computer. This sense of playfulness with the medium is something I love about the best coders. They think “ha, wouldn’t it be funny if …” and then can make it happen.
Amazon as Vendor, Distributed Tasks, Evolutionary Photofitting, and Basic Physics
- The Rise of Amazon Web Services — Stephen O’Grady points out that Amazon has become an enterprise sales company but we don’t treat it as such because we think of it as a retail company that’s dabbling in technology. I think of Amazon as an automation company: they automate and optimize everything, and a data center is just a warehouse for MIPS. (via Matt Asay)
- Celery Project — a distributed task queue. (via joshua on Delicious)
- Memory Upgrade (The Economist) — a photofit system that uses evolutionary algorithms to generate the suspects’ faces, and does clever things like animated distortions to call out features the witness might recall. Technology going beyond automated sketch artists.
- The Particle Adventure: The Fundamental of Matter and Force — basic physics in easy-to-understand language with illustrations, all in bite-size pieces (and 1998-era web design). I’m pondering what one of these would be like for computers, and whether “how do these actually work?” has the same romance as “how does the world really work?”.
Thumb Drives and the Cloud, FCC APIs, Mining on GFS, Check Your Prose with Scribe
- CloudUSB — a USB key containing your operating environment and your data + a protected folder so nobody can access you data, even if you lost the key + a backup program which keeps a copy of your data on an online disk, with double password protection. (via ferrouswheel on Twitter)
- FCC APIs — for spectrum licenses, consumer broadband tests, census block search, and more. (via rjweeks70 on Twitter)
- Sibyl: A system for large scale machine learning (PDF) — paper from Google researchers on how to build machine learning on top of a system designed for batch processing. (via Greg Linden)
- The Surprisingness of What We Say About Ourselves (BERG London) — I made a chart of word-by-word surprisingness: given the statement so far, could Scribe predict what would come next?