- Nest Protect Teardown (Sparkfun) — initial teardown of another piece of domestic industrial Internet.
- Logs — The distributed log can be seen as the data structure which models the problem of consensus. Not kidding when he calls it “real-time data’s unifying abstraction”.
- Mining the Web to Predict Future Events (PDF) — Mining 22 years of news stories to predict future events. (via Ben Lorica)
- Nanocubes — a fast datastructure for in-memory data cubes developed at the Information Visualization department at AT&T Labs – Research. Nanocubes can be used to explore datasets with billions of elements at interactive rates in a web browser, and in some cases it uses sufficiently little memory that you can run a nanocube in a modern-day laptop. (via Ben Lorica)
Downloading Kindle Highlights, Balanced Photos, Long Form, and Crap Regulation
- bookcision — bookmarklet to download your Kindle highlights. (via Nelson Minar)
- Algorithm for a Perfectly Balanced Photo Gallery — remember this when it comes time to lay out your 2013 “Happy Holidays!” card.
- Long Stories (Fast Company Labs) — Our strategy was to still produce feature stories as discrete articles, but then to tie them back to the stub article with lots of prominent links, again taking advantage of the storyline and context we had built up there, making our feature stories sharper and less full of catch-up material.
- Massachusetts Software Tax (Fast Company Labs) — breakdown of why this crappily-written law is bad news for online companies. Laws are the IEDs of the Internet: it’s easy to make massively value-destroying regulation and hard to get it fixed.
Algorithmic Optimisation, 3D Scanners, Corporate Open Source, and Data Dives
- Unhappy Truckers and Other Algorithmic Problems — Even the insides of vans are subjected to a kind of routing algorithm; the next time you get a package, look for a three-letter letter code, like “RDL.” That means “rear door left,” and it is so the driver has to take as few steps as possible to locate the package. (via Sam Minnee)
- Fuel3D: A Sub-$1000 3D Scanner (Kickstarter) — a point-and-shoot 3D imaging system that captures extremely high resolution mesh and color information of objects. Fuel3D is the world’s first 3D scanner to combine pre-calibrated stereo cameras with photometric imaging to capture and process files in seconds.
- Corporate Open Source Anti-Patterns (YouTube) — Brian Cantrill’s talk, slides here. (via Daniel Bachhuber)
- Hacking for Humanity) (The Economist) — Getting PhDs and data specialists to donate their skills to charities is the idea behind the event’s organizer, DataKind UK, an offshoot of the American nonprofit group.
Distributed Browser-Based Computation, Streaming Regex, Preventing SQL Injections, and SVM for Faster Deep Learning
- WeevilScout — browser app that turns your browser into a worker for distributed computation tasks. See the poster (PDF). (via Ben Lorica)
- sregex (Github) — A non-backtracking regex engine library for large data streams. See also slide notes from a YAPC::NA talk. (via Ivan Ristic)
- Bobby Tables — a guide to preventing SQL injections. (via Andy Lester)
- Deep Learning Using Support Vector Machines (Arxiv) — we are proposing to train all layers of the deep networks by backpropagating gradients through the top level SVM, learning features of all layers. Our experiments show that simply replacing softmax with linear SVMs gives significant gains on datasets MNIST, CIFAR-10, and the ICML 2013 Representation Learning Workshop’s face expression recognition challenge. (via Oliver Grisel)
The importance of data science tools that let organizations easily combine, deploy, and maintain algorithms
Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you’re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized and analyzed. Many companies have data engineers (adept at using workflow tools like Azkaban and Oozie), who manage1 pipelines for data scientists and analysts.
A workflow tool for data analysts: Chronos from airbnb
A raw bash scheduler written in Scala, Chronos is flexible, fault-tolerant2, and distributed (it’s built on top of Mesos). What’s most interesting is that it makes the creation and maintenance of complex workflows more accessible: at least within airbnb, it’s heavily used by analysts.
Job orchestration and scheduling tools contain features that data scientists would appreciate. They make it easy for users to express dependencies (start a job upon the completion of another job), and retries (particularly in cloud computing settings, jobs can fail for a variety of reasons). Chronos comes with a web UI designed to let business analysts3 define, execute, and monitor workflows: a zoomable DAG highlights failed jobs and displays stats that can be used to identify bottlenecks. Chronos lets you include asynchronous jobs – a nice feature for data science pipelines that involve long-running calculations. It also lets you easily define repeating jobs over a finite time interval, something that comes in handy for short-lived4 experiments (e.g. A/B tests or multi-armed bandits).
Comparing Algorithms, Programming & Visual Arts, Data Brokers, and Your Brain on Ebooks
- mlcomp — a free website for objectively comparing machine learning programs across various datasets for multiple problem domains.
- Printing Code: Programming and the Visual Arts (Vimeo) — Rune Madsen’s talk from Heroku’s Waza. (via Andrew Odewahn)
- What Data Brokers Know About You (ProPublica) — excellent run-down on the compilers of big data about us. Where are they getting all this info? The stores where you shop sell it to them.
- Subjective Impressions Do Not Mirror Online Reading Effort: Concurrent EEG-Eyetracking Evidence from the Reading of Books and Digital Media (PLOSone) — Comprehension accuracy did not differ across the three media for either group and EEG and eye fixations were the same. Yet readers stated they preferred paper. That preference, the authors conclude, isn’t because it’s less readable. From this perspective, the subjective ratings of our participants (and those in previous studies) may be viewed as attitudes within a period of cultural change.
Comms 101, RoboTurking, Geek Tourism, and Implementing Papers
- How to Redesign Your App Without Pissing Everybody Off (Anil Dash) — the basic straightforward stuff that gets your users on-side. Anil’s making a career out of being an adult.
- Clockwork Raven (Twitter) — open source project to send data analysis tasks to Mechanical Turkers.
- Updates from the Tour in China (Bunnie Huang) — my dream geek tourism trip: going around Chinese factories and bazaars with MIT geeks.
- How to Implement an Algorithm from a Scientific Paper — I have implemented many complex algorithms from books and scientific publications, and this article sums up what I have learned while searching, reading, coding and debugging. (via Siah)
Invisible Data Economy, Hacked Value, Open Algorithms Textbook, and Mobile Testing
- Beyond Goods and Services: The Unmeasured Rise of the Data-Driven Economy — excellent points about data as neither good nor service, and how data use goes unmeasured by economists and thus doesn’t influence policy. According to statistics from the Bureau of Economic Analysis, real consumption of ‘internet access’ has been falling since the second quarter of 2011. In other words, according to official U.S. government figures, consumer access to the Internet—including mobile—has been a drag on economic growth for the past year and a half. (via Mike Loukides)
- How Crooks Turn Even Crappy Hacked PCs Into Money (Brian Krebs) — show to your corporate IT overlords, or your parents, to explain why you want them to get rid of the Windows XP machines. (via BoingBoing)
- Open Data Structures — an open content textbook (Java and C++ editions; CC-BY licensed) on data structures. (via Hacker News)
- Mobiforge — test what gets sent back to mobile browsers. This site sends the HTTP headers that a mobile browser would. cf yesterday’s Responsivator. (via Ronan Cremin)
News App, Data Wrangler, Responsive Previews, and Accountable Algorithms
- cir.ca — news app for iPhone, which lets you track updates and further news on a given story. (via Andy Baio)
- DataWrangler (Stanford) — an interactive tool for data cleaning and transformation. Spend less time formatting and more time analyzing your data. From the Stanford Visualization Group.
- Responsivator — see how websites look at different screen sizes.
- Accountable Algorithms (Ed Felten) — When we talk about making an algorithmic public process open, we mean two separate things. First, we want transparency: the public knows what the algorithm is. Second, we want the execution of the algorithm to be accountable: the public can check to make sure that the algorithm was executed correctly in a particular case. Transparency is addressed by traditional open government principles; but accountability is different.