- Suro (Github) — Netflix data pipeline service for large volumes of event data. (via Ben Lorica)
- NIPS Workshop on Data Driven Education — lots of research papers around machine learning, MOOC data, etc.
- Proofist — crowdsourced proofreading game.
- 3D-Printed Shoes (YouTube) — LeWeb talk from founder of the company, Continuum Fashion). (via Brady Forrest)
ENTRIES TAGGED "data"
Lessons from the design community for developing data-driven applications
When you hear someone say, “that is a nice infographic” or “check out this sweet dashboard,” many people infer that they are “well-designed.” Creating accessible (or for the cynical, “pretty”) content is only part of what makes good design powerful. The design process is geared toward solving specific problems. This process has been formalized in many ways (e.g., IDEO’s Human Centered Design, Marc Hassenzahl’s User Experience Design, or Braden Kowitz’s Story-Centered Design), but the basic idea is that you have to explore the breadth of the possible before you can isolate truly innovative ideas. We, at Datascope Analytics, argue that the same is true of designing effective data science tools, dashboards, engines, etc — in order to design effective dashboards, you must know what is possible.
Zombie Drones, Algebra Through Code, Data Toolkit, and Crowdsourcing Antibiotic Discovery
- Skyjack — drone that takes over other drones. Welcome to the Malware of Things.
- Bootstrap World — a curricular module for students ages 12-16, which teaches algebraic and geometric concepts through computer programming. (via Esther Wojicki)
- Harvest — open source BSD-licensed toolkit for building web applications for integrating, discovering, and reporting data. Designed for biomedical data first. (via Mozilla Science Lab)
- Project ILIAD — crowdsourced antibiotic discovery.
Data Tool, Arduino-like Board, Learn to Code via Videogames, and Creative Commons 4.0 Out
- OpenRefine — (edited: 7 Dec 2013)
Google abandonedGoogle bought Freebase’s GridWorks, turned it into the excellent Refine tool for working with data sets, now picked up and developed by open source community.
- Intel’s Arduino-Compatible Board — launched at MakerFaire Rome. (via Wired UK)
- Game Maven — learn to code by writing casual videogames. (via Greg Linden)
- CC 4.0 Out — The 4.0 licenses are extremely well-suited for use by governments and publishers of public sector information and other data, especially for those in the European Union. This is due to the expansion in license scope, which now covers sui generis database rights that exist there and in a handful of other countries.
We must go beyond hype for incentives to provide data to researchers
The FDA order stopping 23andM3 from offering its genetic test kit strikes right into the heart of the major issue in health care reform: the tension between individual care and collective benefit. Health is not an individual matter. As I will show, we need each other. And beyond narrow regulatory questions, the 23andMe issue opens up the whole goal of information sharing and the funding of health care reform.
Unlocking Scientific Data with Python
Most people working on complex software systems have had That Moment, when you throw up your hands and say “If only we could start from scratch!” Generally, it’s not possible. But every now and then, the chance comes along to build a really exciting project from the ground up.
In 2011, I had the chance to participate in just such a project: the acquisition, archiving and database systems which power a brand-new hypervelocity dust accelerator at the University of Colorado.
Warrant Canary, Polluted Statistics, Dollars for Deathbots, and Protocol Madness
- Apple Transparency Report (PDF) — contains a warrant canary, the statement Apple has never received an order under Section 215 of the USA Patriot Act. We would expect to challenge an order if served on us which will of course be removed if one of the secret orders is received. Bravo, Apple, for implementing a clever hack to route around excessive secrecy. (via Boing Boing)
- You’re Probably Polluting Your Statistics More Than You Think — it is insanely easy to find phantom correlations in random data without obviously being foolish. Anyone who thinks it’s possible to draw truthful conclusions from data analysis without really learning statistics needs to read this. (via Stijn Debrouwere)
- CyPhy Funded (Quartz) — the second act of iRobot co-founder Helen Greiner, maker of the famed Roomba robot vacuum cleaner. She terrified ETech long ago—the audience were expecting Roomba cuteness and got a keynote about military deathbots. It would appear she’s still in the deathbot niche, not so much with the cute. Remember this when you build your OpenCV-powered recoil-resistant load-bearing-hoverbot and think it’ll only ever be used for the intended purpose of launching fertiliser pellets into third world hemp farms.
- User-Agent String History — a light-hearted illustration of why the formal semantic value of free-text fields is driven to zero in the face of actual use.
Time Series Database, Cluster Schedulers, Structural Search-and-Replace, and TV Data
- Influx DB — open-source, distributed, time series, events, and metrics database with no external dependencies.
- Omega (PDF) — ﬂexible, scalable schedulers for large compute clusters. From Google Research.
- Amazon Mines Its Data Trove To Bet on TV’s Next Hit (WSJ) — Amazon produced about 20 pages of data detailing, among other things, how much a pilot was viewed, how many users gave it a 5-star rating and how many shared it with friends.
The Internot of Things, Explainy Learning, Medical Microcontroller Board, and Coder Sutra
- A Cyber Attack Against Israel Shut Down a Road — The hackers targeted the Tunnels’ camera system which put the roadway into an immediate lockdown mode, shutting it down for twenty minutes. The next day the attackers managed to break in for even longer during the heavy morning rush hour, shutting the entire system for eight hours. Because all that is digital melts into code, and code is an unsolved problem.
- Random Decision Forests (PDF) — “Due to the nature of the algorithm, most Random Decision Forest implementations provide an extraordinary amount of information about the final state of the classifier and how it derived from the training data.” (via Greg Borenstein)
- BITalino — 149 Euro microcontroller board full of physiological sensors: muscles, skin conductivity, light, acceleration, and heartbeat. A platform for healthcare hardware hacking?
- How to Be a Programmer — a braindump from a guru.