50 Years of Data Science (PDF) — Because all of science itself will soon become data that can be mined, the imminent revolution in Data Science is not about mere “scaling up,” but instead the emergence of scientific studies of data analysis science-wide.
badwolf — a temporal graph store from Google.
Why Biomedical Superstars are Signing on with Google (Nature) — “To go all the way from foundational first principles to execution of vision was the initial draw, and that’s what has continued to keep me here.” Research to retail, at Google scale.
VR Basics — intro to terminology and hardware in the next gen of hardware, in case you’re late to the goldrush^w exciting field.
The Internot of Things, Explainy Learning, Medical Microcontroller Board, and Coder Sutra
A Cyber Attack Against Israel Shut Down a Road — The hackers targeted the Tunnels’ camera system which put the roadway into an immediate lockdown mode, shutting it down for twenty minutes. The next day the attackers managed to break in for even longer during the heavy morning rush hour, shutting the entire system for eight hours. Because all that is digital melts into code, and code is an unsolved problem.
Random Decision Forests (PDF) — “Due to the nature of the algorithm, most Random Decision Forest implementations provide an extraordinary amount of information about the final state of the classifier and how it derived from the training data.” (via Greg Borenstein)
BITalino — 149 Euro microcontroller board full of physiological sensors: muscles, skin conductivity, light, acceleration, and heartbeat. A platform for healthcare hardware hacking?
How to Be a Programmer — a braindump from a guru.
The Internet of Americas, Pharma Pricey, Who's Watching, and Data Mining Course
Bradley Manning and the Two Americas (Quinn Norton) — The first America built the Internet, but the second America moved onto it. And they both think they own the place now. The best explanation you’ll find for wtf is going on.
Staggering Cost of Inventing New Drugs (Forbes) — $5BB to develop a new drug; and subject to an inverse-Moore’s law: A 2012 article in Nature Reviews Drug Discovery says the number of drugs invented per billion dollars of R&D invested has been cut in half every nine years for half a century.
Who’s Watching You — (Tim Bray) threat modelling. Everyone should know this.
Data Mining with Weka — learn data mining with the popular open source Weka platform.
Spatial Verbs, Open Source Malaria, Surviving Management, and Paper-like UAV
Operative Design — A catalogue of spatial verbs. (via Adafruit)
Open Source Malaria — open science drug discovery.
Surviving Being (Senior) Tech Management (Kellan Elliott-McCrea) — Perspective is the thin line between a challenging but manageable problem, and chittering balled up in the corner.
Disposable UAVs Inspired by Paper Planes (DIY Drones) — The first design, modeled after a paper plane, is created from a cellulose sheet that has electronic circuits ink-jet printed directly onto its body. Once the circuits have been laid on the plane’s frame, the craft is exposed to a UV curing process, turning the planes body into a flexible circuit board. These circuits are then connected to the planes “avionics system”, two elevons attached to the rear of the craft, which give the UAV the ability to steer itself to its destination.
Fit2Cure taps the public's visual skills to match compounds to targets
In the inspiring tradition of Foldit, the game for determining protein shapes, Fit2Cure crowdsources the problem of finding drugs that can cure the many under-researched diseases of developing countries. Fit2Cure appeals to the player’s visual–even physical–sense of the world, and requires much less background knowledge than Foldit.
There about 7,000 rare diseases, fewer than 5% of which have cures. The number of people currently engaged in making drug discoveries is by no means adequate to study all these diseases. A recent gift to Harvard shows the importance that medical researchers attach to filling the gap. As an alternative approach, abstracting the drug discovery process into a game could empower thousands, if not millions, of people to contribute to this process and make discoveries in diseases that get little attention to scientists or pharmaceutical companies.
The biological concept behind Fit2Cure is that medicines have specific shapes that fit into the proteins of the victim’s biological structures like jig-saw puzzle pieces (but more rounded). Many cures require finding a drug that has the same jig-saw shape and can fit into the target protein molecule, thus preventing it from functioning normally.
How the field of genetics is using data within research and to evaluate researchers
Editor’s note: Earlier this week, Part 1 of this article described Sage Bionetworks, a recent Congress they held, and their way of promoting data sharing through a challenge.
Data sharing is not an unfamiliar practice in genetics. Plenty of cell lines and other data stores are publicly available from such places as the TCGA data set from the National Cancer Institute, Gene Expression Omnibus (GEO), and Array Expression (all of which can be accessed through Synapse). So to some extent the current revolution in sharing lies not in the data itself but in critical related areas.
First, many of the data sets are weakened by metadata problems. A Sage programmer told me that the famous TCGA set is enormous but poorly curated. For instance, different data sets in TCGA may refer to the same drug by different names, generic versus brand name. Provenance–a clear description of how the data was collected and prepared for use–is also weak in TCGA.
In contrast, GEO records tend to contain good provenance information (see an example), but only as free-form text, which presents the same barriers to searching and aggregation as free-form text in medical records. Synapse is developing a structured format for presenting provenance based on the W3C’s PROV standard. One researcher told me this was the most promising contribution of Synapse toward the shared used of genetic information.
Observations from Sage Congress and collaboration through its challenge
The glowing reports we read of biotech advances almost cause one’s brain to ache. They leave us thinking that medical researchers must command the latest in all technological tools. But the engines of genetic and pharmaceutical innovation are stuttering for lack of one key fuel: data. Here they are left with the equivalent of trying to build skyscrapers with lathes and screwdrivers.
Sage Congress, held this past week in San Francisco, investigated the multiple facets of data in these field: gene sequences, models for finding pathways, patient behavior and symptoms (known as phenotypic data), and code to process all these inputs. A survey of efforts by the organizers, Sage Bionetworks, and other innovations in genetic data handling can show how genetics resembles and differs from other disciplines.
An intense lesson in code sharing
At last year’s Congress, Sage announced a challenge, together with the DREAM project, intended to galvanize researchers in genetics while showing off the growing capabilities of Sage’s Synapse platform. Synapse ties together a number of data sets in genetics and provides tools for researchers to upload new data, while searching other researchers’ data sets. Its challenge highlighted the industry’s need for better data sharing, and some ways to get there.
In which the question of whether research subjects have any rights to their data is pondered.
The GET (Genomes, Environments and Traits) conference is a confluence of parties interested in the advances being made in human genomes, the measurement of how the environment impacts individuals, and how the two come together to produce traits. Sponsored by the organizers of the Personal Genome Project (PGP) at Harvard, it is a two-day event whose topics range from the appropriate amount of access that patients should have to their genetics data to the ways that Hollywood can be convinced to portray genomics more accurately.
It also is a yearly meeting place for the participants in the Personal Genome Project (one of whom is your humble narrator), people who have agreed to participate in an “open consent” research model. Among other things, this means that PGP participants agree to let their cell lines be used for any purposes (research or commercial). They also acknowledge ahead of time that because their genomes and phenotypic traits are being released publicly, there is a high likelihood that interested parties may be able to identify them from their data. The long term goal of the PGP is to enroll 100,000 participants and perform whole genome sequencing of their DNA, they currently have nearly 2,300 enrolled participants and have sequenced around 165 genomes.