- How a Math Genius Hacked OkCupid to Find True Love (Wired) — if he doesn’t end up working for OK Cupid, productising this as a new service, something is wrong with the world.
- Humin: The App That Uses Context to Enable Better Human Connections (WaPo) — Humin is part of a growing trend of apps and services attempting to use context and anticipation to better serve users. The precogs are coming. I knew it.
- Spoiled Onions — analysis identifying bad actors in the Tor network, Since September 2013, we discovered several malicious or misconfigured exit relays[...]. These exit relays engaged in various attacks such as SSH and HTTPS MitM, HTML injection, and SSL stripping. We also found exit relays which were unintentionally interfering with network traffic because they were subject to DNS censorship.
- My Mind (Github) — a web application for creating and managing Mind maps. It is free to use and you can fork its source code. It is distributed under the terms of the MIT license.
ENTRIES TAGGED "machine learning"
Mating Math, Precogs Are Coming, Tor Bad Guys, and Mind Maps
An interview with Ash Damle of Lumiata on the role of data in healthcare.
Vinod Khosla has stirred up some controversy in the healthcare community over the last several years by suggesting that computers might be able to provide better care than doctors. This includes remarks he made at Strata Rx in 2012, including that, “We need to move from the practice of medicine to the science of medicine. And the science of medicine is way too complex for human beings to do.”
So when I saw the news that Khosla Ventures has just invested $4M in Series A funding into Lumiata (formerly MEDgle), a company that specializes in healthcare data analytics, I was very curious to hear more about that company’s vision. Ash Damle is the CEO at Lumiata. We recently spoke by phone to discuss how data can improve access to care and help level the playing field of care quality.
Tell me a little about Lumiata: what it is and what it does.
Ash Damle: We’re bringing together the best of medical science and graph analytics to provide the best prescriptive analysis to those providing care. We data-mine all the publicly available data sources, such as journals, de-identified records, etc. We analyze the data to make sure we’re learning the right things and, most importantly, what the relationships are among the data. We have fundamentally delved into looking at that whole graph, the way Google does to provide you with relevant search results. We curate those relationships to make sure they’re sensible, and take into account behavioral and social factors.
Software in 2014, Making Systems That Don't Suck, Cognition Troubles, and Usable Security Hacks
- Software in 2014 (Tim Bray) — a good state of the world, much of which I agree with. Client-side: Things are bad. You have to build everything three times: Web, iOS, Android. We’re talent-starved, this is egregious waste, and it’s really hurting us.
- Making Systems That Don’t Suck (Dominus) — every software engineer should have to read this. Every one.
- IBM Struggles to Turn Watson Into Big Business (WSJ) — cognition services harder to onboard than seemed. It smells suspiciously like expert systems from the 1980s, but with more complex analytics on the inside. Analytic skill isn’t the problem for these applications, though, it’s the pain of getting domain knowledge into the system in the first place. This is where G’s web crawl and massive structured general knowledge is going to be a key accelerant.
- Reading This May Harm Your Computer (SSRN) — Internet users face large numbers of security warnings, which they mostly ignore. To improve risk communication, warnings must be fewer but better. We report an experiment on whether compliance can be increased by using some of the social-psychological techniques the scammers themselves use, namely appeal to authority, social compliance, concrete threats and vague threats. We also investigated whether users turned off browser malware warnings (or would have, had they known how).
Artificial Labour, Flexible Circuits, Vanishing Business Sexts, and Themal Imaging
- Artificial Labour and Ubiquitous Interactive Machine Learning (Greg Borenstein) — in which design fiction, actual machine learning, legal discovery, and comics meet. One of the major themes to emerge in the 2H2K project is something we’ve taken to calling “artificial labor”. While we’re skeptical of the claims of artificial intelligence, we do imagine ever-more sophisticated forms of automation transforming the landscape of work and economics. Or, as John puts it, robots are Marxist.
- Clear Flexible Circuit on a Contact Lens (Smithsonian) — ends up about 1/60th as thick as a human hair, and is as flexible.
- Confide (GigaOm) — Enterprise SnapChat. A Sarbanes-Oxley Litigation Printer. It’s the Internet of Undiscoverable Things. Looking forward to Enterprise Omegle.
- FLIR One — thermal imaging in phone form factor, another sensor for your panopticon. (via DIY Drones)
Pattern Recognition, MicroSD Vulnerability, Security Talks, and IoT List
- tooldiag — a collection of methods for statistical pattern recognition. Implemented in C.
- Hacking MicroSD Cards (Bunnie Huang) — In my explorations of the electronics markets in China, I’ve seen shop keepers burning firmware on cards that “expand” the capacity of the card — in other words, they load a firmware that reports the capacity of a card is much larger than the actual available storage. The fact that this is possible at the point of sale means that most likely, the update mechanism is not secured. MicroSD cards come with embedded microcontrollers whose firmware can be exploited.
- 30c3 — recordings from the 30th Chaos Communication Congress.
- IOT Companies, Products, Devices, and Software by Sector (Mike Nicholls) — astonishing amount of work in the space, especially given this list is inevitably incomplete.
It's an extensive, well-documented, and accessible, curated library of machine-learning models
I use a variety of tools for advanced analytics, most recently I’ve been using Spark (and MLlib), R, scikit-learn, and GraphLab. When I need to get something done quickly, I’ve been turning to scikit-learn for my first pass analysis. For access to high-quality, easy-to-use, implementations1 of popular algorithms, scikit-learn is a great place to start. So much so that I often encourage new and seasoned data scientists to try it whenever they’re faced with analytics projects that have short deadlines.
I recently spent a few hours with one of scikit-learn’s core contributors Olivier Grisel. We had a free flowing discussion were we talked about machine-learning, data science, programming languages, big data, Paris, and … scikit-learn! Along the way, I was reminded by why I’ve come to use (and admire) the scikit-learn project.
Commitment to documentation and usability
One of the reasons I started2 using scikit-learn was because of its nice documentation (which I hold up as an example for other communities and projects to emulate). Contributions to scikit-learn are required to include narrative examples along with sample scripts that run on small data sets. Besides good documentation there are other core tenets that guide the community’s overall commitment to quality and usability: the global API is safeguarded, all public API’s are well documented, and when appropriate contributors are encouraged to expand the coverage of unit tests.
Models are chosen and implemented by a dedicated team of experts
scikit-learn’s stable of contributors includes experts in machine-learning and software development. A few of them (including Olivier) are able to devote a portion of their professional working hours to the project.
Covers most machine-learning tasks
Scan the list of things available in scikit-learn and you quickly realize that it includes tools for many of the standard machine-learning tasks (such as clustering, classification, regression, etc.). And since scikit-learn is developed by a large community of developers and machine-learning experts, promising new techniques tend to be included in fairly short order.
As a curated library, users don’t have to choose from multiple competing implementations of the same algorithm (a problem that R users often face). In order to assist users who struggle to choose between different models, Andreas Muller created a simple flowchart for users:
Dinosaur Tries to Suckle, Dashboard Design, Massive Visualizations, Massive Machine Learning
- Behind the Scenes of a Dashboard Design — the design decisions that go into displaying complex info.
- Superconductor — a web framework for creating data visualizations that scale to real-time interactions with up to 1,000,000 data points. It compiles to WebCL, WebGL, and web workers. (via Ben Lorica)
- BIDMach: Large-scale Learning with Zero Memory Allocation (PDF) — GPU-accelerated machine learning. In this paper we describe a caching approach that allows code with complex matrix (graph) expressions at massive scale, i.e. multi-terabyte data, with zero memory allocation after the initial setup. (via Siah)
Inside the Nest Protect, Log Structures, Predictions, and In-Memory Data Cubes
- Nest Protect Teardown (Sparkfun) — initial teardown of another piece of domestic industrial Internet.
- Logs — The distributed log can be seen as the data structure which models the problem of consensus. Not kidding when he calls it “real-time data’s unifying abstraction”.
- Mining the Web to Predict Future Events (PDF) — Mining 22 years of news stories to predict future events. (via Ben Lorica)
- Nanocubes — a fast datastructure for in-memory data cubes developed at the Information Visualization department at AT&T Labs – Research. Nanocubes can be used to explore datasets with billions of elements at interactive rates in a web browser, and in some cases it uses sufficiently little memory that you can run a nanocube in a modern-day laptop. (via Ben Lorica)
AI Book, Science Superstars, Engineering Ethics, and Crowdsourced Science
- Society of Mind — Marvin Minsky’s book now Creative-Commons licensed.
- Collaboration, Stars, and the Changing Organization of Science: Evidence from Evolutionary Biology — The concentration of research output is declining at the department level but increasing at the individual level. [...] We speculate that this may be due to changing patterns of collaboration, perhaps caused by the rising burden of knowledge and the falling cost of communication, both of which increase the returns to collaboration. Indeed, we report evidence that the propensity to collaborate is rising over time. (via Sciblogs)
- As Engineers, We Must Consider the Ethical Implications of our Work (The Guardian) — applies to coders and designers as well.
- Eyewire — a game to crowdsource the mapping of 3D structure of neurons.