"open source" entries

Four short links: 2 November 2015

Four short links: 2 November 2015

Anti-Caching, Tyranny of Ratings, Distributed Deep Learning, and Sorting Rated Things

  1. Anti-Caching (PDF) — paper outlining a clever reframing of the database strategy of keeping frequently accessed things in-memory, namely pushing to disk the things that won’t be accessed … aka, “anti-caching.”
  2. The Rating Game (Verge) — Until companies release ratings data, we can’t know for certain whether this is true, but a study of Airbnb users found that black hosts get less money for similar listings than white hosts, and another study found that white taxi drivers get higher tips than black ones. There’s no reason such biases wouldn’t carry over to ratings.
  3. Singa — Apache distributed deep learning platform turns 1.0.
  4. Scoring Items That Were Voted On or Rated — a Bayesian system to turn a set of ratings or up/down votes into a single score, such that you can sort a list from “best” to “worst.”
Four short links: 29 October 2015

Four short links: 29 October 2015

Cloud Passports, Better Python Notebooks, Slippery Telcos, and Python Data Journalism

  1. Australia Floating the Idea of Cloud PassportsUnder a cloud passport, a traveller’s identity and biometrics data would be stored in a cloud, so passengers would no longer need to carry their passports and risk having them lost or stolen. That sound you hear is Taylor Swift on Security, quoting “Wildest Dreams” into her vodka and Tang: “I can see the end as it begins.” This article is also notable for The idea of cloud passports is the result of a hipster-style-hackathon.
  2. Jupyter — Python Notebooks that allows you to create and share documents that contain live code, equations, visualizations, and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning, and much more.
  3. Telcos $24B Business In Your DataUnder the radar, Verizon, Sprint, Telefonica, and other carriers have partnered with firms including SAP, IBM, HP, and AirSage to manage, package, and sell various levels of data to marketers and other clients. It’s all part of a push by the world’s largest phone operators to counteract diminishing subscriber growth through new business ventures that tap into the data that showers from consumers’ mobile Web surfing, text messaging, and phone calls. Even if you do pay for it, you’re still the product.
  4. Introducing Agate — a Python data analysis library designed to be useable by non-data-scientists, so leads to readable and predictable code. Target market: data journalists.
Four short links: 26 October 2015

Four short links: 26 October 2015

Dataflow Computers, Data Set Explorer, Design Brief, and Coping with Uncertainty

  1. Dataflow Computers: Their History and Future (PDF) — entry from 2008 Wiley Encyclopedia of Computer Science and Engineering.
  2. Mirador — open source tool for visual exploration of complex data sets. It enables users to discover correlation patterns and derive new hypotheses from the data.
  3. How 23AndMe Got Regulatory Approval Back (Fast Company) — In order to meet FDA requirements, the design team had to prove that the reports provided on the website would be comprehensible to any American consumer, regardless of their background or education level. And you thought YOUR design brief was hard.
  4. Getting Comfortable with Uncertainty (The Atlantic) — We have this natural distaste for things that are unfamiliar to us, things that are ambiguous. It goes up from situational stressors, on an individual level and a group level. And we’re stuck with it simply because we have to be ambiguity-reducers.

Open source lessons for synthetic biology

What bio can learn from the open source work of Tesla, Google, and Red Hat.


When building a biotech start-up, there is a certain inevitability to every conversation you will have. For investors, accelerators, academics, friends, baristas, the first two questions will be: “what do you want to do?” and “have you got a patent yet?”

Almost everything revolves around getting IP protection in place, and patent lawyer meetings are usually the first sign that your spin-off is on the way. But what if there was a way to avoid the patent dance, relying instead on implementation? It seems somewhat utopian, but there is a precedent in the technology world: open source.

What is open source? Essentially, any software in which the source code (the underlying program) is available to anyone else to modify, distribute, etc. This means that, unlike typical proprietary development processes, it lends itself to collaborative development between larger groups, often spread out across large distances. From humble beginnings, the open source movement has developed to the point of providing operating systems (e.g. Linux), Internet browsers (Firefox), 3D modelling software (Blender), monetary alternatives (Bitcoin), and even integrating automation systems for your home (OpenHab).

Money, money, money…

The obvious question is then, “OK, but how do they make money?” The answer to this lies not in attempting to profit from the software code itself, but rather from its implementation as well as the applications which are built on top of it. For the implementation side, take Red Hat Inc., a multinational software company in the S&P 500 with a market cap of $14.2 billion, who produce the extremely popular Red Hat Enterprise Linux distribution. Although open source and freely available, Red Hat makes its money by selling a thoroughly bug-tested operating system and then contracting to provide support for 10 years. Thus, businesses are not buying the code; they are buying a rapid response to any problems.

Read more…

Comment: 1
Four short links: 22 October 2015

Four short links: 22 October 2015

Predicting activity, systems replacement fail, Khan React style, and an interoperability system for the Web

  1. Predicting Daily Activities from Egocentric Images Using Deep LearningOur technique achieves an overall accuracy of 83.07% in predicting a person’s activity [from images taken by a camera worn all day by a person] across the 19 activity classes.
  2. Trying to Replace Multiple Systems with One Can Lead to None (IEEE) — check out that final graph, it’s a doozy. It’s a graph of x against time, from various “this project is great, it will replace x systems with 1″ claims about a single project. Software projects should come with giant warning labels: “most fail, you are about to set your money on fire. Are you sure? [Y/N/Abort/Restart]”
  3. Khan React Style Guide — in case you’re dipping your toes into the cool kids’ pool.
  4. ballistaAn interoperability system for the modern Web. Like intents.

Resolving transactional access and analytic performance trade-offs

The O’Reilly Data Show podcast: Todd Lipcon on hybrid and specialized tools in distributed systems.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-Dolderbrug_Steenwijk_inclusief_lichtontwerpIn recent months, I’ve been hearing about hybrid systems designed to handle different data management needs. At Strata + Hadoop World NYC last week, Cloudera’s Todd Lipcon unveiled an open source storage layer — Kudu —  that’s good at both table scans (analytics) and random access (updates and inserts).

While specialized systems will continue to serve companies, there will be situations where the complexity of maintaining multiple systems — to eke out extra performance — will be harder to justify.

During the latest episode of the O’Reilly Data Show Podcast, I sat down with Lipcon to discuss his new project a few weeks before it was released. Here are a few snippets from our conversation:

HDFS and Hbase

[Hadoop is] more like a file store. It allows you to upload files onto an arbitrarily sized cluster with 20-plus petabytes, in single clusters. The thing is, you can upload the files but you can’t edit them in place. To make any change, you have to basically put in a new file. What HBase does in distinction is that it has more of a tabular data model, where you can update and insert individual row-by- row data, and then randomly access that data [in] milliseconds. The distinction here is that HDFS is pretty good for large scans where you’re putting in a large data set, maybe doing a full parse over the data set to train a machine learning model or compute an aggregate. If any of that data changes on a frequent basis or if you want to stream the data in or randomly access individual customer records, you’re kind of out of luck on HDFS. Read more…

Four short links: 2 October 2015

Four short links: 2 October 2015

Automatic Environments, Majority Illusion, Bogus Licensing, and Orchestrating People and Machines

  1. Announcing Otto — new Hashicorp tool that automatically builds development environments without any configuration; it can detect your project type and has built-in knowledge of industry-standard tools to setup a development environment that is ready to go. When you’re ready to deploy, Otto builds and manages an infrastructure, sets up servers, builds, and deploys the application.
  2. The Majority Illusion in Social Networks (arxiv) — if connectors do something, it’s perceived as more popular than if the same number of “unpopular” people in the social graph do it. (via MIT TR)
  3. Scientist Says Researcher in Immigrant-Friendly Countries Can’t Use His Software — software to build phylogenetic trees, but the author’s a loon. It’s another sign that it’s unwise to do science with non-free software.
  4. Orchestraan open source system to orchestrate teams of experts and machines on complex projects.
Four short links: 25 September 2015

Four short links: 25 September 2015

Predicting Policing, Assaulting Advertising, Compliance Ratings, and $9 Computer

  1. Police Program Aims to Pinpoint Those Most Likely to Commit Crimes (NYT) — John S. Hollywood, a senior operations researcher at the RAND Corporation, said that in the limited number of studies undertaken to measure the efficacy of predictive policing, the improvement in forecasting crimes had been only 5% or 10% better than regular policing methods.
  2. Apple’s Assault on Advertising and Google (Calacanis) — Google wants to be proud of their legacy, and tricking people into clicking ads and selling our profiles to advertisers is an awesome business – but a horrible legacy for Larry and Sergey. Read beside the Bloomberg piece on click fraud and the future isn’t too rosy for advertising. If the ad bubble bursts, how much of the Web will it take with it?
  3. China Is Building The Mother Of All Reputation Systems To Monitor Citizen BehaviorThe document talks about the “construction of credibility” — the ability to give and take away credits — across more than 30 areas of life, from energy saving to advertising.
  4. $9 Computer Hardware (Makezine) — open hardware project, with open source software. The board’s spec is a 1GHz R8 ARM processor with 512MB of RAM, 4GB of NAND storage, and Wi-Fi and Bluetooth built in.
Four short links: 21 September 2015

Four short links: 21 September 2015

2-D Single-Stroke Recognizer, Autonomous Vehicle Permits, s3concurrent, and Surviving the Music Industry

  1. $1 Unistroke Recognizera 2-D single-stroke recognizer designed for rapid prototyping of gesture-based user interfaces. In machine learning terms, $1 is an instance-based nearest-neighbor classifier with a Euclidean scoring function — i.e., a geometric template matcher.
  2. Apple Talking to California Officials about Self-Driving Car (Guardian) — California DMV’s main responsibility for autonomous vehicles at present is administering an autonomous vehicle tester program for experimental self-driving cars on California’s roads. So far, 10 companies have been issued permits for about 80 autonomous vehicles and more than 300 test drivers. The most recent, Honda and BMW, received their permits last week.
  3. s3concurrent — sync local file structure with s3, in parallel. (via Winston Chen)
  4. Amanda Palmer on Music Industry Survival Techniques (O’Reilly Radar) — I’ve always approached every Internet platform and every Internet tool with the suspicion that it may not last, and that actually what’s very important is […] the art and the relationships I’m building.

Apache Drill: Tracking its history as an open source community

A strong, open user community needs to be fostered to reveal its potential.


A strong user community is essential to releasing the full potential of an open source project, and this influence is particularly important now for the newly developed Apache Drill project. Drill is a highly scalable SQL query engine for interactive access to a wide range of big data sources and formats. Some of the ways users have an impact are an expected part of the development process: by trying the software and reporting their experiences and use cases, users in the Drill community provide valuable feedback to developers as well as raise awareness with a larger audience of what this big data tool has to offer.

This advantage was especially important with early versions of the software; users have helped development of Drill from early days by reporting bugs and praising features that they like. And now, as Drill is reaching maturity and refinement, users likely will also provide additional innovations: experimenting with Drill in their own projects, they may find new ways to use it that had not occurred to the developers.

Drill’s flexibility and extensibility lend themselves to innovation, but there’s also a natural tendency for this type of change because the big data and Hadoop landscape also are evolving quickly. In the case of Drill, we’re seeing the “unexpectedness benefit” of openness: the community gets out ahead of the leadership in use cases and technological change.

The first big Apache Drill design meeting in September 2012 in San Jose set the tone of openness and inclusion. This was an open meeting, organized by Drill co-founder Tomer Shiran and Drill mentor Ted Dunning, and sponsored by MapR Technologies through the Bay Area Apache Drill User Group. More than 60 people attended in person, and Webex connected a larger, international audience. I recall that in addition to speaker-led presentations and discussion, long strips of paper were mounted around the room for participants to write on during breaks in order to provide ideas or offer specific ways they might want to be involved. Practical steps like this surfaced good ideas immediately, and signaled openness for future ones. Read more…

Comments: 2