- Anti-Caching (PDF) — paper outlining a clever reframing of the database strategy of keeping frequently accessed things in-memory, namely pushing to disk the things that won’t be accessed … aka, “anti-caching.”
- The Rating Game (Verge) — Until companies release ratings data, we can’t know for certain whether this is true, but a study of Airbnb users found that black hosts get less money for similar listings than white hosts, and another study found that white taxi drivers get higher tips than black ones. There’s no reason such biases wouldn’t carry over to ratings.
- Singa — Apache distributed deep learning platform turns 1.0.
- Scoring Items That Were Voted On or Rated — a Bayesian system to turn a set of ratings or up/down votes into a single score, such that you can sort a list from “best” to “worst.”
"open source" entries
What bio can learn from the open source work of Tesla, Google, and Red Hat.
When building a biotech start-up, there is a certain inevitability to every conversation you will have. For investors, accelerators, academics, friends, baristas, the first two questions will be: “what do you want to do?” and “have you got a patent yet?”
Almost everything revolves around getting IP protection in place, and patent lawyer meetings are usually the first sign that your spin-off is on the way. But what if there was a way to avoid the patent dance, relying instead on implementation? It seems somewhat utopian, but there is a precedent in the technology world: open source.
What is open source? Essentially, any software in which the source code (the underlying program) is available to anyone else to modify, distribute, etc. This means that, unlike typical proprietary development processes, it lends itself to collaborative development between larger groups, often spread out across large distances. From humble beginnings, the open source movement has developed to the point of providing operating systems (e.g. Linux), Internet browsers (Firefox), 3D modelling software (Blender), monetary alternatives (Bitcoin), and even integrating automation systems for your home (OpenHab).
Money, money, money…
The obvious question is then, “OK, but how do they make money?” The answer to this lies not in attempting to profit from the software code itself, but rather from its implementation as well as the applications which are built on top of it. For the implementation side, take Red Hat Inc., a multinational software company in the S&P 500 with a market cap of $14.2 billion, who produce the extremely popular Red Hat Enterprise Linux distribution. Although open source and freely available, Red Hat makes its money by selling a thoroughly bug-tested operating system and then contracting to provide support for 10 years. Thus, businesses are not buying the code; they are buying a rapid response to any problems.
The O’Reilly Data Show podcast: Todd Lipcon on hybrid and specialized tools in distributed systems.
Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.
In recent months, I’ve been hearing about hybrid systems designed to handle different data management needs. At Strata + Hadoop World NYC last week, Cloudera’s Todd Lipcon unveiled an open source storage layer — Kudu — that’s good at both table scans (analytics) and random access (updates and inserts).
While specialized systems will continue to serve companies, there will be situations where the complexity of maintaining multiple systems — to eke out extra performance — will be harder to justify.
During the latest episode of the O’Reilly Data Show Podcast, I sat down with Lipcon to discuss his new project a few weeks before it was released. Here are a few snippets from our conversation:
HDFS and Hbase
[Hadoop is] more like a file store. It allows you to upload files onto an arbitrarily sized cluster with 20-plus petabytes, in single clusters. The thing is, you can upload the files but you can’t edit them in place. To make any change, you have to basically put in a new file. What HBase does in distinction is that it has more of a tabular data model, where you can update and insert individual row-by- row data, and then randomly access that data [in] milliseconds. The distinction here is that HDFS is pretty good for large scans where you’re putting in a large data set, maybe doing a full parse over the data set to train a machine learning model or compute an aggregate. If any of that data changes on a frequent basis or if you want to stream the data in or randomly access individual customer records, you’re kind of out of luck on HDFS. Read more…
A strong, open user community needs to be fostered to reveal its potential.
A strong user community is essential to releasing the full potential of an open source project, and this influence is particularly important now for the newly developed Apache Drill project. Drill is a highly scalable SQL query engine for interactive access to a wide range of big data sources and formats. Some of the ways users have an impact are an expected part of the development process: by trying the software and reporting their experiences and use cases, users in the Drill community provide valuable feedback to developers as well as raise awareness with a larger audience of what this big data tool has to offer.
This advantage was especially important with early versions of the software; users have helped development of Drill from early days by reporting bugs and praising features that they like. And now, as Drill is reaching maturity and refinement, users likely will also provide additional innovations: experimenting with Drill in their own projects, they may find new ways to use it that had not occurred to the developers.
Drill’s flexibility and extensibility lend themselves to innovation, but there’s also a natural tendency for this type of change because the big data and Hadoop landscape also are evolving quickly. In the case of Drill, we’re seeing the “unexpectedness benefit” of openness: the community gets out ahead of the leadership in use cases and technological change.
The first big Apache Drill design meeting in September 2012 in San Jose set the tone of openness and inclusion. This was an open meeting, organized by Drill co-founder Tomer Shiran and Drill mentor Ted Dunning, and sponsored by MapR Technologies through the Bay Area Apache Drill User Group. More than 60 people attended in person, and Webex connected a larger, international audience. I recall that in addition to speaker-led presentations and discussion, long strips of paper were mounted around the room for participants to write on during breaks in order to provide ideas or offer specific ways they might want to be involved. Practical steps like this surfaced good ideas immediately, and signaled openness for future ones. Read more…