"open source" entries

Resolving transactional access and analytic performance trade-offs

The O’Reilly Data Show podcast: Todd Lipcon on hybrid and specialized tools in distributed systems.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-Dolderbrug_Steenwijk_inclusief_lichtontwerpIn recent months, I’ve been hearing about hybrid systems designed to handle different data management needs. At Strata + Hadoop World NYC last week, Cloudera’s Todd Lipcon unveiled an open source storage layer — Kudu —  that’s good at both table scans (analytics) and random access (updates and inserts).

While specialized systems will continue to serve companies, there will be situations where the complexity of maintaining multiple systems — to eke out extra performance — will be harder to justify.

During the latest episode of the O’Reilly Data Show Podcast, I sat down with Lipcon to discuss his new project a few weeks before it was released. Here are a few snippets from our conversation:

HDFS and Hbase

[Hadoop is] more like a file store. It allows you to upload files onto an arbitrarily sized cluster with 20-plus petabytes, in single clusters. The thing is, you can upload the files but you can’t edit them in place. To make any change, you have to basically put in a new file. What HBase does in distinction is that it has more of a tabular data model, where you can update and insert individual row-by- row data, and then randomly access that data [in] milliseconds. The distinction here is that HDFS is pretty good for large scans where you’re putting in a large data set, maybe doing a full parse over the data set to train a machine learning model or compute an aggregate. If any of that data changes on a frequent basis or if you want to stream the data in or randomly access individual customer records, you’re kind of out of luck on HDFS. Read more…

Four short links: 2 October 2015

Four short links: 2 October 2015

Automatic Environments, Majority Illusion, Bogus Licensing, and Orchestrating People and Machines

  1. Announcing Otto — new Hashicorp tool that automatically builds development environments without any configuration; it can detect your project type and has built-in knowledge of industry-standard tools to setup a development environment that is ready to go. When you’re ready to deploy, Otto builds and manages an infrastructure, sets up servers, builds, and deploys the application.
  2. The Majority Illusion in Social Networks (arxiv) — if connectors do something, it’s perceived as more popular than if the same number of “unpopular” people in the social graph do it. (via MIT TR)
  3. Scientist Says Researcher in Immigrant-Friendly Countries Can’t Use His Software — software to build phylogenetic trees, but the author’s a loon. It’s another sign that it’s unwise to do science with non-free software.
  4. Orchestraan open source system to orchestrate teams of experts and machines on complex projects.
Four short links: 25 September 2015

Four short links: 25 September 2015

Predicting Policing, Assaulting Advertising, Compliance Ratings, and $9 Computer

  1. Police Program Aims to Pinpoint Those Most Likely to Commit Crimes (NYT) — John S. Hollywood, a senior operations researcher at the RAND Corporation, said that in the limited number of studies undertaken to measure the efficacy of predictive policing, the improvement in forecasting crimes had been only 5% or 10% better than regular policing methods.
  2. Apple’s Assault on Advertising and Google (Calacanis) — Google wants to be proud of their legacy, and tricking people into clicking ads and selling our profiles to advertisers is an awesome business – but a horrible legacy for Larry and Sergey. Read beside the Bloomberg piece on click fraud and the future isn’t too rosy for advertising. If the ad bubble bursts, how much of the Web will it take with it?
  3. China Is Building The Mother Of All Reputation Systems To Monitor Citizen BehaviorThe document talks about the “construction of credibility” — the ability to give and take away credits — across more than 30 areas of life, from energy saving to advertising.
  4. $9 Computer Hardware (Makezine) — open hardware project, with open source software. The board’s spec is a 1GHz R8 ARM processor with 512MB of RAM, 4GB of NAND storage, and Wi-Fi and Bluetooth built in.
Four short links: 21 September 2015

Four short links: 21 September 2015

2-D Single-Stroke Recognizer, Autonomous Vehicle Permits, s3concurrent, and Surviving the Music Industry

  1. $1 Unistroke Recognizera 2-D single-stroke recognizer designed for rapid prototyping of gesture-based user interfaces. In machine learning terms, $1 is an instance-based nearest-neighbor classifier with a Euclidean scoring function — i.e., a geometric template matcher.
  2. Apple Talking to California Officials about Self-Driving Car (Guardian) — California DMV’s main responsibility for autonomous vehicles at present is administering an autonomous vehicle tester program for experimental self-driving cars on California’s roads. So far, 10 companies have been issued permits for about 80 autonomous vehicles and more than 300 test drivers. The most recent, Honda and BMW, received their permits last week.
  3. s3concurrent — sync local file structure with s3, in parallel. (via Winston Chen)
  4. Amanda Palmer on Music Industry Survival Techniques (O’Reilly Radar) — I’ve always approached every Internet platform and every Internet tool with the suspicion that it may not last, and that actually what’s very important is […] the art and the relationships I’m building.

Apache Drill: Tracking its history as an open source community

A strong, open user community needs to be fostered to reveal its potential.


A strong user community is essential to releasing the full potential of an open source project, and this influence is particularly important now for the newly developed Apache Drill project. Drill is a highly scalable SQL query engine for interactive access to a wide range of big data sources and formats. Some of the ways users have an impact are an expected part of the development process: by trying the software and reporting their experiences and use cases, users in the Drill community provide valuable feedback to developers as well as raise awareness with a larger audience of what this big data tool has to offer.

This advantage was especially important with early versions of the software; users have helped development of Drill from early days by reporting bugs and praising features that they like. And now, as Drill is reaching maturity and refinement, users likely will also provide additional innovations: experimenting with Drill in their own projects, they may find new ways to use it that had not occurred to the developers.

Drill’s flexibility and extensibility lend themselves to innovation, but there’s also a natural tendency for this type of change because the big data and Hadoop landscape also are evolving quickly. In the case of Drill, we’re seeing the “unexpectedness benefit” of openness: the community gets out ahead of the leadership in use cases and technological change.

The first big Apache Drill design meeting in September 2012 in San Jose set the tone of openness and inclusion. This was an open meeting, organized by Drill co-founder Tomer Shiran and Drill mentor Ted Dunning, and sponsored by MapR Technologies through the Bay Area Apache Drill User Group. More than 60 people attended in person, and Webex connected a larger, international audience. I recall that in addition to speaker-led presentations and discussion, long strips of paper were mounted around the room for participants to write on during breaks in order to provide ideas or offer specific ways they might want to be involved. Practical steps like this surfaced good ideas immediately, and signaled openness for future ones. Read more…

Comments: 2
Four short links: 10 September 2015

Four short links: 10 September 2015

Decentralised Software, Slow Chemistry, Spectrum Maps, and RF Interference

  1. Popcorn Time — interview with the creator. All the elements we used already existed and had done so for a long time. But nobody had put them together in an interface that talked to the user in a nice way, said Abad. Very Anonymous approach to software: Who are you going to sue? The first? The second? The third? I did the design. Was it illegal? I didn’t link the various parts together. There is no comprehensive overview of who did what. For we don’t have any business. We don’t have any headquarters or a general manager.
  2. Slow Chemistry (Nature) — “lazy man’s chemistry”: let a mix of solid reactants sit around undisturbed while they spontaneously transform themselves. More properly called slow chemistry, or even just ageing, the approach requires few, if any, hazardous solvents and uses minimal energy. If planned properly, it also consumes all the reagents in the mix, so that there is no waste and no need for chemical-intensive purification.
  3. Mapping the Spectrum in the Mission — SDR scanner to make a map of spectrum activity.
  4. Electronic Noise is Drowning Out the Internet of Things (IEEE Spectrum) — (paraphrasing) increases deployment costs, decreases battery life, creates interference, ruins policies of spectrum allocation, is expensive to trace, and almost impossible stop.
Four short links: 7 September 2015

Four short links: 7 September 2015

Nanoscale Motors, Language of Betrayal, Messaging, and Handing Off Culture

  1. Nanoscale Motors (Nature) — “We’ve made 50 or 60 different motors,” says Ben Feringa, a chemist at the University of Groningen in the Netherlands. “I’m less interested in making another motor than actually using it.” An interesting summary of the progress made in nanoscale engineering.
  2. Linguistics Signs of Betrayal — as found by studying Diplomacy players. Betrayers suddenly become more positive, possibly attempting to hide their duplicity. Betrayers suddenly become less polite, after having kept up a façade of politeness, during which the victims were significantly less polite. A reversal of imbalance occurs right before the betrayal. Victims plan more. Making a lot of plans can put pressure on the relationship and hasten betrayal, and, at the same time, if the betrayer’s mind is made up, there is no point for him to plan.
  3. NATS — open source (MIT-licensed) messaging system that shares the best name in the world.
  4. Building a Culture and Handing it Off (Kellan Elliott-McCrea) — Successfully building a culture ensures when you leave you can hand your work off to people you trust and they will run the thing without you and make it better than you could have imagined.
Comment: 1
Four short links: 3 September 2015

Four short links: 3 September 2015

Lock Patterns, Peer-to-Peer Markets, Community Products, and Speech Recognition

  1. The Surprising Predictability of Android Lock Patterns (Ars Technica) — people use the same type of strategy for remembering a pattern as a password
  2. Peer to Peer Markets (PDF) — We discuss elements of market design that make this possible, including search and matching algorithms, pricing, and reputation systems. We then develop a simple model of how these markets enable entry by small or flexible suppliers, and the resulting impact on existing firms. Finally, we consider the regulation of peer-to-peer markets, and the economic arguments for different approaches to licensing and certification, data, and employment regulation.
  3. 16 Product Things I learned at ImgurYou can A/B test individuals, but it’s nearly impossible to A/B test communities because they work based on a mutually reinforcing self-conception. Use a combination of intuition (which comes from experience), talking to other community managers and 1:1 contact with a sample of your community. But you’ll still be wrong a lot.
  4. kaldia toolkit for speech recognition written in C++ and licensed under the Apache License v2.0
Four short links: 1 September 2015

Four short links: 1 September 2015

People Detection, Ratings Patterns, Inspection Bias, and Cloud Filesystem

  1. End-to-End People Detection in Crowded Scenes — research paper and code. When parsing the title, bind “end-to-end” to “scenes” not “people”.
  2. Statistical Patterns in Movie Ratings (PLOSone) — We find that the distribution of votes presents scale-free behavior over several orders of magnitude, with an exponent very close to 3/2, with exponential cutoff. It is remarkable that this pattern emerges independently of movie attributes such as average rating, age and genre, with the exception of a few genres and of high-budget films.
  3. The Inspection Bias is EverywhereIn 1991, Scott Feld presented the “friendship paradox”: the observation that most people have fewer friends than their friends have. He studied real-life friends, but the same effect appears in online networks: if you choose a random Facebook user, and then choose one of their friends at random, the chance is about 80% that the friend has more friends. The friendship paradox is a form of the inspection paradox. When you choose a random user, every user is equally likely. But when you choose one of their friends, you are more likely to choose someone with a lot of friends. Specifically, someone with x friends is overrepresented by a factor of x.
  4. s3qla file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack. S3QL effectively provides a hard disk of dynamic, infinite capacity that can be accessed from any computer with internet access running Linux, FreeBSD or OS-X. (GPLv3)
Four short links: 25 August 2015

Four short links: 25 August 2015

Microservices Anti-Patterns, Reverse Engineering Course, Graph Language, and Automation Research

  1. Seven Microservices Anti-PatternsOne common mistake people made with SOA was misunderstanding how to achieve the reusability of services. Teams mostly focused on technical cohesion rather than functional regarding reusability. For example, several services functioned as a data access layer (ORM) to expose tables as services; they thought it would be highly reusable. This created an artificial physical layer managed by a horizontal team, which caused delivery dependency. Any service created should be highly autonomous – meaning independent of each other.
  2. CSCI 4974 / 6974 Hardware Reverse Engineering — RPI CS course in reverse engineering.
  3. The Gremlin Graph Traversal Language (Slideshare) — preso on a language for navigating graph data structures, which is part of the Apache TinkerPop (“Open Source Graph Computing”) suite.
  4. Why Are There Still So Many Jobs? The History and Future of Workplace Automation (PDF) — paper about the history of technology and labour. The issue is not that middle-class workers are doomed by automation and technology, but instead that human capital investment must be at the heart of any long-term strategy for producing skills that are complemented by rather than substituted for by technological change. Found via Scott Santens’s comprehensive rebuttal.