March 2013 Archives

Data Science tools: Are you “all in” or do you “mix and match”?

It helps to reduce context-switching during long data science workflows.

An integrated data stack boosts productivity
As I noted in my previous post, Python programmers willing to go “all in”, have Python tools to cover most of data science. Lest I be accused of oversimplification, a Python programmer still needs to commit to learning a non-trivial set of tools1. I suspect that once they invest the time to learn the Python data stack, they tend to stick with it unless they absolutely have to use something else. But being able to stick with the same programming language and environment is a definite productivity boost. It requires less “setup time” in order to explore data using different techniques (viz, stats, ML).

Multiple tools and languages can impede reproducibility and flow
On the other end of the spectrum are data scientists who mix and match tools, and use packages and frameworks from several languages. Depending on the task, data scientists can avail of tools that are scalable, performant, require less2 code, and contain a lot of features. On the other hand this approach requires a lot more context-switching, and extra effort is needed to annotate long workflows. Failure to document things properly makes it tough to reproduce3 analysis projects, and impedes knowledge transfer4 within a team of data scientists. Frequent context-switching also makes it more difficult to be in a state of flow, as one has to think about implementation/package details instead of exploring data. It can be harder to discover interesting stories with your data, if you’re constantly having to think about what you’re doing. (It’s still possible, you just have to concentrate a bit harder.)

Read more…

Four short links: 29 March 2013

Four short links: 29 March 2013

Titan Improved, Security Tweeps, Probabilistic Programming, and 3D-Printable Optics

  1. Titan 0.3 Out — graph database now has full-text, geo, and numeric-range index backends.
  2. Mozilla Security Community Do a Reddit AMA — if you wanted a list of sharp web security people to follow on Twitter, you could do a lot worse than this.
  3. Probabilistic Programming and Bayesian Methods for Hackers (Github) — An introduction to Bayesian methods + probabilistic programming in data analysis with a computation/understanding-first, mathematics-second point of view. All in pure Python. See also Why Probabilistic Programming Matters and Trends to Watch: Logic and Probabilistic Programming. (via Mike Loukides and Renee DiRestra)
  4. Open Source 3D-Printable Optics Equipment (PLOSone) — This study demonstrates an open-source optical library, which significantly reduces the costs associated with much optical equipment, while also enabling relatively easily adapted customizable designs. The cost reductions in general are over 97%, with some components representing only 1% of the current commercial investment for optical products of similar function. The results of this study make its clear that this method of scientific hardware development enables a much broader audience to participate in optical experimentation both as research and teaching platforms than previous proprietary methods.

Strata Week: Our phones are giving us away

Anonymized phone data isn't as anonymous as we thought, a CFPB API, and NYC's "geek squad of civic-minded number-crunchers."

Mobile phone mobility traces ID users with only four data points

A study published this week by Scientific Reports, Unique in the Crowd: The privacy bounds of human mobility, shows that the location data in mobile phones is posing an anonymity risk. Jason Palmer reported at the BBC that researchers at MIT and the Catholic University of Louvain reviewed data from 15 months’ worth of phone records for 1.5 million people and were able to identify “mobility traces,” or “evident paths of each mobile phone,” using only four locations and times to positively identify a particular user. Yves-Alexandre de Montjoye, the study’s lead author, told Palmer that “[t]he way we move and the behaviour is so unique that four points are enough to identify 95% of people.”

Read more…

Large-Scale Data Collection and Real-Time Analytics Using Redis

Insights from a Strata Santa Clara 2013 session

By C. Aaron Cois

Strata Santa Clara 2013 is a wrap, and I had a great time speaking and interacting with all of the amazing attendees. I’d like to recap the talk that Tim Palko and I gave, entitled “Large-Scale Data Collection and Real-Time Analytics Using Redis”, and maybe even answer a few questions we were asked following our time on stage.

Our talk centered around a system we designed to collect environmental sensor data from remote sensors located in various places across the country and provide real-time visualization, monitoring, and event detection. Our primary challenge for the initial phase of development proved to be scaling the system to collect data from thousands of nodes, each of which sent sensor readings roughly once per second, which maintaining the ability to query the data in real time for event detection. While each data record was only ~300kb, our expected maximum sensor load indicated a collection rate of about 27 million records, or 8GB, per hour. However, our primary issue was not data size, but data rate. A large number of inserts had to happen each second, and we were unable to buffer inserts into batches or transactions without incurring a delay in the real-time data stream.

Read more…

Commerce Weekly: Reimagining the stages of retail

Remembering the basics in retail, desperate attempts to battle showrooming, and thrifty retail endeavors.

The basics remain key in our radically changing retail environment

Warby Parker BusThis week, PandoDaily’s Sarah Lacy addressed the issue of whether or not brick-and-mortar retail is dead and argued that it’s more “dying as we know it” than dead-dead. Lacy pointed to several ecommerce 2.0 startups — online retailers expanding into brick-and-mortar — who are creating twists, or “tweaks,” in the traditional retail model, eschewing the traditional retail playbook.

Tweaks Lacy highlighted include stores such as Warby Parker and Bonobos employing the showroom model, sort of a reverse-engineered try-before-you-buy — i.e. order online — model; piggybacking on existing retail chains to secure customers and expand reach; and opening pop-up stores or experimenting with physical mobile retail — such as Warby Parker’s experiment driving glasses around the country on a refashioned bus.

Barbara E. Kahn at Harvard Business Review says the strength of these new companies as well as the successful old guard retail chains that remain is their ability to understand how the stages of retail fit into the new and changing retail environment. Read more…

How crowdfunding and the JOBS Act will shape open source companies

New regulations could mark the end of proprietary finance.

Currently, anyone can crowdfund products, projectscauses, and sometimes debt. Current U.S. Securities and Exchange Commission (SEC) regulations make crowdfunding companies (i.e. selling stocks rather than products on crowdfund platforms) illegal. The only way to sell stocks to the public at large under the current law is through the heavily regulated Initial Public Offering (IPO) process.

The JOBS Act will soon change these rules. This will mean that platforms like Kickstarter will be able to sell shares in companies, assuming those companies follow certain strict rules. This change in finance law will enable open source companies to access capital and dominate the technology industry. This is the dawn of crowdfunded finance, and with it comes the dawn of open source technology everywhere.

The JOBS Act is already law, and it required the SEC to create specific rules by specific deadlines. The SEC is working on the rulemaking, but it has made it clear that given the complexity of this new finance structure, meeting the deadlines is not achievable. No one is happy with the delay but the rules should be done in late 2013 or early 2014.

When those rules are addressed, thousands of open source companies will use this financial instrument to create new types of enterprise open source software, hardware, and bioware. These companies will be comfortably funded by their open source communities. Unlike traditional venture-capital-backed companies, these new companies will narrowly focus on getting the technology right and putting their communities first. Eventually, I think these companies will make most proprietary software companies obsolete. Read more…

Four short links: 28 March 2013

Four short links: 28 March 2013

Chinese Lessons, White House Embraces Makers, DC Codes Freed, and Malware Numbers

  1. What American Startups Can Learn From the Cutthroat Chinese Software IndustryIt follows that the idea of “viral” or “organic” growth doesn’t exist in China. “User acquisition is all about media buys. Platform-to-platform in China is war, and it is fought viciously and bitterly. If you have a Gmail account and send an email to, for example, NetEase163.com, which is the local web dominant player, it will most likely go to spam or junk folders regardless of your settings. Just to get an email to go through to your inbox, the company sending the email needs to have a special partnership.” This entire article is a horror show.
  2. White House Hangout Maker Movement (Whitehouse) — During the Hangout, Tom Kalil will discuss the elements of an “all hands on deck” effort to promote Making, with participants including: Dale Dougherty, Founder and Publisher of MAKE; Tara Tiger Brown, Los Angeles Makerspace; Super Awesome Sylvia, Super Awesome Maker Show; Saul Griffith, Co-Founder, Otherlab; Venkatesh Prasad, Ford.
  3. Municipal Codes of DC Freed (BoingBoing) — more good work by Carl Malamud. He’s specifically providing data for apps.
  4. The Modern Malware Review (PDF) — 90% of fully undetected malware was delivered via web-browsing; It took antivirus vendors 4 times as long to detect malware from web-based applications as opposed to email (20 days for web, 5 days for email); FTP was observed to be exceptionally high-risk.

The coming of the industrial internet

Our new research report outlines our vision for the coming-together of software and big machines.

The big machines that define modern life — cars, airplanes, furnaces, and so forth — have become exquisitely efficient, safe, and responsive over the last century through constant mechanical refinement. But mechanical refinement has its limits, and there are enormous improvements to be wrung out of the way that big machines are operated: an efficient furnace is still wasteful if it heats a building that no one is using; a safe car is still dangerous in the hands of a bad driver.

It is this challenge that the industrial internet promises to address by layering smart software on top of machines. The last few years have seen enormous advances in software and computing that can handle gushing streams of data and build nuanced models of complex systems. These have been used effectively in advertising and web commerce, where data is easy to gather and control is easy to exert, and marketers have rejoiced.

Thanks to widespread sensors, pervasive networks, and standardized interfaces, similar software can interact with the physical world — harvesting data, analyzing it in context, and making adjustments in real-time. The same data-driven approach that gives us dynamic pricing on Amazon and customized recommendations on Foursquare has already started to make wind turbines more efficient and thermostats more responsive. It may soon obviate humans as drivers and help blast furnaces anticipate changes in electricity prices. Read more…

Visualization of the Week: MOOC completion rates

Educational researcher Katy Jordan created an interactive visualization using completion and enrollment data from recent MOOCs.

Massive open online courses, or MOOCs, offered through platforms such as Coursera, EdX and Udacity, are arguably helping to fill higher education needs around the world. Educational researcher Katy Jordan noted in a post, however, that “although thousands enroll for courses, a very small proportion actually complete the course.” To take a closer look, she pulled together an interactive visualization to show enrollment numbers and completion rates from recent MOOCs.

Read more…

Let’s do this the hard way

Being both liberal and safe in programming is hard

Recent discoveries of security vulnerabilities in Rails and MongoDB led me to thinking about how people get to write software.

In engineering, you don’t get to build a structure people can walk into without years of study. In software, we often write what the heck we want and go back to clean up the mess later. It works, but the consequences start to get pretty monumental when you consider the network effects of open source.

cartoon-37304_640You might think it’s a consequence of the tools we use—running fast and loose with scripting languages. I’m not convinced. Unusually among computer science courses, my alma mater taught us programming 101 with Ada. Ada is a language that more or less requires a retinal scan before you can use the compiler. It was a royal pain to get Ada to do anything you wanted: the philosophical inverse of Perl or Ruby. We certainly came up the “hard way.”

I’m not sure that the hard way was any better: a language that protects you from yourself doesn’t teach you much about the problems you can create.

But perhaps we are in need of an inversion of philosophy. Where Internet programming is concerned, everyone is quick to quote Postel’s law: “Be conservative in what you do, be liberal in what you accept from others.”

The fact of it is that being liberal in what you accept is really hard. You basically have two options: look carefully for only the information you need, which I think is the spirit of Postel’s law, or implement something powerful that will take care of many use cases. This latter strategy, though seemingly quicker and more future-proof, is what often leads to bugs and security holes, as unintended applications of powerful parsers manifest themselves.

My conclusion is this: use whatever language makes sense, but be systematically paranoid. Be liberal in what you accept, but conservative about what you believe.