"open data" entries

We need open models, not just open data

If you really want to understand the effect data is having, you need the models.

Opening_Up_Sonny_Abesamis_Flickr

Writing my post about AI and summoning the demon led me to re-read a number of articles on Cathy O’Neil’s excellent mathbabe blog. I highlighted a point Cathy has made consistently: if you’re not careful, modelling has a nasty way of enshrining prejudice with a veneer of “science” and “math.”

Cathy has consistently made another point that’s a corollary of her argument about enshrining prejudice. At O’Reilly, we talk a lot about open data. But it’s not just the data that has to be open: it’s also the models. (There are too many must-read articles on Cathy’s blog to link to; you’ll have to find the rest on your own.)

You can have all the crime data you want, all the real estate data you want, all the student performance data you want, all the medical data you want, but if you don’t know what models are being used to generate results, you don’t have much. Read more…

Comments: 5
Four short links: 4 November 2014

Four short links: 4 November 2014

3D Shares, Autonomous Golf Carts, Competitive Solar, and Interesting Data Problems

  1. Cooper-Hewitt Shows How to Share 3D Scan Data Right (Makezine) — important as we move to a web of physical models, maps, and designs.
  2. Singapore Tests Autonomous Golfcarts (Robohub) — a reminder that the future may not necessarily look like someone used the clone tool to paint Silicon Valley over the world.
  3. Solar Hits Parity in 10 States, 47 by 2016 (Bloomberg) — The reason solar-power generation will increasingly dominate: it’s a technology, not a fuel. As such, efficiency increases and prices fall as time goes on. The price of Earth’s limited fossil fuels tends to go the other direction.
  4. Facebook’s Top Open Data Problems (Facebook Research) — even if you’re not interested in Facebook’s Very First World Problems, this is full of factoids like Facebook’s social graph store TAO, for example, provides access to tens of petabytes of data, but answers most queries by checking a single page in a single machine. (via Greg Linden)
Comment
Four short links: 29 October 2014

Four short links: 29 October 2014

Tweet Parsing, Focus and Money, Challenging Open Data Beliefs, and Exploring ISP Data

  1. TweetNLP — CMU open source natural language parsing tools for making sense of Tweets.
  2. Interview with Google X Life Science’s Head (Medium) — I will have been here two years this March. In nineteen months we have been able to hire more than a hundred scientists to work on this. We’ve been able to build customized labs and get the equipment to make nanoparticles and decorate them and functionalize them. We’ve been able to strike up collaborations with MIT and Stanford and Duke. We’ve been able to initiate protocols and partnerships with companies like Novartis. We’ve been able to initiate trials like the baseline trial. This would be a good decade somewhere else. The power of focus and money.
  3. Schooloscope Open Data Post-MortemThe case of Schooloscope and the wider question of public access to school data challenges the belief that sunlight is the best disinfectant, that government transparency would always lead to better government, better results. It challenges the sentiments that see data as value-neutral and its representation as devoid of politics. In fact, access to school data exposes a sharp contrast between the private interest of the family (best education for my child) and the public interest of the government (best education for all citizens).
  4. M-Lab Observatory — explorable data on the data experience (RTT, upload speed, etc) across different ISPs in different geographies over time.
Comment
Four short links: 24 October 2014

Four short links: 24 October 2014

Parallel Algorithm, Open Source Bio, 3D Printed Peptides, and Open London Data

  1. PaGMOParallel Global Multiobjective Optimizer […] a generalization of the island model paradigm working for global and local optimization algorithms. Its main parallelization approach makes use of multiple threads, but MPI is also implemented and can be mixed in with multithreading. PaGMO can be used to solve in a parallel fashion, global optimization tasks.
  2. Avoiding the Tragedy of the Anticommons — Many people talk about “open source biology.” Mike Loukides pulls apart open source and biology to see what the relationship might be. I’m still chewing on what devops for bio would be. Modern software systems throw off gigabytes of data, and we have built tools to monitor those systems, archive their data, and automate much of the analysis. There are free and commercial packages for logging and monitoring, and it continues to be a very active area of software development, as anyone who’s attended O’Reilly’s Velocity conference knows.
  3. peppytides (Makezine) — 3d-printed super accurate, scaled 3D-model of a polypeptide chain that can be folded into all the basic protein structures, like α-helices, β-sheets, and β-turns. (via Lenore Edman)
  4. London Data Store — dashboard and open data catalogue for City of London’s data release efforts.
Comment: 1

Open data for open lands

Recreation.gov should be a platform, not a silo.

President Obama’s well-publicized national open data policy (pdf) makes it clear that government data is a valuable public resource for which the government should be making efforts to maximize access and use. This policy was based on lessons from previous government open data success stories, such as weather data and GPS, which form the basis for countless commercial services that we take for granted today and that deliver enormous value to society. (You can see an impressive list of companies reliant on open government data via GovLab’s Open Data 500 project.)

Based on this open data policy, I’ve been encouraging entrepreneurs to invest their time and ingenuity to explore entrepreneurial opportunities based on government data. I’ve even invested (through O’Reilly AlphaTech Ventures) in one such start-up, Hipcamp, which provides user-friendly interfaces to making reservations at national and state parks.

A better system is sorely needed. The current reservation system is clunky and difficult to use. Hipcamp changes all that, making it a breeze to reserve camping spots. Read more…

Comment
Four short links: 21 August 2014

Four short links: 21 August 2014

Open Data Glue, Smithsonian Crowdsourcing, MIT Family Creativity, and Hardware Owie

  1. Datan open source project that provides a streaming interface between every file format and data storage backend. See the Wired piece on it.
  2. Smithsonian Crowdsourcing Transcription (Smithsonian) — 49 volunteers transcribed 200 pages of correspondence between the Monuments Men in a week. Soon it’ll be mathematics test questions: “if 49 people transcribe 200 pages in 7 days, how many weeks will it take …”
  3. MIT Guide to Family CompSci SessionsThis guide is for educators, community center staff, and volunteers interested in engaging their young people and their families to become designers and inventors in their community.
  4. What to Do When You Screw up 2,000 Orders (SparkFun) — even hardware companies need to do retrospectives.
Comment
Four short links: 3 June 2014

Four short links: 3 June 2014

Machine Learning Mistakes, Recommendation Bandits, Droplet Robots, and Plain English

  1. Machine Learning Done Wrong[M]ost practitioners pick the modeling algorithm they are most familiar with rather than pick the one which best suits the data. In this post, I would like to share some common mistakes (the don’t-s).
  2. Bandits for RecommendationsA common problem for internet-based companies is: which piece of content should we display? Google has this problem (which ad to show), Facebook has this problem (which friend’s post to show), and RichRelevance has this problem (which product recommendation to show). Many of the promising solutions come from the study of the multi-armed bandit problem.
  3. Dropletsthe Droplet is almost spherical, can self-right after being poured out of a bucket, and has the hardware capabilities to organize into complex shapes with its neighbors due to accurate range and bearing. Droplets are available open-source and use cheap vibration motors and a 3D printed shell. (via Robohub)
  4. Apple’s App Store Approval Guidelines — some of the plainest English I’ve seen, especially the Introduction. I can only aspire to that clarity. If your App looks like it was cobbled together in a few days, or you’re trying to get your first practice App into the store to impress your friends, please brace yourself for rejection. We have lots of serious developers who don’t want their quality Apps to be surrounded by amateur hour.
Comment
Four short links: 30 May 2014

Four short links: 30 May 2014

Video Transparency, Software Traffic, Distributed Database, and Open Source Sustainability

  1. Video Quality Report — transparency is a great way to indirectly exert leverage.
  2. Control Your Traffic Flows with Software — using BGP to balance traffic. Will be interesting to see how the more extreme traffic managers deploy SDN in the data center.
  3. Cockroacha distributed key/value datastore which supports ACID transactional semantics and versioned values as first-class features. The primary design goal is global consistency and survivability, hence the name. Cockroach aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. Cockroach nodes are symmetric; a design goal is one binary with minimal configuration and no required auxiliary services.
  4. Linux Foundation Providing for Core Infrastructure Projects — press release, but interested in how they’re tackling sustainability—they’re taking on identifying worthies (glad I’m not the one who says “you’re not worthy” to a project) and being the non-profit conduit for the dosh. Interesting: implies they think the reason companies weren’t supporting necessary open source projects was some combination of being unsure who to support (projects you use, surely?) and how to get them money (ask?). (Sustainability of open source projects is a pet interest of mine)
Comment
Four short links: 29 May 2014

Four short links: 29 May 2014

Modern Software Development, Internet Trends, Software Ethics, and Open Government Data

  1. Beyond the Stack (Mike Loukides) — tools and processes to support software developers who are as massively distributed as the code they build.
  2. Mary Meeker’s Internet Trends 2014 (PDF) — the changes on slide 34 are interesting: usage moving away from G+/Facebook-style omniblather creepware and towards phonebook-based chat apps.
  3. Introduction to Software Engineering Ethics (PDF) — amazing set of provocative questions and scenarios for software engineers about the decisions they made and consequences of their actions. From a course in ethics from SCU.
  4. Open Government Data Online: Impenetrable (Guardian) — Too much knowledge gets trapped in multi-page pdf files that are slow to download (especially in low-bandwidth areas), costly to print, and unavailable for computer analysis until someone manually or automatically extracts the raw data.
Comment
Four short links: 3 April 2014

Four short links: 3 April 2014

Github for Data, Open Laptop, Crowdsourced Analysis, and Open Source Scraping

  1. dat — github-like tool for data, still v. early. It’s overdue. (via Nelson Minar)
  2. Novena Open Laptop — Bunnie Huang’s laptop goes on sale.
  3. Crowd Forecasting (NPR) — How is it possible that a group of average citizens doing Google searches in their suburban town homes can outpredict members of the United States intelligence community with access to classified information?
  4. Portia — open source visual web scraping tool.
Comment: 1