- Machine Learning Done Wrong — [M]ost practitioners pick the modeling algorithm they are most familiar with rather than pick the one which best suits the data. In this post, I would like to share some common mistakes (the don’t-s).
- Bandits for Recommendations — A common problem for internet-based companies is: which piece of content should we display? Google has this problem (which ad to show), Facebook has this problem (which friend’s post to show), and RichRelevance has this problem (which product recommendation to show). Many of the promising solutions come from the study of the multi-armed bandit problem.
- Droplets — the Droplet is almost spherical, can self-right after being poured out of a bucket, and has the hardware capabilities to organize into complex shapes with its neighbors due to accurate range and bearing. Droplets are available open-source and use cheap vibration motors and a 3D printed shell. (via Robohub)
- Apple’s App Store Approval Guidelines — some of the plainest English I’ve seen, especially the Introduction. I can only aspire to that clarity. If your App looks like it was cobbled together in a few days, or you’re trying to get your first practice App into the store to impress your friends, please brace yourself for rejection. We have lots of serious developers who don’t want their quality Apps to be surrounded by amateur hour.
ENTRIES TAGGED "open data"
Machine Learning Mistakes, Recommendation Bandits, Droplet Robots, and Plain English
Video Transparency, Software Traffic, Distributed Database, and Open Source Sustainability
- Video Quality Report — transparency is a great way to indirectly exert leverage.
- Control Your Traffic Flows with Software — using BGP to balance traffic. Will be interesting to see how the more extreme traffic managers deploy SDN in the data center.
- Cockroach — a distributed key/value datastore which supports ACID transactional semantics and versioned values as first-class features. The primary design goal is global consistency and survivability, hence the name. Cockroach aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. Cockroach nodes are symmetric; a design goal is one binary with minimal configuration and no required auxiliary services.
- Linux Foundation Providing for Core Infrastructure Projects — press release, but interested in how they’re tackling sustainability—they’re taking on identifying worthies (glad I’m not the one who says “you’re not worthy” to a project) and being the non-profit conduit for the dosh. Interesting: implies they think the reason companies weren’t supporting necessary open source projects was some combination of being unsure who to support (projects you use, surely?) and how to get them money (ask?). (Sustainability of open source projects is a pet interest of mine)
Modern Software Development, Internet Trends, Software Ethics, and Open Government Data
- Beyond the Stack (Mike Loukides) — tools and processes to support software developers who are as massively distributed as the code they build.
- Mary Meeker’s Internet Trends 2014 (PDF) — the changes on slide 34 are interesting: usage moving away from G+/Facebook-style omniblather creepware and towards phonebook-based chat apps.
- Introduction to Software Engineering Ethics (PDF) — amazing set of provocative questions and scenarios for software engineers about the decisions they made and consequences of their actions. From a course in ethics from SCU.
- Open Government Data Online: Impenetrable (Guardian) — Too much knowledge gets trapped in multi-page pdf files that are slow to download (especially in low-bandwidth areas), costly to print, and unavailable for computer analysis until someone manually or automatically extracts the raw data.
Github for Data, Open Laptop, Crowdsourced Analysis, and Open Source Scraping
- dat — github-like tool for data, still v. early. It’s overdue. (via Nelson Minar)
- Novena Open Laptop — Bunnie Huang’s laptop goes on sale.
- Crowd Forecasting (NPR) — How is it possible that a group of average citizens doing Google searches in their suburban town homes can outpredict members of the United States intelligence community with access to classified information?
- Portia — open source visual web scraping tool.
Unimaginative Vehicular Connectivity, Data Journalism, VR and Gender, and Open Data Justice
- Connected for a Purpose (Jim Stogdill) — At a recent conference, an executive at a major auto manufacturer described his company’s efforts to digitize their line-up like this: “We’re basically wrapping a two-ton car around an iPad. Eloquent critique of the Internet of Shallow Things.
- Why Nate Silver Can’t Explain It All — Data extrapolation is a very impressive trick when performed with skill and grace, like ice sculpting or analytical philosophy, but it doesn’t come equipped with the humility we should demand from our writers. Would be a shame for Nate Silver to become Malcolm Gladwell: nice stories but they don’t really hold up.
- Gender and VR (danah boyd) — Although there was variability across the board, biological men were significantly more likely to prioritize motion parallax. Biological women relied more heavily on shape-from-shading. In other words, men are more likely to use the cues that 3D virtual reality systems relied on. Great article, especially notable for there are more sex hormones on the retina than in anywhere else in the body except for the gonads.
- Even The Innocent Should Worry About Sex Offender Apps (Quartz) — And when data becomes compressed by third parties, when it gets flattened out into one single data stream, your present and your past collide with potentially huge ramifications for your future. When it comes to personal data—of any kind—we not only need to consider what it will be used for but how that data will be represented, and what such representation might mean for us and others. Data policies are like justice systems: either you suffer a few innocent people being wrongly condemned (bad uses of open data0, or your system permits some wrongdoers to escape (mould grows in the dark).
Understanding Image Processing, Sharing Data, Fixing Bad Science, and Delightful Dashboard
- 2D Image Post-Processing Techniques and Algorithms (DIY Drones) — understanding how automated image matching and processing tools work means you can also get a better understanding how to shoot your images and what to prevent to get good matches.
- Scientists Need to Learn to Share — despite science’s reputation for rigor, sloppiness is a substantial problem in some fields. You’re much more likely to check your work and follow best data-handling practices when you know someone is going to run your code and parse your data.
- METRICS — Meta-Research Innovation Center at Stanford. John Ioannidis has a posse: connecting researchers into weak science, running conferences, creating a “journal watch”, and engaging policy makers. (says The Economist)
- Grafana — elegant dashboard for graphite (the realtime data graphing engine).
An exploration of themes in Joel Gurin's book Open Data Now.
As governments and businesses — and increasingly, all of us who are Internet-connected — release data out in the open, we come closer to resolving the tiresomely famous and perplexing quote from Stewart Brand: “Information wants to be free. Information also wants to be expensive.” Open data brings home to us how much free information is available and how productive it is in its free state, but one subterranean thread I found in Joel Gurin’s book Open Data Now highlights an important point: information is very expensive.
In this article, I’ll explore a few themes that piqued my interest in Gurin’s book: the value of open data, the expense it entails, the questions of how much we can use and trust it, and the role the general public and the private sector play in bringing us data’s benefits. This is not meant to be a summary or a review of Gurin’s book; it is an exploration of themes that interest me, inspired by my reading of Gurin.
Open, trustworthy, and useful
“Open data” occupies hierarchies of usefulness. One way of describing its usefulness is the structure of its presentation, as Gurin and others such as Tim Berners-Lee have pointed out. Much data is still fairly unstructured, like the reviews and social media status postings that people generate by the millions and that are funneled into eager consumption by marketing analysts. Some data is more structured, existing as tables. And finally, a tiny fragment can be reached through the RESTful APIs supported by libraries in every modern programming language. Read more…
Wolfram Language, Historic Innovation, SF Culture Wars, and Privacy's Death
- Wolfram Language — a broad attempt to integrate types, operations, and databases along with deployment, parallelism, and real-time I/O. The demo video is impressive, not just in execution but in ambition. Healthy skepticism still necessary.
- Maury, Innovation, and Change (Cory Ondrejka) — amazing historical story of open data, analysis, visualisation, and change. In the mid-1800’s, over the course of 15 years, a disabled Lieutenant changed the US Navy and the world. He did it by finding space to maneuver (as a trouble maker exiled to the Navy Depot), demonstrating value with his early publications, and creating a massive network effect by establishing the Naval Observatory as the clearing house for Navigational data. 150 years before Web 2.0, he built a valuable service around common APIs and aggregated data by distributing it freely to the people who needed it.
- Commuter Shuttle and 21-Hayes EB Bus Stop Observations (Vimeo) — timelapse of 6:15AM to 9:15AM at an SF bus stop Worth watching if you’re outside SF and wondering what they’re talking about when the locals rage against SF becoming a bedroom community for Valley workers.
- A Day of Speaking Truth to Power (Quinn Norton) — It was a room that had written off privacy as an archaic structure. I tried to push back, not only by pointing out this was the opening days of networked life, and so custom hadn’t caught up yet, but also by recommending danah boyd’s new book It’s Complicated repeatedly. To claim “people trade privacy for free email therefore privacy is dead” is like 1800s sweatshop owners claiming “people trade long hours in unpleasant conditions for miserable pay therefore human rights are dead”. Report of privacy’s death are greatly exaggerated.
Minecraft+Pi+Python, Science Torrents, Web App Performance Measurement, and Streaming Data
- Programming Minecraft Pi with Python — an early draft, but shows promise for kids. (via Raspberry Pi)
- Terasaur — BitTorrent for mad-large files, making it easy for datasets to be saved and exchanged.
- Bucky — Open-source tool to measure the performance of your web app directly from your users’ browsers. Nifty graph.
- Zoe Keating’s Streaming Payouts — actual data on a real musician’s distribution and revenues through various channels. Hint: streaming is tragicomically low-paying. (via Andy Baio)
Open Web Ranking, Quantified Self Gadgets, Armband Input, and Bitcoin Exchanges Threatened
- The Common Crawl WWW Ranking — open data, open methodology, behind an open ranking of the top sites on the web. Preprint paper available. (via Slashdot)
- Felton’s Sensors (Quartz) — inside the gadgets Nicholas Felton uses to quantify himself.
- Myo Armband (IEEE Spectrum) — armband input device with eight EMG (electromyography) muscle activity sensors along with a nine-axis inertial measurement unit (that’s three axes each for accelerometer, gyro, and magnetometer), meaning that you get forearm gesture sensing along with relative motion sensing (as opposed to absolute position). The EMG sensors pick up on the electrical potential generated by muscle cells, and with the Myo on your forearm, the sensors can read all of the muscles that control your fingers, letting them spy on finger position as well as grip strength.
- Bitcoin Exchanges Under Massive and Concerted Attack — he who lives by the network dies by the network. a DDoS attack is taking Bitcoin’s transaction malleability problem and applying it to many transactions in the network, simultaneously. “So as transactions are being created, malformed/parallel transactions are also being created so as to create a fog of confusion over the entire network, which then affects almost every single implementation out there,” he added. Antonopoulos went on to say that Blockchain.info’s implementation is not affected, but some exchanges have been affected – their internal accounting systems are gradually going out of sync with the network.