ENTRIES TAGGED "analytics"
Library Box, Data-Driven Racial Profiling, Internet of Washing Machines, and Nokia's IoT R&D
- Librarybox 2.0 — fork of PirateBox for the TP-Link MR 3020, customized for educational, library, and other needs. Wifi hotspot with free and anonymous file sharing. v2 adds mesh networking and more. (via BoingBoing)
- Chicago PD’s Using Big Data to Justify Racial Profiling (Cory Doctorow) — The CPD refuses to share the names of the people on its secret watchlist, nor will it disclose the algorithm that put it there. [...] Asserting that you’re doing science but you can’t explain how you’re doing it is a nonsense on its face. Spot on.
- Cloudwash (BERG) — very good mockup of how and why your washing machine might be connected to the net and bound to your mobile phone. No face on it, though. They’re losing their touch.
- What’s Left of Nokia to Bet on Internet of Things (MIT Technology Review) — With the devices division gone, the Advanced Technologies business will cut licensing deals and perform advanced R&D with partners, with around 600 people around the globe, mainly in Silicon Valley and Finland. Hopefully will not devolve into being a patent troll. [...] “We are now talking about the idea of a programmable world. [...] If you believe in such a vision, as I do, then a lot of our technological assets will help in the future evolution of this world: global connectivity, our expertise in radio connectivity, materials, imaging and sensing technologies.”
Real Time Exploratory Analytics, Algorithmic Agendas, Disassembly Engine, and Future of Employment
- Druid — open source clustered data store (not key-value store) for real-time exploratory analytics on large datasets.
- It’s Time to Engineer Some Filter Failure (Jon Udell) — Our filters have become so successful that we fail to notice: We don’t control them, They have agendas, and They distort our connections to people and ideas. That idea that algorithms have agendas is worth emphasising. Reality doesn’t have an agenda, but the deployer of a similarity metric has decided what features to look for, what metric they’re optimising, and what to do with the similarity data. These are all choices with an agenda.
- Capstone — open source multi-architecture disassembly engine.
- The Future of Employment (PDF) — We note that this prediction implies a truncation in the current trend towards labour market polarization, with growing employment in high and low-wage occupations, accompanied by a hollowing-out of middle-income jobs. Rather than reducing the demand for middle-income occupations, which has been the pattern over the past decades, our model predicts that computerisation will mainly substitute for low-skill and low-wage jobs in the near future. By contrast, high-skill and high-wage occupations are the least susceptible to computer capital. (via The Atlantic)
Data Pipeline, Data Driven Education, Crowdsourced Proofreading, and 3D Printed Shoes
- Suro (Github) — Netflix data pipeline service for large volumes of event data. (via Ben Lorica)
- NIPS Workshop on Data Driven Education — lots of research papers around machine learning, MOOC data, etc.
- Proofist — crowdsourced proofreading game.
- 3D-Printed Shoes (YouTube) — LeWeb talk from founder of the company, Continuum Fashion). (via Brady Forrest)
- SAMOA — Yahoo!’s distributed streaming machine learning (ML) framework that contains a programming abstraction for distributed streaming ML algorithms. (via Introducing SAMOA)
- madlib — an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.
- Data Portraits: Connecting People of Opposing Views — Yahoo! Labs research to break the filter bubble. Connect people who disagree on issue X (e.g., abortion) but who agree on issue Y (e.g., Latin American interventionism), and present the differences and similarities visually (they used wordclouds). Our results suggest that organic visualisation may revert the negative effects of providing potentially sensitive content. (via MIT Technology Review)
- Disguise Detection — using Raspberry Pi, Arduino, and Python.
Open Source BigTable, Robots Lost, Changing the World, Secrecy Binge
- Accumulo — NSA’s BigTable implementation, released as an Apache project.
- How the Robots Lost (Business Week) — the decline of high-frequency trading profits (basically, markets worked and imbalances in speed and knowledge have been corrected). Notable for the regulators getting access to the technology that the traders had: Last fall the SEC said it would pay Tradeworx, a high-frequency trading firm, $2.5 million to use its data collection system as the basic platform for a new surveillance operation. Code-named Midas (Market Information Data Analytics System), it scours the market for data from all 13 public exchanges. Midas went live in February. The SEC can now detect anomalous situations in the market, such as a trader spamming an exchange with thousands of fake orders, before they show up on blogs like Nanex and ZeroHedge. If Midas sees something odd, Berman’s team can look at trading data on a deeper level, millisecond by millisecond.
- PRISM: Surprised? (Danny O’Brien) — I really don’t agree with the people who think “We don’t have the collective will”, as though there’s some magical way things got done in the past when everyone was in accord and surprised all the time. It’s always hard work to change the world. Endless, dull hard work. Ten years later, when you’ve freed the slaves or beat the Nazis everyone is like “WHY CAN’T IT BE AS EASY TO CHANGE THIS AS THAT WAS, BACK IN THE GOOD OLD DAYS. I GUESS WE’RE ALL JUST SHEEPLE THESE DAYS.”
- What We Don’t Know About Spying on Citizens is Scarier Than What We Do Know (Bruce Schneier) — The U.S. government is on a secrecy binge. It overclassifies more information than ever. And we learn, again and again, that our government regularly classifies things not because they need to be secret, but because their release would be embarrassing. Open source BigTable implementation: free. Data gathering operation around it: $20M/year. Irony in having the extent of authoritarian Big Brother government secrecy questioned just as a whistleblower’s military trial is held “off the record”: priceless.
Machine Learning Demos, iOS Debugging, Industrial Internet, and Deanonymity
- MLDemos — an open-source visualization tool for machine learning algorithms created to help studying and understanding how several algorithms function and how their parameters affect and modify the results in problems of classification, regression, clustering, dimensionality reduction, dynamical systems and reward maximization. (via Mark Alen)
- kiln (GitHub) — open source extensible on-device debugging framework for iOS apps.
- Industrial Internet — the O’Reilly report on the industrial Internet of things is out. Prasad suggests an illustration: for every car with a rain sensor today, there are more than 10 that don’t have one. Instead of an optical sensor that turns on windshield wipers when it sees water, imagine the human in the car as a sensor — probably somewhat more discerning than the optical sensor in knowing what wiper setting is appropriate. A car could broadcast its wiper setting, along with its location, to the cloud. “Now you’ve got what you might call a rain API — two machines talking, mediated by a human being,” says Prasad. It could alert other cars to the presence of rain, perhaps switching on headlights automatically or changing the assumptions that nearby cars make about road traction.
- Unique in the Crowd: The Privacy Bounds of Human Mobility (PDF, Nature) — We study fifteen months of human mobility data for one and a half million individuals and find that human mobility traces are highly unique. In fact, in a dataset where the location of an individual is specified hourly, and with a spatial resolution equal to that given by the carrier’s antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals. We coarsen the data spatially and temporally to find a formula for the uniqueness of human mobility traces given their resolution and the available outside information. This formula shows that the uniqueness of mobility traces decays approximately as the 1/10 power of their resolution. Hence, even coarse datasets provide little anonymity. These findings represent fundamental constraints to an individual’s privacy and have important implications for the design of frameworks and institutions dedicated to protect the privacy of individuals. As Edd observed, “You are a unique snowflake, after all.” (via Alasdair Allan)
- A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method (PDF) — This project was simultaneously an experiment in developing quantitative and computational methods for tracing changes in literary language. We wanted to see how far quantifiable features such as word usage could be pushed toward the investigation of literary history. Could we leverage quantitative methods in ways that respect the nuance and complexity we value in the humanities? To this end, we present a second set of results, the techniques and methodological lessons gained in the course of designing and running this project. Even litcrit becoming a data game.
- Easy6502 — get started writing 6502 assembly language. Fun way to get started with low-level coding.
- How Analytics Really Work at a Small Startup (Pete Warden) — The key for us is that we’re using the information we get primarily for decision-making (should we build out feature X?) rather than optimization (how can we improve feature X?). Nice rundown of tools and systems he uses, with plug for KissMetrics.
Malware Industrial Complex, Indies Needed, TV Analytics, and HTTP Benchmarking
- Welcome to the Malware-Industrial Complex (MIT) — brilliant phrase, sound analysis.
- Stupid Stupid xBox — The hardcore/soft-tv transition and any lead they feel they have is simply not defensible by licensing other industries’ generic video or music content because those industries will gladly sell and license the same content to all other players. A single custom studio of 150 employees also can not generate enough content to defensibly satisfy 76M+ customers. Only with quality primary software content from thousands of independent developers can you defend the brand and the product. Only by making the user experience simple, quick, and seamless can you defend the brand and the product. Never seen a better put statement of why an ecosystem of indies is essential.
- Data Feedback Loops for TV (Salon) — Netflix’s data indicated that the same subscribers who loved the original BBC production also gobbled down movies starring Kevin Spacey or directed by David Fincher. Therefore, concluded Netflix executives, a remake of the BBC drama with Spacey and Fincher attached was a no-brainer, to the point that the company committed $100 million for two 13-episode seasons.
- wrk — a modern HTTP benchmarking tool capable of generating significant load when run on a single multi-core CPU. It combines a multithreaded design with scalable event notification systems such as epoll and kqueue.
Handmade Hardware, Tab Silencer, Surprise and Models, and Sciencey GIFs
- Your USB Sticks Are Made With Chopsticks (Bunnie Huang) — behind-the-scenes on how USB sticks are made.
- mutetab — find and kill the Chrome tab making all the damn noise! (via Nelson Minar)
- Visualization, Modeling, and Surprises (John D Cook) — paraphrases Hadley Wickham: Visualization can surprise you, but it doesn’t scale well. Modelling scales well, but it can’t surprise you.
- Head Like an Orange — science animated GIFs, assembled from nature documentaries. (via Ed Yong)
Free Books, Analytics Goofs, Book Boilerplate, and Learn CS with the Raspberry Pi
- Free Book Sifter — lists all the free books on Amazon, has RSS feeds and newsletters. (via BoingBoing)
- Whom the Gods Would Destroy, They First Give Realtime Analytics — a few key reasons why truly real-time analytics can open the door to a new type of (realtime!) bad decision making. [U]ser demographics could be different day over day. Or very likely, you could see a major difference in user behavior immediately upon releasing a change, only to watch it evaporate as users learn to use new functionality. Given all of these concerns, the conservative and reasonable stance is to only consider tests that last a few days or more.
- Web Book Boilerplate (Github) — uses plain old markdown and generates a well structured HTML version of your written words. Since it’s sitting on top of Pandoc and Grunt, you can easily make your books available for every platform. MIT-style license.
- Raspberry Pi Education Manual (PDF) — from Scratch to Python and HCI all via the Raspberry Pi. Intended to be informative and a series of lessons for teachers and students learning coding with the Raspberry Pi as their first device.