"machine learning" entries

Four short links: 24 September 2015

Four short links: 24 September 2015

Machine Music Learning, Cyber War, Backing Out Ads, and COBOL OF THE 2020s

  1. The Hit Charade (MIT TR) — Spotify’s deep-learning system still has to be trained using millions of example songs, and it would be perplexed by a bold new style of music. What’s more, such algorithms cannot arrange songs in a creative way. Nor can they distinguish between a truly original piece and yet another me-too imitation of a popular sound. Johnson acknowledges this limitation, and he says human expertise will remain a key part of Spotify’s algorithms for the foreseeable future.
  2. The Future of War is the Distant Past (John Birmingham) — the Naval Academy is hedging against the future by creating cybersecurity midshipmen, and by requiring every midshipman to learn how to do celestial navigation.
  3. What Happens Next Will Amaze You (Maciej Ceglowski) — the next in Maciej’s amazing series of keynotes, where he’s building a convincing case for fixing the Web.
  4. Go Will Dominate the Next Decade (Ian Eyberg) — COBOL OF THE 2020s. There, I saved you the trouble.
Four short links: 22 September 2015

Four short links: 22 September 2015

Ant Algorithms, Git Commit, NASA's Deep Learning, and Built-In Empathy

  1. Ant Algorithms for Discrete Optimization (Adrian Colyer) — Stigmergy is the generic term for the stimulation of workers by the performance they have achieved – for example, termite nest-building works in a similar way. Stigmergy is a form of indirect communication “mediated by physical modifications of environmental states which are only locally accessible to the communicating agents.
  2. How to Write a Git Commit Message (Chris Beams) — A diff will tell you what changed, but only the commit message can properly tell you why.
  3. Deep Belief Networks at the Heart of NASA Image ClassificationThe two new labeled satellite data sets were put to the test with a modified deep-belief-networks-driven approach, ultimately. The results show classification accuracy of 97.95%, which performed better than the unmodified pure deep belief networks, convolutional neural networks, and stacked de-noising auto-encoders by around 11%.
  4. The Consequences of An Insightful Algorithm (Carina C. Zona) — We design software for humans. Balancing human needs and business specs can be tough. It’s crucial that we learn how to build in systematic empathy. (via Rowan Crawford)
Four short links: 21 September 2015

Four short links: 21 September 2015

2-D Single-Stroke Recognizer, Autonomous Vehicle Permits, s3concurrent, and Surviving the Music Industry

  1. $1 Unistroke Recognizera 2-D single-stroke recognizer designed for rapid prototyping of gesture-based user interfaces. In machine learning terms, $1 is an instance-based nearest-neighbor classifier with a Euclidean scoring function — i.e., a geometric template matcher.
  2. Apple Talking to California Officials about Self-Driving Car (Guardian) — California DMV’s main responsibility for autonomous vehicles at present is administering an autonomous vehicle tester program for experimental self-driving cars on California’s roads. So far, 10 companies have been issued permits for about 80 autonomous vehicles and more than 300 test drivers. The most recent, Honda and BMW, received their permits last week.
  3. s3concurrent — sync local file structure with s3, in parallel. (via Winston Chen)
  4. Amanda Palmer on Music Industry Survival Techniques (O’Reilly Radar) — I’ve always approached every Internet platform and every Internet tool with the suspicion that it may not last, and that actually what’s very important is […] the art and the relationships I’m building.

Build better machine learning models

A beginner's guide to evaluating your machine learning models.


Everything today is being quantified, measured, and tracked — everything is generating data, and data is powerful. Businesses are using data in a variety of ways to improve customer satisfaction. For instance, data scientists are building machine learning models to generate intelligent recommendations to users so that they spend more time on a site. Analysts can use churn analysis to predict which customers are the best targets for the next promotional campaign. The possibilities are endless.

However, there are challenges in the machine learning pipeline. Typically, you build a machine learning model on top of your data. You collect more data. You build another model. But how do you know when to stop?

When is your smart model smart enough?

Evaluation is a key step when building intelligent business applications with machine learning. It is not a one-time task, but must be integrated with the whole pipeline of developing and productionizing machine learning-enabled applications.

In a new free O’Reilly report Evaluating Machine Learning Models: A Beginner’s Guide to Key Concepts and Pitfalls, we cut through the technical jargon of machine learning, and elucidate, in simple language, the processes of evaluating machine learning models. Read more…

Four short links: 16 September 2015

Four short links: 16 September 2015

Data Pipelines, Amazon Culture, Real-time NFL Data, and Deep Learning for Chess

  1. Three Best Practices for Building Successful Data Pipelines (Michael Li) — three key areas that are often overlooked in data pipelines, and those are making your analysis: reproducible, consistent, and productionizable.
  2. Amazon’s Culture Controversy Decoded (Rita J King) — very interesting culture map analysis of the reports of Amazon’s culture, and context for how companies make choices about what to be. (via Mike Loukides)
  3. How Will Real-Time Tracking Change the NFL? (New Yorker) — At the moment, the NFL is being tightfisted with the data. Commentators will have access during games, as will the betting and analytics firm Sportradar. Users of the league’s Xbox One app, which provides an interactive way of browsing video clips, fantasy-football statistics, and other metrics, will be able to explore a feature called Next Gen Replay, which allows them to track each player’s speed and trajectory, combining moving lines on a virtual field with live footage from the real one. But, for now, coaches are shut out; once a player exits the locker room on game day, the dynamic point cloud that is generated by his movement through space is a corporately owned data set, as outlined in the league’s 2011 collective-bargaining agreement. Which should tell you all you need to know about the NFL’s role in promoting sporting excellence.
  4. Giraffe: Using Deep Reinforcement Learning to Play Chess (Matthew Lai) — Giraffe, a chess engine that uses self-play to discover all its domain-specific knowledge, with minimal hand-crafted knowledge given by the programmer. See also the code. (via GitXiv)
Four short links: 14 September 2015

Four short links: 14 September 2015

Robotics Boom, Apple in Communities, Picture Research, and Programming Enlightenment

  1. Uber Would Like to Buy Your Robotics Department (NY Times) — ‘‘If you’re well versed in the area of robotics right now and you’re not working on self-driving cars, you’re either an idiot or you have more of a passion for something else,’’ says Jerry Pratt, head of a robotics team in Pensacola that worked on a humanoid robot that beat Carnegie Mellon’s CHIMP in this year’s contest. ‘‘It’s a multibillion- if not trillion-dollar industry.’’
  2. What the Heck is Angela Ahrendts Doing at Apple? (Fortune) — Apple has always intended for each of them to be a community center; now Cook and Ahrendts want them to be the community center. That means expanding from serving existing and potential customers to, say, creating opportunities for underserved minorities and women. “In my mind,” Ahrendts says, store leaders “are the mayors of their community.”
  3. Imitation vs. Innovation: Product Similarity Network in the Motion Picture Industry (PDF) — machine learning to build a model of movies released in the last few decades, We find that big-budget movies benefit more from imitation, but small-budget movies favor novelty. This leads to interesting market dynamics that cannot be produced by a model without learning.
  4. Enlightened Imagination for Citizens (Bret Victor) — It should be painfully obvious that learning how to program a computer has no direct connection to any high form of enlightenment. Amen!
Four short links: 1 September 2015

Four short links: 1 September 2015

People Detection, Ratings Patterns, Inspection Bias, and Cloud Filesystem

  1. End-to-End People Detection in Crowded Scenes — research paper and code. When parsing the title, bind “end-to-end” to “scenes” not “people”.
  2. Statistical Patterns in Movie Ratings (PLOSone) — We find that the distribution of votes presents scale-free behavior over several orders of magnitude, with an exponent very close to 3/2, with exponential cutoff. It is remarkable that this pattern emerges independently of movie attributes such as average rating, age and genre, with the exception of a few genres and of high-budget films.
  3. The Inspection Bias is EverywhereIn 1991, Scott Feld presented the “friendship paradox”: the observation that most people have fewer friends than their friends have. He studied real-life friends, but the same effect appears in online networks: if you choose a random Facebook user, and then choose one of their friends at random, the chance is about 80% that the friend has more friends. The friendship paradox is a form of the inspection paradox. When you choose a random user, every user is equally likely. But when you choose one of their friends, you are more likely to choose someone with a lot of friends. Specifically, someone with x friends is overrepresented by a factor of x.
  4. s3qla file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack. S3QL effectively provides a hard disk of dynamic, infinite capacity that can be accessed from any computer with internet access running Linux, FreeBSD or OS-X. (GPLv3)

Bridging the divide: Business users and machine learning experts

The O'Reilly Data Show Podcast: Alice Zheng on feature representations, model evaluation, and machine learning models.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

606px-IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881As tools for advanced analytics become more accessible, data scientist’s roles will evolve. Most media stories emphasize a need for expertise in algorithms and quantitative techniques (machine learning, statistics, probability), and yet the reality is that expertise in advanced algorithms is just one aspect of industrial data science.

During the latest episode of the O’Reilly Data Show podcast, I sat down with Alice Zheng, one of Strata + Hadoop World’s most popular speakers. She has a gift for explaining complex topics to a broad audience, through presentations and in writing. We talked about her background, techniques for evaluating machine learning models, how much math data scientists need to know, and the art of interacting with business users.

Making machine learning accessible

People who work at getting analytics adopted and deployed learn early on the importance of working with domain/business experts. As excited as I am about the growing number of tools that open up analytics to business users, the interplay between data experts (data scientists, data engineers) and domain experts remains important. In fact, human-in-the-loop systems are being used in many critical data pipelines. Zheng recounts her experience working with business analysts:

It’s not enough to tell someone, “This is done by boosted decision trees, and that’s the best classification algorithm, so just trust me, it works.” As a builder of these applications, you need to understand what the algorithm is doing in order to make it better. As a user who ultimately consumes the results, it can be really frustrating to not understand how they were produced. When we worked with analysts in Windows or in Bing, we were analyzing computer system logs. That’s very difficult for a human being to understand. We definitely had to work with the experts who understood the semantics of the logs in order to make progress. They had to understand what the machine learning algorithms were doing in order to provide useful feedback. Read more…

Comments: 4
Four short links: 27 August 2015

Four short links: 27 August 2015

Chrome as APT, Nature's Mimicry, Information Extraction, and Better 3D Printing

  1. The Advanced Persistent Threat You Have: Google Chrome (PDF) — argues that if you can’t detect and classify Google Chrome’s self-updating behavior, you’re not in a position to know when you’re hit by malware that also downloads and executes code from the net that updates executables and system files.
  2. Things Mimicking Other Things — nifty visual catalog/graph of camouflage and imitation in nature.
  3. MITIE — permissively-licensed (Boost) tools for named entity extraction and binary relation detection as well as tools for training custom extractors and relation detectors.
  4. MultiFab Prints 10 Materials At Once — and uses computer vision to self-calibrate and self-correct, as well as letting users embed objects (e.g., circuit boards) in the print. developed by CSAIL researchers from low-cost, off-the-shelf components that cost a total of $7,000

Data-driven neuroscience

The O'Reilly Radar Podcast: Bradley Voytek on data's role in neuroscience, the brain scanner, and zombie brains in STEM.


Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.

In this week’s Radar Podcast, O’Reilly’s Mac Slocum chats with Bradley Voytek, an assistant professor of cognitive science and neuroscience at UC San Diego. Voytek talks about using data-driven approaches in his neuroscience work, the brain scanner project, and applying cognitive neuroscience to the zombie brain.

Here are a few snippets from their chat:

In the neurosciences, we’ve got something like three million peer reviewed publications to go through. When I was working on my Ph.D., I was very interested, in particular, in two brain regions. I wanted to know how these two brain regions connect, what are the inputs to them and where do they output to. In my naivety as a Ph.D. student, I had assumed there would be some sort of nice 3D visualization, where I could click on a brain region and see all of its inputs and outputs. Such a thing did not exist — still doesn’t, really. So instead, I ended up spending three or four months of my Ph.D. combing through papers written in the 1970s … and I kept thinking to myself, this is ridiculous, and this just stewed in the back of my mind for a really long time.

Sitting at home [with my wife], I said, I think I’ve figured out how to address this problem I’m working on, which is basically very simple text mining. Lets just scrape the text of these three million papers, or at least the titles and abstracts, and see what words co-occur frequently together. It was very rudimentary text mining, with the idea that if words co-occur frequently … this might give us an index of how related things are, and she challenged me to a code-off.

Read more…

Comments: 2