If you’re a programmer who reads the Internet, you’ll have heard of deep belief networks. Google loves them, Facebook just hired one of the pioneers to lead a new group, and they win Kaggle competitions. I’ve been using deep belief networks extensively for image recognition at Jetpac across hundreds of millions of Instagram photos, and the reports are true: the results really are great.
If you’re like me, you probably want to see it for yourself, to understand by experimenting. There are several great open-source solutions out there, but they’re largely aimed at academics and back-end engineers, with a lot of dependencies and a steep learning curve. The current state of software makes it look like the technology is destined to remain in data centers running on high-end machines.
Why does this matter so much? We’ve seen how many interesting applications become possible when cell phones can understand the world through GPS, accelerometers, gyroscopes and compasses, and with deep belief your cameras becomes another rich source of information. Imagine Pi’s with cameras that could be attached to a lamp post and left to count how many cars, dogs or people walk past. Think about leaving a camera running continuously on a wearable camera, processing your environment locally so the device can tell whether you’re sitting at your desk, out jogging, driving, chatting to friends over a drink, at a restaurant, going for a hike, all running locally so you don’t have to worry about privacy. How about a pet door that lets in cats, but not raccoons or skunks? An unattended wildlife camera that only takes photos when its motion sensor is triggered by a bird, and not by a branch moving in the wind?
It’s traditionally been almost impossible to extract meaning from images, but deep belief networks are able to crack them open and say something significant about what’s in them. They bridge the gap between the real world and CPUs in a way we’ve never been able to do before.
I’m wary of AI promises after having been burned too many times by fads that didn’t deliver, so I encourage you to play with the demo yourself to get a realistic idea of what it’s good for. The way I find most useful to approach the results is that they’re the output of a noisy sensor measuring all kinds of semantically meaningful properties from the scene. Most applications have a lot of context that define what objects they’re likely to see, and which they care about, which lets those more general properties be reliably transformed into specific categories. For example, an American pet door is unlikely to encounter a lioness, so it’s safe to accept that as a cat (unless you’re very unlucky!). In Jetpac’s case, we know that not many Instagram users are taking photos of toilet seats, so the odds are strongly in favor of it being a plate of food instead. Combining what you already know about the problem with the output of the algorithm can often be enough to build a useful solution. We’ve even automated the approach by taking the raw category information and running it through another layer of machine learning for our domain-specific classifications.
Think about the results as noisy sensor readings rather than human-accurate judgments, and you’ll start to see how powerful even imperfect results can be, as long as they still contain strong signals. We’re entering a new world where computers can see, even if it’s only through a glass darkly.