## Object Recognition

Now you have 100 million images, you need to do something useful with them. Photos contain a massive amount of information, and OpenCV has a lot of the building blocks you’ll need to extract some of the parts you care about. Think about the properties you’re interested in — maybe you want to exclude pictures containing people if you’re trying to create slideshows about places — and then look around at the tools that the library offers; for this example you could search for faces, and identify photos that are faceless as less likely to contain people. Anything you could do on a single image, you can now do in parallel on millions of them.

Before you go ahead and do a distributed run, though, make sure it works. I like to write a standalone command-line version of my image processing step, so I can debug it easily and iterate fast on the algorithm. OpenCV has a lot of convenience functions that make it easy to load images from disk, or even capture them from your webcam, display them, and save them out at the end. Their Java support is quite new, and I hit a few hiccups, but overall, it works very well. This made it possible to write a wrapper that is handed the images from HIPI’s decoding engine, does some processing, and then writes the results out in a text format, one line per image, all within a natively-Java Hadoop job.

Once you’re happy with the way the algorithm is working on your test data, and you’ve wrapped it in a HIPI wrapper, you’re ready to apply it to all the images you’ve downloaded. Just spin up a Hadoop job with your jar file, pointing at the HDFS folders containing the output of the downloader. In our experience, we were able to process around two million 700×700-sized JPEG images on each server per day, so you can use that as a rule of thumb for the size/speed tradeoffs you want to make when you choose how many machines to put in your cluster. Surprisingly, the actual processing we did within our algorithm didn’t affect the speed very much, apparently the object recognition and image-processing work ran fast enough that it was swamped by the time spent on IO.

I hope I’ve left you excited and ready to tackle your own large-scale image processing challenges. Whether you’re a data person who’s interested in image processing or a computer-vision geek who wants to widen your horizons, there’s a lot of new ground to be explored, and it’s surprisingly easy once you put together the pieces.

tags:

### Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

Since local IO is a limit – have you looked at SSD based instance? I am pretty optimisting about SSD backed instances on Rackspace. More at http://developer.rackspace.com/blog/welcome-to-performance-cloud-servers-have-some-benchmarks.html

• mike

Using spot instances would make this a lot cheaper.

• http://micahwalter.com Micah Walter

Pete, great article. I have about 30,000 images stored on S3 which I wish to process against each other, with a simple script I wrote in Python. This seems like a good candidate for hadoop/mapreduce and maybe Hipi, but not sure if thats really the way to go. Basically, each image needs to be compared to each other image via the script, and then the output of that script is just a new row in a database.

Thoughts?

Micah