Google & IBM giving students a distributed systems lab using Hadoop

Google & IBM have partnered to give university students hands-on experience developing software for large-scale distributed systems. This initiative focuses on parallel processing for large data sets using Hadoop, an open source implementation of Google’s MapReduce. (See Tim’s earlier post about Yahoo & Hadoop)

“The goal of this initiative is to improve computer science students’ knowledge of highly parallel computing practices to better address the emerging paradigm of large-scale distributed computing. IBM and Google are teaming up to provide hardware, software and services to augment university curricula and expand research horizons. With their combined resources, the companies hope to lower the financial and logistical barriers for the academic community to explore this emerging model of computing.”

The project currently includes the University of Washington, Carnegie-Mellon University, MIT, Stanford, UC Berkeley and the University of Maryland. Students in participating classes will have access to a dedicated cluster of “several hundred computers” running Linux under XEN virtualization. The project is expected to expand to thousands of processors and eventually be open to researchers and students at other institutions.

As part of this effort, Google and the University of Washington have released a Creative Commons licensed curriculum to help teach distributed systems concepts and techniques. IBM is also providing Hadoop plug-ins for Eclipse.

Note: You can also build similar systems using Hadoop with Amazon EC2. Tom White recently posted an excellent guide and Powerset has been using this in production for quite some time.

Google & IBM giving students a distributed systems lab using Hadoop

Get the O’Reilly Programming Newsletter