Strata Week: Simplifying MapReduce through Java

Here are a few of the data stories that caught my attention this week:

Crunch looks to make MapReduce easier

Despite the growing popularity of MapReduce and other data technologies, there’s still a steep learning curve associated with these tools. Some have even wondered if they’re worth introducing to programming students.

All of this makes the introduction of Crunch particularly good news. Crunch is a new Java library from Cloudera that’s aimed at simplifying the writing, testing, and running of MapReduce pipelines. In other words, developers won’t need to write a lot of custom code or libraries, which as Cloudera data scientist Josh Willis points out, “is a serious drain on developer productivity.”

He adds that:

Crunch shares a core philosophical belief with Google’s FlumeJava: novelty is the enemy of adoption. For developers, learning a Java library requires much less up-front investment than learning a new programming language. Crunch provides full access to the power of Java for writing functions, managing pipeline execution, and dynamically constructing new pipelines, obviating the need to switch back and forth between a data flow language and a real programming language.

The Crunch library has been released under the Apache license, and the code can be downloaded here.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine “The Data Frame” — focusing on the impact of data in today’s networked economy.

Save $300 on registration with the code RADAR

Querying the web with Datafiniti

Datafiniti launched this week into public beta, calling itself the “first search engine for data.” That might just sound like a nifty startup slogan, but when you look at what Datafiniti queries and how it works, the engine begins to look profoundly ambitious and important.

Datafiniti enables its users to enter a search query (or make an API call) against the web. Or, that’s the goal at least. As it stands, Datafiniti lets users make calls about location, products, news, real estate, and social identity. But that’s a substantial number of datasets, using information that’s publicly available on the web.

Although Datafiniti demands you enter SQL parameters, it tries to make the process of doing so fairly easy, with a guide that pops up beneath the search box to help you phrase things properly. That interface is just one of the indications that Datafiniti is making a move to help democratize big data search.

The company grew out of a previous startup named 80Legs. As Shion Deysarker, founder of Datafiniti told me, it was clear that the web-crawling services provided by 80Legs were really just being utilized to ask specific queries. Things like, what’s the average listing price for a home in Houston? How many times has a brand name been mentioned on Twitter or Facebook over the last few months? And so on.

Deysarker frames Datafiniti in terms of data access, arguing that until now a few providers have controlled the data. The startup wants to help developers and companies overcome both access and expense issues associated with gathering, processing, curating and accessing datasets. It plans to offer both subscription-based and unit-based pricing.

Keep tabs on the Large Hadron Collider from your smartphone

New apps don’t often make it into my data news roundup, but it’s hard to ignore this one: LHSee is an Android app from the University of Oxford that delivers data directly from the ATLAS experiment at CERN. The app lets you see data from collisions at the Large Hadron Collider.

The ATLAS experiment describes itself as an effort to learn about “the basic forces that have shaped our Universe since the beginning of time and that will determine its fate. Among the possible unknowns are the origin of mass, extra dimensions of space, unification of fundamental forces, and evidence for dark matter candidates in the Universe.”

The LHSee app provides detailed information into how CERN and the Large Hadron Collider work. It also offers a “Hunt the Higgs Boson” game as well as opportunities to watch 3-D collisions streamed live from CERN. The app is available for free through the Android Market.

Got data news?

Feel free to email me.

Related:

Strata Week: Simplifying MapReduce through Java

MapReduce gets easier, a new search engine for data, and now you can monitor the universe's forces on your phone.

Crunch looks to make MapReduce easier

Querying the web with Datafiniti

Keep tabs on the Large Hadron Collider from your smartphone

Got data news?

Get the O’Reilly Data Newsletter