Strata Week: The challenge of real-time analytics

Blue is the color, getting help with email overload.

The call for proposals for O’Reilly Strata ends on Sept. 28. We’re keen to hear your stories about the business and practice of data, analytics and visualization. Submit a proposal now.

When MapReduce is too slow

This week the Register reported on Google’s move away from a MapReduce architecture for compiling their search index. Pioneered by Google, MapReduce is a way to distribute calculations among many processors. MapReduce led the field in big data analytics frameworks, and is now popularly used in the form of the open-source Hadoop project, spearheaded by Yahoo!

Does Google think MapReduce is dead? Not quite. The problem is that MapReduce is a batch processing architecture. Google was recomputing their entire search index and replacing it wholesale every few hours. By contrast, content is being updated on the web in real-time. With a MapReduce-centric architecture, Google could never be truly up-to-date.

Caffeine, Google’s new indexing system, supports incremental indexing and avoids the refresh rate problem of MapReduce. Carrie Grimes of Google explained the benefits, writing in a Google blog post:

With Caffeine, we analyze the web in small portions and update our search index on a continuous basis, globally. As we find new pages, or new information on existing pages, we can add these straight to the index.

Google isn’t the only company that wants real-time big data processing. Facebook, deeply invested in Hadoop, are working to get their latencies down to matters of seconds rather than minutes. Real-time analytics is a priority for many of companies I’ve spoken to in researching the Strata program. Whether it’s MapReduce-based or not, we will see the emergence of more real-time big data technologies over the next 12 months.

  • Want to get a taste of using MapReduce without deploying any infrastructure? Check out mincemeat.py, a simple self-contained Python MapReduce implementation.

Feeling blue

The folks at COLOURlovers noticed that a lot of people favored the color blue for their Twitter theme. Was this just because Twitter itself was blue, or does blue have a stronger hold on our preferences? To investigate, COLOURlovers decided to research the top 100 online brands.

Blue, the color of Twitter, Facebook, Paypal, and AT&T, does indeed dominate online brands. But it’s not alone. There’s a strong showing for red from companies such as CNN, ESPN, Comcast, CNET, BBC and YouTube. Red seems to be a strong indicator for media organizations.

Excerpt from COLOURlovers visualization.

Is there any hope for variety, or we doomed to a red-blue future? COLOURlovers suspect that once category leaders establish a certain color, newcomers are likely to repeat it.

Once a rocketship of a web startup takes flight, there are a number of Jr. Internet astronauts hoping to emulate their success … and are inspired by their brands. And so blue and red will probably continue to dominate, but we can have hope for the GoWallas, DailyBooths and other more adventurous brands out there.

Personal email analytics

Email is one of the richest, most useful and most infuriating sources of data in our lives. For years we’ve been wanting tools to help make sense of the flow of people and information that it brings. In 1991, for example, Jaimie Zawinski created the Insidious Big Brother Database (BBDB), with the aim of making both email and people more manageable.

BBDB can automatically keep track of what other topics the sender has corresponded with you about; when you last read a message from the sender; what mailing lists their messages came through; and any other details or automatic annotations you care to add. It also does a good job of noting when someone’s email address has changed.

More recently it seems that innovation has been slower to come to email clients. However, the opening up of GMail’s API has brought some interesting new tools, based on machine learning and analytics.

For basic exploration of your email flows, try Graph Your Inbox. This is a Chrome browser extension that will chart queries over your GMail data, essentially a Google Trends for your email. Below is a graph comparing the volumes of email I sent and received.

Graph Your Inbox results for outbound vs incoming email

With tools such as Graph Your Inbox we can retrospectively mine our own email and discover the ebb and flow of people and projects in our lives. Can machine learning help us in a proactive way? Whether you are conscious of it or not, machine learning techniques help us daily in the fight against spam. But what about separating the signal from the noise among our non-spam communications?

Google recently introduced Priority Inbox, in an attempt to help users decide which emails are important. Small voting buttons and dividers in the interface enable you to train Priority Inbox. But some of this seems a bit redundant — we already passively prioritize by how quickly we read and reply to messages from different people. Why can’t the computer learn?

SaneBox is a web application that takes a more low-key approach. It will label mail according to whether it needs immediate attention, can be postponed for later, or whether it’s a bulk mailing. I’ve been using it for some months, and it admittedly takes a little time to learn to trust. The results however are impressive. Simply removing non-urgent mail from view lowers stress levels considerably.

Send us news

Email us news, tips and interesting tidbits at strataweek@oreilly.com.

tags: , , , , ,