The growing popularity of Big Data management tools (Hadoop; MPP, real-time SQL, NoSQL databases; and others1) means many more companies can handle large amounts of data. But how do companies analyze and mine their vast amounts of data? The cutting-edge (social) web companies employ teams of data scientists2 who comb through data using different Hadoop interfaces and use custom analysis and visualization tools. Other companies integrate their MPP databases with familiar Business Intelligence tools. For companies that already have large amounts of data in Hadoop, there’s room for even simpler tools that would allow business users to directly interact with Big Data.
A startup aims to expose Big Data to analysts charged with producing most routine reports. Datameer3 has an interesting workflow model that enables spreadsheet users to quickly perform analytics with data in Hadoop. The Datameer Analytics Solution (DAS) assumes data sits in Hadoop4, and from there a business analyst can rapidly load, transform, analyze, and visualize data:
Datameer’s workflow uses the familiar spreadsheet interface as a data processing pipeline. Random samples are pulled into worksheets where spreadsheet functions let analysts customize transformations, aggregations, and joins5. Once their analytic models are created, results are computed via Hadoop’s distributed processing technology (computations are initiated through a simple GUI). DAS contains over a hundred standard spreadsheet functions, NLP tools (tokenization, ngrams) for unstructured data, and basic charting tools.
What’s intriguing about DAS is that it opens up Big Data analysis to large sets of business users. Based on the private demo we saw last week, we think Datameer is off to a good start. While still in beta, DAS has been deployed by many customers and feedback from users has resulted in an intuitive and extremely useful analytic tool. With DAS, spreadsheet users will be able to perform Big Data analysis without assistance from their colleagues in IT.
The buzz over Big Data has so far centered largely on (new) data management tools6. More recently, we’re hearing from companies eager to tackle the next step: Big Data analysis ranging from routine reports to complex quantitative models. On one end, machine-learning algorithms and statistics are starting to appear as in-database analytic functions. At the other end, companies besides Datameer will develop Big Data analysis tools for average users (i.e., users who won’t learn BI tools, SQL, Pig, Hive, and the like). If money isn’t an issue, IBM’s ambitious (and still immature) BigSheets project goes a step further than Datameer. It aims to provide data scientists with a single tool that can handle data acquisition (web crawlers), data management (Hadoop), text mining, and visualization (many eyes).
(1) Splunk is a tool that does both Big Data management and analytics.
(2) In fact data scientist is a title that’s increasingly used in companies like Yahoo!, Facebook, Linkedin, Twitter, the NY Times, …
(3) Datameer is a San Mateo startup, with some engineers in Germany. The company name is based on the German word for ocean.
(4) DAS can actually handle data from a variety of other sources, but for now, data from other sources gets pipelined to Hadoop in (near) real-time.
(5) Spreadsheet users should quickly be able to merge data sources with DAS: joins are done between worksheets and are intuitive. DAS is a single-tool that can handle data manipulation, analysis, and visualization, thus reducing the need to switch back-and-forth between multiple tools.
(6) Along with the cool new data management tools, there are occasional stories of amazing custom analytics produced by data scientists.