The growing popularity of Big Data management tools (Hadoop; MPP, real-time SQL, NoSQL databases; and others) means many more companies can handle large amounts of data. But how do companies analyze and mine their vast amounts of data? For companies that already have large amounts of data in Hadoop, there's room for even simpler tools that would allow business users to directly interact with Big Data.
Some organizations create their own real-time analysis tools, while others turn to specialized solutions. In a previous post, I highlighted SQL-based real-time analytic tools that can handle large amounts of data. I noted that other big data management systems such as MPP databases and MapReduce/Hadoop were too batch-oriented to deliver analysis in near real-time. At least for MapReduce/Hadoop systems things may have changed slightly. A group of researchers from UC Berkeley and Yahoo recently modified MapReduce to allow for pipelining between operators.
The growing need to manage and make sense of Big Data, has led to a surge in demand for analytic databases, which many companies are attempting to fill. As an alternative to current shared-nothing analytic databases, HadoopDB is a hybrid that combines parallel databases with scalable and fault-tolerant Hadoop/MapReduce systems.
Our belief that proficiency in managing and analyzing large amounts of data distinguishes market leading companies, led to a recent report designed to help users understand the different large-scale data management techniques. Our report on Big Data Technologies was the result of interviews with over thirty experts, including research scientists, (open-source) hackers, vendors, data analysts, and entrepreneurs. I recently sat down with my co-author, Roger Magoulas (Director of Research at O’Reilly), who agreed talk about our report and Big Data in general.