Outliers and coexistence are the new normal for big data

Letting data speak for itself through analysis of entire data sets is eclipsing modeling from subsets. In the past, all too often what were once disregarded as “outliers” on the far edges of a data model turned out to be the telltale signs of a micro-trend that became a major event. To enable this advanced analytics and integrate in real-time with operational processes, companies and public sector organizations are evolving their enterprise architectures to incorporate new tools and approaches.

Whether you prefer “big,” “very large,” “extremely large,” “extreme,” “total,” or another adjective for the “X” in the “X Data” umbrella term, what’s important is accelerated growth in three dimensions: volume, complexity and speed.

Big data is not without its limitations. Many organizations need to revisit business processes, solve data silo challenges, and invest in visualization and collaboration tools to make big data understandable and actionable across an extended organization.

“Sampling is dead”

When complete huge data volumes can be processed and analyzed at scale, “sampling is dead,” says Abhishek Mehta, former Bank of America (BofA) managing director and Tresata co-founder, and speaker at last year’s Hadoop World. Potential applications include risk default analysis of every loan in a bank’s portfolio and analysis of granular data for targeted advertising.

The BofA corporate investments group adopted a SAS high performance risk management solution together with IBM BladeCenter grid and XIV storage to power credit-risk modeling, scoring and loss forecasting. As explained in a recent call with the SAS high-performance computing team, this new enterprise risk management system reduced calculation times at BofA for forecasting the probability of loan defaults from 96 hours to four hours. In addition to speeding up loan processing and hedging decisions, Bank of America can aggregate bottom-up data from individual loans for perhaps a more accurate picture of total risk than what was possible previously by testing models on just subsets of data.

nPario holds an exclusive license from Yahoo for technology based on columnar storage that within Yahoo’s internal infrastructure handles over eight petabytes of data for advertising and promotion, per a February 2011 discussion with nPario President and CEO Bassel Y. Ojjeh. nPario has basically forked the code, so that Yahoo can continue their internal use while nPario goes to market with a commercial offering for external customers. The nPario technology enables analysis at the granular level, not just at aggregate or sampled data. In addition to supporting a range of other data sources, nPario offers full integration with Adobe Omniture, including APIs that can pull data from Omniture (although Omniture charges a fee for this download).

Electronic Arts uses nPario for an “insight’s suite” that details how gamers engage with advertising. The nPario-powered EA analytics suite tracks clicks, impressions, demographic profiles, social media buzz and other data across EA’s online, console game, mobile and social channels. The result is a much more precise understanding of consumer intent and ability to micro-target ads, over what was previously possible either with sampled data or with data limited to just online or shrink-wrapped and not across the complete range of EA’s customer engagement.

Multiple big data technologies coexist in many enterprise architectures

Coexistence In many cases, organizations will use a mix-and-match combination of relational database management systems (RDBMS), Hadoop/MapReduce, R, columnar databases such as HP Vertica or ParAccel, or document-oriented databases. Also, there is growing adoption this year beyond just the financial services industry and government for complex event processing (CEP) and related real-time or near-real-time technologies to take action from web, IT, sensor and other streaming data.

At the same time that cost-effective, fast tools to analyze huge data sets are making data sampling a thing of the past, coexistence is quickly becoming the new normal for big data infrastructure and service architectures. For many enterprises and public sector organizations, the focus is “the right tool for the job” to manage structured, unstructured and semi-relational data from disparate sources. While infrastructure coexistence is hardly new — one could argue that it’s as old as the technology industry itself — what is becoming significantly more commonplace, and hence a “new normal”, is the integration of Hadoop/MapReduce, CEP, “NoSQL”, and other database and data streaming variants as extensions of existing relational-based enterprise data warehouses (EDWs). A few examples:

The Strata Online Conference, being held April 6, will look at how information — and the ability to put it to work — will shape tomorrow’s markets. Scheduled speakers include: Gavin Starks from AMEE, Jeff Jonas from IBM, Chris Thorpe from Artfinder, and Ian White from Urban Mapping.

Registration is open

AOL Advertising integrated two data management systems: one optimized for high-throughput data analysis (the “analytics” system), the other for low-latency random access (the “transactional” system). After evaluating alternatives, AOL Advertising combined Cloudera Distribution for Apache Hadoop (CDH) with Membase (now Couchbase). This pairs Hadoop’s capability for handling large, complex data volumes with Membase’s capability for speed for sub-millisecond latency in making optimized decisions for real-time ad placement.
At LinkedIn, to power large-scale data computations of more than 100 billion relationships a day and low-latency site serving, they use a combination of Hadoop to process massive batch workloads, Project Voldemort, for a NoSQL key/value storage engine, and the Azkaban open-source workflow system. Further, they developed a real-time, persistent messaging system named Kafka for log aggregation and activity processing.
The Walt Disney Co. Technology Shared Services Group extended its existing data warehouse architecture with a Hadoop cluster to provide an integration mashup for diverse departmental data, most of which is stored separately by Disney’s many business units and subsidiaries. With a Hadoop cluster that went into production for shared service internal business units last October, this data can now be analyzed for patterns across different but connected customer activities, such as attendance at a theme park, purchases from Disney stores, and viewership of Disney’s cable television programming. (Disney case study summarized from PricewaterhouseCoopers, Technology Forecast, Big Data issue, 2010).

Centralization and coexistence at eBay

Even companies whose enterprise architecture more closely aligns with the enterprise data warehouse (EDW) vision associated with Bill Inmon than the federated model popularized by Ralph Kimball are finding themselves migrating their architectures toward greater coexistence to empower business growth. eBay offers an instructive example.

“A data mart can’t be cheap enough to justify its existence,” says Oliver Ratzesberger, eBay’s senior director of architecture and operations. eBay has migrated to coexistence architecture featuring Teradata as the core EDW, Teradata offshoot named Singularity for behavioral analysis and clickstream semi-relational data, and Hadoop for image processing and deep data mining. All three store multiple petabytes of data.

Named after Ray Kurzweil’s thought-provoking book “The Singularity is Near,” the Singularity system at eBay is running production for managing and analyzing semi-relational data, using the same Teradata SQL user interfaces that are already widely understood and liked by many eBay staff. eBay’s Hadoop instances still require separate management tools, and to date, still come with fewer capabilities for workload management than what eBay receives with its Teradata architecture.

Using this tripartite architecture, on eBay’s consumer online marketplace, there are no static pages. Every page is dynamic, and many if not yet all ads are individualized. These technical innovations at eBay are helping to empower eBay’s corporate resurgence, as highlighted in the March 2011 Harvard Business Review “How eBay Developed a Culture of Experimentation” interview with eBay CEO John Donahoe.

Coexistence at Bank of America

Bank of America operates a Teradata data warehouse architecture with Hadoop, R and columnar extensions along with: IBM Cognos business intelligence, InfoSphere Foundation Tools and InfoSphere DataStage; Tableau reporting; SAP global ERP reporting system; and Cisco telepresence for internal collaboration; among other technologies and systems.

R-specialist Revolution Analytics cites a Bank of America reference. In it, Mike King, a quantitative analyst at Bank of America, describes how he uses R to write programs for capital adequacy modeling, decision systems design and predictive analytics:

R allows you to take otherwise overwhelmingly complex data and view it in such a way that, all of a sudden, the choice becomes more intuitive because you can picture what it looks like. Once you have that visual image of the data in your mind, it’s easier to pick the most appropriate quantitative techniques.

While Revolution Analytics is sponsoring a SAS to R Challenge for SAS customers to consider converting to R, coexistence between enterprise-grade software such as SAS and emerging tools such as R, is a more common outcome than a replacement or cutback in the number of current or future SAS licenses, as shown by Bank of America’s recent investment described above in the SAS risk management offering.

For its part, SAS indicates that SAS/IML Studio (formerly known as SAS Stat Studio) provides one existing capability to interface with the R language. According to Radhika Kulkarni, vice president of advanced analytics at SAS, in a discussion about SAS-R integration on the SAS website: “We are busy working on an R interface that can be surfaced in the SAS server or via other SAS clients. In the future, users will be able to interface with R through the IML procedure.”

To quote Bob Rodriguez, senior director of statistical development at SAS, from that website discussion: “R is a leading language for developing new statistical methods. Our new PhD developers learned R in their graduate programs and are quite versed in it.” The SAS article added that: “Both R and SAS are here to stay, and finding ways to make them work better with each other is in the best interests of our customers.”

Recent evolutions in big data vendors

As 10gen CEO and co-founder Dwight Merriman and new President Max Schireson described in a call March 8: “There have been periodic rebellions against the RDBMS.” Intuit’s small business division uses document-oriented MongoDB from 10gen for real-time tracking of website user engagement and user activities. Document-oriented CouchDB supporter CouchOne merged with key value store and memcached specialist Membase to form Couchbase; their customers include AOL and social gaming leader Zynga.

Customers had asked DataStax (previously named Riptano) for a roadmap for integrated Cassandra and Hadoop management, per an O’Reilly Strata conference discussion with DataStax CEO and co-founder Matt Pfeil and products VP Ben Werther. In March 2011, DataStax announced the Brisk integrated Hadoop, Hive and Cassandra platform, to support high-volume, high-velocity websites and complex event processing, among other applications that require real-time or near-real-time processing. According to DataStax VP of Products Ben Werther in a March 29 email: “Cassandra is at the core of Brisk and eliminates the need for HBase because it natively provides low-latency access and everything you’d get in HBase without the complexity.”

Originating at Facebook and with commercial backing from DataStax, Cassandra is in use at Cisco, Facebook, Ooyala, Rackspace/Cloudkick, SimpleGeo, Twitter and other organizations that have large, active data sets. It’s basically a BigTable data model running on an Amazon Dynamo like infrastructure. DataStax’s largest Cassandra production cluster has more than 700 nodes. Cloudkick, acquired by Rackspace, offers a good discussion of their selection process that led to use of Cassandra: 4 months with Cassandra, a love story.

While EMC/Greenplum and Teradata/Aster Data started with PostgreSQL and moved forward from there, EnterpriseDB has continued to incorporate PostgreSQL updates. EnterpriseDB CEO Ed Boyajian and VP Karen Tegan Padir explained in a call last month that while much of the PostgreSQL initial work was to build databases for sophisticated users, EnterpriseDB has done more to improve manageability and ease of use, including a 1-click installer for PostgreSQL similar to Red Hat installer for Linux. EnterpriseDB envisions becoming for PostgreSQL what Cloudera has become for Hadoop: an integrated solution provider aimed a commercial, enterprise and public-sector accounts.

MicroStrategy is one of Cloudera‘s key partners for visualization and collaboration, and Informatica is quickly becoming a strong partner for ETL. To speed up what can be slow transfers in ODBC, Cloudera is building an optimized version of Sqoop. Flume agents support CEP applications, but it’s not a big use case yet for Hadoop, per a call in February with Dr. Amr Awadallah, co-founder and VP of engineering, and marketing VP John Kreisa.

The following are additional examples of big data integration and coexistence efforts based on phone and in-person discussions with vendor executives in February and March 2011:

Adobe acquired data management platform vendor Demdex to integrate with Omniture in the Adobe Online Marketing Suite. Demdex helps advertisers shift dollars and focus from buying content-driven placements to buying specific audiences.
Appistry extended its CloudIQ Storage with a Hadoop edition and partnership with Accenture for a Cloud MapReduce offering for private clouds. This joint offering runs MapReduce jobs on top of the Appistry CloudIQ Platform for behind-the-firewall corporate applications.
Together with its siblings Cassandra and Project Voldemort, Riak is an Amazon.com Dynamo-inspired database that Comcast, Mozilla and others use to prototype, test and deploy applications, with commercial support and services from Basho Technologies.
At CloudScale, CEO Bill McColl and his team offer a platform to help developers create applications designed for real-time distributed architectures.
Clustrix‘s clustered database system looks like a MySQL database “on the wire,” but without MySQL code, to combine key-value stores with relational database functionality, with a focus on online transaction processing (OLTP) applications.
Concurrent supports an open source abstraction for MapReduce called Cascading that allows applications to integrate with Hadoop through Java API.
Within an enterprise and extending to its SaaS or social media data, Coveo offer integrated search tools for finding information quickly. For example, a Coveo user can search Microsoft SharePoint files or pull up data from Salesforce.com all from within her Outlook email browser.
Germany-based Exasol added a bulk-loader and increased integration capabilities for SAP clients.
Yale’s Daniel Abadi and several of his colleagues unveiled Hadapt to run large and ad-hoc SQL queries with high velocity on both structured and unstructured data in Hadoop, to commercialize a project that began in the Yale computer science department.
IBM Netezza has partnered with R specialist Revolution Analytics to add built-in R capabilities to the IBM Netezza TwinFin Data Warehouse Appliance. While Revolution Analytics has challenged SAS, they see more of a partner model with IBM Netezza and IBM SPSS. This may in part reflect the work career of Revolution Analytics President and CEO Norman Nie; prior to his current role, he co-invented SPSS.
Mapr targets speeding up Hadoop/MapReduce through a proprietary replacement for HDFS that can integrate with the rest Apache Hadoop ecosystem. (For a backgrounder on that ecosystem, refer to Meet the Big Data Equivalent of the LAMP Stack).
MarkLogic offers a purpose-built database using an XML data model for unstructured information for Simon & Schuster, Pearson Education, Boeing, the U.S. Federal Aviation Administration and other customers.
Microsoft Dryad offers a programming model to write parallel and distributed programs to scale from a small cluster to a large data center.
Pentaho offers an open source BI suite integrating capabilities for ETL, reporting, OLAP analysis, dashboards and data mining.

With its SpringSource and Wavemaker acquisitions, VMware is offering and expanding a suite of tools for developers to program applications that take advantage of virtualized cloud delivery environments. VMware’s cloud application strategy is to empower developers to run modern applications that share information with underlying infrastructure to maximize performance, quality of service and infrastructure utilization. This extends VMware’s virtualization business farther up into the software development lifecycle and provides incremental revenue for VMware while VMware positions itself for desktop virtualization to take off.

Data in the cloud

Data in cloud Cloud computing and big data technologies overlap. As Judith Hurwitz at Hurwitz & Associates explained in a call on February 22: “Amazon has definitely blazed the trail as the pioneer for compute services.” Amazon found they had extra capacity and started renting it out, but with little or no service level guarantees, and then from 2006 on invested in dedicated infrastructure to serve external customers for Amazon Web Services (AWS). In general, AWS has competed on pricing and self-service provisioning, which suits start-ups and enterprise departmental needs well, but without many of the more stringent service level agreements (SLAs) sought by corporate IT departments.

Based on Big Table and other Google technologies, Fusion Tables are a service for managing large collections of tabular data in the cloud, as explained in a conversation this month with Dr. Alon Halevy, head of the Structured Data Group at Google Research. You can upload tables of up to 100MB and share them with collaborators, or make them public. You can apply filters and aggregation to your data, visualize it on maps and other charts, merge data from multiple tables, and export it to the web or csv files. You can access Fusion Tables via a web user interface or API, and Google offers examples to help you get started.

As Judith Hurwitz discussed, the data in the cloud market is starting to bifurcate. Private clouds are advancing the enterprise shared services model with workload management, self-provisioning and other automation of shared services. IBM, Unisys, Microsoft Azure, HP, NaviSite (Time Warner) and others have begun offering enterprise-grade services. While data in Amazon is pretty portable — most services link with Amazon — many APIs and tools are still specific to one environment, or reflect important dependencies, e.g., Microsoft Azure basically assumes a .Net infrastructure.

At the 1000 Genomes Project, medical researchers are benefiting from a cloud architecture to access data for genomics research, including the ability to download a public dataset through Amazon Web Services. For medical researchers on limited budgets, using the cloud capacity for analytics can save investment dollars. However, Amazon pricing can be deceptive as CPU hours can add up to quite a lot of money over time. To speed data transfers from the cloud, the project participants are using Aspera and its fasp protocol.

The University of Washington, Monterey Bay Aquarium Research Institute and Microsoft have collaborated on Project Trident to provide a scientific workflow workbench for oceanography. Trident, implemented with Windows Workflow Foundation, .NET, Silverlight and other Microsoft technologies, allows scientists to explore and visualize oceanographic data in real-time. They can use Trident to compose, run and catalog oceanography experiments from any web browser.

Pervasive DataCloud adds a data services layer to Amazon Web Services for integration and transformation capabilities. An enterprise with multiple CRM systems can synchronize application data from Oracle/Siebel, Salesforce.com and Force.com partner applications within a Pervasive DataCloud2 process. They can then use the feeds from that DataCloud process to power executive dashboards or business analytics. Likewise, an enterprise with Salesforce.com data can use DataCloud2 to synch with an on-premise relational database, or synch data between Salesforce.com and Intuit QuickBooks accounting software.

Big data jobs

All of this activity is welcome news for software engineers and other technical staff whose jobs may have been affected by overseas outsourcing. The monthly Hadoop user group meetups at the Yahoo campus now feature hundreds of attendees and even some job offers: many big data mega vendors and startups are hiring. For example, while Yahoo ended its own distribution of Hadoop, it has some interesting work underway with its Cloud Data Platform and Services including job openings there.

Cloudera counts 85 employees and continues to hire. Cloudera’s Hadoop training courses are consistently sold out, including big demand from public sector organizations; the venture capital arm of the CIA, In-Q-Tel, became a Cloudera investor last month.

Recognizing big data’s limits

To temper enthusiasm just a bit, 2011 is also a good time for a reality check to put big data into perspective. To benefit from big data, many enterprises and public sector organizations need to revisit business processes, solve data silo challenges, and invest in visualization and collaboration tools to help make big data understandable and actionable across an extended organization. Visualization tools are helpful, but only in combination with collaboration tools that enable discussion of data sources, context and implications, and in some cases correction of misleading data, as Paul Miller discusses in an article on GigaOM Pro (subscription or free trial required).

Many leaders in managing and benefiting from big data are finding it beneficial to hire and develop staff with “T-shaped” skills that combine deep technical experience (the T’s vertical line) and wide business skills (the T’s horizontal line). For example, the phrase a “new normal” in this article’s title refers to periodic phases in ongoing technology and marketplace trends, not to isolation of statistical error in repeated measured data (normalization from a mathematics perspective) or organization of data to minimize redundancy (normalization from a RDBMS perspective). Staff with T-shaped skills can “talk these different languages” to collaborate productively with colleagues, partners and customers who work in business, technology, statistics and other roles.

Big data applications such as risk management software will not by themselves prevent the next sub-prime mortgage meltdown or the previous generation’s savings and loan industry crisis. Decision-makers at financial institutions will need to make the right risk decisions, and regulatory oversight such as the new Basel rules for minimum capital requirements may play an important role too. And big data raises a number of important concerns for data privacy and ownership of data.

For more on big data technology and business trends, including a longer discussion on big data opportunities and limitations, take a look at my recently published Putting Big Data to Work: Opportunities for Enterprises report on GigaOM Pro.