|
|
|||||
Outliers and coexistence are the new normal for big dataAnalysis of complete data sets and integration of new tools are leading to revenue growth and new business models.Letting data speak for itself through analysis of entire data sets is eclipsing modeling from subsets. In the past, all too often what were once disregarded as "outliers" on the far edges of a data model turned out to be the telltale signs of a micro-trend that became a major event. To enable this advanced analytics and integrate in real-time with operational processes, companies and public sector organizations are evolving their enterprise architectures to incorporate new tools and approaches. Whether you prefer "big," "very large," "extremely large," "extreme," "total," or another adjective for the "X" in the "X Data" umbrella term, what's important is accelerated growth in three dimensions: volume, complexity and speed. Big data is not without its limitations. Many organizations need to revisit business processes, solve data silo challenges, and invest in visualization and collaboration tools to make big data understandable and actionable across an extended organization. "Sampling is dead"When complete huge data volumes can be processed and analyzed at scale, "sampling is dead," says Abhishek Mehta, former Bank of America (BofA) managing director and Tresata co-founder, and speaker at last year's Hadoop World. Potential applications include risk default analysis of every loan in a bank's portfolio and analysis of granular data for targeted advertising. The BofA corporate investments group adopted a SAS high performance risk management solution together with IBM BladeCenter grid and XIV storage to power credit-risk modeling, scoring and loss forecasting. As explained in a recent call with the SAS high-performance computing team, this new enterprise risk management system reduced calculation times at BofA for forecasting the probability of loan defaults from 96 hours to four hours. In addition to speeding up loan processing and hedging decisions, Bank of America can aggregate bottom-up data from individual loans for perhaps a more accurate picture of total risk than what was possible previously by testing models on just subsets of data. nPario holds an exclusive license from Yahoo for technology based on columnar storage that within Yahoo's internal infrastructure handles over eight petabytes of data for advertising and promotion, per a February 2011 discussion with nPario President and CEO Bassel Y. Ojjeh. nPario has basically forked the code, so that Yahoo can continue their internal use while nPario goes to market with a commercial offering for external customers. The nPario technology enables analysis at the granular level, not just at aggregate or sampled data. In addition to supporting a range of other data sources, nPario offers full integration with Adobe Omniture, including APIs that can pull data from Omniture (although Omniture charges a fee for this download). Electronic Arts uses nPario for an "insight's suite" that details how gamers engage with advertising. The nPario-powered EA analytics suite tracks clicks, impressions, demographic profiles, social media buzz and other data across EA's online, console game, mobile and social channels. The result is a much more precise understanding of consumer intent and ability to micro-target ads, over what was previously possible either with sampled data or with data limited to just online or shrink-wrapped and not across the complete range of EA's customer engagement. Multiple big data technologies coexist in many enterprise architectures
At the same time that cost-effective, fast tools to analyze huge data sets are making data sampling a thing of the past, coexistence is quickly becoming the new normal for big data infrastructure and service architectures. For many enterprises and public sector organizations, the focus is "the right tool for the job" to manage structured, unstructured and semi-relational data from disparate sources. While infrastructure coexistence is hardly new -- one could argue that it's as old as the technology industry itself -- what is becoming significantly more commonplace, and hence a "new normal", is the integration of Hadoop/MapReduce, CEP, "NoSQL", and other database and data streaming variants as extensions of existing relational-based enterprise data warehouses (EDWs). A few examples: The Strata Online Conference, being held April 6, will look at how information — and the ability to put it to work — will shape tomorrow's markets. Scheduled speakers include: Gavin Starks from AMEE, Jeff Jonas from IBM, Chris Thorpe from Artfinder, and Ian White from Urban Mapping.Registration is open
Centralization and coexistence at eBayEven companies whose enterprise architecture more closely aligns with the enterprise data warehouse (EDW) vision associated with Bill Inmon than the federated model popularized by Ralph Kimball are finding themselves migrating their architectures toward greater coexistence to empower business growth. eBay offers an instructive example. "A data mart can't be cheap enough to justify its existence," says Oliver Ratzesberger, eBay's senior director of architecture and operations. eBay has migrated to coexistence architecture featuring Teradata as the core EDW, Teradata offshoot named Singularity for behavioral analysis and clickstream semi-relational data, and Hadoop for image processing and deep data mining. All three store multiple petabytes of data. Named after Ray Kurzweil's thought-provoking book "The Singularity is Near," the Singularity system at eBay is running production for managing and analyzing semi-relational data, using the same Teradata SQL user interfaces that are already widely understood and liked by many eBay staff. eBay's Hadoop instances still require separate management tools, and to date, still come with fewer capabilities for workload management than what eBay receives with its Teradata architecture. Using this tripartite architecture, on eBay's consumer online marketplace, there are no static pages. Every page is dynamic, and many if not yet all ads are individualized. These technical innovations at eBay are helping to empower eBay's corporate resurgence, as highlighted in the March 2011 Harvard Business Review "How eBay Developed a Culture of Experimentation" interview with eBay CEO John Donahoe. Coexistence at Bank of AmericaBank of America operates a Teradata data warehouse architecture with Hadoop, R and columnar extensions along with: IBM Cognos business intelligence, InfoSphere Foundation Tools and InfoSphere DataStage; Tableau reporting; SAP global ERP reporting system; and Cisco telepresence for internal collaboration; among other technologies and systems. R-specialist Revolution Analytics cites a Bank of America reference. In it, Mike King, a quantitative analyst at Bank of America, describes how he uses R to write programs for capital adequacy modeling, decision systems design and predictive analytics:
While Revolution Analytics is sponsoring a SAS to R Challenge for SAS customers to consider converting to R, coexistence between enterprise-grade software such as SAS and emerging tools such as R, is a more common outcome than a replacement or cutback in the number of current or future SAS licenses, as shown by Bank of America's recent investment described above in the SAS risk management offering. For its part, SAS indicates that SAS/IML Studio (formerly known as SAS Stat Studio) provides one existing capability to interface with the R language. According to Radhika Kulkarni, vice president of advanced analytics at SAS, in a discussion about SAS-R integration on the SAS website: "We are busy working on an R interface that can be surfaced in the SAS server or via other SAS clients. In the future, users will be able to interface with R through the IML procedure." To quote Bob Rodriguez, senior director of statistical development at SAS, from that website discussion: "R is a leading language for developing new statistical methods. Our new PhD developers learned R in their graduate programs and are quite versed in it." The SAS article added that: "Both R and SAS are here to stay, and finding ways to make them work better with each other is in the best interests of our customers." Recent evolutions in big data vendorsAs 10gen CEO and co-founder Dwight Merriman and new President Max Schireson described in a call March 8: "There have been periodic rebellions against the RDBMS." Intuit's small business division uses document-oriented MongoDB from 10gen for real-time tracking of website user engagement and user activities. Document-oriented CouchDB supporter CouchOne merged with key value store and memcached specialist Membase to form Couchbase; their customers include AOL and social gaming leader Zynga. Customers had asked DataStax (previously named Riptano) for a roadmap for integrated Cassandra and Hadoop management, per an O'Reilly Strata conference discussion with DataStax CEO and co-founder Matt Pfeil and products VP Ben Werther. In March 2011, DataStax announced the Brisk integrated Hadoop, Hive and Cassandra platform, to support high-volume, high-velocity websites and complex event processing, among other applications that require real-time or near-real-time processing. According to DataStax VP of Products Ben Werther in a March 29 email: "Cassandra is at the core of Brisk and eliminates the need for HBase because it natively provides low-latency access and everything you'd get in HBase without the complexity." Originating at Facebook and with commercial backing from DataStax, Cassandra is in use at Cisco, Facebook, Ooyala, Rackspace/Cloudkick, SimpleGeo, Twitter and other organizations that have large, active data sets. It's basically a BigTable data model running on an Amazon Dynamo like infrastructure. DataStax's largest Cassandra production cluster has more than 700 nodes. Cloudkick, acquired by Rackspace, offers a good discussion of their selection process that led to use of Cassandra: 4 months with Cassandra, a love story. While EMC/Greenplum and Teradata/Aster Data started with PostgreSQL and moved forward from there, EnterpriseDB has continued to incorporate PostgreSQL updates. EnterpriseDB CEO Ed Boyajian and VP Karen Tegan Padir explained in a call last month that while much of the PostgreSQL initial work was to build databases for sophisticated users, EnterpriseDB has done more to improve manageability and ease of use, including a 1-click installer for PostgreSQL similar to Red Hat installer for Linux. EnterpriseDB envisions becoming for PostgreSQL what Cloudera has become for Hadoop: an integrated solution provider aimed a commercial, enterprise and public-sector accounts. MicroStrategy is one of Cloudera's key partners for visualization and collaboration, and Informatica is quickly becoming a strong partner for ETL. To speed up what can be slow transfers in ODBC, Cloudera is building an optimized version of Sqoop. Flume agents support CEP applications, but it's not a big use case yet for Hadoop, per a call in February with Dr. Amr Awadallah, co-founder and VP of engineering, and marketing VP John Kreisa. The following are additional examples of big data integration and coexistence efforts based on phone and in-person discussions with vendor executives in February and March 2011:
With its SpringSource and Wavemaker acquisitions, VMware is offering and expanding a suite of tools for developers to program applications that take advantage of virtualized cloud delivery environments. VMware's cloud application strategy is to empower developers to run modern applications that share information with underlying infrastructure to maximize performance, quality of service and infrastructure utilization. This extends VMware's virtualization business farther up into the software development lifecycle and provides incremental revenue for VMware while VMware positions itself for desktop virtualization to take off. Data in the cloud
Based on Big Table and other Google technologies, Fusion Tables are a service for managing large collections of tabular data in the cloud, as explained in a conversation this month with Dr. Alon Halevy, head of the Structured Data Group at Google Research. You can upload tables of up to 100MB and share them with collaborators, or make them public. You can apply filters and aggregation to your data, visualize it on maps and other charts, merge data from multiple tables, and export it to the web or csv files. You can access Fusion Tables via a web user interface or API, and Google offers examples to help you get started. As Judith Hurwitz discussed, the data in the cloud market is starting to bifurcate. Private clouds are advancing the enterprise shared services model with workload management, self-provisioning and other automation of shared services. IBM, Unisys, Microsoft Azure, HP, NaviSite (Time Warner) and others have begun offering enterprise-grade services. While data in Amazon is pretty portable -- most services link with Amazon -- many APIs and tools are still specific to one environment, or reflect important dependencies, e.g., Microsoft Azure basically assumes a .Net infrastructure. At the 1000 Genomes Project, medical researchers are benefiting from a cloud architecture to access data for genomics research, including the ability to download a public dataset through Amazon Web Services. For medical researchers on limited budgets, using the cloud capacity for analytics can save investment dollars. However, Amazon pricing can be deceptive as CPU hours can add up to quite a lot of money over time. To speed data transfers from the cloud, the project participants are using Aspera and its fasp protocol. The University of Washington, Monterey Bay Aquarium Research Institute and Microsoft have collaborated on Project Trident to provide a scientific workflow workbench for oceanography. Trident, implemented with Windows Workflow Foundation, .NET, Silverlight and other Microsoft technologies, allows scientists to explore and visualize oceanographic data in real-time. They can use Trident to compose, run and catalog oceanography experiments from any web browser. Pervasive DataCloud adds a data services layer to Amazon Web Services for integration and transformation capabilities. An enterprise with multiple CRM systems can synchronize application data from Oracle/Siebel, Salesforce.com and Force.com partner applications within a Pervasive DataCloud2 process. They can then use the feeds from that DataCloud process to power executive dashboards or business analytics. Likewise, an enterprise with Salesforce.com data can use DataCloud2 to synch with an on-premise relational database, or synch data between Salesforce.com and Intuit QuickBooks accounting software. Big data jobsAll of this activity is welcome news for software engineers and other technical staff whose jobs may have been affected by overseas outsourcing. The monthly Hadoop user group meetups at the Yahoo campus now feature hundreds of attendees and even some job offers: many big data mega vendors and startups are hiring. For example, while Yahoo ended its own distribution of Hadoop, it has some interesting work underway with its Cloud Data Platform and Services including job openings there. Cloudera counts 85 employees and continues to hire. Cloudera's Hadoop training courses are consistently sold out, including big demand from public sector organizations; the venture capital arm of the CIA, In-Q-Tel, became a Cloudera investor last month. Recognizing big data's limitsTo temper enthusiasm just a bit, 2011 is also a good time for a reality check to put big data into perspective. To benefit from big data, many enterprises and public sector organizations need to revisit business processes, solve data silo challenges, and invest in visualization and collaboration tools to help make big data understandable and actionable across an extended organization. Visualization tools are helpful, but only in combination with collaboration tools that enable discussion of data sources, context and implications, and in some cases correction of misleading data, as Paul Miller discusses in an article on GigaOM Pro (subscription or free trial required). Many leaders in managing and benefiting from big data are finding it beneficial to hire and develop staff with "T-shaped" skills that combine deep technical experience (the T's vertical line) and wide business skills (the T's horizontal line). For example, the phrase a "new normal" in this article's title refers to periodic phases in ongoing technology and marketplace trends, not to isolation of statistical error in repeated measured data (normalization from a mathematics perspective) or organization of data to minimize redundancy (normalization from a RDBMS perspective). Staff with T-shaped skills can "talk these different languages" to collaborate productively with colleagues, partners and customers who work in business, technology, statistics and other roles. Big data applications such as risk management software will not by themselves prevent the next sub-prime mortgage meltdown or the previous generation's savings and loan industry crisis. Decision-makers at financial institutions will need to make the right risk decisions, and regulatory oversight such as the new Basel rules for minimum capital requirements may play an important role too. And big data raises a number of important concerns for data privacy and ownership of data. For more on big data technology and business trends, including a longer discussion on big data opportunities and limitations, take a look at my recently published Putting Big Data to Work: Opportunities for Enterprises report on GigaOM Pro. |
|||||
|
|||||
Comments: 1
Monoo [31 March 2011 08:44 PM]
NONE