Big data and cloud technology go hand-in-hand. Big data needs clusters of servers for processing, which clouds can readily provide. So goes the marketing message, but what does that look like in reality? Both “cloud” and “big data” have broad definitions, obscured by considerable hype. This article breaks down the landscape as simply as possible, highlighting what’s practical, and what’s to come.
IaaS and private clouds
What is often called “cloud” amounts to virtualized servers: computing resource that presents itself as a regular server, rentable per consumption. This is generally called infrastructure as a service (IaaS), and is offered by platforms such as Rackspace Cloud or Amazon EC2. You buy time on these services, and install and configure your own software, such as a Hadoop cluster or NoSQL database. Most of the solutions I described in my Big Data Market Survey can be deployed on IaaS services.
Using IaaS clouds doesn’t mean you must handle all deployment manually: good news for the clusters of machines big data requires. You can use orchestration frameworks, which handle the management of resources, and automated infrastructure tools, which handle server installation and configuration. RightScale offers a commercial multi-cloud management platform that mitigates some of the problems of managing servers in the cloud.
Frameworks such as OpenStack and Eucalyptus aim to present a uniform interface to both private data centers and the public cloud. Attracting a strong flow of cross industry support, OpenStack currently addresses computing resource (akin to Amazon’s EC2) and storage (parallels Amazon S3).
The race is on to make private clouds and IaaS services more usable: over the next two years using clouds should become much more straightforward as vendors adopt the nascent standards. There’ll be a uniform interface, whether you’re using public or private cloud facilities, or a hybrid of the two.
Particular to big data, several configuration tools already target Hadoop explicitly: among them Dell’s Crowbar, which aims to make deploying and configuring clusters simple, and Apache Whirr, which is specialized for running Hadoop services and other clustered data processing systems.
Today, using IaaS gives you a broad choice of cloud supplier, the option of using a private cloud, and complete control: but you’ll be responsible for deploying, managing and maintaining your clusters.
Using IaaS only brings you so far for with big data applications: they handle the creation of computing and storage resources, but don’t address anything at a higher level. The set up of Hadoop and Hive or a similar solution is down to you.
Beyond IaaS, several cloud services provide application layer support for big data work. Sometimes referred to as managed solutions, or platform as a service (PaaS), these services remove the need to configure or scale things such as databases or MapReduce, reducing your workload and maintenance burden. Additionally, PaaS providers can realize great efficiencies by hosting at the application level, and pass those savings on to the customer.
The general PaaS market is burgeoning, with major players including VMware (Cloud Foundry) and Salesforce (Heroku, force.com). As big data and machine learning requirements percolate through the industry, these players are likely to add their own big-data-specific services. For the purposes of this article, though, I will be sticking to the vendors who already have implemented big data solutions.
Today’s primary providers of such big data platform services are Amazon, Google and Microsoft. You can see their offerings summarized in the table toward the end of this article. Both Amazon Web Services and Microsoft’s Azure blur the lines between infrastructure as a service and platform: you can mix and match. By contrast, Google’s philosophy is to skip the notion of a server altogether, and focus only on the concept of the application. Among these, only Amazon can lay claim to extensive experience with their product.
Amazon Web Services
Amazon has significant experience in hosting big data processing. Use of Amazon EC2 for Hadoop was a popular and natural move for many early adopters of big data, thanks to Amazon’s expandable supply of compute power. Building on this, Amazon launched Elastic Map Reduce in 2009, providing a hosted, scalable Hadoop service.
Applications on Amazon’s platform can pick from the best of both the IaaS and PaaS worlds. General purpose EC2 servers host applications that can then access the appropriate special purpose managed solutions provided by Amazon.
As well as Elastic Map Reduce, Amazon offers several other services relevant to big data, such as the Simple Queue Service for coordinating distributed computing, and a hosted relational database service. At the specialist end of big data, Amazon’s High Performance Computing solutions are tuned for low-latency cluster computing, of the sort required by scientific and engineering applications.
Elastic Map Reduce
Elastic Map Reduce (EMR) can be programmed in the usual Hadoop ways, through Pig, Hive or other programming language, and uses Amazon’s S3 storage service to get data in and out.
Access to Elastic Map Reduce is through Amazon’s SDKs and tools, or with GUI analytical and IDE products such as those offered by Karmasphere. In conjunction with these tools, EMR represents a strong option for experimental and analytical work. Amazon’s EMR pricing makes it a much more attractive option to use EMR, rather than configure EC2 instances yourself to run Hadoop.
When integrating Hadoop with applications generating structured data, using S3 as the main data source can be unwieldy. This is because, similar to Hadoop’s HDFS, S3 works at the level of storing blobs of opaque data. Hadoop’s answer to this is HBase, a NoSQL database that integrates with the rest of the Hadoop stack. Unfortunately, Amazon does not currently offer HBase with Elastic Map Reduce.
Instead of HBase, Amazon provides DynamoDB, its own managed, scalable NoSQL database. As this a managed solution, it represents a better choice than running your own database on top of EC2, in terms of both performance and economy.
DynamoDB data can be exported to and imported from S3, providing interoperability with EMR.
Google’s cloud platform stands out as distinct from its competitors. Rather than offering virtualization, it provides an application container with defined APIs and services. Developers do not need to concern themselves with the concept of machines: applications execute in the cloud, getting access to as much processing power as they need, within defined resource usage limits.
To use Google’s platform, you must work within the constraints of its APIs. However, if that fits, you can reap the benefits of the security, tuning and performance improvements inherent to the way Google develops all its services.
AppEngine, Google’s cloud application hosting service, offers a MapReduce facility for parallel computation over data, but this is more of a feature for use as part of complex applications rather than for analytical purposes. Instead, BigQuery and the Prediction API form the core of Google’s big data offering, respectively offering analysis and machine learning facilities. Both these services are available exclusively via REST APIs, consistent with Google’s vision for web-based computing.
BigQuery is an analytical database, suitable for interactive analysis over datasets of the order of 1TB. It works best on a small number of tables with a large number of rows. BigQuery offers a familiar SQL interface to its data. In that, it is comparable to Apache Hive, but the typical performance is faster, making BigQuery a good choice for exploratory data analysis.
Getting data into BigQuery is a matter of directly uploading it, or importing it from Google’s Cloud Storage system. This is the aspect of BigQuery with the biggest room for improvement. Whereas Amazon’s S3 lets you mail in disks for import, Google doesn’t currently have this facility. Streaming data into BigQuery isn’t viable either, so regular imports are required for constantly updating data. Finally, as BigQuery only accepts data formatted as comma-separated value (CSV) files, you will need to use external methods to clean up the data beforehand.
Rather than provide end-user interfaces itself, Google wants an ecosystem to grow around BigQuery, with vendors incorporating it into their products, in the same way Elastic Map Reduce has acquired tool integration. Currently in beta test, to which anybody can apply, BigQuery is expected to be publicly available during 2012.
Many uses of machine learning are well defined, such as classification, sentiment analysis, or recommendation generation. To meet these needs, Google offers its Prediction API product.
Applications using the Prediction API work by creating and training a model hosted within Google’s system. Once trained, this model can be used to make predictions, such as spam detection. Google is working on allowing these models to be shared, optionally with a fee. This will let you take advantage of previously trained models, which in many cases will save you time and expertise with training.
Though promising, Google’s offerings are in their early days. Further integration between its services is required, as well as time for ecosystem development to make their tools more approachable.
I have written in some detail about Microsoft’s big data strategy in Microsoft’s plan for Hadoop and big data. By offering its data platforms on Windows Azure in addition to Windows Server, Microsoft’s aim is to make either on-premise or cloud-based deployments equally viable with its technology. Azure parallels Amazon’s web service offerings in many ways, offering a mix of IaaS services with managed applications such as SQL Server.
Hadoop is the central pillar of Microsoft’s big data approach, surrounded by the ecosystem of its own database and business intelligence tools. For organizations already invested in the Microsoft platform, Azure will represent the smoothest route for integrating big data into the operation. Azure itself is pragmatic about language choice, supporting technologies such as Java, PHP and Node.js in addition to Microsoft’s own.
As with Google’s BigQuery, Microsoft’s Hadoop solution is currently in closed beta test, and is expected to be generally available sometime in the middle of 2012.
Big data cloud platforms compared
The following table summarizes the data storage and analysis capabilities of Amazon, Google and Microsoft’s cloud platforms. Intentionally excluded are IaaS solutions without dedicated big data offerings.
|Amazon Web Services
|Google Cloud Services
|Big data storage
|HDFS on Azure
|Elastic Block Store
|AppEngine (Datastore, Blobstore)
|Blob, table, queues
|Relational Database Service (MySQL or Oracle)
|Cloud SQL (MySQL compatible)
|Elastic MapReduce (Hadoop)
|AppEngine (limited capacity)
|Hadoop on Azure2
|Big data analytics
|Elastic MapReduce (Hadoop interface3)
|BigQuery2 (TB-scale, SQL interface)
|Hadoop on Azure (Hadoop interface3)
|Via Hadoop + Mahout on EMR or EC2
|Mahout with Hadoop
|Nothing prepackaged: use custom solution on EC2
|Prospective Search API 4
|StreamInsight2 (“Project Austin”)
|Network, physically ship drives
|Public Data Sets
|A few sample datasets
|Windows Azure Marketplace
|Some services in private beta
|Some services in private beta
Cloud-based big data services offer considerable advantages in removing the overhead of configuring and tuning your own clusters, and in ensuring you pay only for what you use. The biggest issue is always going to be data locality, as it is slow and expensive to ship data. The most effective big data cloud solutions will be the ones where the data is also collected in the cloud. This is an incentive to investigate EC2, Azure or AppEngine as a primary application platform, and an indicator that PaaS competitors such as Cloud Foundry and Heroku will have to address big data as a priority.
It is early days yet for big data in the cloud, with only Amazon offering battle-tested solutions at this point. Cloud services themselves are at an early stage, and we will see both increasing standardization and innovation over the next two years.
However, the twin advantages of not having to worry about infrastructure and economies of scale mean it is well worth investigating cloud services for your big data needs, especially for an experimental or green-field project. Looking to the future, there’s no doubt that big data analytical capability will form an essential component of utility computing solutions.
1 In public beta.
2 In controlled beta test.
3 Hive and Pig compatible.
4 Experimental status.