How an enterprise begins its big data journey

As the amount of data continues to double in size every two years, organizations are struggling more than ever before to manage, ingest, store, process, transform, and analyze massive data sets. It has become clear that getting started on the road to using data successfully can be a difficult task, especially with a growing number of new data sources, demands for fresher data, and the need for increased processing capacity. In order to advance operational efficiencies and drive business growth, however, organizations must address and overcome these challenges.

In recent years, many organizations have heavily invested in the development of enterprise data warehouses (EDW) to serve as the central data system for reporting, extract/transform/load (ETL) processes, and ways to take in data (data ingestion) from diverse databases and other sources both inside and outside the enterprise. Yet, as the volume, velocity, and variety of data continues to increase, already expensive and cumbersome EDWs are becoming overloaded with data. Furthermore, traditional ETL tools are unable to handle all the data being generated, creating bottlenecks in the EDW that result in major processing burdens.

As a result of this overload, organizations are now turning to open source tools like Hadoop as cost-effective solutions to offloading data warehouse processing functions from the EDW. While Hadoop can help organizations lower costs and increase efficiency by being used as a complement to data warehouse activities, most businesses still lack the skill sets required to deploy Hadoop.

Where to begin?

Organizations challenged with overburdened EDWs need solutions that can offload the heavy lifting of ETL processing from the data warehouse to an alternative environment that is capable of managing today’s data sets. The first question is always, “how can this be done in a simple, cost-effective manner that doesn’t require specialized skill sets?”

Let’s start with Hadoop. As previously mentioned, many organizations deploy Hadoop to offload their data warehouse processing functions. After all, Hadoop is a cost-effective, highly scalable platform that can store volumes of structured, semi-structured, and unstructured data sets. Hadoop can also help accelerate the ETL process, while significantly reducing costs in comparison to running ETL jobs in a traditional data warehouse. However, while the benefits of Hadoop are appealing, the complexity of this platform continues to hinder adoption at many organizations. It has been our goal to find a better solution.

Using tools to offload ETL workloads

One option to solve this problem comes from a combined effort between Dell, Intel, Cloudera, and Syncsort. Together they have developed a pre-configured offloading solution that enables businesses to capitalize on the technical and cost-effective features offered by Hadoop. It is an ETL offload solution that delivers a use-case driven Hadoop Reference Architecture that can augment the traditional EDW, ultimately enabling customers to offload ETL workloads to Hadoop, increasing performance, and optimizing EDW utilization by freeing up cycles for analysis in the EDW.

The new solution combines the Hadoop distribution from Cloudera with a framework and tool set for ETL offload from Syncsort. These technologies are powered by Dell networking components and Dell PowerEdge R series servers with Intel Xeon processors.

The technology behind the ETL offload solution simplifies data processing by providing an architecture to help users optimize an existing data warehouse. So, how does the technology behind all of this actually work?

The ETL offload solution provides the Hadoop environment through Cloudera Enterprise software. The Cloudera Distribution of Hadoop (CDH) delivers the core elements of Hadoop, such as scalable storage and distributed computing, and together with the software from Syncsort, allows users to reduce Hadoop deployment to weeks, develop Hadoop ETL jobs in a matter of hours, and become fully productive in days. Additionally, CDH ensures security, high-availability, and integration with the large set of ecosystem tools.

Syncsort DMX-h software is a key component in the solution or RA. Designed from the ground up to run efficiently in Hadoop, Syncsort DMX-h removes barriers for mainstream Hadoop adoption by delivering an end-to-end approach for shifting heavy ETL workloads into Hadoop, and provides the connectivity required to build an enterprise data hub. For even tighter integration and accessibility, DMX-h has monitoring capabilities integrated directly into Cloudera Manager.

With Syncsort DMX-h, organizations no longer have to be equipped with MapReduce skills and write mountains of code to take advantage of Hadoop. This is made possible through intelligent execution that allows users to graphically design data transformations and focus on business rules rather than underlying platforms or execution frameworks. Furthermore, users no longer have to make application changes to deploy the same data flows on or off of Hadoop, on premise, or in the cloud. This future-proofing concept provides a consistent user experience during the process of collecting, blending, transforming, and distributing data.

Additionally, Syncsort has developed SILQ, a tool that facilitates understanding, documenting, and converting massive amounts of SQL code to Hadoop. SILQ takes an SQL script as an input and provides a detailed flow chart of the entire data stream, mitigating the need for specialized skills and greatly accelerating the process, thereby removing another roadblock to offloading the data warehouse into Hadoop.

Dell PowerEdge R730 servers are then used for infrastructure nodes, and Dell PowerEdge R730xd servers are used for data nodes.

The path forward

Offloading massive data sets from an EDW can seem like a major barrier to organizations looking for more effective ways to manage their ever-increasing data sets. Fortunately, businesses can now capitalize on ETL offload opportunities with the correct software and hardware required to shift expensive workloads and associated data from overloaded enterprise data warehouses to Hadoop.

By picking the right tools, organizations can make better use of existing EDW investments by reducing the costs and resource requirements for ETL.

This post is part of a collaboration between O’Reilly, Dell, and Intel. See our statement of editorial independence.

Cropped public domain image on article and category pages via Wikimedia Commons.

How an enterprise begins its big data journey

An ETL offload solution addresses the challenges of data overload, rising costs, and the skills gap.

Where to begin?

Using tools to offload ETL workloads

The path forward

Get the O’Reilly Data Newsletter