How an enterprise begins its big data journey

An ETL offload solution addresses the challenges of data overload, rising costs, and the skills gap.


As the amount of data continues to double in size every two years, organizations are struggling more than ever before to manage, ingest, store, process, transform, and analyze massive data sets. It has become clear that getting started on the road to using data successfully can be a difficult task, especially with a growing number of new data sources, demands for fresher data, and the need for increased processing capacity. In order to advance operational efficiencies and drive business growth, however, organizations must address and overcome these challenges.

In recent years, many organizations have heavily invested in the development of enterprise data warehouses (EDW) to serve as the central data system for reporting, extract/transform/load (ETL) processes, and ways to take in data (data ingestion) from diverse databases and other sources both inside and outside the enterprise. Yet, as the volume, velocity, and variety of data continues to increase, already expensive and cumbersome EDWs are becoming overloaded with data. Furthermore, traditional ETL tools are unable to handle all the data being generated, creating bottlenecks in the EDW that result in major processing burdens.

As a result of this overload, organizations are now turning to open source tools like Hadoop as cost-effective solutions to offloading data warehouse processing functions from the EDW. While Hadoop can help organizations lower costs and increase efficiency by being used as a complement to data warehouse activities, most businesses still lack the skill sets required to deploy Hadoop. Read more…

Comment: 1

Understanding neural function and virtual reality

The O'Reilly Data Show Podcast: Poppy Crum explains that what matters is efficiency in identifying and emphasizing relevant data.


Like many data scientists, I’m excited about advances in large-scale machine learning, particularly recent success stories in computer vision and speech recognition. But I’m also cognizant of the fact that press coverage tends to inflate what current systems can do, and their similarities to how the brain works.

During the latest episode of the O’Reilly Data Show Podcast, I had a chance to speak with Poppy Crum, a neuroscientist who gave a well-received keynote at Strata + Hadoop World in San Jose. She leads a research group at Dolby Labs and teaches a popular course at Stanford on Neuroplasticity in Musical Gaming. I wanted to get her take on AI and virtual reality systems, and hear about her experience building a team of researchers from diverse disciplines.

Understanding neural function

While it can sometimes be nice to mimic nature, in the case of the brain, machine learning researchers recognize that understanding and identifying the essential neural processes is much more critical. A related example cited by machine learning researchers is flight: wing flapping and feathers aren’t critical, but an understanding of physics and aerodynamics is essential.

Crum and other neuroscience researchers express the same sentiment. She points out that a more meaningful goal should be to “extract and integrate relevant neural processing strategies when applicable, but also identify where there may be opportunities to be more efficient.”

The goal in technology shouldn’t be to build algorithms that mimic neural function. Rather, it’s to understand neural function. … The brain is basically, in many cases, a Rube Goldberg machine. We’ve got this limited set of evolutionary building blocks that we are able to use to get to a sort of very complex end state. We need to be able to extract when that’s relevant and integrate relevant neural processing strategies when it’s applicable. We also want to be able to identify that there are opportunities to be more efficient and more relevant. I think of it as table manners. You have to know all the rules before you can break them. That’s the big difference between being really cool or being a complete heathen. The same thing kind of exists in this area. How we get to the end state, we may be able to compromise, but we absolutely need to be thinking about what matters in neural function for perception. From my world, where we can’t compromise is on the output. I really feel like we need a lot more work in this area. Read more…

Comment: 1

How real-time analytics integrates with our connected world

The O'Reilly Podcast: Scott Jarr on how real-time analytics applications can unlock value and automate decision-making.


In this special-edition O’Reilly Podcast, O’Reilly’s Ben Lorica and VoltDB’s co-founder Scott Jarr discuss how VoltDB’s hybrid transaction, analytic system allows for real-time analytics and personalization of data across various industries.

Scaling transaction processing without losing the relational database

MIT’s Mike Stonebraker (VoltDB’s co-founder) wanted to scale traditional OLTP (online transaction processing) without losing performance. The project evolved and eventually commercialized as VoltDB around the time NoSQL systems introduced a paradigm shift to non-relational databases. Jarr describes how Stonebraker’s approach didn’t assume a relational database was a core issue:

To give you an old story, but it’s a good story, they took a traditional style OLTP database and they ran it in memory. What they found was that it was doing less than 10% of its effective workload in processing transactions. The rest was dealing with overhead in various forms. He said, ‘Without getting rid of any of the things that we know [are] involved in the database world — consistency, SQL, ACID transactions, relational structures, high-level query languages — let’s keep all that, but let’s see if we can make this thing go faster.’

When those [NoSQL] systems were coming out, and they were coming out very strong, it was around the same time we were coming out with VoltDB. People were asking questions, ‘Well you’re consistent and they’re not.’ Or, ‘You’re relational and they’re not.’ I think that really lost the true meaning of what the differences were … [let’s] not get mired in the details … let’s look at the workloads that people are trying to accomplish.

Read more…

Comment: 1

How trains are becoming data driven

Railways are at the intersection of Internet and industry.


Trains and public transport are, for many of us, a vital part of our daily lives. Large cities are particularly dependent on an efficient public transport system, and if disruption occurs, it usually affects many passengers while spreading across the transport network. But our requirements as passengers are growing and maturing. Safety is paramount, but we also care about timeliness, comfort, Internet access, and other amenities. With strong competition for regional and long-distance trains, providing an attractive service has become critical for many rail operators today.

The railway industry is an old industry. For the last 150 years, this industry was built around mechanical systems maintained throughout a lifetime of 30 years, mostly through reactive or preventive maintenance. But this is not enough anymore to deliver the type of service we all want and expect to experience.

Deriving insight from the data of trains

Over the last few years, the rail industry has been transforming itself, embracing IT, digitalization, big data, and the related changes in business models. This change is driven both by the railway operating companies demanding higher vehicle and infrastructure availability, and, increasingly, wanting to transition their operational risk to suppliers. In parallel, the thought leaders among maintenance providers have embraced the technology opportunities to radically improve their offerings and help their customers deliver better value. Read more…

Comment: 1

Big data, interactive access: How Apache Drill makes it easy

True SQL queries? Yes. Parquet and other complex data structures? Yes. Drill 1.1 is full of surprises.


Register for the free webcast “Easy, real-time access to data with Apache Drill,” which will be held Thursday, July 30, 2015, at 10 a.m. PT. This panel discussion will explore the major role SQL-on-Hadoop technologies play in organizations.

Big data techniques are becoming mainstream in an increasing number of businesses, but how do people get self-service, interactive access to their big data? And how do they do this without having to train their SQL-literate employees to be advanced developers?

One solution is to take advantage of the rapidly maturing open source, open community software tool known as Apache Drill. Drill is not the first SQL-on-Hadoop tool. It is, however, a new and very sophisticated highly scalable SQL query engine that has been built from the ground up to be appropriate for use even in production settings. Drill extends query capabilities to a variety of new data sources and formats without the requirement for IT intervention that might be expected from a SQL query engine. In short, Drill allows self-exploration of data by providing flexibility along with performance.

As capabilities in the big data world have progressed, our understanding of what is needed for high-performance, enterprise-grade architectures have also increased. A need for a SQL solution for the Hadoop and NoSQL space was recognized fairly early, and it’s not surprising that to meet an urgent need, some of the first tools approached the problem with SQL-like syntax and made compromises that led to limitations in the data sources and formats they could handle well. Read more…

Comment: 1

Big data, small cluster

Finding new ways to shrink disk space for storing partitionable data.


Register for the free webcast, “Extending Cassandra with Doradus OLAP for High Performance Analytics,” which will be held July 29 at 9 a.m. PT.

Engineers at Dell were developing customer apps when they found that the query response times their customers were demanding — something on the order of seconds (in other words, the need to scan millions of objects/second) — required a new type of query engine. This led them on a four-year journey to create Doradus, one of Dell Software Group’s first open-source projects.

Doradus is a server framework that runs on top of Cassandra. To build Doradus, the team borrowed from several well-accepted paradigms. They used traditional OLAP techniques to allow data to be arranged into static, multidimensional cubes. They leveraged the vertical orientation and efficient compression of columnar databases. And, from the NoSQL world, they employed sharding. The result: a storage and query engine called Doradus OLAP that stores data up to 1M objects/second/node, providing nearly real-time data warehousing. This architecture also allows for extreme compression of the data, sometimes producing up to a 99% reduction in space usage.

This extremely dense storage means that data that once took multiple nodes can now be stored on a single node, allowing for fast queries without the expense of a large cluster. Because Doradus is built on top of Cassandra, the option to scale out is still there. This allows for sharding and replication, and also takes advantage of Cassandra’s failover features. Read more…

Comment: 1