"data storage" entries

Fast data calls for new ways to manage its flow

Examples of multi-layer, three-tier data-processing architecture.

Storage_Servers_grover_net_Flickr

Like CPU caches, which tend to be arranged in multiple levels, modern organizations direct their data into different data stores under the principle that a small amount is needed for real-time decisions and the rest for long-range business decisions. This article looks at options for data storage, focusing on one that’s particularly appropriate for the “fast data” scenario described in a recent O’Reilly report.

Many organizations deal with data on at least three levels:

  1. They need data at their fingertips, rather like a reference book you leave on your desk. Organizations use such data for things like determining which ad to display on a web page, what kind of deal to offer a visitor to their website, or what email message to suppress as spam. They store such data in memory, often in key/value stores that allow fast lookups. Flash is a second layer (slower than memory, but much cheaper), as I described in a recent article. John Piekos, vice president of engineering at VoltDB, which makes an in-memory database, says that this type of data storage is used in situations where delays of just 20 or 30 milliseconds mean lost business.
  2. For business intelligence, theses organizations use a traditional relational database or a more modern “big data” tool such as Hadoop or Spark. Although the use of a relational database for background processing is generally called online analytic processing (OLAP), it is nowhere near as online as the previous data used over a period of just milliseconds for real-time decisions.
  3. Some data is archived with no immediate use in mind. It can be compressed and perhaps even stored on magnetic tape.

For the new fast data tier, where performance is critical, techniques such as materialized views further improve responsiveness. According to Piekos, materialized views bypass a certain amount of database processing to cut milliseconds off of queries. Read more…

Four data themes to watch from Strata + Hadoop World 2012

In-memory data storage, SQL, data preparation and asking the right questions all emerged as key trends at Strata + Hadoop World.

At our successful Strata + Hadoop World conference (including successfully avoiding Sandy), a few themes emerged that resonated with my interests and experience as a hands-on data analyst and as a researcher who tracks technology adoption trends. Keep in mind that these themes reflect my personal biases. Others will have a different take on their own key takeaways from the conference.

1. In-memory data storage for faster queries and visualization

Interactive or real-time query for large datasets is seen as a key to analyst productivity (real-time as in query times fast enough to keep the user in the flow of analysis, from sub-second to less than a few minutes). The existing large-scale data management schemes aren’t fast enough and reduce analytical effectiveness when users can’t explore the data by quickly iterating through various query schemes. We see companies with large data stores building out their own in-memory tools, e.g., Dremel at Google, Druid at Metamarkets, and Sting at Netflix, and new tools, like Cloudera’s Impala announcement at the conference, UC Berkeley’s AMPLab’s Spark, SAP Hana, and Platfora.

We saw this coming a few years ago when analysts we pay attention to started building their own in-memory data store sandboxes, often in key/value data management tools like Redis, when trying to make sense of new, large-scale data stores. I know from my own work that there’s no better way to explore a new or unstructured data set than to be able to quickly run off a series of iterative queries, each informed by the last. Read more…

BuzzData: Come for the data, stay for the community

A Canadian startup aspires to be the GitHub of datasets.

BuzzData looks to tap the gravitational pull of data, then keep people around through conversation and collaboration.

Strata Week: What happens when 200,000 hard drives work together?

IBM is building a massive 120-petabyte array and Infochimps releases a unified geo schema.

IBM takes data storage to a whole new level (120 petabytes, to be exact), Infochimps' new API tries to make life easier for geo developers, and the "Internet of people" keeps an eye on Hurricane Irene.

Real-time data needs to power the business side, not just tech

Theo Schlossnagle on the state of real-time data analysis and where it needs to go.

Real-time data analysis has come a long way, but Theo Schlossnagle, principal and CEO of OmniTI, says some technology improvements are actually causing a data analysis devolution.

The truth about data: Once it’s out there, it’s hard to control

Jeff Jonas on data ownership, security concerns, and privacy trade offs.

In a recent interview, Jeff Jonas, IBM distinguished engineer and chief scientist at IBM Entity Analytics, discussed the willingness of consumers to give away their data and the issues around data replication.

The truth about data: Once it's out there, it's hard to control

Jeff Jonas on data ownership, security concerns, and privacy trade offs.

In a recent interview, Jeff Jonas, IBM distinguished engineer and chief scientist at IBM Entity Analytics, discussed the willingness of consumers to give away their data and the issues around data replication.

Data integration services combine storage and analysis tools

Companies are looking to help business clients store and analyze data.

IBM Netezza and Revolution R Enterprise announced a new partnership, which together with recent moves by Microsoft and HP signal a growing realization that integrating data storage and analysis provides a better client experience.