Like CPU caches, which tend to be arranged in multiple levels, modern organizations direct their data into different data stores under the principle that a small amount is needed for real-time decisions and the rest for long-range business decisions. This article looks at options for data storage, focusing on one that’s particularly appropriate for the “fast data” scenario described in a recent O’Reilly report.
Many organizations deal with data on at least three levels:
- They need data at their fingertips, rather like a reference book you leave on your desk. Organizations use such data for things like determining which ad to display on a web page, what kind of deal to offer a visitor to their website, or what email message to suppress as spam. They store such data in memory, often in key/value stores that allow fast lookups. Flash is a second layer (slower than memory, but much cheaper), as I described in a recent article. John Piekos, vice president of engineering at VoltDB, which makes an in-memory database, says that this type of data storage is used in situations where delays of just 20 or 30 milliseconds mean lost business.
- For business intelligence, theses organizations use a traditional relational database or a more modern “big data” tool such as Hadoop or Spark. Although the use of a relational database for background processing is generally called online analytic processing (OLAP), it is nowhere near as online as the previous data used over a period of just milliseconds for real-time decisions.
- Some data is archived with no immediate use in mind. It can be compressed and perhaps even stored on magnetic tape.
For the new fast data tier, where performance is critical, techniques such as materialized views further improve responsiveness. According to Piekos, materialized views bypass a certain amount of database processing to cut milliseconds off of queries. Materialized views can be compared to a column in a spreadsheet that is based on a calculation using other columns and is updated as the spreadsheet itself is updated. In a database, an SQL query defines a materialized view. As rows are inserted, deleted, or modified in the underlying table on which the view is based, the materialized view calculations are automatically updated. Naturally, the users must decide in advance what computation is crucial to them and define the queries accordingly. The result is an effectively instant, updated result suitable for real-time decision-making.
Some examples of the multi-layer, three-tier architecture cited by Piekos are:
- Ericsson, the well-known telecom company, puts out a set-top box for television viewers. These boxes collect a huge amount of information about channel changes and data transfers, potentially from hundreds of millions of viewers. Therefore, one of the challenges is just writing to the database at a rate that supports the volume of data they receive. They store the data in the cloud, where they count such things as latency, response time, and error rates. Data is then pushed to a slower, historical data store.
- Sakura, one of the largest Japanese ISPs, uses their database for protection against distributed Denial of Service attacks. This requires quick recognition of a spike in incoming network traffic, plus the ability to distinguish anomalies from regular network uses. Every IP packet transmitted through Sakura is logged to a VoltDB database — but only for two or three hours, after which the traffic is discarded to make room for more. The result is much more subtle than traditional blacklists, which punish innocent network users who happen to share a block of IP addresses with the attacker.
- Flytxt, which analyzes telecom messages for communication service providers, extracts intelligence from four billion events per day, streaming from more than 200 million mobile subscribers. With this data, operators can make quick decisions, such as whether the customer’s balance covers the call, whether the customer is a minor whose call should be blocked through parental controls, and so forth. This requires complex SQL queries, which a materialized view enables in the short time desired.
The choices for data storage these days are nearly overwhelming. There are currently no clear winners — each option has value in particular situations. Therefore, we should not be surprised that users are taking advantage of two or more solutions at once, each for what it is best at doing.
This post is part of a collaboration between O’Reilly and VoltDB exploring fast and big data. See our statement of editorial independence.