Andy Oram

Andy Oram is an editor at O'Reilly Media. An employee of the company since 1992, Andy currently specializes in open source technologies and software engineering. His work for O'Reilly includes the first books ever released by a U.S. publisher on Linux, the 2001 title Peer-to-Peer, and the 2007 best-seller Beautiful Code.

The security infusion

Building access policies into data stores.

Safe_Rob_Pongsajapan_Flickr_350pxHadoop jobs reflect the same security demands as other programming tasks. Corporate and regulatory requirements create complex rules concerning who has access to different fields in data sets, sensitive fields must be protected from internal users as well as external threats, and multiple applications run on the same data and must treat different users with different access rights. The modern world of virtualization and containers adds security at the software level, but tears away the hardware protection formerly offered by network segments, firewalls, and DMZs.

Furthermore, security involves more than saying yes or no to a user running a Hadoop job. There are rules for archiving or backing up data on the one hand, and expiring or deleting it on the other. Audit logs are a must, both to track down possible breaches and to conform to regulation.

Best practices for managing data in these complex, sensitive environments implement the well-known principle of security by design. According to this principle, you can’t design a database or application in a totally open manner and then layer security on top if you expect it to be robust. Instead, security must be infused throughout the system and built in from the start. Defense in depth is a related principle that urges the use of many layers of security, so that an intruder breaking through one layer may be frustrated by the next. Read more…

Squaring big data with database queries

Integrating open source tools into a data warehouse has its advantages.

alone_realworkhard_pixabay

Although next-gen big data tools such as Hadoop, Spark, and MongoDB are finding more and more uses, most organizations need to maintain data in traditional relational stores as well. Deriving the benefits of both key/value stores and relational databases takes a lot of juggling. Three basic strategies are currently in use.

  • Double up on your data storage. Log everything in your fast key/value repository and duplicate part of it (or perform some reductions and store the results) in your relational data warehouse.
  • Store data primarily in a relational data warehouse, and use extract, transform, and load (ETL) tools to make it available for analytics. These tools run a fine-toothed comb through data to perform string manipulation, remove outlier values, etc. and produce a data set in the format required by data processing tools.
  • Put each type of data into the repository best suited to it––relational, Hadoop, etc.––but run queries between the repositories and return results from one repository to another for post-processing.

The appeal of the first is a large-scale simplicity, in that it uses well-understood systems in parallel. The second brings the familiarity of relational databases for business users to access. This article focuses on the third solution, which has advantages over the others: it avoids the redundancy of the first solution and is much easier to design and maintain than the second. I’ll describe how it is accomplished by Teradata, through its appliances and cloud solutions, but the building blocks are standard, open source tools such as Hive and HCatalog, so this strategy can be implemented by anyone. Read more…

Entities move data from tables to knowledge

Cleaning and combining fields can turn messy data into actionable insight.

Red_Buckets_in_a_Row_Don_DeBold_Flickr

We often talk in business and computing about moving from “raw data” to “knowledge,” hoping to take useful actions based on the data our organization has collected over time. Before one can view trends in your data or do other analytics, you need tools for data cleaning and for combining multiple data sources into meaningful collections of information, known as entities. An entity may be a customer, a product, a point of sale, an incident being investigated by the police, or anything else around which you want to build meaningful context.

In this post, we’ll explore some of the complexities in real-life data that create headaches — and how analytical software can help users prepare data for sophisticated queries and visualizations. Read more…

What the IoT can learn from the health care industry

Federated authentication and authorization could provide security solutions for the Internet of Things.

Adrian Gropper co-authored this post.

Nyckel_erik_forsberg_FlickrAfter a short period of excitement and rosy prospects in the movement we’ve come to call the Internet of Things (IoT), designers are coming to realize that it will survive or implode around the twin issues of security and user control: a few electrical failures could scare people away for decades, while a nagging sense that someone is exploiting our data without our consent could sour our enthusiasm. Early indicators already point to a heightened level of scrutiny — Senator Ed Markey’s office, for example, recently put the automobile industry under the microscope for computer and network security.

In this context, what can the IoT draw from well-established technologies in federated trust? Federated trust in technologies as diverse as the Kerberos and SAML has allowed large groups of users to collaborate securely, never having to share passwords with people they don’t trust. OpenID was probably the first truly mass-market application of federated trust.

OpenID and OAuth, which have proven their value on the Web, have an equally vital role in the exchange of data in health care. This task — often cast as the interoperability of electronic health records — can reasonably be described as the primary challenge facing the health care industry today, at least in the IT space. Reformers across the health care industry (and even Congress) have pressured the federal government to make data exchange the top priority, and the Office of the National Coordinator for Health Information Technology has declared it the centerpiece of upcoming regulations. Read more…

Marshal your data with entity resolution

Analytics can make combining or comparing data faster and less painful.

Entity_Resolution_webcastEntity resolution refers to processes that businesses and other organizations have to do all the time in order to produce full reports on people, organizations, or events. Entity resolution can be used, for instance, to:

  • Combine your customer data with a list purchased from a data broker. Identical data may be in columns of different names, such as “last” and “surname.” Connecting columns from different databases is a common extract, transform, and load (ETL) task.
  • Extract values from one database and match them against one or more columns in another. For instance, if you get a party list, you might want to find your clients among the attendees. A police detective might want to extract the names of people involved in a crime report and see whether any suspects are among them.
  • Find a match in dirty data, such as a person whose name is spelled differently in different rows.

Dirty, inconsistent, or unstructured data is the chief challenge in entity resolution. Jenn Reed, director of product management for Novetta Entity Analytics, points out that it’s easy for two numbers to get switched, such as a person’s driver’s license and social security numbers. Over time, sophisticated rules have been created to compare data, and it often requires the comparison of several fields to make sure a match is correct. (For instance, health information exchanges use up to 17 different types of data to make sure the Marcia Marquez who just got admitted to the ER is the same Marcia Marquez who visited her doctor last week.) Read more…

How browsers get to know you in milliseconds

Behind the scenes of a real-time ad auction on the web.

A small technological marvel occurs on almost every visit to a web page. In the seconds that elapse between the user’s click and the display of the page, an ad auction takes place in which hundreds of bidders gather whatever information they can get on the user, determine which ads are likely to be of interest, place bids, and transmit the winning ad to be placed in the page.

How can all that happen in approximately 100 milliseconds? Let’s explore the timeline and find out what goes on behind the scenes in a modern ad auction. Most of the information I have comes from two companies that handle different stages of the auction: the ad exchange AppNexus and the demand side platform Yashi. Both store critical data in an Aerospike database running on flash to achieve sub-second speeds.

Read more…

Fast data calls for new ways to manage its flow

Examples of multi-layer, three-tier data-processing architecture.

Storage_Servers_grover_net_Flickr

Like CPU caches, which tend to be arranged in multiple levels, modern organizations direct their data into different data stores under the principle that a small amount is needed for real-time decisions and the rest for long-range business decisions. This article looks at options for data storage, focusing on one that’s particularly appropriate for the “fast data” scenario described in a recent O’Reilly report.

Many organizations deal with data on at least three levels:

  1. They need data at their fingertips, rather like a reference book you leave on your desk. Organizations use such data for things like determining which ad to display on a web page, what kind of deal to offer a visitor to their website, or what email message to suppress as spam. They store such data in memory, often in key/value stores that allow fast lookups. Flash is a second layer (slower than memory, but much cheaper), as I described in a recent article. John Piekos, vice president of engineering at VoltDB, which makes an in-memory database, says that this type of data storage is used in situations where delays of just 20 or 30 milliseconds mean lost business.
  2. For business intelligence, theses organizations use a traditional relational database or a more modern “big data” tool such as Hadoop or Spark. Although the use of a relational database for background processing is generally called online analytic processing (OLAP), it is nowhere near as online as the previous data used over a period of just milliseconds for real-time decisions.
  3. Some data is archived with no immediate use in mind. It can be compressed and perhaps even stored on magnetic tape.

For the new fast data tier, where performance is critical, techniques such as materialized views further improve responsiveness. According to Piekos, materialized views bypass a certain amount of database processing to cut milliseconds off of queries. Read more…

The big data sweet spot: policy that balances benefits and risks

Deciding what data to collect is hard when consequences are unpredictable.

Footprints_Jo_Jakeman

A big reason why discussions of “big data” get complicated — and policy-makers resort to vague hand-waving, as in the well-known White House executive office report — is that its ripple effects travel fast and far. Your purchase, when recorded by a data broker, affects not only the the ads and deals they offer you in the future, but the ones they offer innumerable people around the country that share some demographic with you.

Policy-making might be simple if data collectors or governments could say, “We’ll collect certain kinds of data for certain purposes and no others” — but the impacts of data collection are rarely predictable. And if one did restrict big data that way, its value would be seriously reduced.

Follow my steps: big data privacy vs collection

Data collection will explode as we learn how to connect you to different places you’ve been by the way you walk or to recognize certain kinds of diseases by your breath.

When such data exhaust is being collected, you can’t evade consequences by paying cash and otherwise living off the grid. In fact, trying to do so may disadvantage you even more: people who lack access to sophisticated technologies leave fewer tracks and therefore may not be served by corporations or governments. Read more…

Commodity data analytics for health care

Predixion service could signal a trend for smaller health facilities.

Analytics are expensive and labor intensive; we need them to be routine and ubiquitous. I complained earlier this year that analytics are hard for health care providers to muster because there’s a shortage of analysts and because every data-driven decision takes huge expertise.

Currently, only major health care institutions such as Geisinger, the Mayo Clinic, and Kaiser Permanente incorporate analytics into day-to-day decisions. Research facilities employ analytics teams for clinical research, but perhaps not so much for day-to-day operations. Large health care providers can afford departments of analysts, but most facilities — including those forming accountable care organizations — cannot.

Imagine that you are running a large hospital and are awake nights worrying about the Medicare penalty for readmitting patients within 30 days of their discharge. Now imagine you have access to analytics that can identify about 40 measures that combine to predict a readmission, and a convenient interface is available to tell clinicians in a simple way which patients are most at risk of readmission. Better still, the interface suggests specific interventions to reduce readmissions risk: giving the patient a 30-day supply of medication, arranging transportation to rehab appointments, etc. Read more…

Fast data fuels real-time streaming applications

A new report describes an imminent shift in real-time applications and the data architecture they require.

Fast_data_coverThe era is here: we’re starting to see computers making decisions that people used to make, through a combination of historical and real-time data. These streams of data come together in applications that answer questions like:

  • What news items or ads is this website visitor likely to be interested in?
  • Is current network traffic part of a Distributed Denial of Service attack?
  • Should our banking site offer a visitor a special deal on a mortgage, based on her credit history?
  • What promotion will entice this gamer to stay on our site longer?
  • Is a particular part of the assembly line overheating and need to be shut down?

Such decisions require the real-time collection of data from the particular user or device, along with others in the environment, and often need to be done on a per-person or per-event basis. For instance, leaderboarding (determining who is top candidate among a group of users, based on some criteria) requires a database that tracks all the relevant users. Such a database nowadays often resides in memory. Read more…