Hadoop jobs reflect the same security demands as other programming tasks. Corporate and regulatory requirements create complex rules concerning who has access to different fields in data sets, sensitive fields must be protected from internal users as well as external threats, and multiple applications run on the same data and must treat different users with different access rights. The modern world of virtualization and containers adds security at the software level, but tears away the hardware protection formerly offered by network segments, firewalls, and DMZs.
Furthermore, security involves more than saying yes or no to a user running a Hadoop job. There are rules for archiving or backing up data on the one hand, and expiring or deleting it on the other. Audit logs are a must, both to track down possible breaches and to conform to regulation.
Best practices for managing data in these complex, sensitive environments implement the well-known principle of security by design. According to this principle, you can’t design a database or application in a totally open manner and then layer security on top if you expect it to be robust. Instead, security must be infused throughout the system and built in from the start. Defense in depth is a related principle that urges the use of many layers of security, so that an intruder breaking through one layer may be frustrated by the next.
In this article, I’ll describe how security by design can work in a Hadoop environment. I interviewed the staff of PHEMI for the article and will refer to their product PHEMI Central to illustrate many of the concepts. But the principles are general ones with long-standing roots in computer security.
The core of a security-by-design approach is a policy enforcement engine that intervenes and checks access rights before any data enters or leaves the data store. The use of such an engine makes it easier for an organization to guarantee consistent and robust restrictions on its data, while simplifying application development by taking policy enforcement off the shoulders of the developers.
Combining metadata into policies
Security is a cross between two sets of criteria: the traits that make data sensitive and the traits of the people who can have access to it.
Sometimes you can simply label a column sensitive because it contains private data (an address, a salary, a social security number). So, column names in databases, tags in XML, and keys in JSON represent the first level of metadata on which you can filter access. But you might want to take several other criteria into account, particularly when you add data retention and archiving to the mix. Thus, you can add any metadata you can extract during data ingestion, such as filenames, timestamps, and network addresses. Your users may also add other keywords or tags to the system.
Each user, group, or department to which you grant access must be associated with some combination of metadata. For instance, a billing department might get access to a customer’s address field and to billing data that’s less than one year old.
Storing policies with the data
Additional security is provided by storing policies right with the raw data instead of leaving the policies in a separate database that might become detached from the system or out of sync with changing data. It’s worth noting, in this regard, that several tools in the Hadoop family — Ranger, Falcon, and Knox — can check data against ACLs and enforce security, but they represent the older model of security as an afterthought. PHEMI Central exemplifies the newer security-by-design approach.
PHEMI Central stores a reference to each policy with the data in an Accumulo index. A policy can be applied to a row, a column, a field in XML or JSON, or even a particular cell. Multiple references to policies can be included without a performance problem, so that different users can have different access rights to the same data item. Performance hits are minimized through Accumulo’s caching and through the use of locality groups. These cluster data according to the initial characters of the assigned keys and ensure that data with related keys are put on the same server. An administrator can also set up commonly used filters and aggregations such as min, max, and average in advance, which gives a performance boost to users who need such filters.
The policy enforcement engine
So far, we have treated data passively and talked only about its structure. Now we can turn to the active element of security: the software that stands between the user’s query or job request and the data.
The policy enforcement engine retrieves the policy for each requested column, cell, or other data item and determines whether the user should be granted access. If access is granted, the data is sent to the user application. If access is denied, the effect on the user is just as if no such data existed at all. However, a sophisticated policy enforcement engine can also offer different types or levels of access. Suppose, for instance, that privacy rules prohibit researchers from seeing a client’s birthdate, but that it’s permissible to mask the birthdate and present the researcher with the year of birth. A policy enforcement engine can do this transformation. In other words, different users get different amounts of information based on access rights.
Note that many organizations duplicate data in order to grant quick access to users. For instance, they may remove data needed by analysts from the Hadoop environment and provide a data mart dedicated to those analysts. This requires extra servers and disk space, and leads to the risk of giving analysts outdated information. It truly undermines some of the reasons organizations moved to Hadoop in the first place.
In contrast, a system like PHEMI Central can provide each user with a view suited to his or her needs, without moving any data. The process is similar to views in relational databases, but more flexible.
Take as an example medical patient data, which is highly regulated, treated with great concern by the patients, and prized by the health care industry. A patient and the physicians treating the patient may have access to all data, including personal information, diagnosis, etc. A researcher with whom the data has been shared for research purposes must have access only to specific data items (e.g., blood glucose level) or the outcome of analyses performed on the data. A policy enforcement engine can offer these different views in a secure manner without making copies. Instead, the content is filtered based on access policies in force at the time of query.
Fraud detection is another common use case for filtering. For example, a financial institution has access to personal financial information for individuals. Certain patterns indicate fraud, such as access to a particular account from two countries at the same time. The institution could create a view containing only coarse-grained geographic information — such as state and country, along with date of access — and share that with an application run by a business partner to check for fraud.
Benefits of centralizing policy
In organizations without policy engines, each application developer has to build policies into the application. These are easy to get wrong, and take up precious developer time that should be focused on the business needs of the organization.
A policy enforcement engine can enforce flexible and sophisticated rules. For instance, HIPAA’s privacy rules guard against the use or disclosure of an individual’s identifying health information. These rules provide extensive guidelines on how individual data items must be de-identified for privacy purposes and can come into play, for example, when sharing patient data for research purposes. By capturing them as metadata associated with each data item, rules can be enforced at query time by the policy engine.
Another benefit of this type of system, as mentioned earlier, is that data can be massaged before being presented to the user. Thus, different users or applications see different views, but the underlying data is kept in a single place with no need to copy and store altered versions.
At the same time, the engine can enforce retention policies and automatically track data’s provenance when the data enters the system. The engine logs all accesses, to meet regulatory requirements and provides an audit trail when things go wrong.
Security by design is strongest when the metadata used for access is built right into the system. Applications, databases, and the policy enforcement engine can work together seamlessly to give users all the data they need while upholding organizational and regulatory requirements.
This post is a collaboration between O’Reilly and PHEMI. See our statement of editorial independence.