"data security" entries
Building access policies into data stores.
Hadoop jobs reflect the same security demands as other programming tasks. Corporate and regulatory requirements create complex rules concerning who has access to different fields in data sets, sensitive fields must be protected from internal users as well as external threats, and multiple applications run on the same data and must treat different users with different access rights. The modern world of virtualization and containers adds security at the software level, but tears away the hardware protection formerly offered by network segments, firewalls, and DMZs.
Furthermore, security involves more than saying yes or no to a user running a Hadoop job. There are rules for archiving or backing up data on the one hand, and expiring or deleting it on the other. Audit logs are a must, both to track down possible breaches and to conform to regulation.
Best practices for managing data in these complex, sensitive environments implement the well-known principle of security by design. According to this principle, you can’t design a database or application in a totally open manner and then layer security on top if you expect it to be robust. Instead, security must be infused throughout the system and built in from the start. Defense in depth is a related principle that urges the use of many layers of security, so that an intruder breaking through one layer may be frustrated by the next. Read more…
Practical tips for centralizing security data.
But let’s be realistic. You probably have numerous repositories for your security data. Your Security Information and Event Management (SIEM) solution doesn’t scale to the volumes of data that you would really like to collect. This, in turn, makes it hard to use all of your data for any kind of analytics. It’s likely that your tools have to operate on multiple, disconnected data stores that have very different capabilities for data access and analysis. Even worse, during an incident, how many different consoles do you have to touch before you get the complete picture of what has happened? I would guess probably at least four (I would have said 42, but that seemed a bit excessive).
When talking to your peers about this problem, do they tell you to implement Hadoop to deal with the huge data volumes? But what does that really mean — is Hadoop really the solution? After all, Hadoop is a pretty complex ecosystem of tools that requires skilled and expensive people to implement and maintain. Read more…
A field guide to the Apache Hadoop projects, subprojects, and related technologies.
IT managers, developers, data analysts, and system architects are encountering the largest and most disruptive change in data analysis since the ascendency of the relational database in early 1980s — the challenge to process, organize, and take full advantage of big data. With 73% of organizations making big data investments in 2014 and 2015, this transition is occurring at a historic pace, requiring new ways of thinking to go along with new tools and techniques.
Hadoop is the cornerstone of this change to a landscape of systems and skills we’ve traditionally possessed. In the nine short years since the project revolutionized data science at Yahoo!, an entire ecosystem of technologies has sprung up around it. While the power of this ecosystem is plain to see, it can be a challenge to navigate your way through the complex and rapidly evolving collection of projects and products.
A couple years ago, my coworker Marshall Presser and I started our journey into the world of Hadoop. Like many folks, we found the company we worked for was making a major investment in the Hadoop ecosystem, and we had to find a way to adapt. We started in all of the typical places — blog posts, trade publications, Wikipedia articles, and project documentation. Quickly, we learned that many of these sources are often highly biased, either too shallow or too deep, and just plain inconsistent. Read more…
When the death of trust meets the birth of BYOD
Dr. Andrew Litt, Chief Medical Officer at Dell, made a thoughtful blog post last week about the trade-offs inherent in designing for both the security and accessibility of medical data, especially in an era of BYOD (bring your own device) and the IoT (internet of things). As we begin to see more internet-enabled diagnostic and monitoring devices, Litt writes, “The Internet of Things (no matter what you think of the moniker), is related to BYOD in that it could, depending on how hospitals set up their systems, introduce a vast array of new access points to the network. … a very scary thought when you consider the sensitivity of the data that is being transmitted.”
As he went on to describe possible security solutions (e.g., store all data in central servers rather than on local devices), I was reminded of a post my colleague Simon St.Laurent wrote last fall about “security after the death of trust.” In the wake of some high-profile security breaches, including news of NSA activities, St.Laurent says, we have a handful of options when it comes to data security—and you’re not going to like any of them.
Data stores are rolling out easy-to-use analysis tools
Originated by the NSA, Apache Accumulo is a BigTable inspired data store known for being highly scalable and for its interesting security model. Federal agencies and Defense contractors have deployed Accumulo on clusters of a thousand or more servers. It also uses “cell-level” security to control access to values stored in individual cells1.
What Accumulo was lacking were easy-to-use, standard analytic engines that allow users to interact with data. The release of Sqrrl Enterprise this past week fills that gap. Sqrrl Enterprise provides an initial set of analytic engines for the Accumulo ecosystem2. It includes support for interactive SQL, fulltext search, and queries over graph data. Each of these engines takes into account security labels placed on data: since every data object ingested into Sqrrl has a security label, (query & analytic) results incorporate those access levels. Analysts interact with data as they normally would. For example Sqrrl’s indexing technology accounts for security labels, and search queries are written in standard Lucene syntax. Reminiscent of the Phoenix project for HBase3, SQL queries4 in Sqrrl are converted into optimized Accumulo iterators.
The battle to open source OFA code; a student hacker uncovers security flaw, gets expelled; and ethics and taxes for user data collection.
A cloudy future for Obama’s election code
A battle is brewing between politicians and the dream team of programmers that helped Obama win the nerdiest election ever. Ben Popper reports at The Verge that the programmers who worked on the Obama for America (OFA) 2012 campaign want to open source the code behind the campaign’s website, its donation collection and email systems, and its mobile app. Yet “[t]hree months after the election, the data and software is still tightly controlled by the president and his campaign staff, with the fate of the code still largely undecided,” Popper writes.
OFA’s director of front-engineering Daniel Ryan told Popper that he believes the Democratic National Committee (DNC) will “mothball” the tech and argues that it should be open because it was built on top of open source code and, therefore, should go back to the public. Popper also notes that if the DNC keeps the code on ice until the 2016 election, it will be useless. “But if our work was open and people were forking it and improving it all the time,” Ryan told Popper, “then it keeps up with changes as we go.” Ryan also points out that not opening up the code not only would stifle development for the next election, but would also hinder opportunities for other progressive organizations to build on the code in the next four years.
Popper reports that a DNC official responded to a request for comment, stating that “OFA is still working out the future of their tech and data infrastructure so any speculation at this time is premature and uninformed.” You can read Popper’s in-depth report at The Verge.
IDC forecast underestimates big data growth, EU report sounds an alarm over FISA Amendments Act, and big data's growing role in daily life.
Here are a few stories from the data space that caught my attention this week.
Big data needs a bigger forecast
The International Data Corporation (IDC) released a forecast this week, projecting “the worldwide big data technology and services market will grow at a 31.7% compound annual growth rate (CAGR) — about seven times the rate of the overall information and communication technology (ICT) market — with revenues reaching $23.8 billion in 2016.”
According to the press release, findings from IDC’s research also forecasted specific segment growth, including 21.1% CAGR for services and 53.4% for storage. GigaOm’s Derrick Harris says IDC’s research “only tells part of the story” and that the market will actually be much bigger. For instance, Harris notes that the report doesn’t include analytics software, a critical component of the big data market that the IDC predicts will hit $51 billion by 2016. And what of the outliers? Harris writes:
” .. .where does one include the rash of Software-as-a-Service applications targeting fields from marketing to publishing? They’re all about big data at their core, but the companies selling them certainly don’t fit into the mold of ‘big data’ vendors.”
Harris highlights potential problems the IDC might have in maintaining their report segments — servers, storage, networking, software and services — with more and more cloud providers hosting big data applications and startups offering cloud-based big data services; calculating these revenues will be no easy feat, he writes. You can read Harris’ piece in full at GigaOm.
Platfora, Continuuity secure funding; the Internet of Things gets connected; and personal big data needs a national awareness campaign.
Here are a few stories from the data space that caught my attention this week.
Two Hadoop BI startups secure funding
There were a couple notable pieces of investment news this week. Platfora, a startup looking to democratize Hadoop as a business intelligence (BI) tool for everyday business users, announced this week that it has raised $20 million in series B funding, bringing its total funding to $25.7 million, according to a report by Derrick Harris at GigaOm.
Harris notes that investors seem to get the technology — CEO Ben Werther told Harris that in this funding round, discussions moved to signed term sheets in just three weeks. Harris writes that the smooth investment experience “probably has something to do with the consensus the company has seen among venture capitalists, who project Hadoop will take about 20 percent of a $30 billion legacy BI market and are looking for the startups with the vision to win that business.”
Platfora faces plenty of well-funded legacy BI competitors, but Werther told Christina Farr at Venture Beat that Platfora’s edge is speed: “People can visualize and ask questions about data within hours. There is no six-month cycle time to make Hadoop amazing.”
In other investment news, Continuuity announced it has secured $10 million in series A funding to further develop AppFabric, its cloud-based platform-as-a-service tool designed to host Hadoop-based BI applications. Alex Wilhelm reports at The Next Web that Continuuity is looking to make AppFabric “the de facto location where developers can move their big data tools from idea to product, without worrying about building their own backend, or fretting about element integration.”
Hadoop and security, surprising results from a consumer data survey, and disconcerting data retention legislation.
In the latest Strata Week: Will big data offer us more security insights? Or will large data stores become targets for security threats? Plus: A very old map gets a digital upgrade.