Securing Big Data (Part 2 of 2)

The race for dominance in big data security is on. Big data vendors understand that end users would like to get back control similar to what they had in large data warehouses to meet their customers' expectations.

By Raghuveeran Sowmyanarayanan, Vice President at Accenture

[Editor's note: Part 1 of this article is available here.]

You'll hear people say that when it comes to security, there is no silver bullet. I disagree. There certainly are security threats, but not every one is harmful. Enterprises can use encryption, malware detection, whitelists, signature-based detection, regular scanning, and patching security software to protect their data. It is difficult to guarantee that data is 100 percent secure, but if they can't completely eliminate problems, they can reduce the risk level.

Let's look at one way to do that with a Hadoop-based approach to big data security.

Hadoop is increasingly being used for big data analytics, and is an emerging key priority for CxOs. Hadoop security incorporates strong access controls. Open source software contributors and vendors are working to force-fit security into the Hadoop framework because data in Hadoop is not inherently protected.

We need to protect data from any source and in any format before it enters Hadoop. Once data is in the Hadoop ecosystem, security can be implemented in multiple places, such as within the infrastructure, within clusters, and as part of access control.

At the infrastructure level, Hadoop can be protected via the network using a firewall that allows access only to Hadoop Name node. Clients communicating to data nodes are prohibited. Data-at-rest encryption protects sensitive data and keeps at a minimum the disks out of audit scope. This encryption ensures no readable residual data remains when data is removed or copied and when the disks are decommissioned. Posix-style permissions in secure HDFS can also be used to implement controls across the Hadoop stack. Rogue nodes getting added as cluster nodes can be prevented by using pre-shared certificates.

Hadoop cluster management security involves token delegation and managing lifetime and renewable tokens. Some of the controls provided within Hadoop distributions include:

  • Starting and authenticating services: For specific OS users, enterprises generate passwords specific to the service when the service is installed

  • Service-to-service authentication and authorization: Special authentication is implemented for Hadoop services that will not act on behalf of users for the purpose of cluster coordination and monitoring

  • User-to-service authentication: Authentication of users to Hadoop components and services

  • Executing queries with user's identity: Through token delegations

Some Hadoop access controls and auditing features mimic state-of-the-art RBAC features of traditional RDBMSs, including controls that are part of OSS Hadoop, controls and add-ons that are commonly packaged together with a Hadoop distribution, and data-centric controls provided by proprietary Hadoop security add-ons.

Data security controls outside Hadoop can be applied to

  • Data inbound to Hadoop: Sensitive data such as personally identifiable information (PII) data not to be stored in Hadoop.

  • Data retrieved from Hadoop: Data in Hadoop will inherit all data warehouse controls that can enforce standard SQL security.

  • Data discovery: Identifying whether sensitive data is present in Hadoop, where it is located, and then triggering appropriate data protection measures such as data masking, tokenization/encryption, etc. Compared to structured data, in data discovery of unstructured data, identification of location, count, and classification of sensitive data becomes much more challenging.

Hadoop Security's Soft Spots

Security policies are usually configured using clear text files because these files are readable and editable by root, sudoers, and privileged application users. Security configuration files are not self-contained, so checking the validity of such configuration files is challenging. Clear text data may need to be transferred from one data node to another data node as in Hadoop 2.0, but it's possible that the scheduler is not able to locate resources (data nodes) that are next to a data node and may need to read data over the network. SQL-like facilities such as Hive are vulnerable to SQL injection attacks.

At a high level, these soft spots can be secured through traditional safeguards. File integrity monitoring (FIM) can be used to monitor unauthorized changes to policy and configuration files. Encryption of network traffic to and from Hadoop is a way to protect from network traffic sniffing. However, this will not protect data at rest or user mishandling of data. Database audit and protection (DAP) tools (such as Fortinet and GreenSQL) can be used to mitigate SQL injection attacks where applicable. DAP tools have been developed to cover implementation of data security policies, data discovery, access privilege management, audit, vulnerability assessment, and data protection.

Focused approach: This is one of the standard approaches for implementing state-of-the-art data security within the scope of one application. This approach typically limits enterprises to one of the core components of the Hadoop systems such as Hbase or Hive. Currently HBase security seems to be ahead, but DBAs would like to see SQL standards-based features.

Adaptive approach: This is another standard approach that deploys well-understood controls (OSS Hadoop security controls and proprietary controls) independently on different levels of infrastructure.

Data in Hadoop is many-to-many. It can come from a variety of sources and can be accessed and transformed by different users and services. This many-to-many paradigm makes it virtually impossible to trace what really happened to the data or to enforce the path of your data. Data management tools such as Apache Falcon and Cloudera Navigator implement an adaptive approach by collecting the metadata, visualizing it, and enforcing its path and lifetime through Hadoop.


Organizationally, we will see continued efforts to remove silos between risk, compliance, and information security departments; a continuing move towards these departments to work more closely together; and requirements for combined detection capabilities. From an operational perspective, taking a life cycle approach to secure data, developing an enterprise data policy, and joining up investigative capabilities to develop a single intelligence platform across the enterprise will be increasingly important. This will be combined with the deployment of integrated case management for all forms of financial crimes across all financial institutions.

Raghuveeran Sowmyanarayanan is a vice president at Accenture and is responsible for designing solution architecture for RFPs/opportunities. You can reach him at

TDWI Membership

Get immediate access to training discounts, video library, BI Teams, Skills, Budget Report, and more

Individual, Student, & Team memberships available.