Big Data Hadoop Security - A Comprehensive Guide

Big data has become an inherent part of majority of large organizations due to their core ability to derive critical business insights from large variety of data available to organizations. Technically, big data infrastructure is primarily based on Hadoop and its ecosystem. One of the rising concern among big data and hadoop deployments is lack of clear security architecture across the board.

Big Data and Hadoop Security Risk

A typical hadoop cluster store and operate on large amount of business critical data. Any data leakage or security lapse in this infrastructure can lead to big disruption for the business.

Other than outright data breach, there is also a wider issue of access control and data privacy:

Traditionally hadoop don't have a well defined security architecture and lags behind database and data warehouse community. There are large number of deployments which use unsecure deployments where anybody access to same network as Hadoop cluster can easily read data off HDFS (Hadoop distributed file system)

Big Data Hadoop Security - Key Technologies

Kerberos Support and Secure Hadoop

The core of a big data hadoop security lies in enabling Kerberos support for both YARN cluster manager as well as HDFS. Kerberos acts as the core authentication and authorization mechanism for all actions or data access. The default mode is hadoop is called no-secure and don't have kerberos enabled.

One can find details of securing hadoop using kerberos at https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SecureMode.html

Please note that securing YARN (cluster manager) and HDFS (filesystem) is just the initial steps of a secure hadoop. Typical hadoop infrastructure relies on large number of other services like hive, hbase, mongodb, kafka etc and one need to look into security aspect of each of these.

Secure Hadoop at Cloud

If your hadoop infrastructure is on cloud, there is another possibility. For example, lot of people on Hadoop is launching hadoop cluster without kerberos, i.e., in a no secure mode. To take care of security in such a deployment, they rely on cloud provider support for private virtual private network (VPN).

Private VPN ensure that even though computers inside the VPN can connect to each other, outside computers can not connect to the nodes under VPN. So in a cloud environment one can use unsecure hadoop inside a private VPN.

Please note that this solves only half of the problem though. The bigger and difficult issue of separate access control within the employees of company still needs to be implemented.