Big Data on Cloud - Overview of top 5 technologies

Cloud and Big data is a deadly combination. Over last few years, big data technologies have seen a tremendous increase in importance and adoption due to rising need to analyze large amount of data.

Big data is now being used in industries across the domains due to its ability to derive useful insights for business that were difficult to achieve earlier. At the same time, cloud has emerged as an important system to streamline a company computing resources, and allowing it to increase and decrease its computing footprint in a seamless manner.

This article presents top technologies which make it possible to run big data technologies smoothly in a cloud environment.

Big Data on Cloud - Top 5 Technologies

Running big data technologies like hadoop and spark is quite different from running them on bare metal. People who have tried to run hadoop, hdfs and spark as is on cloud platform (like aws) have faced sub-optimal performance.

Here is a set of Cloud big data technologies which make it worthwhile to run hadoop/spark clusters on a cloud.

Hadoop and Spark On demand

Traditionally in a big data infrastructure, a cluster of nodes remain up 24/7. But in a cloud setting, this is not the optimal use of resources. Instead we can launch a hadoop cluster, run our jobs say for few hours and then release them to save money. Running on demand allows one to run a much bigger hadoop cluster in the same money - e.g., running 5 instances for 24 hours vs 60 instances for 2 hours.

AWS S3 (and similar cloud storage)

Many of us coming from big data infrastructure have been used to concept of HDFS as the distributed file system and persistent storage for our data. But on a cloud infrastructure, we don't have a consistent set of nodes and hard disks. So the best approach is to use cloud storage like aws s3 as persistent storage.

In such case, whenever a hadoop cluster is launched, data is copied from aws s3 to hdfs. And once the big data processing is over, the results are copied back to aws s3.

AWS Spot Instances

AWS Spot instances are a surprising addition to this list. But they are the secret sauce used by most of the companies running big data technologies on Amazon AWS.

AWS spot instances is a bid based system that allows one to use spare capacity and it is typically available at one-tenth or even smaller prices.

Managed Services - EMR, Quoble etc

One can either build on top of above given technologies in house, but over time many managed services have evolved that help you in managing these in a seamless manner.

Services like Quoble and EMR charge a price based on your cluster size and provide you optimized services to manage hadoop and spark cluster. These technologies have in-built support for utilizing spot instances or optimized way of using aws s3 for persistent storage.

More recently, there are newer data platforms like Incluin that provide an easy to use layer above different cloud providers to easily launch tools like spark, tensorflow and at the same time help analysts to track and manage their experiments, models, and data lineage.