Saturday, July 13, 2019

Hadoop Vs Traditional Data Processing Solutions


RDBMS is designed to handle structured data; they are not designed to handle huge amount of data of different kind. Complexity and Cost involved in scaling a Hadoop Cluster is very less as compared to RDBMS. Apart from that the scaling and parallelism which Hadoop can achieve is nearly impossible for an RDBMS to achieve. There are lots of limitation in RDBMS which restricts it to be used for Big Data.

Why Hadoop? Why Can’t we stick Traditional Data Processing Solutions and RDBMS?

We have explained the answer to this question in detail. This article will focus on: -
v  What restricts traditional RDBMS from being used to process BigData.
v  How new Big Data solution can overcome the problems of traditional DBs and what capabilities it adds to it.

In this article we will discuss why exactly the traditional solution can’t be chosen over Hadoop for Big Data Processing and how they are different from Big data solutions.
Without further ado We will start with a comparative analysis between traditional DBs and Hadoop.

1. Better throughout in processing of huge data volume:

Hadoop processes the data faster when it comes to huge amount of data. There are some striking differences based on the way RDBMS and Hadoop treats the data.

If there is a table of 1 terabyte and you run a “select *” on the table. How differently it will be processed in Hadoop and in RDBMS?

In RDBMS, all the data blocks of the table required for the processing will first move to the application server and then the logic will be applied on data, but in Hadoop Data resides in the node itself where processing happens so its not the data which goes to the processing server but actually the job code is sent to the node where processing needs to happen thereby saving the time of data movement. 

Suppose we have a table of 10GB. "select *" from emp will cause movement of 10 GB of data to DB server.  But in Hadoop, it’s the code which may be of 10o kb or less moves to all the nodes, data is already distributed among nodes. This is the most interesting and most important reason for better throughput in Hadoop. Moving Terabytes and petabytes of data across network can itself take hours in data movement. Typical architecture of RDBMS solution used in enterprise is shown below: -

Basic Diagram of 3 tier DB architecture

RDBMS has an Application server, Storage servers. The client is connected through high speed network lines where the data moves from storage to the application server.

Another difference in the way how they treat data is, RSBMS follows ACID rule and Hadoop follows BASE rule, so it takes care of eventual consistency in contrast to the two way commit of RDBMS. 

2. Scalability

Scalability is one of the most important feature of Hadoop. It is the ability to expand the number of servers being used for data storage and with the addition of servers the power of compute it brings is commendable. You just have to modify the core-site.xml to inform your name node that there is a new member in the cluster. Traditional Dbs are also scalable but the problem lies in the way they scale. Vertical scaling makes RDBMS very costly for processing large batches of data.

Traditional DBs are best when you have small table with 1-10 million rows. But when you grow to 500 million or petabytes of data, it becomes difficult to process.

They don’t scale very well. Though grid solutions or sharding can help with this problem, but the increase in amount of data in RDBMS bring a lot of limitation which hampers the performance.

Some limitations which huge data volume brings on RDBMS are:
Scaling can be done in RDBMS but RDBMS Scaling doesn’t increase the performance linearly. Application servers, Processors, Storage can be increased but the scaling doesn’t improve the performance linearly in RDBMS. Diagram shows the Scaling vs Performance graph.

Performance Vs Scalability for Hadoop and RDBMS

The graph shows Scalability vs Performance of Hadoop and RDBMS. Red line shows Hadoop and the Blue one is for RDBMS. For Hadoop, the graph is almost linear but for RDBMS a nonlinear improvement in performance is seen when more servers are increased. There are several reasons for Nonlinear performance improvement in RDBMS with Scaling. RDBMS follows ACID rule while Hadoop Follows BASE rule, so it focuses more on the eventual commit rather than two-way commit. Predefined schema in RDBMS also hampers linear performance scalability. 

3. Cost

Scaling up is really very costly. If RDBMS is used to handle “big data,” it will eventually turn out to be very expensive. When Required Relational databases tend to scale up vertically which means they add extra horsepower to the system to enable faster operations on the dataset.
On the contrary, Hadoop, NoSQL Databases like the HBase, MongoDB, Couchbase scale horizontally with the addition of extra nodes (commodity servers) to the resource pool, so that the load can be distributed easily.
The cost of storing large amounts of data in a relational database gets expensive in comparison to storing data in HDFS, while cost of storing data in a Hadoop grows linearly with the volume of data and there is no ultimate limit.

Hadoop Provides flexibility in storage and Processing

You can read and write any type of file and use any kind of processing mechanism. Either treat it as a table and process through sql or treat it as a file and use any processing engine like spark, MapReduce and do all kind of analysis on them, be it predictive, sentiment, regression or real-time processing using lambda approach.

 “RDBMS is a backbox. Once you put your data in RDBMS, you can access it only through the sql queries. Hadoop gives a sense of openness”. It provides tremendous flexibility in saving and processing of data the way you want

It is also difficult to implement some kind of use case such as shortest path between two points, sentiment analysis, predictive analysis, using SQL on top of a relational databases.

There are also datatype and field length limitations in Traditional Relational databases systems. for e.g. Netezza a well-known solution for handling big data has a limitation of 1600 columns, however in Hadoop there is no such limitation.

To learn more on Spark click hereLet us know your views or feedback on Facebook or Twitter.

No comments:

Post a Comment