RDBMS is designed to handle structured data; they are not designed
to handle huge amount of data of different kind. Complexity and Cost involved
in scaling a Hadoop Cluster is very less as compared to RDBMS. Apart from that
the scaling and parallelism which Hadoop can achieve is nearly impossible
for an RDBMS to achieve. There are lots of limitation in RDBMS which restricts
it to be used for Big Data.
Why Hadoop? Why Can’t we stick Traditional Data Processing Solutions and RDBMS?
We have explained the answer to this question in detail. This
article will focus on: -
v What restricts traditional RDBMS from being used to process
BigData.
v How new Big Data solution can overcome the problems of traditional
DBs and what capabilities it adds to it.
In this article we will discuss why exactly the traditional
solution can’t be chosen over Hadoop for Big Data Processing and how they
are different from Big data solutions.
Without further ado We will start with a comparative analysis
between traditional DBs and Hadoop.
1. Better throughout in
processing of huge data volume:
Hadoop processes the data faster when it comes to huge amount of
data. There are some striking differences based on the way RDBMS and Hadoop
treats the data.
If there is a table of 1 terabyte and you run a “select *” on
the table. How differently it will be processed in Hadoop and in RDBMS?
In RDBMS, all the data blocks of the table required for the
processing will first move to the application server and then the logic will be
applied on data, but in Hadoop Data resides in the node itself where
processing happens so its not the data which goes to the processing server
but actually the job code is sent to the node where processing needs to happen
thereby saving the time of data movement.
Suppose we have a table of 10GB. "select *" from emp will cause movement of 10 GB of data to DB server. But in Hadoop, it’s the code which may be of 10o kb or less moves to all the nodes, data is already distributed among nodes. This is the most interesting and most important reason for better throughput in Hadoop. Moving Terabytes and petabytes of data across network can itself take hours in data movement. Typical architecture of RDBMS solution used in enterprise is shown below: -
Basic
Diagram of 3 tier DB architecture
RDBMS has an Application server, Storage servers. The client is
connected through high speed network lines where the data moves from storage to
the application server.
Another difference in the way how they treat data is, RSBMS
follows ACID rule and Hadoop follows BASE rule, so it takes care of eventual consistency in contrast to the two way commit of
RDBMS.
2. Scalability
Scalability is one of the most important feature
of Hadoop. It is the ability to expand the number of servers being used for
data storage and with the addition of servers the power of compute it brings is
commendable. You just have to modify the
core-site.xml to inform your name node that there is a new member in the
cluster. Traditional Dbs are also scalable but the problem lies in the way they
scale. Vertical scaling makes RDBMS very costly for processing large
batches of data.
Traditional DBs are best when you have small table with 1-10
million rows. But when you grow to 500 million or petabytes of data, it becomes
difficult to process.
They don’t scale very well. Though grid solutions
or sharding can help with this problem, but the increase in amount of data in
RDBMS bring a lot of limitation which hampers the performance.
Some limitations which huge data volume brings on
RDBMS are:
Scaling can be done in RDBMS but RDBMS Scaling doesn’t
increase the performance linearly. Application servers, Processors,
Storage can be increased but the scaling doesn’t improve the performance
linearly in RDBMS. Diagram shows the Scaling vs Performance graph.
Performance
Vs Scalability for Hadoop and RDBMS
The graph shows Scalability vs Performance of Hadoop and
RDBMS. Red line shows Hadoop and the Blue one is for RDBMS. For Hadoop, the
graph is almost linear but for RDBMS a nonlinear improvement in performance is
seen when more servers are increased. There are several reasons for Nonlinear
performance improvement in RDBMS with Scaling. RDBMS follows ACID rule while
Hadoop Follows BASE rule, so it focuses more on the eventual commit rather than
two-way commit. Predefined schema in RDBMS also hampers linear performance
scalability.
3. Cost
Scaling up is really very costly. If
RDBMS is used to handle “big data,” it will eventually turn out to be very
expensive. When Required Relational
databases tend to scale up vertically which means they add extra horsepower to
the system to enable faster operations on the dataset.
On the contrary, Hadoop, NoSQL Databases like the HBase, MongoDB, Couchbase
scale horizontally with the addition of extra nodes (commodity servers) to the
resource pool, so that the load can be distributed easily.
The cost of storing large amounts of data in a relational database
gets expensive in comparison to storing data in HDFS, while cost of storing
data in a Hadoop grows linearly with the volume of data and there is no
ultimate limit.
Hadoop Provides flexibility in storage and Processing
You can read and write any type of file and use any kind of
processing mechanism. Either treat it as a table and process through sql or
treat it as a file and use any processing engine like spark, MapReduce and do
all kind of analysis on them, be it predictive, sentiment, regression or
real-time processing using lambda approach.
“RDBMS is a backbox. Once
you put your data in RDBMS, you can access it only through the sql queries.
Hadoop gives a sense of openness”. It provides tremendous flexibility in saving
and processing of data the way you want”
It is also difficult to implement some kind of
use case such as shortest path between two points, sentiment analysis,
predictive analysis, using SQL on top of a relational databases.
There are also datatype and field length limitations in
Traditional Relational databases systems. for e.g. Netezza a well-known
solution for handling big data has a limitation of 1600 columns, however in
Hadoop there is no such limitation.