Saturday, March 23, 2019

What is an RDD and Why Spark needs it?


RDD (Resilient Distributed Data set) is the core of Apache Spark. It is the fundamental data structure on top of which all the spark components reside. In layman terms it can be understood as distributed collection of records which resides in memory*. In Spark Cluster multiple nodes work together on a job, each node works on some portion of data. If you using HDFS as storage then data will be saved in a distributed manner. For computation all the distributed chunks of a data-set which primarily resides in HDFS hard disc comes into RAM* of each node for a fraction of second, this distributed data at that point is collectively known as RDD.

This is same as the class and object concept, while writing code you create an object of a class in a textile, but the object is actually materialized when it is executed and occupies some heap memory in execution engine.

When RDD is Materialized?

The above process where RDDs gets materialized, happens only when an Action is called on RDD. You can keep on deriving one RDD from another through Transformation but Spark won’t materialize the RDD (i.e. data won’t be fetched into RAM), only a graph will be created by which it keeps all the information of RDD dependency and transformation operation to be applied on it to create a new RDD. This graph is called DAG.

Spark keeps on adding the transformation and resulting RDD information into DAG until it finds an action call on any subsequent RDD.

Image: - Apache Spark DAG

Once an action is found on RDD the DAG is submitted to the DAG scheduler which further divides the job into multiple stages and execute the DAG to populate the data into RDD and do the predefined transformation.

Why RDD is materialized just for a fraction of second?

Since there can be multiple jobs running in spark at the same time so it is not efficient to keep a materialized RDD in memory always. So Spark came up with a concept called lazy evaluation, which means until an action is called on RDD, it wont get materialized. And once an action is called then all the RDDs will be materialized as per the transformation defined in DAG one by one. Once materialized and action is achieved the RDDs are flushed from memory.

Next time you call action on same RDD the DAG will get executed again. In case if there is any RDD which is getting referenced multiple times in the DAG or which is getting computed multiple times then you can Cache or Persist that RDD, this will avoid re-computation of RDD.

What is the difference in Spark and Map Reduce way of processing the data?

Though the underlying connects of mapping and reducing the data is same in Spark also but there are multiple differences in the way MR and Spark both handles the data. The key difference which makes spark faster is it doesn’t store the intermediate result of a stage into the hard disc rather it keeps it in memory.

 Map Reduce Vs Spark Way of Processing data: -

Image:- Map Reduce Vs Spark Way of Processing data

Unlike MR Spark keeps the intermediate result into memory which acts as an input for the next step. However, if at any point of time the available memory in cluster is less than the memory required to keep the intermediate result then the spill over data will be written to disk. So RDD data can reside in RAM and hard disc both portion of data will move to hard disc only if there is a spill over.

Now as per the definition, RDD stands for Resilient Distributed Data-set. Each term has a meaning defined below: -

Resilient: - As per dictionary meaning Resilient means “able to recover quickly”. RDDs are resilient because they can recover if any of its partition is lost. RDD can be recomputed if lost based on the lineage graph called DAG.

Distributed: - As mentioned in the very beginning, Data resides on memory of multiple node in a distributed manner when RDD is materialized.

Dataset: - RDD is the collection of the distributed dataset.

Features of an RDD?

RDD has following key features: -

1.   Fault Tolerance: - Can recover easily if its lost or if any of its partition is lost based on the DAG.

2.   Immutable: - Once an RDD is created it cannot be modified. This makes RDD consistent and safe to be accessed across multiple nodes. If you need to modify it then you will have to create a new RDD from an existing one.

3.   Lazy Evaluation: - RDD is materialized only when an action is called otherwise Spark will keep on adding the transformation and the resulting RDD information into a lineage graph called DAG.

4.   In Memory Processing: - Data is processed in memory, i.e. intermediate results of each stage is stored in memory until there is a memory shortage and spill over happens. In case of spill over data is written to disk.

5.   Partitioned: - Data is logically portioned inside RDD to achieve parallel processing. You can also re-partition the RDD based on your performance requirement.

6.   Location Sickness: - While materializing the RDD, DAG Scheduler will place the RDD partition to the node which is closest to the data. This means in most cases a node will work on the portion of data which is present in it. This reduces movement of data through network and shuffling of data.

7.   Persistence: - You can persist an RDD, If the same RDD is used multiple times then to avoid re-computation you can save the RDD in Cache or on Hard Disk.

How to create an RDD?

Following are the three ways to create RDD: -

1.   Load dataset which is present in file or in table or any external storage. 

2.   Parallelize a collection: - You can pass a list or collection to parallelize() and get an RDD

3.   Transform an existing RDD into new one.

All the above three ways to create an RDD will be explained in detail in next post.

To learn more on Spark click hereLet us know your views or feedback on Facebook or Twitter.

No comments:

Post a Comment