Sunday, March 17, 2019

Introduction to Apache Spark


Apache Spark is a cluster computing system. It is lightning fast in-memory* parallel processing engine. Though it is based on Hadoop Map-Reduce but can also work independently without Hadoop(hdfs and Map-reduce). 
It supports multiple languages like Java, Python, Scala, R etc. It provides high level APIs in these languages to process the data. It also supports SQL queries.

Why not Hadoop Map-Reduce? What was the need for Apache Spark?

Following are the reasons why industry needed a framework like Spark: -

1.   Hadoop was designed for Batch Processing : - 

Map Reduce is designed for batch processing, not real-time processing.

2.   Map reduce is slow: - 

I/O operations are the most expensive operation when we deal with data processing. Writing data to disc and reading back the data from disc is very costly in terms of time and resources. When a Map-Reduce job is submitted it is divided into multiple stages where multiple mapper and reducers runs. Every mapper output is written to disc, the same data from disc is read by the subsequent mapper or reducer.

For a big job there can be hundreds of mappers. This writing of data by mappers to disc and reading from disc by subsequent mapper or reducer makes the Hadoop very slow.

3.   No Single framework for doing all different kind of data processing and analysis: -  

Depending upon the kind of requirement like batch or real-time or graph processing, you need a different tool and skill set because the available tools to achieve those are completely different from one another for example: -

a.   Hadoop is designed for batch processing, not real-time processing.
b.   Hive and Impala for running SQL queries against the data. It was not designed for complex analytical jobs like sentimental analysis, or regression analysis.
c.    Apache Storm for stream processing.

Developer need to be aware of all the different technologies depending upon their use case.  There was no single framework or tool which can give the power of batch processing, real time stream processing, complex analytical and graph processing within itself by leveraging just one technology or skill.

Benefits of Apache Spark and how it different from other existing processing engine?

Benefits of Apache Spark and how it is different from other big data solutions is mentioned below:-

1.   Spark is very fast: - 

Spark is 100+ times faster that map-reduce jobs. It is lightning fast because of its design and the way it handles and process the data. The key features which make Spark lightning fast are
a.   In-memory parallel processing*
b.   Lazy Evaluation
c.    Resilient Distributed Dataset(RDD)
d.   DAG
e.   Catalyst Optimizer

2.   Spark is a single framework which can be used for any type of processing: -

Spark can be used for batch, real-time, graph, image processing. It supports multiple languages like Java, Python, Scala and R. Developer need to know any one of these four languages in order to work with spark. Spark also supports SQL.

3.   Reliable solution for enterprise: -

All high-level components are tightly integrated in Spark. It has a very fast distributed computation engine at the very core. On top of the core multiple higher-level components are built like SparkSQL, ML, AI, GraphX etc are written. All the components are highly integrated. Sine all the high-level components are built on top of Spark Core so any improvement in Spark core benefits all the high-level components.

4.   Cost effective and Scalable: -

Support for so many high-level components such as SQL, ML, AI under a single umbrella reduces problem of procuring and maintaining multiple software, hardware and skills from market. Moreover, Spark is free.

No comments:

Post a Comment