Apache Spark is a cluster computing
system. It is lightning fast in-memory* parallel processing engine. Though it
is based on Hadoop Map-Reduce but can also work independently without Hadoop(hdfs
and Map-reduce).
It supports multiple languages like Java, Python, Scala, R etc. It provides high level APIs in these languages to process the data. It also supports SQL queries.
2.
Spark is a single framework which can be used for
any type of processing: -
3.
Reliable solution for enterprise: -
4.
Cost effective and Scalable: -
Best ladies Watch under Rs 10000
Top 10 Best ladies Watch under Rs 10000
Best Induction stove / cook-tops in India under 5000
https://www.smartdealforyou.com/electronics/top-10-best-headphones-under-rs-1000
10 Best Water Purifiers in India – A Buying Guide
:- https://www.smartdealforyou.com/appliances/10-best-water-purifiers-in-india-a-buying-guide
It supports multiple languages like Java, Python, Scala, R etc. It provides high level APIs in these languages to process the data. It also supports SQL queries.
Why not Hadoop Map-Reduce? What was the need for Apache Spark?
Following are the reasons why industry needed a framework like
Spark: -
1. Hadoop was designed for Batch Processing : -
Map Reduce is designed for batch processing, not real-time processing.2. Map reduce is slow: -
I/O operations are the most expensive operation when we deal with data processing. Writing data to disc and reading back the data from disc is very costly in terms of time and resources. When a Map-Reduce job is submitted it is divided into multiple stages where multiple mapper and reducers runs. Every mapper output is written to disc, the same data from disc is read by the subsequent mapper or reducer.
For
a big job there can be hundreds of mappers. This writing of data by mappers to
disc and reading from disc by subsequent mapper or reducer makes the Hadoop very
slow.
3. No Single framework for doing all different kind of data processing and analysis: -
Depending upon the kind of requirement like batch or real-time or graph processing, you need a different tool and skill set because the available tools to achieve those are completely different from one another for example: -
a.
Hadoop is designed for batch processing, not real-time
processing.
b.
Hive and Impala for running SQL queries against
the data. It was not designed for complex analytical jobs like sentimental
analysis, or regression analysis.
c.
Apache Storm for stream processing.
Developer need to be aware of all
the different technologies depending upon their use case. There was no single framework or tool which
can give the power of batch processing, real time stream processing, complex
analytical and graph processing within itself by leveraging just one technology
or skill.
Benefits of Apache Spark and how it different from other existing processing engine?
Benefits of Apache Spark and how it is different from other
big data solutions is mentioned below:-
1. Spark is very fast: -
Spark is 100+ times faster that map-reduce jobs. It is lightning fast because of its design and the way it handles and process the data. The key features which make Spark lightning fast are
a.
In-memory parallel processing*
b.
Lazy Evaluation
c.
Resilient Distributed Dataset(RDD)
d.
DAG
e.
Catalyst Optimizer
2.
Spark is a single framework which can be used for
any type of processing: -
Spark
can be used for batch, real-time, graph, image processing. It supports multiple
languages like Java, Python, Scala and R. Developer need to know any one of
these four languages in order to work with spark. Spark also supports SQL.
3.
Reliable solution for enterprise: -
All
high-level components are tightly integrated in Spark. It has a very fast
distributed computation engine at the very core. On top of the core multiple higher-level
components are built like SparkSQL, ML, AI, GraphX etc are written. All the
components are highly integrated. Sine all the high-level components are built
on top of Spark Core so any improvement in Spark core benefits all the high-level
components.
4.
Cost effective and Scalable: -
Support
for so many high-level components such as SQL, ML, AI under a single umbrella
reduces problem of procuring and maintaining multiple software, hardware and
skills from market. Moreover, Spark is free.