Wednesday, March 20, 2019

Scala or Python for Apache Spark?


If you are beginner in Spark you must be confused about starting with PySpark or Scala Spark. I was in the same situation couple of years back. Coming from an Oracle PL-SQL and ETL background I was a bit confused those days in choosing Scala or Python for Spark. Both were a new language for me. I was not aware of the market trend and requirement for Apache Spark at that moment so I spent considerable amount of time in asking people and searching on google about the best language for Apache Spark.  

This article is based on my experience and on the feedback I recieved from people in Industry from India and US. Here is my few cents which may help you in deciding whether to choose Python or Scala for Apache Spark.

Language used in Industry for data analysis: -

No language can beat python in areas like data science, machine learning, deep learning, its presence is unparalleled. It is a very popular language among data scientists and also has tons of libraries and open community support which makes the developers life very easy. 

Last year 
Google brain released Tenser Flow, which is an excellent open source software library used for Machine Learning and Neural Networks. Tenser flow supports only Python as of now, which further pushed the demand for Python developers in field of ML and AI. Python is also used in data mining, scientific researches, Machine learning, software application, cross platform development, business application development and RAD (Rapid Application development). 

Learning python may broaden the scope of your career.

Performance of PySpark :-

When it comes to performance in PySpark and Scala Spark, there are certain areas where Scala spark performs considerably better than PySpark. When processing data through RDD PySpark was found to be slower than Scala Spark. You will also find PySpark slower when you define a complex UDF of your own.

Spark 2.0.X came up with Data Frames. Data Frame provides better performance because of some key enhancement like catalyst optimizer. With Data Frames the performance Spark using Python or Scala is same, Scala and Python Data Frames are compiled into JVM bytecodes so there is negligible performance difference. 

In actual ETL project mostly you will work with Data Frames not RDD. When Data Frames are used it actually doesn’t matter because both PySpark and Scala spark give the same performance, but yes sometime you may need to work on RDDs also.

Trend in Industry: -

Hadoop and Spark deliver a powerful, reliable and cheaper data processing solution. Most of the industry giants are implementing or re-platforming their existing ETL projects to Hadoop and Spark. In almost all ETL projects which I came across I found that they had lots of Shell, Perl or other scripts for their jobs.

Last few years Industry has seen a swing towards Python. Python has a steep learning curve because of which it is preferred by the ETL developers. For a PLSQL guy Python will certainly be his choice.  It has a wide presence and rich community which ease the development process and makes it an excellent alternative to Perl and shell scripts. It is both functional and object oriented which makes it easier and robust.

Python is the only language which has seen a tremendous growth in last couple of years and is also expected to grow with an accelerating rate in future

Source: -

Python is a multipurpose language and it’s easy to learn: - 

Python can open you up for more career options. Python is used in everything be it an app development, web development or data analysis. Python is also very easy to learn. If compared with other programming languages it has a steady learning curve.

It’s really easy!!!!

Mastering Apache Spark? Is that what you really want? :-

If you know Scala it will let you understand and modify Spark internal code as Spark is written in Scala. Spark community is still growing, if you are stuck at any point you may face difficulty in finding a solution on internet. Sometimes you will also have to understand the spark internals and modify it if required.

If you come across a bug in spark you can only fix it if you know Scala. E.g. there is an issue in DataFrameWriter.saveAsTable with hive format to create partitioned table. You can play around to with Spark code and try to fix the issue provided you know Scala.

Conclusion: -

1.   If you are a beginner then go for Python. It is easy, it has a steady learning curve. Once you know the language, you can then focus only upon the Spark features and APIs. Python will make you a good spark developer in very less time. Once you become good at Spark then you would be in a good position to decide whether you want to continue with PySpark in your career or Scala. Nevertheless it won’t be difficult to switch to Scala as you will find similarity in the way code is written in Python and Scala.

2.   In last three years there has been a trend in migration of ETL projects to Python. It will be good to choose Python at this point.

3.   Companies working on data science (with spark), biotech software will prefer you if you know PySpark.

4.   PySpark is really easy.

     To learn more on Spark click here. Let us know your views or feedback on Facebook or Twitter.

No comments:

Post a Comment