Python Now a First-Class Language on Spark, Databricks Says
The Apache Spark community has made significant improvements to Python support, making it a first-class language, no longer a clunky add-on. This is a significant development, as Python is the world’s most popular programming language.
Python programming
In the past, Python users have been dismayed by the poor integration with Apache Spark, including its tendency to be buggy. However, the folks at Databricks, who lead the development of Apache Spark, took these comments to heart and pledged to do something about Python’s poor integration and performance with Spark.
The work commenced in 2020 around Project Zen, with the goal of providing a more soothing and copasetic experience for Python coders writing Spark jobs. Project Zen has already resulted in better integration between Python and Spark. Over the years, various Zen-based features have been released, including a redesigned pandas UDF, better error reporting in Spark 3.0, and making PySpark more Pythonic and user-friendly in Spark 3.1.
Writing Spark jobs in Scala is the native way of writing it. So that’s the way that Spark is most likely to understand your job, and it’s not going to be as buggy. - Zach Wilson, Airbnb engineer
The work continued through Spark 3.4 and into Spark 4.0, which was released to public preview on June 3. According to Reynold Xin, co-founder and Chief Architect at Databricks, all the investments in Zen are paying off.
Spark 4.0
Python is no longer the buggy language that it once was. In fact, Xin says so much improvement has been made that, at some levels, Python has overtaken Scala in terms of capabilities. There are many Python features that are not even available in Scala, including defining a UDF and using that to connect to arbitrary data sources.
This slide summarizes a lot of the key important features for PySpark in Spark 3 and Spark 4. And if you look at them, it really tells you Python is no longer just a bolt-on onto Spark, but rather a first-class language. - Reynold Xin
The enhancements undoubtedly will help the PySpark community get more work done. Python was already the most popular language in Spark before the latest batch of improvements. So it’s interesting to note the level of usage that Python-developed jobs are getting on the Databricks platform, which is one of the biggest big data systems on the planet.
According to Xin, an average of 5.5 billion Python on Spark 3.3 queries run on Databricks every single day. The comp-sci PhD says that that work—with one Spark language on one version of Spark—exceeds the volume of every other data warehousing platforms on the planet.
Databricks
In conclusion, Python is now a first-class language on Spark, and this development is expected to have a significant impact on the PySpark community. With the continued improvements, Python developers can now write Spark jobs with confidence, knowing that they have a robust and reliable platform to work with.