Last active
March 22, 2019 10:56
-
-
Save andrearota/f77b6a293421a3f26dd5d2fb0a04046e to your computer and use it in GitHub Desktop.
Apache Spark issues with optimizer when using PySpark UDF and complex types
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark.sql import SparkSession, functions as f | |
from pyspark.sql.types import IntegerType | |
import sys | |
if __name__ == '__main__': | |
spark = SparkSession.builder.getOrCreate() | |
print('Spark version:', spark.version) | |
print('Python version:', sys.version) | |
foo_udf = f.udf(lambda x: 1, IntegerType()) | |
df = spark.createDataFrame([['bar']]) \ | |
.withColumn('result', foo_udf(f.col('_1'))) \ | |
.withColumn('a', f.col('result')) \ | |
.withColumn('b', f.col('result')) | |
df.explain() | |
df.show() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Spark version: 2.3.1 | |
Python version: 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] | |
== Physical Plan == | |
*(1) Project [_1#0, pythonUDF2#16 AS result#2, pythonUDF2#16 AS a#5, pythonUDF2#16 AS b#9] | |
+- BatchEvalPython [<lambda>(_1#0), <lambda>(_1#0), <lambda>(_1#0)], [_1#0, pythonUDF0#14, pythonUDF1#15, pythonUDF2#16] | |
+- Scan ExistingRDD[_1#0] | |
+---+------+---+---+ | |
| _1|result| a| b| | |
+---+------+---+---+ | |
|bar| 1| 1| 1| | |
+---+------+---+---+ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment