Last active
July 10, 2021 18:01
-
-
Save giefferre/0998159953466b4273ec8f921d6dc773 to your computer and use it in GitHub Desktop.
Save the schema of a Spark DataFrame to be able to reuse it when reading json files.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# read a part of the whole datalake just to extract the schema | |
part = spark.read.json("s3a://path/to/json/part") | |
# create a temporary rdd in order to store the schema as binary file | |
temp_rdd = sc.parallelize(part.schema) | |
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle") | |
# from now on, the schema will be saved. | |
# it could be used to improve the speed of reading json files. | |
schema_rdd = sc.pickleFile("s3a://path/to/destination_schema.pickle") | |
reading_schema = StructType(schema_rdd.collect()) | |
your_data_set = spark.read.json("s3a://path/to/entire_data_lake", reading_schema) # this would be quicker than just spark.read.json() |
@federicobaiocco unfortunately I haven't, sorry. I launched the commands on a AWS EMR cluster using an Apache Zeppelin notebook
Hey @federicobaiocco, if you add this configuration line it should work in Glue:
sc = SparkContext # or whatever you're doing to grab your SparkContext
sc._jsc.hadoopConfiguration().set("mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter")
can we store the file in json or text format and later read schema from it instead of .pickle as I want to edit the schema file and pickle extension is not readable or editable.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Have you tried it in a glue job? I am getting an error:
An error occurred while calling o75.saveAsObjectFile. java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.mapred.DirectOutputCommitter not found