Tests that after Spark overwrites partitioned Parquet files on S3/MinIO,
the date-level partition prefixes remain visible in delimited ListObjectsV2
so that Spark's partition discovery still works correctly.
- Generates two sample CSV files (
batch1.csv,batch2.csv) with the same schema and same date partitions but different values. - Writes
batch1as partitioned Parquet (appendmode), creating:s3a://your-bucket/data/transactions/date=2024-01-01/part-00000.parquet s3a://your-bucket/data/transactions/date=2024-01-02/part-00000.parquet - Writes
batch2to the same path withoverwritemode — Spark deletes the old.parquetfiles and writes new ones under the same date prefixes. - Reads back the Parquet and asserts the partition folders are still discoverable. If the object store's delimited listing is broken, Spark returns an empty dataset silently.
| Requirement | Version |
|---|---|
| Python | 3.8+ |
| PySpark | 3.3+ |
pip install pysparkThe S3A jars are resolved automatically via --packages — no manual jar
downloads needed. hadoop-aws pulls in aws-java-sdk-bundle as a transitive
dependency from Maven Central.
Edit the SparkSession block at the top of the script to point at your
MinIO instance:
spark = SparkSession.builder \
.appName("parquet-overwrite-test") \
.config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9000") \
.config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
.config("spark.hadoop.fs.s3a.secret.key", "minioadmin") \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.getOrCreate()Also update BASE_PATH to a bucket that already exists on your MinIO:
BASE_PATH = "s3a://your-bucket/data/transactions"Then run:
pyspark --packages org.apache.hadoop:hadoop-aws:3.3.4 spark_parquet_overwrite_test.pyspark = SparkSession.builder \
.appName("parquet-overwrite-test") \
.config("spark.hadoop.fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.DefaultAWSCredentialsProviderChain") \
.getOrCreate()Credentials are picked up from ~/.aws/credentials or environment variables
(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY).
pyspark --packages org.apache.hadoop:hadoop-aws:3.3.4 spark_parquet_overwrite_test.pyAfter the overwrite, both reads should return data — not empty results:
+---+-----+------+----------+
| id| name|amount| date|
+---+-----+------+----------+
| 1|Alice| 110|2024-01-01|
| 2| Bob| 220|2024-01-01|
...
+----------+
| date|
+----------+
|2024-01-01|
|2024-01-02|
+----------+
Empty output from the partition listing means the object store is not
returning the date=XXXX/ common prefixes correctly after the overwrite.