1. What Spark Is (And Is Not)

Spark is NOT:

just a SQL engine
just a query engine

Spark IS:

A general-purpose distributed data processing engine capable of ETL, analytics, and machine learning.

Spark SQL
   ↓
Catalyst Optimizer
   ↓
Physical Execution Plan
   ↓
Parquet / HDFS

2. Spark and Hive Metastore

Spark uses the Hive Metastore for:

Table schemas
Partition info
Storage locations

Spark does not store data in the metastore.

3. Reading a Hive Parquet Table in Spark

spark.table("orders")

Steps:

Load table metadata
Discover partitions
Build logical plan
Optimize via Catalyst
Generate physical plan

4. Catalyst Optimizer (Key Difference)

Catalyst performs:

Predicate pushdown
Column pruning
Constant folding
Join reordering

All before execution.

5. Partition Pruning

Predicate:

WHERE country = 'US'

Result:

Only country=US directories scanned
Others skipped entirely

This happens before execution starts.

6. Parquet Read Path

Spark reads Parquet as:

Row groups
Column chunks
Pages

Only required columns are loaded into memory.

7. Predicate Pushdown into Parquet

Spark pushes:

Numeric filters
String equality
Range filters

Filters applied:

At row group level
At page level

8. In-Memory Processing (Why Spark Is Flexible)

Spark:

Decompresses data
Converts to Tungsten format
Operates in memory

This allows:

Complex transformations
Joins
Aggregations

9. Writing Parquet Files with Spark

spark.sql("INSERT INTO orders SELECT ...")

Spark:

Executes transformations
Buffers rows
Forms row groups
Writes column chunks
Writes Parquet footer

Produces:

part-00000.snappy.parquet

10. Spark vs Hive Writes

Spark writes:

Faster
More configurable
Better small-file handling

But:

Still immutable
Still file-based

11. Updates and Deletes in Spark

Native Spark + Parquet:

❌ No UPDATE
❌ No DELETE

With table formats:

Iceberg
Delta Lake
Hudi

Spark is the primary engine for these formats.

12. When to Use Spark

Spark is ideal for:

ETL pipelines
Large transformations
Machine learning
Writing Parquet

Less ideal for:

Low-latency BI

13. Mental Model

Spark is the workhorse that transforms data and writes Parquet, while also being able to query it efficiently.

Hive Metastore → Spark Planner → Catalyst → Parquet

sany2k8/parquet_with_spark_and_hive.md

Select an option

No results found