Spark is NOT:
- just a SQL engine
- just a query engine
Spark IS:
A general-purpose distributed data processing engine capable of ETL, analytics, and machine learning.
Spark SQL
↓
Catalyst Optimizer
↓
Physical Execution Plan
↓
Parquet / HDFS
Spark uses the Hive Metastore for:
- Table schemas
- Partition info
- Storage locations
Spark does not store data in the metastore.
spark.table("orders")Steps:
- Load table metadata
- Discover partitions
- Build logical plan
- Optimize via Catalyst
- Generate physical plan
Catalyst performs:
- Predicate pushdown
- Column pruning
- Constant folding
- Join reordering
All before execution.
Predicate:
WHERE country = 'US'Result:
- Only
country=USdirectories scanned - Others skipped entirely
This happens before execution starts.
Spark reads Parquet as:
- Row groups
- Column chunks
- Pages
Only required columns are loaded into memory.
Spark pushes:
- Numeric filters
- String equality
- Range filters
Filters applied:
- At row group level
- At page level
Spark:
- Decompresses data
- Converts to Tungsten format
- Operates in memory
This allows:
- Complex transformations
- Joins
- Aggregations
spark.sql("INSERT INTO orders SELECT ...")Spark:
- Executes transformations
- Buffers rows
- Forms row groups
- Writes column chunks
- Writes Parquet footer
Produces:
part-00000.snappy.parquet
Spark writes:
- Faster
- More configurable
- Better small-file handling
But:
- Still immutable
- Still file-based
Native Spark + Parquet:
- ❌ No UPDATE
- ❌ No DELETE
With table formats:
- Iceberg
- Delta Lake
- Hudi
Spark is the primary engine for these formats.
Spark is ideal for:
- ETL pipelines
- Large transformations
- Machine learning
- Writing Parquet
Less ideal for:
- Low-latency BI
Spark is the workhorse that transforms data and writes Parquet, while also being able to query it efficiently.
Hive Metastore → Spark Planner → Catalyst → Parquet