Skip to content

Instantly share code, notes, and snippets.

@sany2k8
Last active February 4, 2026 13:39
Show Gist options
  • Select an option

  • Save sany2k8/e6f93dfe12d2d42497870ee994bb1fa6 to your computer and use it in GitHub Desktop.

Select an option

Save sany2k8/e6f93dfe12d2d42497870ee994bb1fa6 to your computer and use it in GitHub Desktop.
Below is a systems-level explanation of how Apache Spark works with Hive tables and Parquet files. This document focuses on data writing, query execution, and optimization, and is intentionally separate from Hive and Impala.

1. What Spark Is (And Is Not)

Spark is NOT:

  • just a SQL engine
  • just a query engine

Spark IS:

A general-purpose distributed data processing engine capable of ETL, analytics, and machine learning.

Spark SQL
   ↓
Catalyst Optimizer
   ↓
Physical Execution Plan
   ↓
Parquet / HDFS

2. Spark and Hive Metastore

Spark uses the Hive Metastore for:

  • Table schemas
  • Partition info
  • Storage locations

Spark does not store data in the metastore.


3. Reading a Hive Parquet Table in Spark

spark.table("orders")

Steps:

  1. Load table metadata
  2. Discover partitions
  3. Build logical plan
  4. Optimize via Catalyst
  5. Generate physical plan

4. Catalyst Optimizer (Key Difference)

Catalyst performs:

  • Predicate pushdown
  • Column pruning
  • Constant folding
  • Join reordering

All before execution.


5. Partition Pruning

Predicate:

WHERE country = 'US'

Result:

  • Only country=US directories scanned
  • Others skipped entirely

This happens before execution starts.


6. Parquet Read Path

Spark reads Parquet as:

  • Row groups
  • Column chunks
  • Pages

Only required columns are loaded into memory.


7. Predicate Pushdown into Parquet

Spark pushes:

  • Numeric filters
  • String equality
  • Range filters

Filters applied:

  • At row group level
  • At page level

8. In-Memory Processing (Why Spark Is Flexible)

Spark:

  • Decompresses data
  • Converts to Tungsten format
  • Operates in memory

This allows:

  • Complex transformations
  • Joins
  • Aggregations

9. Writing Parquet Files with Spark

spark.sql("INSERT INTO orders SELECT ...")

Spark:

  1. Executes transformations
  2. Buffers rows
  3. Forms row groups
  4. Writes column chunks
  5. Writes Parquet footer

Produces:

part-00000.snappy.parquet

10. Spark vs Hive Writes

Spark writes:

  • Faster
  • More configurable
  • Better small-file handling

But:

  • Still immutable
  • Still file-based

11. Updates and Deletes in Spark

Native Spark + Parquet:

  • ❌ No UPDATE
  • ❌ No DELETE

With table formats:

  • Iceberg
  • Delta Lake
  • Hudi

Spark is the primary engine for these formats.


12. When to Use Spark

Spark is ideal for:

  • ETL pipelines
  • Large transformations
  • Machine learning
  • Writing Parquet

Less ideal for:

  • Low-latency BI

13. Mental Model

Spark is the workhorse that transforms data and writes Parquet, while also being able to query it efficiently.

Hive Metastore → Spark Planner → Catalyst → Parquet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment