Skip to content

Instantly share code, notes, and snippets.

@sany2k8
Last active February 4, 2026 13:37
Show Gist options
  • Select an option

  • Save sany2k8/62184bf1d2cb8252d505db6e2e3786fd to your computer and use it in GitHub Desktop.

Select an option

Save sany2k8/62184bf1d2cb8252d505db6e2e3786fd to your computer and use it in GitHub Desktop.
Below is a step-by-step, end-to-end walk-through of how a Parquet file is formed, stored, and queried in a Hadoop ecosystem, starting from a raw dataset and ending at query execution.

1. Example Dataset (Logical View)

Assume this dataset coming from an ingestion job:

order_id country product amount
1 US Book 120
2 IN Pen 20
3 US Book 300
4 IN Pencil 10
5 US Book 150

Schema:

order_id INT
country STRING
product STRING
amount INT

2. Insert Operation (What “INSERT INTO” Really Means)

Example:

INSERT INTO TABLE orders
SELECT * FROM staging_orders;

What actually happens

There is no row-by-row insert into Parquet.

Instead:

  1. Execution engine (Spark / Hive / Impala) runs a distributed job
  2. Each task produces one or more Parquet files
  3. Files are immutable and written once

3. In-Memory Row Buffering

Execution engine:

  • Reads rows
  • Buffers them in memory
  • Groups them into row groups

Typical row group size:

128 MB (default)

Our example is tiny, so assume:

Row Group 1 contains all 5 rows

4. Row Group Formation (Horizontal Split)

Row Group = horizontal partition

Row Group 1:
(1, US, Book, 120)
(2, IN, Pen, 20)
(3, US, Book, 300)
(4, IN, Pencil, 10)
(5, US, Book, 150)

Row groups are independent units for:

  • Parallelism
  • Skipping data
  • Compression

5. Column Chunk Formation (Vertical Split)

Inside each row group, data is split by column.

Row Group 1
 ├── order_id column chunk
 ├── country column chunk
 ├── product column chunk
 └── amount column chunk

Actual stored values

order_id → [1, 2, 3, 4, 5]
country  → [US, IN, US, IN, US]
product  → [Book, Pen, Book, Pencil, Book]
amount   → [120, 20, 300, 10, 150]

Each column chunk is written contiguously on disk.


6. Page Creation (Smallest Physical Unit)

Column chunks are further split into pages (default ~1MB).

Example: country column chunk

Page 1:
US, IN, US, IN, US

Each page contains:

  • Encoded values
  • Optional dictionary
  • Compression

7. Encoding (How Values Become Bytes)

Dictionary Encoding (Strings)

country:

Dictionary:
0 → US
1 → IN

Data:
[0, 1, 0, 1, 0]

product:

Dictionary:
0 → Book
1 → Pen
2 → Pencil

Data:
[0, 1, 0, 2, 0]

Integer Encoding

amount:

[120, 20, 300, 10, 150]
→ Bit-packed / RLE

Encoding dramatically reduces size.


8. Compression (After Encoding)

Encoded pages are compressed:

Common codecs:

  • Snappy (default)
  • GZIP
  • ZSTD

Result:

Encoded + compressed byte stream

9. Metadata Generation (Critical for Query Speed)

Page-level metadata

  • Number of values
  • Encoding type
  • Compressed size

Column chunk metadata

  • min / max values
  • null count
  • value count

Example:

amount:
  min = 10
  max = 300

Row group metadata

  • Total rows
  • Total size
  • Column offsets

10. File Footer (Written Last)

Parquet writes metadata at the end of the file.

Footer contains:

  • Full schema
  • Row group locations
  • Column chunk offsets
  • Statistics

Why footer at end? → Writer doesn’t know offsets until data is written.


11. Final Physical File Layout

[Magic Bytes]
[Row Group 1]
  [order_id column chunk]
  [country column chunk]
  [product column chunk]
  [amount column chunk]
[Footer Metadata]
[Magic Bytes]

This entire file is then split into HDFS blocks.


12. Query Execution (SELECT)

Example:

SELECT order_id, amount
FROM orders
WHERE country = 'US'
AND amount > 100;

13. How Query Engine Reads Parquet

Step 1: Read Footer

  • Schema
  • Row group metadata

Step 2: Row Group Pruning

Check statistics:

country.min = IN
country.max = US
amount.min = 10
amount.max = 300

Row group cannot be skipped → read it


14. Column Pruning

Only read required columns:

order_id
amount
country

product column never read


15. Predicate Pushdown

country = 'US':

  • Dictionary scan
  • Page-level filtering

amount > 100:

  • Page skipped if max <= 100

16. Page Decoding & Row Reconstruction

Engine:

  1. Decompress pages
  2. Decode values
  3. Apply filters
  4. Reconstruct rows

Result:

(1, 120)
(3, 300)
(5, 150)

17. Insert / Update / Delete Reality

Insert

✔ Append new Parquet files

Update / Delete

❌ No in-place modification

Instead:

  • Rewrite files
  • Partition overwrite
  • Use Iceberg / Delta / Hudi

18. Multiple Files & Parallelism

Real system:

/orders/
  part-00000.snappy.parquet
  part-00001.snappy.parquet
  part-00002.snappy.parquet

Each file:

  • Independently readable
  • Independently skippable
  • Parallelizable

19. Mental Model (One Sentence)

Parquet stores columns inside row groups, encoded and compressed, with rich metadata at the end so query engines can avoid reading most of the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment