Assume this dataset coming from an ingestion job:
| order_id | country | product | amount |
|---|---|---|---|
| 1 | US | Book | 120 |
| 2 | IN | Pen | 20 |
| 3 | US | Book | 300 |
| 4 | IN | Pencil | 10 |
| 5 | US | Book | 150 |
Schema:
order_id INT
country STRING
product STRING
amount INT
Example:
INSERT INTO TABLE orders
SELECT * FROM staging_orders;
There is no row-by-row insert into Parquet.
Instead:
- Execution engine (Spark / Hive / Impala) runs a distributed job
- Each task produces one or more Parquet files
- Files are immutable and written once
Execution engine:
- Reads rows
- Buffers them in memory
- Groups them into row groups
Typical row group size:
128 MB (default)
Our example is tiny, so assume:
Row Group 1 contains all 5 rows
Row Group = horizontal partition
Row Group 1:
(1, US, Book, 120)
(2, IN, Pen, 20)
(3, US, Book, 300)
(4, IN, Pencil, 10)
(5, US, Book, 150)
Row groups are independent units for:
- Parallelism
- Skipping data
- Compression
Inside each row group, data is split by column.
Row Group 1
├── order_id column chunk
├── country column chunk
├── product column chunk
└── amount column chunk
order_id → [1, 2, 3, 4, 5]
country → [US, IN, US, IN, US]
product → [Book, Pen, Book, Pencil, Book]
amount → [120, 20, 300, 10, 150]
Each column chunk is written contiguously on disk.
Column chunks are further split into pages (default ~1MB).
Example: country column chunk
Page 1:
US, IN, US, IN, US
Each page contains:
- Encoded values
- Optional dictionary
- Compression
country:
Dictionary:
0 → US
1 → IN
Data:
[0, 1, 0, 1, 0]
product:
Dictionary:
0 → Book
1 → Pen
2 → Pencil
Data:
[0, 1, 0, 2, 0]
amount:
[120, 20, 300, 10, 150]
→ Bit-packed / RLE
Encoding dramatically reduces size.
Encoded pages are compressed:
Common codecs:
- Snappy (default)
- GZIP
- ZSTD
Result:
Encoded + compressed byte stream
- Number of values
- Encoding type
- Compressed size
- min / max values
- null count
- value count
Example:
amount:
min = 10
max = 300
- Total rows
- Total size
- Column offsets
Parquet writes metadata at the end of the file.
Footer contains:
- Full schema
- Row group locations
- Column chunk offsets
- Statistics
Why footer at end? → Writer doesn’t know offsets until data is written.
[Magic Bytes]
[Row Group 1]
[order_id column chunk]
[country column chunk]
[product column chunk]
[amount column chunk]
[Footer Metadata]
[Magic Bytes]
This entire file is then split into HDFS blocks.
Example:
SELECT order_id, amount
FROM orders
WHERE country = 'US'
AND amount > 100;
- Schema
- Row group metadata
Check statistics:
country.min = IN
country.max = US
amount.min = 10
amount.max = 300
Row group cannot be skipped → read it
Only read required columns:
order_id
amount
country
❌ product column never read
country = 'US':
- Dictionary scan
- Page-level filtering
amount > 100:
- Page skipped if max <= 100
Engine:
- Decompress pages
- Decode values
- Apply filters
- Reconstruct rows
Result:
(1, 120)
(3, 300)
(5, 150)
✔ Append new Parquet files
❌ No in-place modification
Instead:
- Rewrite files
- Partition overwrite
- Use Iceberg / Delta / Hudi
Real system:
/orders/
part-00000.snappy.parquet
part-00001.snappy.parquet
part-00002.snappy.parquet
Each file:
- Independently readable
- Independently skippable
- Parallelizable
Parquet stores columns inside row groups, encoded and compressed, with rich metadata at the end so query engines can avoid reading most of the data.