1. First Principle: What a Hive Table Really Is

A Hive table is NOT:

a storage engine
a file format
a database that owns data

A Hive table IS:

Metadata that maps a logical table definition to files on HDFS

Hive sits between SQL and files.

Hive SQL
   ↓
Metastore (schema + location)
   ↓
HDFS files (Parquet)

2. Hive Metastore (The Control Plane)

When you create a Hive table, Hive stores metadata in the Hive Metastore (usually backed by MySQL/Postgres).

Metadata includes:

Table name
Columns & types
File format (Parquet)
HDFS location
Partition columns
SerDe info

No data is stored in the metastore.

3. Creating a Hive Table on Parquet

Example

CREATE TABLE orders (
  order_id INT,
  country STRING,
  product STRING,
  amount INT
)
STORED AS PARQUET
LOCATION '/warehouse/orders';

What happens internally

Hive writes metadata to Metastore
Creates directory:
```
/warehouse/orders/
```
No files are created yet

👉 Table exists even if directory is empty

4. Managed vs External Tables (Critical Distinction)

Managed Table

CREATE TABLE orders_managed (...)
STORED AS PARQUET;

Hive owns the data
DROP TABLE deletes files

External Table

CREATE EXTERNAL TABLE orders_ext (...)
STORED AS PARQUET
LOCATION '/warehouse/orders';

Hive only owns metadata
DROP TABLE keeps files

In both cases: ➡ Parquet behavior is identical

5. Insert into Hive Table (How Parquet Gets Written)

INSERT INTO TABLE orders
SELECT * FROM staging_orders;

Step-by-step

Hive parses SQL
Hive planner builds execution plan
Execution engine runs:
- MapReduce (old)
- Tez
- Spark (most common)
Tasks write new Parquet files
Files are written to:
```
/warehouse/orders/part-00000.parquet
```

🚫 No row-level insert ✅ File-level append only

6. How Hive Uses Parquet Internally

Hive does not implement Parquet itself.

It uses:

Parquet SerDe (serialization/deserialization)
Parquet InputFormat / OutputFormat

Responsibilities split like this:

Component	Responsibility
Hive	SQL, schema, planning
Execution Engine	Row processing
Parquet Writer	Encoding, row groups, metadata
HDFS	Byte storage

7. Schema Enforcement (Hive vs Parquet)

Schema lives in TWO places

Hive Metastore
Parquet file footer

What happens on read

Hive schema is authoritative
Parquet schema is matched

Compatible changes:

Column reordering
Column addition
Missing columns → NULL

Incompatible:

Type mismatch (STRING vs INT)

8. Partitioned Hive Tables (Very Important)

CREATE TABLE orders (
  order_id INT,
  product STRING,
  amount INT
)
PARTITIONED BY (country STRING)
STORED AS PARQUET;

Physical layout

/warehouse/orders/
  country=US/
    part-00000.parquet
  country=IN/
    part-00001.parquet

What Hive stores

Metastore:

Partition: country=US → location
Partition: country=IN → location

9. Insert into Partitioned Table

INSERT INTO TABLE orders PARTITION (country)
SELECT order_id, product, amount, country
FROM staging_orders;

Result

Hive creates partition directories
Parquet files written per partition
Country column NOT stored in Parquet
Value inferred from directory name

This is partition pruning at directory level.

10. Query Execution with Hive

SELECT order_id, amount
FROM orders
WHERE country = 'US'
AND amount > 100;

Step-by-step

Hive reads metastore
Finds partition:
```
country=US
```
Skips other partitions entirely
Reads Parquet footers
Applies:
- Column pruning
- Predicate pushdown
Reads minimal data

11. What Hive Pushes Down to Parquet

Hive can push down:

Column selection
Filters (>, <, =, BETWEEN)
IS NULL checks

Hive cannot push:

Complex UDF logic
Regex-heavy predicates

12. How Hive Differs from Spark & Impala Here

Aspect	Hive	Spark	Impala
Metadata	Hive Metastore	Hive Metastore	Hive Metastore
Execution	MR / Tez / Spark	Spark	Native C++
Read speed	Medium	Fast	Fastest
Writes	Batch	Batch	Limited
ACID	Via ORC	Via formats	Via formats

Hive is:

Schema authority + SQL compiler

13. Updates & Deletes in Hive

Classic Hive + Parquet:

❌ No UPDATE
❌ No DELETE

Workarounds:

INSERT OVERWRITE
Partition rewrite

Modern Hive:

ACID tables require ORC
Parquet + ACID requires Iceberg / Hudi / Delta

14. End-to-End Mental Model

Hive defines what the data means, Parquet defines how the data is stored, and HDFS defines where the bytes live.

Hive SQL
   ↓
Metastore (schema, partitions)
   ↓
Execution Engine
   ↓
Parquet Writer / Reader
   ↓
HDFS blocks

sany2k8/how_hive_works_with_parquet_file.md

Select an option

No results found

Select an option

No results found

1. First Principle: What a Hive Table Really Is

2. Hive Metastore (The Control Plane)

3. Creating a Hive Table on Parquet

Example

What happens internally

4. Managed vs External Tables (Critical Distinction)

Managed Table

External Table

5. Insert into Hive Table (How Parquet Gets Written)

Step-by-step

6. How Hive Uses Parquet Internally

7. Schema Enforcement (Hive vs Parquet)

Schema lives in TWO places

What happens on read

8. Partitioned Hive Tables (Very Important)

Physical layout

What Hive stores

9. Insert into Partitioned Table

Result

10. Query Execution with Hive

Step-by-step

11. What Hive Pushes Down to Parquet

12. How Hive Differs from Spark & Impala Here

13. Updates & Deletes in Hive

14. End-to-End Mental Model