Skip to content

Instantly share code, notes, and snippets.

@sany2k8
Last active February 4, 2026 13:38
Show Gist options
  • Select an option

  • Save sany2k8/986f8e05d08035b38fda19c27ec5e321 to your computer and use it in GitHub Desktop.

Select an option

Save sany2k8/986f8e05d08035b38fda19c27ec5e321 to your computer and use it in GitHub Desktop.
Below is a step-by-step, systems-level explanation of how a Hive table fits into the Parquet process. This is anchored to the same dataset and Parquet lifecycle, and clarifies what Hive actually does vs what it does not do

1. First Principle: What a Hive Table Really Is

A Hive table is NOT:

  • a storage engine
  • a file format
  • a database that owns data

A Hive table IS:

Metadata that maps a logical table definition to files on HDFS

Hive sits between SQL and files.

Hive SQL
   ↓
Metastore (schema + location)
   ↓
HDFS files (Parquet)

2. Hive Metastore (The Control Plane)

When you create a Hive table, Hive stores metadata in the Hive Metastore (usually backed by MySQL/Postgres).

Metadata includes:

  • Table name
  • Columns & types
  • File format (Parquet)
  • HDFS location
  • Partition columns
  • SerDe info

No data is stored in the metastore.


3. Creating a Hive Table on Parquet

Example

CREATE TABLE orders (
  order_id INT,
  country STRING,
  product STRING,
  amount INT
)
STORED AS PARQUET
LOCATION '/warehouse/orders';

What happens internally

  1. Hive writes metadata to Metastore

  2. Creates directory:

    /warehouse/orders/
    
  3. No files are created yet

πŸ‘‰ Table exists even if directory is empty


4. Managed vs External Tables (Critical Distinction)

Managed Table

CREATE TABLE orders_managed (...)
STORED AS PARQUET;
  • Hive owns the data
  • DROP TABLE deletes files

External Table

CREATE EXTERNAL TABLE orders_ext (...)
STORED AS PARQUET
LOCATION '/warehouse/orders';
  • Hive only owns metadata
  • DROP TABLE keeps files

In both cases: ➑ Parquet behavior is identical


5. Insert into Hive Table (How Parquet Gets Written)

INSERT INTO TABLE orders
SELECT * FROM staging_orders;

Step-by-step

  1. Hive parses SQL

  2. Hive planner builds execution plan

  3. Execution engine runs:

    • MapReduce (old)
    • Tez
    • Spark (most common)
  4. Tasks write new Parquet files

  5. Files are written to:

    /warehouse/orders/part-00000.parquet
    

🚫 No row-level insert βœ… File-level append only


6. How Hive Uses Parquet Internally

Hive does not implement Parquet itself.

It uses:

  • Parquet SerDe (serialization/deserialization)
  • Parquet InputFormat / OutputFormat

Responsibilities split like this:

Component Responsibility
Hive SQL, schema, planning
Execution Engine Row processing
Parquet Writer Encoding, row groups, metadata
HDFS Byte storage

7. Schema Enforcement (Hive vs Parquet)

Schema lives in TWO places

  1. Hive Metastore
  2. Parquet file footer

What happens on read

  • Hive schema is authoritative
  • Parquet schema is matched

Compatible changes:

  • Column reordering
  • Column addition
  • Missing columns β†’ NULL

Incompatible:

  • Type mismatch (STRING vs INT)

8. Partitioned Hive Tables (Very Important)

CREATE TABLE orders (
  order_id INT,
  product STRING,
  amount INT
)
PARTITIONED BY (country STRING)
STORED AS PARQUET;

Physical layout

/warehouse/orders/
  country=US/
    part-00000.parquet
  country=IN/
    part-00001.parquet

What Hive stores

Metastore:

Partition: country=US β†’ location
Partition: country=IN β†’ location

9. Insert into Partitioned Table

INSERT INTO TABLE orders PARTITION (country)
SELECT order_id, product, amount, country
FROM staging_orders;

Result

  • Hive creates partition directories
  • Parquet files written per partition
  • Country column NOT stored in Parquet
  • Value inferred from directory name

This is partition pruning at directory level.


10. Query Execution with Hive

SELECT order_id, amount
FROM orders
WHERE country = 'US'
AND amount > 100;

Step-by-step

  1. Hive reads metastore

  2. Finds partition:

    country=US
    
  3. Skips other partitions entirely

  4. Reads Parquet footers

  5. Applies:

    • Column pruning
    • Predicate pushdown
  6. Reads minimal data


11. What Hive Pushes Down to Parquet

Hive can push down:

  • Column selection
  • Filters (>, <, =, BETWEEN)
  • IS NULL checks

Hive cannot push:

  • Complex UDF logic
  • Regex-heavy predicates

12. How Hive Differs from Spark & Impala Here

Aspect Hive Spark Impala
Metadata Hive Metastore Hive Metastore Hive Metastore
Execution MR / Tez / Spark Spark Native C++
Read speed Medium Fast Fastest
Writes Batch Batch Limited
ACID Via ORC Via formats Via formats

Hive is:

Schema authority + SQL compiler


13. Updates & Deletes in Hive

Classic Hive + Parquet:

  • ❌ No UPDATE
  • ❌ No DELETE

Workarounds:

  • INSERT OVERWRITE
  • Partition rewrite

Modern Hive:

  • ACID tables require ORC
  • Parquet + ACID requires Iceberg / Hudi / Delta

14. End-to-End Mental Model

Hive defines what the data means, Parquet defines how the data is stored, and HDFS defines where the bytes live.

Hive SQL
   ↓
Metastore (schema, partitions)
   ↓
Execution Engine
   ↓
Parquet Writer / Reader
   ↓
HDFS blocks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment