A Hive table is NOT:
- a storage engine
- a file format
- a database that owns data
A Hive table IS:
Metadata that maps a logical table definition to files on HDFS
Hive sits between SQL and files.
Hive SQL
β
Metastore (schema + location)
β
HDFS files (Parquet)
When you create a Hive table, Hive stores metadata in the Hive Metastore (usually backed by MySQL/Postgres).
Metadata includes:
- Table name
- Columns & types
- File format (Parquet)
- HDFS location
- Partition columns
- SerDe info
No data is stored in the metastore.
CREATE TABLE orders (
order_id INT,
country STRING,
product STRING,
amount INT
)
STORED AS PARQUET
LOCATION '/warehouse/orders';-
Hive writes metadata to Metastore
-
Creates directory:
/warehouse/orders/ -
No files are created yet
π Table exists even if directory is empty
CREATE TABLE orders_managed (...)
STORED AS PARQUET;- Hive owns the data
- DROP TABLE deletes files
CREATE EXTERNAL TABLE orders_ext (...)
STORED AS PARQUET
LOCATION '/warehouse/orders';- Hive only owns metadata
- DROP TABLE keeps files
In both cases: β‘ Parquet behavior is identical
INSERT INTO TABLE orders
SELECT * FROM staging_orders;-
Hive parses SQL
-
Hive planner builds execution plan
-
Execution engine runs:
- MapReduce (old)
- Tez
- Spark (most common)
-
Tasks write new Parquet files
-
Files are written to:
/warehouse/orders/part-00000.parquet
π« No row-level insert β File-level append only
Hive does not implement Parquet itself.
It uses:
- Parquet SerDe (serialization/deserialization)
- Parquet InputFormat / OutputFormat
Responsibilities split like this:
| Component | Responsibility |
|---|---|
| Hive | SQL, schema, planning |
| Execution Engine | Row processing |
| Parquet Writer | Encoding, row groups, metadata |
| HDFS | Byte storage |
- Hive Metastore
- Parquet file footer
- Hive schema is authoritative
- Parquet schema is matched
Compatible changes:
- Column reordering
- Column addition
- Missing columns β NULL
Incompatible:
- Type mismatch (STRING vs INT)
CREATE TABLE orders (
order_id INT,
product STRING,
amount INT
)
PARTITIONED BY (country STRING)
STORED AS PARQUET;/warehouse/orders/
country=US/
part-00000.parquet
country=IN/
part-00001.parquet
Metastore:
Partition: country=US β location
Partition: country=IN β location
INSERT INTO TABLE orders PARTITION (country)
SELECT order_id, product, amount, country
FROM staging_orders;- Hive creates partition directories
- Parquet files written per partition
- Country column NOT stored in Parquet
- Value inferred from directory name
This is partition pruning at directory level.
SELECT order_id, amount
FROM orders
WHERE country = 'US'
AND amount > 100;-
Hive reads metastore
-
Finds partition:
country=US -
Skips other partitions entirely
-
Reads Parquet footers
-
Applies:
- Column pruning
- Predicate pushdown
-
Reads minimal data
Hive can push down:
- Column selection
- Filters (>, <, =, BETWEEN)
- IS NULL checks
Hive cannot push:
- Complex UDF logic
- Regex-heavy predicates
| Aspect | Hive | Spark | Impala |
|---|---|---|---|
| Metadata | Hive Metastore | Hive Metastore | Hive Metastore |
| Execution | MR / Tez / Spark | Spark | Native C++ |
| Read speed | Medium | Fast | Fastest |
| Writes | Batch | Batch | Limited |
| ACID | Via ORC | Via formats | Via formats |
Hive is:
Schema authority + SQL compiler
Classic Hive + Parquet:
- β No UPDATE
- β No DELETE
Workarounds:
- INSERT OVERWRITE
- Partition rewrite
Modern Hive:
- ACID tables require ORC
- Parquet + ACID requires Iceberg / Hudi / Delta
Hive defines what the data means, Parquet defines how the data is stored, and HDFS defines where the bytes live.
Hive SQL
β
Metastore (schema, partitions)
β
Execution Engine
β
Parquet Writer / Reader
β
HDFS blocks