Impala is NOT:
- a storage engine
- a file format
- a batch processing system
Impala IS:
A distributed, MPP (Massively Parallel Processing) SQL query engine optimized for low-latency analytics on HDFS data.
Impala SQL
↓
Hive Metastore
↓
Native C++ Execution Engine
↓
Parquet / HDFS
Impala does not maintain its own catalog.
It relies entirely on:
- Hive Metastore for schema
- HDFS for file locations
When a table is created in Hive:
CREATE TABLE orders STORED AS PARQUET;
Impala can query it immediately after metadata refresh.
INVALIDATE METADATA orders;
-- or
REFRESH orders;Impala sees a Hive table as:
- Logical schema (columns, types)
- Set of directories (partitions)
- Set of Parquet files
Impala never writes data (except limited INSERT SELECT).
Example:
SELECT order_id, amount
FROM orders
WHERE country = 'US'
AND amount > 100;Planning steps:
- Read table metadata
- Identify partitions
- Build distributed execution plan
- Assign fragments to daemons
Each fragment runs in parallel.
Given layout:
/orders/
country=US/
country=IN/
Predicate:
WHERE country = 'US'Result:
- Only
country=USdirectory scanned - Zero IO for other partitions
Impala reads:
- Parquet footer first
- Row group metadata
- Column statistics
This enables:
- Row group pruning
- Column pruning
Footer reads are tiny and cached.
Only required columns are read:
order_id
amount
country
Unreferenced columns: ❌ Not read ❌ Not decompressed ❌ Not decoded
Impala pushes predicates into Parquet:
- =, <, >, BETWEEN
- IS NULL
Row groups skipped using:
- min / max statistics
- dictionary filters
Key characteristics:
- Native C++
- Vectorized execution
- No JVM
- Long-running daemons
Each Impala daemon:
- Scans local HDFS blocks
- Executes fragments in memory
Coordinator Node
↓
Fragment Executors
↓
Local Parquet Scans
↓
Aggregation / Filters
↓
Result Stream
No MapReduce. No Spark.
Impala supports:
INSERT INTO orders SELECT ...;But:
- Writes are batch
- No UPDATE / DELETE
- No ACID
Writes still produce Parquet files.
Impala is ideal for:
- Interactive BI queries
- Dashboards
- Ad-hoc analytics
Not ideal for:
- ETL
- Streaming
- Complex transformations
Impala is a fast reader of Hive-managed Parquet data, optimized to skip as much data as possible.
Hive Metastore → Impala Planner → Native Executors → Parquet