1. What Impala Is (And Is Not)

Impala is NOT:

a storage engine
a file format
a batch processing system

Impala IS:

A distributed, MPP (Massively Parallel Processing) SQL query engine optimized for low-latency analytics on HDFS data.

Impala SQL
   ↓
Hive Metastore
   ↓
Native C++ Execution Engine
   ↓
Parquet / HDFS

2. Metadata: Impala and Hive Metastore

Impala does not maintain its own catalog.

It relies entirely on:

Hive Metastore for schema
HDFS for file locations

When a table is created in Hive:

CREATE TABLE orders STORED AS PARQUET;

Impala can query it immediately after metadata refresh.

INVALIDATE METADATA orders;
-- or
REFRESH orders;

3. How Impala Sees a Table

Impala sees a Hive table as:

Logical schema (columns, types)
Set of directories (partitions)
Set of Parquet files

Impala never writes data (except limited INSERT SELECT).

4. Query Planning in Impala

Example:

SELECT order_id, amount
FROM orders
WHERE country = 'US'
AND amount > 100;

Planning steps:

Read table metadata
Identify partitions
Build distributed execution plan
Assign fragments to daemons

Each fragment runs in parallel.

5. Partition Pruning (Directory-Level)

Given layout:

/orders/
  country=US/
  country=IN/

Predicate:

WHERE country = 'US'

Result:

Only country=US directory scanned
Zero IO for other partitions

6. Parquet Footer Reading

Impala reads:

Parquet footer first
Row group metadata
Column statistics

This enables:

Row group pruning
Column pruning

Footer reads are tiny and cached.

7. Columnar Read Path

Only required columns are read:

order_id
amount
country

Unreferenced columns: ❌ Not read ❌ Not decompressed ❌ Not decoded

8. Predicate Pushdown

Impala pushes predicates into Parquet:

=, <, >, BETWEEN
IS NULL

Row groups skipped using:

min / max statistics
dictionary filters

9. Execution Engine (Why Impala Is Fast)

Key characteristics:

Native C++
Vectorized execution
No JVM
Long-running daemons

Each Impala daemon:

Scans local HDFS blocks
Executes fragments in memory

10. Data Flow During Query

Coordinator Node
   ↓
Fragment Executors
   ↓
Local Parquet Scans
   ↓
Aggregation / Filters
   ↓
Result Stream

No MapReduce. No Spark.

11. Inserts and Writes in Impala

Impala supports:

INSERT INTO orders SELECT ...;

But:

Writes are batch
No UPDATE / DELETE
No ACID

Writes still produce Parquet files.

12. When to Use Impala

Impala is ideal for:

Interactive BI queries
Dashboards
Ad-hoc analytics

Not ideal for:

ETL
Streaming
Complex transformations

13. Mental Model

Impala is a fast reader of Hive-managed Parquet data, optimized to skip as much data as possible.

Hive Metastore → Impala Planner → Native Executors → Parquet

sany2k8/how_Impala_works_with_Hive_tables_and_Parquet_files.md

Select an option

No results found