Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save sany2k8/335dab0067686eb14c33086133cf9f20 to your computer and use it in GitHub Desktop.

Select an option

Save sany2k8/335dab0067686eb14c33086133cf9f20 to your computer and use it in GitHub Desktop.
Below is a systems-level explanation of how Impala works with Hive tables and Parquet files. This document focuses on query execution, metadata usage, and read-path optimizations, and is intentionally separate from Hive and Spark.

1. What Impala Is (And Is Not)

Impala is NOT:

  • a storage engine
  • a file format
  • a batch processing system

Impala IS:

A distributed, MPP (Massively Parallel Processing) SQL query engine optimized for low-latency analytics on HDFS data.

Impala SQL
   ↓
Hive Metastore
   ↓
Native C++ Execution Engine
   ↓
Parquet / HDFS

2. Metadata: Impala and Hive Metastore

Impala does not maintain its own catalog.

It relies entirely on:

  • Hive Metastore for schema
  • HDFS for file locations

When a table is created in Hive:

CREATE TABLE orders STORED AS PARQUET;

Impala can query it immediately after metadata refresh.

INVALIDATE METADATA orders;
-- or
REFRESH orders;

3. How Impala Sees a Table

Impala sees a Hive table as:

  • Logical schema (columns, types)
  • Set of directories (partitions)
  • Set of Parquet files

Impala never writes data (except limited INSERT SELECT).


4. Query Planning in Impala

Example:

SELECT order_id, amount
FROM orders
WHERE country = 'US'
AND amount > 100;

Planning steps:

  1. Read table metadata
  2. Identify partitions
  3. Build distributed execution plan
  4. Assign fragments to daemons

Each fragment runs in parallel.


5. Partition Pruning (Directory-Level)

Given layout:

/orders/
  country=US/
  country=IN/

Predicate:

WHERE country = 'US'

Result:

  • Only country=US directory scanned
  • Zero IO for other partitions

6. Parquet Footer Reading

Impala reads:

  • Parquet footer first
  • Row group metadata
  • Column statistics

This enables:

  • Row group pruning
  • Column pruning

Footer reads are tiny and cached.


7. Columnar Read Path

Only required columns are read:

order_id
amount
country

Unreferenced columns: ❌ Not read ❌ Not decompressed ❌ Not decoded


8. Predicate Pushdown

Impala pushes predicates into Parquet:

  • =, <, >, BETWEEN
  • IS NULL

Row groups skipped using:

  • min / max statistics
  • dictionary filters

9. Execution Engine (Why Impala Is Fast)

Key characteristics:

  • Native C++
  • Vectorized execution
  • No JVM
  • Long-running daemons

Each Impala daemon:

  • Scans local HDFS blocks
  • Executes fragments in memory

10. Data Flow During Query

Coordinator Node
   ↓
Fragment Executors
   ↓
Local Parquet Scans
   ↓
Aggregation / Filters
   ↓
Result Stream

No MapReduce. No Spark.


11. Inserts and Writes in Impala

Impala supports:

INSERT INTO orders SELECT ...;

But:

  • Writes are batch
  • No UPDATE / DELETE
  • No ACID

Writes still produce Parquet files.


12. When to Use Impala

Impala is ideal for:

  • Interactive BI queries
  • Dashboards
  • Ad-hoc analytics

Not ideal for:

  • ETL
  • Streaming
  • Complex transformations

13. Mental Model

Impala is a fast reader of Hive-managed Parquet data, optimized to skip as much data as possible.

Hive Metastore → Impala Planner → Native Executors → Parquet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment