Skip to content

Instantly share code, notes, and snippets.

@sany2k8
sany2k8 / parquet_with_spark_and_hive.md
Last active February 4, 2026 13:39
Below is a systems-level explanation of how Apache Spark works with Hive tables and Parquet files. This document focuses on data writing, query execution, and optimization, and is intentionally separate from Hive and Impala.

1. What Spark Is (And Is Not)

Spark is NOT:

  • just a SQL engine
  • just a query engine

Spark IS:

A general-purpose distributed data processing engine capable of ETL, analytics, and machine learning.

@sany2k8
sany2k8 / how_Impala_works_with_Hive_tables_and_Parquet_files.md
Created February 4, 2026 13:19
Below is a systems-level explanation of how Impala works with Hive tables and Parquet files. This document focuses on query execution, metadata usage, and read-path optimizations, and is intentionally separate from Hive and Spark.

1. What Impala Is (And Is Not)

Impala is NOT:

  • a storage engine
  • a file format
  • a batch processing system

Impala IS:

@sany2k8
sany2k8 / how_hive_works_with_parquet_file.md
Last active February 4, 2026 13:38
Below is a step-by-step, systems-level explanation of how a Hive table fits into the Parquet process. This is anchored to the same dataset and Parquet lifecycle, and clarifies what Hive actually does vs what it does not do

1. First Principle: What a Hive Table Really Is

A Hive table is NOT:

  • a storage engine
  • a file format
  • a database that owns data

A Hive table IS:

@sany2k8
sany2k8 / how_parquet_file_formed_with_data_and_queried.md
Last active February 4, 2026 13:37
Below is a step-by-step, end-to-end walk-through of how a Parquet file is formed, stored, and queried in a Hadoop ecosystem, starting from a raw dataset and ending at query execution.

1. Example Dataset (Logical View)

Assume this dataset coming from an ingestion job:

order_id country product amount
1 US Book 120
2 IN Pen 20
3 US Book 300
4 IN Pencil 10
@sany2k8
sany2k8 / Big Data Ecosystem Overview.md
Last active November 7, 2025 08:45
A comprehensive Markdown file that documents the end-to-end Big Data ecosystem workflowm, including Hue, Hive, Impala, HDFS, Spark (PySpark), HBase, Iceberg, and file formats (Parquet, ORC).

🧭 Big Data Ecosystem Overview

This document explains how the main components of the Hadoop-based Big Data ecosystem connect and work together: Hue, Hive, Impala, HDFS, Spark (PySpark), HBase, Iceberg, Parquet, ORC, Oozie, and Teradata.


🧱 Core Components and Their Roles

Component Type Purpose

Typos & Corrections

Original Text (on page) Issue Suggested Correction
“List sroted pods” Typo “List sorted pods”
“List pods using a different output” Wording unclear Could be: “List pods with different output formats”
“View all cotainers logs…” Typo “View all containers logs…”
“locahost-port” Typo “localhost-port”
“hosts-port” Wording Better: “host-port”
**“
flowchart TD
%% =============================
%% Global Layout Tweaks
%% =============================
%% Make arrows thicker and more visible
linkStyle default stroke-width:2px,stroke:#555,opacity:0.9;
@sany2k8
sany2k8 / search_types.sql
Last active August 25, 2025 17:19
All the working queries
-- CREATE EXTENSION IF NOT EXISTS postgis;
-- CREATE EXTENSION IF NOT EXISTS pgvector;
------------------------------ Exercise 1 ------------------------------
-- Table setup
CREATE TABLE products (
id SERIAL PRIMARY KEY,
sku VARCHAR(20) UNIQUE,
name VARCHAR(200),
category VARCHAR(50),
Search Type Speed Accuracy Flexibility Storage Best For
Exact Match ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ IDs, codes, filters
Pattern Match ⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐ Autocomplete, prefixes
Full-Text ⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐ Documents, articles
Vector / Semantic ⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐ Recommendations, concepts
Fuzzy ⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐ Typos, data cleaning
Scenario Best Choice Alternative / Avoid
User login/auth Exact Match All others
Product SKU lookup Exact Match All others
Autocomplete Pattern Match (prefix) Fuzzy, Vector
Blog search Full-Text Vector + Full-Text, Pattern
Recommendation Vector + Full-Text Pattern
Exact Data with typos Fuzzy Pattern, Exact
Multi-language content Vector + Full-Text Pattern
Real-time search Exact / Pattern Full-Text, Vector