Skip to content

Instantly share code, notes, and snippets.

@jmealo
Created February 5, 2026 01:35
Show Gist options
  • Select an option

  • Save jmealo/e9efb64593b0bc2c49adb81713f711a2 to your computer and use it in GitHub Desktop.

Select an option

Save jmealo/e9efb64593b0bc2c49adb81713f711a2 to your computer and use it in GitHub Desktop.
pgsyncd: Schema Evolution Handling - Design Proposal

Schema Evolution Handling for pgsyncd

Author: Engineering Team Date: 2026-02-04 Status: Proposal Type: Feature Enhancement


Executive Summary

This proposal outlines how pgsyncd should handle schema evolution when source and target databases diverge. The guiding principle is: stop syncing when schema changes could cause data loss, continue when changes are safe.

Current behavior: pgsyncd assumes schemas match. When they don't, behavior is undefined and can lead to silent data loss or sync failures.

Proposed behavior: Validate schemas continuously, stop sync on divergence that risks data loss, continue with warnings when safe.


Problem Statement

Current State

pgsyncd currently has no schema validation. It assumes:

  • Source and target schemas are identical
  • Column names and types match
  • No columns are added or removed during sync

Real-world problems:

  1. New column added to source

    • Source: users(id, name, email, phone)
    • Target: users(id, name, email)
    • Current behavior: Undefined (likely INSERT fails or phone data is lost)
  2. Column type changed

    • Source: products.price NUMERIC(10,2)
    • Target: products.price INTEGER
    • Current behavior: Undefined (likely silent data corruption: $19.99 → $20)
  3. Column removed from source

    • Source: users(id, name, email) (removed legacy_id)
    • Target: users(id, name, email, legacy_id)
    • Current behavior: Undefined

User Impact

  • Silent data loss: Columns skipped without notice
  • Data corruption: Lossy type casts without warning
  • Production incidents: Sync fails unexpectedly when schemas diverge
  • Difficult debugging: No visibility into schema mismatches

Design Philosophy

Core Principles

  1. Never silently lose data - If we can't sync a column safely, stop and alert
  2. Fail loudly, not silently - Schema problems should be impossible to miss
  3. Provide escape hatches - Allow overrides for emergency situations
  4. Resume-able - Schema issues shouldn't corrupt sync state
  5. Clear path to resolution - Error messages show exactly how to fix

Decision Framework

For each schema divergence scenario:

Can we sync without data loss?
  → YES: Continue with warnings
  → NO: Stop and require explicit decision
  → MAYBE: Stop and provide options

Proposed Behavior by Scenario

Scenario 1: Column Added to Source 🛑

Situation: Source has users.phone_number, target doesn't

Risk: DATA LOSS - source is collecting data we can't store

Behavior: STOP SYNC

Rationale:

  • If we skip the column, we lose data that exists in source
  • When target is updated later, we'd need a backfill
  • Better to stop immediately and force schema alignment

Error Message:

[ERROR] Schema mismatch detected on public.users
  Source has new column: 'phone_number' (type: varchar)
  Target does not have this column

  Sync STOPPED to prevent data loss.

  Actions:
    1. Add column to target (RECOMMENDED):
       ALTER TABLE target.users ADD COLUMN phone_number varchar;

    2. Or explicitly skip this column (NOT RECOMMENDED - data loss):
       pgsyncd sync --skip-columns=users.phone_number

    3. Or allow all source-only columns (DANGEROUS):
       pgsyncd sync --allow-schema-drift=source_superset

Status output:

{
  "table": "public.users",
  "status": "schema_mismatch",
  "error": "Source has new column 'phone_number' not in target",
  "sync_stopped_at": "2026-02-04T20:15:00Z",
  "resolution": "Add column to target or use --skip-columns"
}

Scenario 2: Column Added to Target ✅/🛑

Situation: Target has users.created_by_system, source doesn't

Risk: Depends on target column constraints

Behavior:

  • ✅ CONTINUE if target column is nullable or has DEFAULT
  • 🛑 STOP if target column is NOT NULL with no DEFAULT

Rationale:

  • No source data being lost
  • Target column for audit/metadata is common pattern
  • Safe if target handles missing values

Safe case (continue):

-- Target column nullable → INSERT will use NULL
ALTER TABLE users ADD COLUMN created_by_system TEXT;

-- Target column has DEFAULT → INSERT will use default
ALTER TABLE users ADD COLUMN created_by_system TEXT DEFAULT 'pgsyncd';

Warning message:

[WARN] Schema difference on public.users
  Target has extra column: 'created_by_system' (nullable)
  This column will be NULL for all synced rows
  This is OK for audit/metadata columns

Unsafe case (stop):

-- NOT NULL with no DEFAULT → INSERT will fail
ALTER TABLE users ADD COLUMN created_by_system TEXT NOT NULL;

Error message:

[ERROR] Schema mismatch on public.users
  Target has new column 'created_by_system' (NOT NULL, no DEFAULT)
  Source does not have this column

  Cannot insert rows - column requires a value.

  Fix:
    ALTER TABLE target.users ALTER COLUMN created_by_system DROP NOT NULL;
    -- OR --
    ALTER TABLE target.users ALTER COLUMN created_by_system SET DEFAULT 'pgsyncd';

Scenario 3: Type Mismatch (Lossy) 🛑

Situation: Source has price NUMERIC(10,2), target has price INTEGER

Risk: DATA CORRUPTION - casting loses precision

Behavior: STOP SYNC

Rationale:

  • Casting $19.99 → $20 silently corrupts data
  • Silent corruption worse than stopping
  • Operator must explicitly allow precision loss

Error message:

[ERROR] Type mismatch on public.products.price
  Source: numeric(10,2)
  Target: integer

  Casting would lose precision (19.99 → 20)

  Fix:
    1. Match types (RECOMMENDED):
       ALTER TABLE target.products ALTER COLUMN price TYPE numeric(10,2);

    2. Or explicitly allow cast (DATA LOSS):
       pgsyncd sync --allow-type-cast=products.price:integer

    3. Or allow all casts (DANGEROUS):
       pgsyncd sync --allow-schema-drift=type_casts --force

Scenario 4: Type Mismatch (Safe Widening) ✅

Situation: Source has id INTEGER, target has id BIGINT

Risk: NONE - safe widening cast

Behavior: CONTINUE

Rationale:

  • Safe casts: int→bigint, varchar(10)→varchar(20), date→timestamp
  • No data loss, no precision loss
  • Common pattern for capacity planning

Safe casts list:

  • smallintintegerbigint
  • integernumeric
  • varchar(N)varchar(M) where M > N
  • varchar(N)text
  • datetimestamp
  • timestamptimestamptz

Info message:

[INFO] Type difference on public.users.id
  Source: integer
  Target: bigint
  Safe widening cast - continuing

Scenario 5: Column Removed from Source 🛑

Situation: Source removed users.legacy_id, target still has it

Risk: AMBIGUOUS - what should happen to target column?

Behavior: STOP SYNC

Rationale:

  • Multiple valid approaches:
    • Keep target column (preserve historical data)
    • Set to NULL (mark as removed)
    • Remove from target (match source)
  • Operator must choose explicitly

Error message:

[ERROR] Schema mismatch on public.users
  Target has column 'legacy_id' that no longer exists in source

  Decide what to do with target column:
    1. Remove from target:
       ALTER TABLE target.users DROP COLUMN legacy_id;

    2. Keep existing values, stop updating:
       pgsyncd sync --preserve-columns=users.legacy_id

    3. Set to NULL on all synced rows:
       pgsyncd sync --null-removed-columns=users.legacy_id

Scenario 6: Column Order Different ✅

Situation: Source has (id, name, email), target has (id, email, name)

Risk: NONE - if we use explicit column names

Behavior: CONTINUE (always use explicit column names)

Rationale:

  • Never rely on column order
  • Always generate INSERT INTO table (col1, col2) VALUES ($1, $2)
  • This should "just work"

Implementation:

-- Always use explicit column names:
INSERT INTO target.users (id, name, email) VALUES (1, 'Alice', 'alice@example.com')

-- Never use positional:
INSERT INTO target.users VALUES (1, 'Alice', 'alice@example.com')  -- ❌ WRONG

Implementation Design

Schema Fingerprinting

Each table tracks a schema fingerprint containing:

  • Column names (in order)
  • Column types (with precision/scale)
  • NOT NULL constraints
  • DEFAULT expressions
  • Primary key definition
typedef struct SchemaFingerprint {
    char *tableName;
    int columnCount;
    SchemaColumn *columns;  // Array of column definitions
    char *primaryKey;
    char *fingerprintHash;  // SHA256 of canonical schema
} SchemaFingerprint;

typedef struct SchemaColumn {
    char *name;
    char *dataType;
    bool notNull;
    char *defaultExpr;
    int position;
} SchemaColumn;

Setup Phase

pgsyncd setup --source <uri> --target <uri>

# For each table:
1. Query source schema: SELECT * FROM information_schema.columns WHERE table_name = ...
2. Query target schema: SELECT * FROM information_schema.columns WHERE table_name = ...
3. Compare schemas:
   - Identify common columns
   - Identify source-only columns (data loss risk!)
   - Identify target-only columns (check NOT NULL)
   - Identify type mismatches (check if safe cast)
4. Validate compatibility:
   - If source has extra columns → FAIL
   - If target has NOT NULL column without DEFAULT → FAIL
   - If lossy type cast required → FAIL
   - If only warnings → PROCEED
5. Store schema fingerprint in catalog:
   - Common columns to sync
   - Safe casts to apply
   - Warnings issued

Catalog schema:

CREATE TABLE schema_fingerprints (
    table_oid INTEGER PRIMARY KEY,
    table_name TEXT NOT NULL,
    source_fingerprint TEXT NOT NULL,  -- JSON schema from source
    target_fingerprint TEXT NOT NULL,  -- JSON schema from target
    common_columns TEXT[],             -- Columns safe to sync
    compatibility_status TEXT,         -- 'compatible', 'warning', 'error'
    validation_warnings TEXT[],
    last_validated TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW()
);

Sync Phase

// Before syncing each table, validate schema hasn't changed
bool validate_schema_before_sync(SyncdTable *table) {
    // Get current schema from both databases
    SchemaFingerprint *source_current = get_current_source_schema(table);
    SchemaFingerprint *target_current = get_current_target_schema(table);

    // Get stored fingerprints from setup
    SchemaFingerprint *source_stored = table->sourceSchemaFingerprint;
    SchemaFingerprint *target_stored = table->targetSchemaFingerprint;

    // Check if schemas changed
    if (!fingerprints_match(source_current, source_stored)) {
        log_error("Source schema changed for %s.%s since setup",
                 table->nspname, table->relname);

        SchemaDiff diff = compute_diff(source_stored, source_current);

        if (diff.has_new_columns) {
            log_error("  New columns in source: %s", join(diff.new_columns));
            log_error("  Sync STOPPED to prevent data loss");
            log_error("  Run: pgsyncd schema diff --table %s", table->relname);
            return false;
        }

        if (diff.has_removed_columns) {
            log_error("  Removed columns from source: %s", join(diff.removed_columns));
            log_error("  Sync STOPPED - ambiguous what to do with target columns");
            return false;
        }

        if (diff.has_type_changes) {
            log_error("  Type changes: %s", format_type_changes(diff.type_changes));
            log_error("  Sync STOPPED to prevent data corruption");
            return false;
        }
    }

    if (!fingerprints_match(target_current, target_stored)) {
        log_warn("Target schema changed for %s.%s since setup",
                table->nspname, table->relname);

        // Re-validate if still compatible
        if (!validate_target_changes_safe(table, target_current)) {
            return false;
        }
    }

    return true;
}

SQL Generation

// Use only common columns when generating SQL
void build_insert_sql(SyncdTable *table, char *sql_buf, size_t len) {
    // Get list of columns safe to sync (from schema fingerprint)
    char **columns = table->common_columns;
    int column_count = table->common_column_count;

    // Build column list: "id, name, email, created_at"
    char column_list[1024];
    join_columns(columns, column_count, ", ", column_list, sizeof(column_list));

    // Build placeholder list: "$1, $2, $3, $4"
    char placeholders[1024];
    build_placeholders(column_count, placeholders, sizeof(placeholders));

    // Generate INSERT with explicit column names
    snprintf(sql_buf, len,
             "INSERT INTO %s.%s (%s) VALUES (%s) ON CONFLICT (%s) DO UPDATE SET %s",
             table->target_nspname,
             table->target_relname,
             column_list,
             placeholders,
             table->primary_key,
             build_update_set_clause(columns, column_count));
}

CLI Commands

Schema Validation

# Validate schemas without starting sync
pgsyncd schema validate \
  --source "postgresql://source/db" \
  --target "postgresql://target/db"

# Output:
╔════════════════════════════════════════════════════════════════╗
║ Schema Validation Report                                      ║
╠════════════════════════════════════════════════════════════════╣
║ public.users                                          ✓ OK     ║
║   15 source columns, 14 target columns                        ║
║   ⚠  Source has 'phone' (not in target) - WILL STOP SYNC     ║
║                                                                ║
║ public.orders                                         ✓ OK     ║
║   All 12 columns match perfectly                              ║
║                                                                ║
║ public.products                                       ❌ ERROR  ║
║   Type mismatch: price (numeric vs integer)                   ║
║   Action: Fix target schema or use --allow-type-cast          ║
╚════════════════════════════════════════════════════════════════╝

Summary:
  ✓ 1 table fully compatible
  ⚠ 1 table compatible with warnings
  ❌ 1 table INCOMPATIBLE

Recommendation: Fix issues before running sync

Schema Diff

# Show detailed schema differences
pgsyncd schema diff \
  --source "postgresql://source/db" \
  --target "postgresql://target/db" \
  --table users

# Output:
Table: public.users

Columns in common: 13
  ✓ id (integer) - types match
  ✓ name (varchar) - types match
  ✓ email (varchar) - types match
  ...

Source only: 2 columns
  + phone              varchar(20)           ⚠ Data loss risk
  + ssn                varchar(11)           ⚠ Data loss risk

Target only: 1 column
  + created_by_system  text DEFAULT 'pgsyncd'  ✓ Safe (has default)

Type mismatches: 0

Sync status: INCOMPATIBLE
Reason: Source has columns not in target (data loss risk)

Fix:
  ALTER TABLE target.users ADD COLUMN phone varchar(20);
  ALTER TABLE target.users ADD COLUMN ssn varchar(11);

Schema Sync (Generate Migration)

# Generate SQL to align target with source
pgsyncd schema sync \
  --source "postgresql://source/db" \
  --target "postgresql://target/db" \
  --dry-run

# Output:
[DRY RUN] Would execute on target:

-- Table: public.users
ALTER TABLE public.users ADD COLUMN phone varchar(20);
ALTER TABLE public.users ADD COLUMN ssn varchar(11);

-- Table: public.products
ALTER TABLE public.products ALTER COLUMN price TYPE numeric(10,2);

-- Apply with:
pgsyncd schema sync --source ... --target ... --apply
# Actually apply migrations
pgsyncd schema sync \
  --source "postgresql://source/db" \
  --target "postgresql://target/db" \
  --apply

# Output:
[INFO] Aligning target schema with source...
[INFO] Adding column target.users.phone
[INFO] Adding column target.users.ssn
[INFO] Altering column target.products.price
[SUCCESS] Target schema now matches source
[SUCCESS] Run 'pgsyncd setup' to refresh schema fingerprints

Configuration Options

CLI Flags (Override Safety Checks)

# Allow specific schema drift types
--allow-schema-drift=<type>
  Types:
    - source_superset: Allow source to have extra columns (SKIP them - data loss!)
    - target_superset: Allow target to have extra columns (use NULL/DEFAULT)
    - type_casts: Allow lossy type casts (precision loss!)
    - all: Allow any drift (DANGEROUS - requires --force)

# Granular column control
--skip-columns=<table.column>,<table.column>
  Example: --skip-columns=users.ssn,users.phone

--preserve-columns=<table.column>,<table.column>
  Keep target columns that were removed from source

--null-removed-columns=<table.column>,<table.column>
  Set target columns to NULL when removed from source

# Type cast control
--allow-type-cast=<table.column>:<target_type>
  Example: --allow-type-cast=products.price:integer

--allow-all-type-casts
  Allow all type casts, even lossy ones (DANGEROUS)

# Safety override
--force
  Required with dangerous flags like --allow-schema-drift=all
  Prevents accidental data loss

Configuration File

# pgsyncd.yaml

schema:
  # Validation strictness
  validation: strict              # strict | permissive | disabled

  # Validation frequency
  validation_interval: 60s        # How often to re-check schemas during sync

  # Column handling
  source_added_columns: stop      # stop | skip (data loss!)
  target_added_columns: allow     # allow | stop
  source_removed_columns: stop    # stop | preserve | null

  # Type handling
  type_mismatches: stop           # stop | cast_if_safe | cast_always
  safe_casts_allowed: true        # Allow int→bigint, varchar(10)→varchar(20), etc

  # Per-table overrides
  table_overrides:
    - table: public.audit_logs
      skip_columns: [user_ip, user_agent]  # Don't sync PII to reporting DB

    - table: public.products
      allow_type_cast:
        price: integer  # Explicitly allow lossy cast for this table

Metrics & Observability

Prometheus Metrics

# Schema validation status per table
pgsyncd_schema_status{namespace, deployment, table, schema}
  1 = schemas match
  0 = schema mismatch blocking sync

# Schema mismatches by type
pgsyncd_schema_mismatches_total{namespace, deployment, table, type}
  type: source_added_column, target_added_column, type_mismatch,
        source_removed_column, safe_cast, unsafe_cast

# Tables blocked by schema issues
pgsyncd_tables_blocked_schema_total{namespace, deployment, reason}
  reason: data_loss_risk, type_mismatch, target_not_null

# Rows skipped due to schema issues
pgsyncd_rows_skipped_schema_total{namespace, deployment, table, reason}
  reason: column_mismatch, cast_error, constraint_violation

# Schema validations performed
pgsyncd_schema_validations_total{namespace, deployment, result}
  result: pass, fail_safe, fail_unsafe, warning

Grafana Dashboard Panels

Panel: Schema Health Status

# Table showing schema status for all tables
pgsyncd_schema_status{namespace="$namespace", deployment="$deployment"}

Panel: Schema Mismatch Alerts

# Count of tables with schema issues
count(pgsyncd_schema_status{namespace="$namespace", deployment="$deployment"} == 0)

Alert: Schema Blocking Sync

- alert: PgsyncSchemaBlocked
  expr: pgsyncd_schema_status == 0
  for: 5m
  annotations:
    summary: "pgsyncd sync blocked by schema mismatch on {{ $labels.table }}"
    description: |
      Table {{ $labels.table }} has schema mismatch blocking sync.

      Check: pgsyncd schema diff --table {{ $labels.table }}

      Dashboard: {{ $grafana_url }}/d/pgsyncd
  labels:
    severity: critical

Migration Path

Phase 1: Detection Only (Safe Rollout)

Week 1-2: Add validation, don't block

# Validate schemas but only warn, don't stop sync
pgsyncd sync --schema-validation=warn-only

# Logs schema mismatches but continues syncing
[WARN] Schema mismatch on public.users (source has 'phone', target doesn't)
[WARN] This will become a blocking error in version X.Y
[WARN] Continuing sync for now (--schema-validation=warn-only)

Metrics collected:

  • How many tables have schema mismatches in production?
  • What types of mismatches are most common?
  • Are any mismatches causing data loss?

Phase 2: Blocking Mode (After Validation)

Week 3+: Enable blocking

# Default behavior: stop on schema mismatch
pgsyncd sync  # --schema-validation=strict (default)

# Output:
[ERROR] Schema mismatch on public.users
[ERROR] Source has new column 'phone' not in target
[ERROR] Sync STOPPED to prevent data loss

Phase 3: Automatic Schema Sync (Future)

Future enhancement:

# Automatically apply schema changes to target
pgsyncd sync --auto-migrate-schema

# Or in config:
schema:
  auto_migrate: true
  auto_migrate_types: [add_column, safe_type_change]
  require_approval: true  # Require operator confirmation

Error Messages (User Experience)

Example 1: Source Added Column

╔════════════════════════════════════════════════════════════════╗
║ SYNC STOPPED: Schema Mismatch                                 ║
╠════════════════════════════════════════════════════════════════╣
║ Table: public.users                                            ║
║                                                                ║
║ Problem:                                                       ║
║   Source database has new column 'phone_number' (varchar)     ║
║   Target database does not have this column                   ║
║                                                                ║
║ Impact:                                                        ║
║   If sync continues, phone_number data will be LOST           ║
║   Users inserted/updated since schema change: ~5,000          ║
║                                                                ║
║ Fix (RECOMMENDED):                                            ║
║   Add column to target database:                              ║
║                                                                ║
║   ALTER TABLE target.public.users                             ║
║   ADD COLUMN phone_number varchar;                            ║
║                                                                ║
║ Alternative (NOT RECOMMENDED - DATA LOSS):                    ║
║   Skip this column:                                           ║
║                                                                ║
║   pgsyncd sync --skip-columns=users.phone_number              ║
║                                                                ║
║ More info:                                                    ║
║   pgsyncd schema diff --table users                           ║
╚════════════════════════════════════════════════════════════════╝

Example 2: Type Mismatch

╔════════════════════════════════════════════════════════════════╗
║ SYNC STOPPED: Type Mismatch                                   ║
╠════════════════════════════════════════════════════════════════╣
║ Table: public.products                                         ║
║ Column: price                                                  ║
║                                                                ║
║ Problem:                                                       ║
║   Source: numeric(10,2) - e.g., $19.99                        ║
║   Target: integer       - e.g., $19                           ║
║                                                                ║
║ Impact:                                                        ║
║   Casting would lose cents: $19.99 → $19                      ║
║   This is DATA CORRUPTION                                     ║
║                                                                ║
║ Fix (RECOMMENDED):                                            ║
║   Match target type to source:                                ║
║                                                                ║
║   ALTER TABLE target.public.products                          ║
║   ALTER COLUMN price TYPE numeric(10,2);                      ║
║                                                                ║
║ Alternative (DATA LOSS):                                      ║
║   Explicitly allow precision loss:                            ║
║                                                                ║
║   pgsyncd sync --allow-type-cast=products.price:integer       ║
║                                                                ║
╚════════════════════════════════════════════════════════════════╝

Testing Strategy

Unit Tests

// tests/syncd/test_schema_validation.c

void test_detect_source_added_column() {
    // Setup: Create source with extra column
    SchemaFingerprint *source = create_schema(
        "public", "users",
        (SchemaColumn[]){
            {"id", "integer", false, NULL},
            {"name", "varchar", false, NULL},
            {"phone", "varchar", false, NULL}  // Extra column
        }, 3
    );

    SchemaFingerprint *target = create_schema(
        "public", "users",
        (SchemaColumn[]){
            {"id", "integer", false, NULL},
            {"name", "varchar", false, NULL}
        }, 2
    );

    // Test
    SchemaDiff diff = compute_schema_diff(source, target);

    // Assert
    assert(diff.has_data_loss_risk == true);
    assert(diff.source_added_count == 1);
    assert(strcmp(diff.source_added[0], "phone") == 0);
}

void test_allow_safe_type_widening() {
    SchemaFingerprint *source = create_schema_with_type("id", "integer");
    SchemaFingerprint *target = create_schema_with_type("id", "bigint");

    SchemaDiff diff = compute_schema_diff(source, target);

    assert(diff.has_type_mismatches == true);
    assert(diff.has_unsafe_casts == false);  // int→bigint is safe
    assert(diff.has_data_loss_risk == false);
}

void test_detect_lossy_type_cast() {
    SchemaFingerprint *source = create_schema_with_type("price", "numeric");
    SchemaFingerprint *target = create_schema_with_type("price", "integer");

    SchemaDiff diff = compute_schema_diff(source, target);

    assert(diff.has_unsafe_casts == true);
    assert(diff.has_data_loss_risk == true);
}

Integration Tests

#!/usr/bin/env bash
# tests/syncd/test_schema_evolution.sh

test_source_added_column_stops_sync() {
    # Setup: Create source with extra column
    psql $SOURCE -c "ALTER TABLE users ADD COLUMN phone varchar"

    # Test: Sync should fail
    if pgsyncd sync --source $SOURCE --target $TARGET --dir $WORKDIR; then
        error "Sync should have failed on schema mismatch"
        return 1
    fi

    # Verify error message
    if ! grep -q "Source has new column 'phone'" sync.log; then
        error "Expected schema mismatch error"
        return 1
    fi

    # Fix schema
    psql $TARGET -c "ALTER TABLE users ADD COLUMN phone varchar"

    # Sync should now succeed
    pgsyncd sync --source $SOURCE --target $TARGET --dir $WORKDIR
}

test_target_added_nullable_column_continues() {
    # Setup: Add nullable column to target
    psql $TARGET -c "ALTER TABLE users ADD COLUMN metadata text"

    # Test: Sync should succeed with warning
    pgsyncd sync --source $SOURCE --target $TARGET --dir $WORKDIR 2>&1 | tee sync.log

    # Verify warning logged
    grep -q "Target has extra column 'metadata'" sync.log

    # Verify data synced
    assert_row_count_matches
}

test_escape_hatch_skip_columns() {
    # Setup: Source has extra column
    psql $SOURCE -c "ALTER TABLE users ADD COLUMN ssn varchar"

    # Test: Skip column explicitly
    pgsyncd sync \
        --source $SOURCE \
        --target $TARGET \
        --dir $WORKDIR \
        --skip-columns=users.ssn

    # Verify sync succeeded
    assert_row_count_matches

    # Verify SSN data not in target
    assert_column_not_exists "users" "ssn" $TARGET
}

Open Questions

  1. Performance impact of schema validation?

    • How often to re-validate during long-running sync?
    • Cache fingerprints? For how long?
    • Query information_schema on every loop iteration?
  2. Backward compatibility?

    • Should this be opt-in initially? (--schema-validation=strict)
    • Or opt-out for existing deployments? (--schema-validation=disabled)
  3. Schema sync automation?

    • Should pgsyncd automatically apply safe schema changes to target?
    • Or always require manual approval?
    • How to handle production safety?
  4. Partial table sync?

    • If 10 tables are syncing and 1 has schema issue, should:
      • Stop all 10 tables?
      • Continue 9, stop 1?
      • Current proposal: Continue healthy tables, stop problematic one
  5. Schema change detection latency?

    • How quickly should we detect schema changes?
    • Re-validate every loop iteration (slow)?
    • Re-validate every N minutes?
    • Re-validate on error only?

Success Metrics

Before Implementation (Baseline)

  • Schema-related incidents per month: ~5-10
  • Time to detect schema issue: 2-24 hours (manual discovery)
  • Data loss incidents: ~1-2 per quarter
  • Mean time to recovery: 2-4 hours (find issue, fix schema, backfill)

After Implementation (Target)

  • Schema-related incidents per month: 0-1 (only if validation disabled)
  • Time to detect schema issue: < 1 minute (automatic detection)
  • Data loss incidents: 0 (sync stops before data loss)
  • Mean time to recovery: < 30 minutes (clear error message, fix schema, resume)

Monitoring

# Incidents caused by schema issues
rate(pgsyncd_schema_status{status="error"}[7d])

# Time to detect
pgsyncd_schema_validation_duration_seconds

# Tables blocked
pgsyncd_tables_blocked_schema_total

Alternatives Considered

Alternative 1: Always Allow Schema Drift (Permissive)

Approach: Sync whatever columns exist in both, ignore differences

Pros:

  • Never blocks sync
  • Simpler implementation

Cons:

  • Silent data loss (worst case scenario)
  • No visibility into problems
  • Difficult to debug later

Rejected: Violates "never silently lose data" principle

Alternative 2: Manual Schema Locking

Approach: Require explicit schema version in catalog, fail on any change

Pros:

  • Maximum safety
  • Explicit schema contracts

Cons:

  • Too strict for normal operations
  • Every schema change requires manual intervention
  • Poor developer experience

Rejected: Too rigid for real-world use

Alternative 3: Column Allowlists

Approach: Only sync columns explicitly listed in config

Pros:

  • Explicit about what to sync
  • Schema changes don't break sync

Cons:

  • High maintenance (list every column)
  • Error-prone (forget to add new columns)
  • Doesn't solve type mismatch problem

Rejected: Too much manual work, error-prone


Conclusion

This proposal provides a fail-safe approach to schema evolution:

Stops sync when data loss risk existsContinues when changes are safeClear error messages with resolution stepsEscape hatches for emergenciesObservable via metrics

Key insight: When schemas diverge, the question isn't "can we technically continue?" The question is "will continuing cause data loss or corruption?" If yes → STOP. If no → CONTINUE.

This approach never silently loses data, makes problems visible immediately, and provides clear paths to resolution.


Appendix: SQL Queries

Query Source Schema

SELECT
    column_name,
    data_type,
    character_maximum_length,
    numeric_precision,
    numeric_scale,
    is_nullable,
    column_default,
    ordinal_position
FROM information_schema.columns
WHERE table_schema = $1
  AND table_name = $2
ORDER BY ordinal_position;

Query Target Schema

-- Same as source query

Detect Schema Differences

-- Columns in source but not target (data loss risk)
SELECT s.column_name, s.data_type
FROM source_schema s
LEFT JOIN target_schema t ON s.column_name = t.column_name
WHERE t.column_name IS NULL;

-- Columns in target but not source
SELECT t.column_name, t.data_type, t.is_nullable, t.column_default
FROM target_schema t
LEFT JOIN source_schema s ON t.column_name = s.column_name
WHERE s.column_name IS NULL;

-- Type mismatches
SELECT
    s.column_name,
    s.data_type AS source_type,
    t.data_type AS target_type
FROM source_schema s
JOIN target_schema t ON s.column_name = t.column_name
WHERE s.data_type != t.data_type
   OR s.character_maximum_length != t.character_maximum_length
   OR s.numeric_precision != t.numeric_precision;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment