Skip to content

Instantly share code, notes, and snippets.

@Jeongseup
Last active May 7, 2025 06:28
Show Gist options
  • Save Jeongseup/798f5a53503b7a0a842bcba0a995cf90 to your computer and use it in GitHub Desktop.
Save Jeongseup/798f5a53503b7a0a842bcba0a995cf90 to your computer and use it in GitHub Desktop.
Scalable Nomad Architecture

Scalable Nomad Architecture Evolution (Multi-Client / Global Ready)

Initial Setup

  • Nomad Server: 1
  • Nomad Clients: 3
  • Use Case: Service discovery + load balancing with automatic failover

Before Failover

         ┌────────────┐
         │   Client   │
         └─────┬──────┘
               │
    ┌──────────▼──────────┐
    │ External LB (CNAME) │
    └──────────┬──────────┘
               │
  ┌────────────┼────────────┐
  │            │            │
┌─▼─┐        ┌─▼─┐        ┌─▼─┐
│C1 │        │C2 │        │C3 │
│T+C│        │T+C│        │T+C│
│N  │        │N  │        │N  │
│API│        │API│        │API│
└───┘        └───┘        └───┘
  • T = Traefik, C = Consul, N = Nomad
  • go-api.nomad running on 3 clients

After Failover (e.g., C1 Disk Full)

C1 dies or gets drained
    ▼
API task is rescheduled to C4

         ┌────────────┐
         │   Client   │
         └─────┬──────┘
               │
    ┌──────────▼──────────┐
    │ External LB (CNAME) │
    └──────────┬──────────┘
               │
  ┌────────────┼────────────┐
  │            │            │
┌─▼─┐        ┌─▼─┐        ┌─▼─┐
│C2 │        │C3 │        │C4 │
│T+C│        │T+C│        │T+C│
│N  │        │N  │        │N  │
│API│        │API│        │API│ (migrated)
└───┘        └───┘        └───┘
  • Consul updates service address
  • Traefik reloads config dynamically
  • No downtime

Scalable Global Version

  • Nomad Server: 3+ (HA)
  • Nomad Clients: 10+
  • Regions: US-East, EU-West
  • External DNS LB (e.g., Route53, Cloudflare)
  • Traefik + Consul Connect optional for secure mesh

✅ Nomad Failover Test: Disk Overflow Recovery

This document outlines the test scenario for verifying Nomad's allocation migration in case of disk overflow using the following architecture:

                         ┌────────────┐
                         │   Client   │
                         │   Request  │
                         └─────┬──────┘
                               │
                ┌─────────────▼─────────────┐
                │  External Load Balancer   │
                │   (CNAME → Traefik IPs)   │
                └─────────────┬─────────────┘
                              │
           ┌──────────────────┼────────────────────┐
           │                  │                    │
     ┌─────▼─────┐      ┌─────▼─────┐        ┌─────▼─────┐
     │ Client 1  │      │ Client 2  │        │ Client 3  │
     │ Nomad     │      │ Nomad     │        │ Nomad     │
     │ Traefik   │      │ Traefik   │        │ Traefik   │
     │ Consul    │      │ Consul    │        │ Consul    │
     │           │      │           │        │           │
     │ go-api    │      │ go-api    │        │ go-api    │
     │ (실행 중) │      │ (실행 중) │        │ (실행 중) │
     └─────┬─────┘      └───────────┘        └───────────┘
           │
           │  ⚠️ Node 1 Disk full (e.g. 50GB+)
           │
           ▼
     ┌──────────────────────────────────────┐
     │ 🚨 Nomad triggers allocation restart │
     │     based on constraint/resource     │
     └──────────────────────────────────────┘
           │
           ▼
     ┌─────▼─────┐
     │ Client 4  │
     │ Nomad     │
     │ Traefik   │
     │ Consul    │
     │           │
     │ go-api    │  ◀◀◀─── Nomad migration
     │ (migrated)│
     └───────────┘

          🔁 Consul 재등록 → Traefik 라우팅 자동 반영 → 유저 콜 계속 정상 동작

🎯 Test Objectives

  • Simulate disk overflow on Client 1
  • Verify Nomad's automatic rescheduling of the go-api allocation to Client 4
  • Validate Consul service re-registration
  • Confirm Traefik automatically reroutes traffic to the new node

🧪 Test Steps

  1. Initial State Check

    • All clients (1~3) running go-api via Nomad
    • Traefik routing verified (use curl loop to confirm round-robin)
  2. Simulate Disk Overflow

    • Use fallocate or large file copy to fill /opt/nomad/data on Client 1
    • Confirm disk exceeds threshold (e.g., > 50GB)
  3. Nomad Auto-Reallocation

    • Watch nomad alloc status or logs
    • Confirm go-api allocation is evicted from Client 1 and scheduled on Client 4
  4. Consul Service Re-registration

    • Check Consul UI or use dig to resolve updated go-api.service.consul
  5. Validate Routing

    • From client shell: curl http://traefik.service.consul:8000
    • Confirm response contains Hostname: Client 4

✅ Expected Behavior

  • Nomad reacts to disk usage constraints
  • Allocation is restarted on healthy node
  • Consul re-registers the service automatically
  • Traefik reflects change and resumes service routing

🔧 Notes

  • This setup assumes:
    • All nodes are part of the same Nomad cluster
    • Consul agent and Traefik run on all clients
    • External Load Balancer (e.g., CNAME → Traefik IP) routes to any healthy node

Nomad Failover Test Flow (Consul + Traefik)

Cluster Setup

  • Nomad Server: 1
  • Nomad Clients: 4 (Client1, Client2, Client3, Client4)
  • Each client runs:
    • Nomad Client Agent
    • Consul Agent
    • Traefik instance

Job: go-api

  • 3 allocations running on 3 different Nomad clients
  • Nomad automatically handles placement and health checks
  • Consul auto-registers the service
  • Traefik routes traffic based on Consul catalog

Failover Simulation Flow

Step 1: Observe existing allocations

nomad job status go-api

Step 2: Simulate disk overuse on Client1

# Simulate disk alert with metrics or dummy file
fallocate -l 51G /tmp/dummyfile

Step 3: Scale up job to add a 4th instance

nomad job scale go-api 4

Step 4: Verify allocation migration

nomad alloc status <alloc-id>

Step 5: Drain the overloaded node

nomad node drain -enable -yes <node-id>

Step 6: Verify Consul + Traefik update routing

dig @127.0.0.1 -p 8600 go-api.service.consul
curl http://traefik.service.consul:8000/go-api

Outcome

  • New allocation lands on Client4
  • Client1 gracefully removed
  • Traefik dynamically picks up new service location via Consul
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment