Skip to content

Instantly share code, notes, and snippets.

@rami-ruhayel-vgw
Last active March 27, 2026 06:52
Show Gist options
  • Select an option

  • Save rami-ruhayel-vgw/9f9c4dd1994344d0cce92152b19d9fe8 to your computer and use it in GitHub Desktop.

Select an option

Save rami-ruhayel-vgw/9f9c4dd1994344d0cce92152b19d9fe8 to your computer and use it in GitHub Desktop.
Geo Event Instrumentation Addendum: Unified Alert Source, Snowflake Integration Path, and CloudFront Geo Root Cause

Geo Event Instrumentation Addendum: Unified Alerts, Snowflake Path, CloudFront Geo

Status: Loki instrumentation (events, dashboard, alerts, simulator) is in draft PRs awaiting review. Snowflake integration and CloudFront geo fix are proposed but not yet implemented.

Companion addendum: KPI Addendum (Loki vs Snowflake, data pipeline, first-time/repeat segmentation)

Builds on: Technical Approach which covers sink abstraction, event payloads, dashboard panels, and simulator.

Deep-dive: Data Pipeline Investigation (full lineage traces, VGW data platform inventory, root cause analysis)

Roadmap: Rollout Observability Roadmap (phased ticket breakdown)


Unified Alert Definitions

Alert rules are defined once in monitoring/alerts/*.yml and consumed by both Terraform (production Grafana Cloud) and the local dev stack.

The problem solved

Previously, alert rules existed in two places:

  • monitoring/grafana-geo-location-alerts.tf (696 lines of inline HCL for production)
  • monitoring/dev/provisioning/alerting/rules.yml (hand-maintained YAML for local Grafana)

These drifted. The local YAML had different condition structures, different ref IDs, and was missing the threshold data blocks that the production TF had.

The solution

A single canonical YAML (monitoring/alerts/geo-location.yml) using #terraform_environment as a placeholder:

# monitoring/alerts/geo-location.yml
apiVersion: 1
groups:
  - orgId: 1
    name: Geo Location
    folder: Global Poker
    interval: 1m
    rules:
      - uid: geo-verification-failure-rate-spike
        title: Verification Failure Rate Spike
        condition: D
        for: 5m
        data:
          - refId: A
            datasourceUid: grafanacloud-logs
            model:
              expr: >-
                sum(count_over_time({aws_log_group="/#terraform_environment/ecs/gameclient"}
                |= `[geo-analytics] geo.verification.completed`
                |= `"outcome":"failed"` [5m]))

Terraform reads this via yamldecode(replace(file(...), "#terraform_environment", var.environment)) and creates grafana_rule_group resources with dynamic blocks. Contact point routing is derived from the severity label:

contact_point = (
  var.environment == "pok-prod-public" &&
  rule.value.labels.severity == "critical"
) ? "pok - critical" : "pok - warning"

Local dev replaces the placeholder via an init-grafana busybox container in docker-compose:

init-grafana:
  image: busybox
  command: >-
    sh -c '
      for f in /alerts/*.yml; do
        sed "s/#terraform_environment/local/g" "$f" > "/out-alerts/$(basename $f)"
      done &&
      for f in /dashboards/*.json; do
        sed "s/#terraform_environment/local/g" "$f" > "/out-dashboards/$(basename $f)"
      done
    '

The same init-grafana service also processes dashboard JSONs. The $environment dropdown variable was removed from the dashboard (each Grafana instance is environment-specific) and replaced with a hidden constant whose value is #terraform_environment.

Adding a new alert rule

Edit monitoring/alerts/geo-location.yml. Both production and local dev pick up the change automatically. The TF file is 55 lines of generic dynamic-block logic, not per-rule HCL.

6 alert rules

Rule Condition Severity
Verification Failure Rate Spike >20% failures over 5min critical
Elevated Session Sign-Outs >10 sign-outs in 1hr warning
Lobby Load Rate Drop <70% lobby rate over 15min critical
Friction Impact Exceeded >15pp delta between geo and control over 15min warning
No Geo Events Ingested Zero events for 15min critical
Verification Latency Spike P95 >10s over 5min warning

Dashboard Environment Variable

The $environment template variable was changed from a query dropdown to a hidden constant:

{
  "type": "constant",
  "name": "environment",
  "hide": 2,
  "query": "#terraform_environment"
}

Terraform's existing replace(file(...), "#terraform_environment", var.environment) substitutes the value at deploy time. Queries continue using $environment unchanged. The dropdown is gone from the UI.


Docker Compose Profiles

Simulators use Docker Compose profiles so you're not forced to start everything:

docker compose up -d                              # Core only (Loki + Grafana)
docker compose --profile geo up -d                # Core + geo simulator
docker compose --profile geo --profile kyc up -d  # Multiple simulators
docker compose --profile geo down                 # Stop including profiled services

New simulators add profiles: [their-domain] to their docker-compose service definition. Dashboards and alerts are auto-provisioned from monitoring/dashboards/ and monitoring/alerts/ (no docker-compose changes needed).


Verification Latency (durationMs)

The simulator generates realistic durationMs values using outcome-specific distributions:

Outcome Distribution P50 P95
success log-normal(0.9, 0.4) ~2.5s ~5s
restricted log-normal(0.9, 0.4) ~2.5s ~5s
failed/permission_denied log-normal(0.0, 0.3) ~1s ~2s
failed/timeout uniform(60s, 120s) ~90s ~117s
failed/verification_failed log-normal(1.1, 0.5) ~3s ~7s

LogQL uses regexp extraction (not | json) because log lines have a timestamp prefix before the JSON:

quantile_over_time(0.95,
  {aws_log_group="/$environment/ecs/gameclient"}
  |= `[geo-analytics] geo.verification.completed`
  | regexp `"durationMs":(?P<durationMs>\d+)`
  | unwrap durationMs [$__interval]
) by ()

Snowflake Integration Path

Current state

Geo events exist only in Loki. The poker domain has a mature path to Snowflake (event stores -> ECS projectors -> S3 DLZ -> Snowpipe -> Snowflake) but geo events don't use it.

The IDENTITY_LOGIN table in Snowflake has 5 geo columns, all hardcoded NULL. The schema was designed for geo data that never arrived.

Root cause: CloudFront geo gap

CloudFront adds viewer_country and viewer_country_region headers. These flow through to the JWT (used as jwtGeo in Loki events). But the Auth0 post-audit-login-event action filters them out of the cookie object sent to pok-user, because viewer_country is not in the cookie allow-list (line 118 of post-audit-login-event.ts). The LOGGED_IN event stores raw IP but no resolved geo.

Auth0 also provides event.request.geoip in every action, but no action reads it.

Integration plan

Phase 1: Fix CloudFront gap (pre-canary, one-line change)

Add viewer_country and viewer_country_region to the cookie allow-list in pok-auth0/auth0/src/actions/post-audit-login-event.ts line 118. Add corresponding fields to AuditUserLoginRequest and LoggedInEvent in pok-user. Update the IDENTITY_LOGIN Snowflake task to extract these instead of hardcoding NULL.

Result: IP-based geo in Snowflake for ALL users. Establishes baseline before GeoComply.

Phase 2: GeoComply verification events (before scaling beyond 5%)

Add GEO_VERIFICATION_COMPLETED event type to the pok-user event store:

Fields: outcome, subdivision, country, providerName ("GeoComply"),
        providerVersion, sourceType ("sdk"), durationMs, restrictionType

New pok-user endpoint: POST /audit/geo-verification/:id. Game client calls it after verification (parallel to the existing LogManager emitGeoEvent call). Events flow through the existing user-eventstore-snowflake-pm projector -> S3 DLZ -> Snowpipe -> Snowflake automatically.

Phase 3: Multi-source geo (at 100% rollout)

Each login has up to two geo readings:

  • CloudFront/Auth0 (IP-based, always available, approximate)
  • GeoComply SDK (device-level, canary group then all users, precise, legally required)

The IDENTITY_LOGIN table's GEO_LOCATION_SOURCE_TYPE and GEO_LOCATION_PROVIDER_NAME columns distinguish between sources.

Ownership

Component Owner Change
pok-auth0 Our team Allow-list fix, optionally read event.request.geoip
pok-user Our team New fields on LOGGED_IN, new GEO_VERIFICATION_COMPLETED event + endpoint
gp-game-client Our team Call pok-user endpoint after verification
pok-infra Our team Already done (Loki dashboards + alerts)
pok-snowflake Data Engineering Update IDENTITY_LOGIN task, new CLEANSED table for geo verification
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment