Status: Loki instrumentation (events, dashboard, alerts, simulator) is in draft PRs awaiting review. Snowflake integration and CloudFront geo fix are proposed but not yet implemented.
Companion addendum: KPI Addendum (Loki vs Snowflake, data pipeline, first-time/repeat segmentation)
Builds on: Technical Approach which covers sink abstraction, event payloads, dashboard panels, and simulator.
Deep-dive: Data Pipeline Investigation (full lineage traces, VGW data platform inventory, root cause analysis)
Roadmap: Rollout Observability Roadmap (phased ticket breakdown)
Alert rules are defined once in monitoring/alerts/*.yml and consumed by both Terraform (production Grafana Cloud) and the local dev stack.
Previously, alert rules existed in two places:
monitoring/grafana-geo-location-alerts.tf(696 lines of inline HCL for production)monitoring/dev/provisioning/alerting/rules.yml(hand-maintained YAML for local Grafana)
These drifted. The local YAML had different condition structures, different ref IDs, and was missing the threshold data blocks that the production TF had.
A single canonical YAML (monitoring/alerts/geo-location.yml) using #terraform_environment as a placeholder:
# monitoring/alerts/geo-location.yml
apiVersion: 1
groups:
- orgId: 1
name: Geo Location
folder: Global Poker
interval: 1m
rules:
- uid: geo-verification-failure-rate-spike
title: Verification Failure Rate Spike
condition: D
for: 5m
data:
- refId: A
datasourceUid: grafanacloud-logs
model:
expr: >-
sum(count_over_time({aws_log_group="/#terraform_environment/ecs/gameclient"}
|= `[geo-analytics] geo.verification.completed`
|= `"outcome":"failed"` [5m]))Terraform reads this via yamldecode(replace(file(...), "#terraform_environment", var.environment)) and creates grafana_rule_group resources with dynamic blocks. Contact point routing is derived from the severity label:
contact_point = (
var.environment == "pok-prod-public" &&
rule.value.labels.severity == "critical"
) ? "pok - critical" : "pok - warning"Local dev replaces the placeholder via an init-grafana busybox container in docker-compose:
init-grafana:
image: busybox
command: >-
sh -c '
for f in /alerts/*.yml; do
sed "s/#terraform_environment/local/g" "$f" > "/out-alerts/$(basename $f)"
done &&
for f in /dashboards/*.json; do
sed "s/#terraform_environment/local/g" "$f" > "/out-dashboards/$(basename $f)"
done
'The same init-grafana service also processes dashboard JSONs. The $environment dropdown variable was removed from the dashboard (each Grafana instance is environment-specific) and replaced with a hidden constant whose value is #terraform_environment.
Edit monitoring/alerts/geo-location.yml. Both production and local dev pick up the change automatically. The TF file is 55 lines of generic dynamic-block logic, not per-rule HCL.
| Rule | Condition | Severity |
|---|---|---|
| Verification Failure Rate Spike | >20% failures over 5min | critical |
| Elevated Session Sign-Outs | >10 sign-outs in 1hr | warning |
| Lobby Load Rate Drop | <70% lobby rate over 15min | critical |
| Friction Impact Exceeded | >15pp delta between geo and control over 15min | warning |
| No Geo Events Ingested | Zero events for 15min | critical |
| Verification Latency Spike | P95 >10s over 5min | warning |
The $environment template variable was changed from a query dropdown to a hidden constant:
{
"type": "constant",
"name": "environment",
"hide": 2,
"query": "#terraform_environment"
}Terraform's existing replace(file(...), "#terraform_environment", var.environment) substitutes the value at deploy time. Queries continue using $environment unchanged. The dropdown is gone from the UI.
Simulators use Docker Compose profiles so you're not forced to start everything:
docker compose up -d # Core only (Loki + Grafana)
docker compose --profile geo up -d # Core + geo simulator
docker compose --profile geo --profile kyc up -d # Multiple simulators
docker compose --profile geo down # Stop including profiled servicesNew simulators add profiles: [their-domain] to their docker-compose service definition. Dashboards and alerts are auto-provisioned from monitoring/dashboards/ and monitoring/alerts/ (no docker-compose changes needed).
The simulator generates realistic durationMs values using outcome-specific distributions:
| Outcome | Distribution | P50 | P95 |
|---|---|---|---|
| success | log-normal(0.9, 0.4) | ~2.5s | ~5s |
| restricted | log-normal(0.9, 0.4) | ~2.5s | ~5s |
| failed/permission_denied | log-normal(0.0, 0.3) | ~1s | ~2s |
| failed/timeout | uniform(60s, 120s) | ~90s | ~117s |
| failed/verification_failed | log-normal(1.1, 0.5) | ~3s | ~7s |
LogQL uses regexp extraction (not | json) because log lines have a timestamp prefix before the JSON:
quantile_over_time(0.95,
{aws_log_group="/$environment/ecs/gameclient"}
|= `[geo-analytics] geo.verification.completed`
| regexp `"durationMs":(?P<durationMs>\d+)`
| unwrap durationMs [$__interval]
) by ()
Geo events exist only in Loki. The poker domain has a mature path to Snowflake (event stores -> ECS projectors -> S3 DLZ -> Snowpipe -> Snowflake) but geo events don't use it.
The IDENTITY_LOGIN table in Snowflake has 5 geo columns, all hardcoded NULL. The schema was designed for geo data that never arrived.
CloudFront adds viewer_country and viewer_country_region headers. These flow through to the JWT (used as jwtGeo in Loki events). But the Auth0 post-audit-login-event action filters them out of the cookie object sent to pok-user, because viewer_country is not in the cookie allow-list (line 118 of post-audit-login-event.ts). The LOGGED_IN event stores raw IP but no resolved geo.
Auth0 also provides event.request.geoip in every action, but no action reads it.
Phase 1: Fix CloudFront gap (pre-canary, one-line change)
Add viewer_country and viewer_country_region to the cookie allow-list in pok-auth0/auth0/src/actions/post-audit-login-event.ts line 118. Add corresponding fields to AuditUserLoginRequest and LoggedInEvent in pok-user. Update the IDENTITY_LOGIN Snowflake task to extract these instead of hardcoding NULL.
Result: IP-based geo in Snowflake for ALL users. Establishes baseline before GeoComply.
Phase 2: GeoComply verification events (before scaling beyond 5%)
Add GEO_VERIFICATION_COMPLETED event type to the pok-user event store:
Fields: outcome, subdivision, country, providerName ("GeoComply"),
providerVersion, sourceType ("sdk"), durationMs, restrictionType
New pok-user endpoint: POST /audit/geo-verification/:id. Game client calls it after verification (parallel to the existing LogManager emitGeoEvent call). Events flow through the existing user-eventstore-snowflake-pm projector -> S3 DLZ -> Snowpipe -> Snowflake automatically.
Phase 3: Multi-source geo (at 100% rollout)
Each login has up to two geo readings:
- CloudFront/Auth0 (IP-based, always available, approximate)
- GeoComply SDK (device-level, canary group then all users, precise, legally required)
The IDENTITY_LOGIN table's GEO_LOCATION_SOURCE_TYPE and GEO_LOCATION_PROVIDER_NAME columns distinguish between sources.
| Component | Owner | Change |
|---|---|---|
| pok-auth0 | Our team | Allow-list fix, optionally read event.request.geoip |
| pok-user | Our team | New fields on LOGGED_IN, new GEO_VERIFICATION_COMPLETED event + endpoint |
| gp-game-client | Our team | Call pok-user endpoint after verification |
| pok-infra | Our team | Already done (Loki dashboards + alerts) |
| pok-snowflake | Data Engineering | Update IDENTITY_LOGIN task, new CLEANSED table for geo verification |