Skip to content

Instantly share code, notes, and snippets.

@rami-ruhayel-vgw
Last active March 27, 2026 06:52
Show Gist options
  • Select an option

  • Save rami-ruhayel-vgw/124b8db7cfd77f88dbc9051e34a1af82 to your computer and use it in GitHub Desktop.

Select an option

Save rami-ruhayel-vgw/124b8db7cfd77f88dbc9051e34a1af82 to your computer and use it in GitHub Desktop.
GeoComply KPI Addendum: Data Pipeline Architecture, Snowflake Integration, and Rollout Observability Plan

GeoComply KPI Addendum: Data Pipeline Architecture + Rollout Observability

Status: Proposed. Findings from data pipeline investigation during V1 instrumentation work.

Companion addendum: Technical Approach Addendum (unified alerts, Snowflake integration path, CloudFront geo root cause)

Builds on: KPI Instrumentation Analysis which covers event model, state machines, funnel, and KPI definitions.

Deep-dive: Data Pipeline Investigation (full lineage traces, VGW data platform inventory, root cause analysis)

Roadmap: Rollout Observability Roadmap (phased ticket breakdown)


Two Systems, Two Purposes

Geo events need to exist in two systems. They serve different audiences asking different questions at different timescales.

Loki (operational monitoring): "Is the geo flow healthy right now?"

Question Who asks Timescale
Is the verification failure rate spiking? On-call engineer Minutes
Is GeoComply SDK latency degrading? On-call engineer Minutes
Are users being signed out mid-session? On-call engineer Hours
Is the canary group reaching the lobby? Feature owner Hours
Is geo causing more friction than control? Feature owner Hours

Rate/count/quantile queries over sliding time windows. Triggers alerts. Aggregate rates are sufficient.

Status: DONE. 9 events instrumented, 29-panel dashboard, 6 alert rules, P50/P95 latency tracking.

Snowflake (product analytics): "What is geo doing to our players over time?"

Question Who asks Timescale Why Loki can't answer it
D7 retention for players whose first session included geo? Product manager Weeks Requires joining first event with session 7 days later
How many unique players failed verification this week? Product manager Days Loki counts events, not distinct players
Did a specific player complete all funnel steps? Support engineer Per-session Requires joining events by sessionId
What % of new signups never reach the lobby? Growth team Weeks Requires correlating registration with absence of lobby events
Does geo friction affect high-value players differently? Analytics team Months Requires joining geo events with player value segment
Geo completion rate by US state? Compliance/legal Months No GROUP BY with distinct counts in LogQL
Are canary group players depositing less than control? Product manager Weeks Requires joining geo assignment with transaction data

SQL queries with joins, window functions, distinct counts, cross-domain data. Per-player precision required.

Status: NOT YET BUILT. Data pipeline investigation complete. Integration path identified.

The fundamental difference

  • Loki answers: "What's happening?" (aggregate, real-time, operational)
  • Snowflake answers: "What happened, to whom, and what was the impact?" (per-player, historical, analytical)

The same event (e.g., a verification failure) needs to exist in both. In Loki it increments a counter on a dashboard. In Snowflake it's a row tied to a specific player that can be joined with their registration date, deposit history, and future sessions to determine whether that failure caused them to churn.


First-Time vs Repeat Player Segmentation

The problem

Every metric today is a blended average. A 15% verification failure rate could mean 40% of new players fail (catastrophic) and 5% of returning players fail (normal). When the canary scales from 5% to 100%, 95% of active players hit geo for the first time. Without segmentation, dashboards become unreadable.

Separation of concerns

Two distinct problems were being conflated:

Geo-specific (belongs in geo instrumentation, queryable in Loki):

Dimension Type Description
isFirstGeoVerification boolean Has this player ever completed geo verification?
geoVerifyCount integer How many successful verifications?
verificationSequence integer Nth attempt within this session (fresh vs retry)

Source: Server-side flag in pok-user (we own it).

Platform-wide (belongs in LogManager or platform analytics, queryable in Snowflake):

Dimension Why it's not geo-specific
playerTenureDays Useful on every event in the system
isNewAccount Same
daysSinceLastVisit Session-level attribute
D1/D7/D30 retention Cross-session, cross-day correlation
Unique player counts Requires COUNT(DISTINCT)
Cohort analysis Requires grouping by first-event date

These require SQL (Snowflake), not LogQL (Loki).

Behavioural science context

  • First-time players have lower friction tolerance (~50-80 units) vs returning players (~150-300 units). A verification failure that barely registers for a returning player is budget-destroying for a new one.
  • Habituation to the geo step completes around visit 8-15. Track geoVerifyCount to verify empirically.
  • The canary-to-100% transition creates a specific cohort: loyal players encountering new friction. Loss aversion predicts they'll react more negatively than brand-new players who never knew a frictionless flow.

Current Data Pipeline Architecture

Path A: Application logs (Loki) - OPERATIONAL

Browser -> LogManager -> /log endpoint -> stdout -> CloudWatch -> Firehose -> Loki

Geo analytics events travel this path. Dashboards and alerts query Loki.

Path B: Domain event stores (Snowflake) - ANALYTICAL

PostgreSQL Event Stores -> ECS Projectors -> S3 DLZ Buckets -> Snowpipe -> Snowflake

Five projector services run this pattern today:

Projector Source S3 DLZ Events
user-eventstore-snowflake-pm aurora-pg-user customer-dlz/user-eventstore/ Registration, login, identity
game-eventstore-snowflake-pm aurora-pg-game game-dlz/v2_casino/ Casino/slots
cdd-eventstore-snowflake aurora-pg-cdd customer-dlz/cdd-eventstore/ KYC/AML
store-eventstore-snowflake-pm aurora-pg-store transaction-dlz/v2_store/ Purchases
player-account-snowflake-pm player-account DB player-account-dlz/player-items/ Wallet, items

Additionally, 7 Kafka topics feed Snowflake via a Kafka Connect connector (connect-pok-events-{env}).

Path C: Geo analytics events -> Snowflake - GAP

Geo events exist only in Loki. They do not reach Snowflake. The IDENTITY_LOGIN table in Snowflake has 5 geo columns (GEO_LOCATION_COUNTRY_CODE, GEO_LOCATION_SUBDIVISION_CODE, GEO_LOCATION_PROVIDER_NAME, GEO_LOCATION_PROVIDER_VERSION, GEO_LOCATION_SOURCE_TYPE) but all are hardcoded NULL. The schema was designed anticipating geo data that never arrived.


The CloudFront Geo Gap

Three geo sources exist at login time. None reach Snowflake:

Source Available where Why it's not in Snowflake
CloudFront headers (viewer_country, viewer_country_region) Rendered into page globals, sent to Auth0 as query params, persisted to user_metadata.cookies, embedded in JWT Filtered out by post-audit-login-event.ts cookie allow-list
Auth0 geoip (event.request.geoip) Available in all Auth0 actions Never read by any action
Raw IP (event.request.ip) Sent in audit POST body Stored but not geo-resolved

The CloudFront geo data is captured, flows through Auth0, gets embedded in the JWT (used as jwtGeo in Loki events), but is filtered out by a cookie allow-list before reaching the user event store. A one-line fix to the allow-list would start flowing IP-based geo to Snowflake for all users.


What's Already in Snowflake (Relevant to Geo)

Table Key Fields Use
CLEANSED.USER_EVENTSTORE_LOGGED_IN email, platform, time, IP, authId, userAgent Login events
CURATED.CUSTOMER_ATTRIBUTES_OUTPUT registrationDate, lastLogInDate, valueSegmentTier Player tenure
CURATED.ACCOUNT_ACTIVITY_SUMMARY firstLoginDate, lastLoginDate, firstPlayDate Retention fields
CURATED.IDENTITY_LOGIN geo columns (all NULL) Placeholder for geo data

Retention baselines can be established from existing data before the canary scales.


Roadmap

Pre-canary

  1. Merge game client instrumentation PRs (Loki path)
  2. Merge dashboard + alerts PR (Grafana)
  3. Fix CloudFront geo allow-list in pok-auth0 (one-line change, establishes Snowflake baseline)
  4. Add geo fields to LOGGED_IN event in pok-user

Canary at 5%

  1. Validate dashboards with real traffic, tune alert thresholds
  2. Add isFirstGeoVerification server-side flag
  3. Segmented Loki dashboard panels (first-time vs repeat)

Before scaling beyond 5%

  1. New GEO_VERIFICATION_COMPLETED event type in pok-user event store
  2. Game client calls pok-user endpoint after verification (parallel to LogManager)
  3. Snowflake ingestion via existing pipeline (projector -> S3 -> Snowpipe)
  4. Retention baseline queries in Snowflake

Before 100% rollout

  1. geoRolloutPhase dimension ("canary_5", "ga")
  2. Pre-populate first-geo flags for canary-period verifiers
  3. First-time user funnel dashboard with separate alert thresholds

Ownership

All components are owned by our team except Snowflake task updates:

Component Owner
gp-game-client, pok-user, pok-auth0, pok-infra Our team
pok-snowflake (IDENTITY_LOGIN task, new tables) Data Engineering
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment