Skip to content

Instantly share code, notes, and snippets.

@qudongfang
Last active February 14, 2025 10:57
Show Gist options
  • Save qudongfang/e9c1c28425b206dcd0edb0eee24396f5 to your computer and use it in GitHub Desktop.
Save qudongfang/e9c1c28425b206dcd0edb0eee24396f5 to your computer and use it in GitHub Desktop.
flux change flowchart

happy flow

All issues were found during the Checks&Tests. No bad changes were rolled out.

flowchart TD
    A[stage 0/PR] --> B[github actions/localstack];
    B -- Yes --> C[stage 1 rollout];
    B -- No  --> D[block PR merge];
    C --> E[rollout to 1 low traffic dev cluster of each provider, parallel:max];
    E --> F[stage 1 Checks&Tests];
    F -- Yes --> G[put to the latest stable release of stage 1/stack];
    F -- No --> H[abort current rollout];
    G -->I[stage 2 rollout];
    H --> AA[alert the change owner, PR owner & appprover];
    H --> AB[revert all clusters of current stage to last stable release];
    I --> J[rollout to all dev clusters, parallel:n/TBD];
    J --> K[stage 2 Checks/Tests];
    K -- YES --> L[put to the latest stable release of stage 2/stack];
    K -- No --> H;
    L --> M[stage 3 rollout];
    M --> N[rollout to 1 low traffic prod cluster of each provider, e.g, internal/mgmt, parallel:1];
    N --> O[stage 3 Checks/Tests];
    O -- Yes --> P[put to the latest stable release of stage 3/stack];
    O -- No --> H;
    P --> Q[stage 4 rollout]
    Q --> R[rollout to all prod clusters, parallel: n/TBD];
    R --> S[stage 4 Checks&Tests];
    S -- Yes --> T[put to the latest stable release of stage 4/stack];
    S -- No --> H;
    T --> U[stage 5 rollout];
    U --> V[disagg/etc ...];
    
    
Loading

emergency cases

Bad changes were not found by Checks&Tests.

dev only

No rollback! WE CAN REVERT THE BUGGY COMMIT as a new change. we may need to revert all changes to other services relying it.

flowchart TD
    A[issue confirmed in dev only] --> B[freeze prod rollout automation];
    B --> C[pause all existing prod rollouts];
    C --> D[investigate and isolate the buggy change];
    D --> E[resume rollouts before the buggy change in prod];
    E --> Z[terminate/skip all rollouts after the buggy change in prod];
    D --> F[update dev Checks&Tests to include the buggy case];
    D --> G[merge in the fix];
    G --> H[fast-forward all dev clusters to the change/version with the fix];
    H --Yes--> Y[mark changes with the issue but without the fix as bad changes, block them everywhere];
    H --Yes--> I[fast-forward all prod clusters to the change/version with the fix]
    H --No-->D;
    I --Yes--> J[unfreeze prod rollout automation];
    I --No-->D;
    J --> K[done];
Loading

dev & prod

we don't know for sure which version to rollback to, WE CAN REVERT THE BUGGY COMMIT as a new change. we may need to revert all changes to other services relying it.

flowchart TD
    A[issue confirmed in dev & prod] --> B[freeze the world for intestigation];
    B --> C[terminate all existing dev & prod rollouts, no reverts];
    C --> D[investigate and isolate the buggy change];
    D --> E[update dev & prod Checks&Tests to include the buggy case];
    E --> F[merge in the fix];
    F --> H[fast-forward all dev clusters to the change/version with the fix];
    F --> Y[mark changes with the issue but without the fix as bad changes, block them everywhere];
    H --Yes-->I[fast-forward all prod clusters to the change/version with the fix];
    H --No-->C;
    I --Yes-->J[unfreeze the word];
    I --No-->C;
    J --> K[done];
Loading

prod only

If we cannot re-produce in dev, which means either the issue has been fixed in dev or it's environment-relevant issue.

We may don't know for sure which version to rollback to! Once we isolated the problematic change, we cannot rollback because it's not easy to rollback all the other services dependent on flux changes.

WE CAN REVERT THE BUGGY COMMIT as a new change. we may need to revert all changes to other services relying it.

flowchart TD
    A[issue confirmed in prod] --no use to freeze rollouts if it's everywhere--> B[investigate and isolate the buggy change];
    B --> C[update dev & prod Checks&Tests to include the buggy case];
    B --> D[merge in the fix];
    D --> E[fast-forward all dev clusters to the change/version with the fix];
    D --> F[mark changes with the issue but without the fix as bad changes, block them everywhere];
    E --Yes-->G[fast-forward all prod clusters to the change/version with the fix];
    E --No-->D;
    G --Yes-->H[done];
    G --No-->D;

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment