Skip to content

Instantly share code, notes, and snippets.

@fmount
Last active January 28, 2025 16:08
Show Gist options
  • Save fmount/2c5e7b99d3e1bcc1a2afdb619c7ad9d6 to your computer and use it in GitHub Desktop.
Save fmount/2c5e7b99d3e1bcc1a2afdb619c7ad9d6 to your computer and use it in GitHub Desktop.

Topology and Affinity notes

oc label nodes master-0 node=node0 zone=zoneA --overwrite
oc label nodes master-1 node=node1 zone=zoneB --overwrite
oc label nodes master-2 node=node2 zone=zoneC --overwrite
+--------+      +--------+      +--------+
|        |      |        |      |        |
| ZONE A |      | ZONE B |      | ZONE C |
|        |      |        |      |        |
+--------+      +--------+      +--------+
 |_ api-ext-0    |_ api-ext-1    |_ api-ext-2
 |_ api-int-0    |_ api-int-1    |_ api-int-2

---
apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
  name: glance-default-spread-pods
  namespace: openstack
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        service: glance

In this case, we can observe how the TopologySpreadConstraints matches with the preferredAntiAffinityRules, the system is stable and the scheduler is able to schedule Pods in each zone. The topologySpreadConstraints is applied to Pods that match the specified labelSelector, regardless of whether they are replicas of the same Pod or different Pods. A key point to understand is that in Kubernetes, there's no real concept of "replicas of the same Pod" at the scheduler level - what we commonly call "replicas" are actually individual Pods that share the same labels and are created by a controller (like a Deployment, StatefulSet, etc.). Each Pod is scheduled independently, even if they were created as part of the same set of replicas. The topologySpreadConstraints would apply to ALL 6 pods because they all match the labelSelector service: glance. The scheduler would try to spread all these pods across the nodes according to the constraint, treating them as a single group of 6 pods that need to be spread, not as separate groups of 2 Pods with 3 replicas. When we define a TopologySpreadConstraints, maxSkew plays an important role. In general, Kubernetes scheduler calculates pod spreading through this maxSkew parameter as follows:

skew = max(|actualPodsInZone - avgPodsPerZone|)

Where:

  • actualPodsInZone: the number of pods in a specific zone
  • avgPodsPerZone: total pods / number of zones

For example, with 7 pods and 3 zones:

avgPodsPerZone = 7/3 ≈ 2.33
If distribution is [3,2,2], max skew is |3-2.33| = 0.67
If distribution is [4,2,1], max skew is |4-2.33| = 1.67

The maxSkew parameter represents the maximum allowed difference from the average.

If we set maxSkew: 1:

- [3,2,2] would be allowed (skew 0.67 < 1)
- [4,2,1] would not be allowed (skew 1.67 > 1)

In summary:

  • Scheduler tries to minimize skew while respecting other constraints
  • Higher maxSkew allows more uneven distribution
  • Lower maxSkew enforces more balanced distribution
  • maxSkew: 1 is a common choice to reach a reasonable balance
  • whenUnsatisfiable: DoNotSchedule prevents exceeding maxSkew
  • whenUnsatisfiable: ScheduleAnyway allows exceeding if necessary

To spread different types of Pods independently, we would need to use different labels and different topology spread constraints for each type. For example, we can select only a subset of Pods (e.g. a specific GlanceAPI, called czone), and spread the resulting Pods across zoneA, zoneB and zoneC. To select only the czone glanceAPI, we rely on matchExpressions, that fits well a context where we do not necessarily propagate the same label keys from the top level CR to the resulting Pods. The following example, selects the glance-czone-edge-api Pods, and spreads them across the existing kubernetes nodes.

apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
  name: glance-czone-spread-pods
  namespace: openstack
spec:
  topologySpreadConstraint:
  - labelSelector:
      matchExpressions:
      - key: glanceAPI
        operator: In
        values:
        - czone
    maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule

To achieve the above, a glanceAPI called czone has been created, and a topologyRef called glance-czone-spread-pods has been applied.

+--------+      +--------+      +--------+
|        |      |        |      |        |
| ZONE A |      | ZONE B |      | ZONE C |
|        |      |        |      |        |
+--------+      +--------+      +--------+
 |_ api-ext-0    |_ api-ext-1    |_ api-ext-2
 |_ api-int-0    |_ api-int-1    |_ api-int-2
 |_ czone-edge-0 |_ czone-edge-1 |_ czone-edge-2

However, a quite common scenario is to reach the exact opposite model, where Podsare scheduled in a particular/specific zone. In our example, the idea is to take all Pods that belong to glanceAPI: glance-czone-edge and schedule all of them in ZoneC (which is equals to master-2 in a three nodes environment). To achieve this goal, we create the following topology CR:

apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
  name: glance-czone-node-affinity
  namespace: openstack
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: zone
                operator: In
                values:
                  - zoneC

The above establishes a nodeAffinity, to make sure we schedule czone glanceAPI Pods in zoneC. In other words, we require czone Pods to be scheduled on a node that has zoneC as label, and this condition is stronger than the preferredAntiAffinityRules applied by default to the statefulSet. Note that in this case we do not need any TopologySpreadConstraints, because we're not really interested in the Pods distribution, but we're trying to achieve isolation between AZs.

+--------+      +--------+      +--------+
|        |      |        |      |        |
| ZONE A |      | ZONE B |      | ZONE C |
|        |      |        |      |        |
+--------+      +--------+      +--------+
 |_ api-ext-0    |_ api-ext-1    |_ api-ext-2
 |_ api-int-0    |_ api-int-1    |_ api-int-2
                                 |_ czone-edge-0
                                 |_ czone-edge-1
                                 |_ czone-edge-2

The picture above can be checked with the following:

Every 2.0s: oc get pods -l service=glance -o wide

NAME                            READY   STATUS    RESTARTS   AGE     IP             NODE       NOMINATED NODE   READINESS GATES
glance-czone-edge-api-0         3/3     Running   0          3m13s   10.128.1.172   master-2   <none>           <none>
glance-czone-edge-api-1         3/3     Running   0          3m25s   10.128.1.171   master-2   <none>           <none>
glance-czone-edge-api-2         3/3     Running   0          3m37s   10.128.1.170   master-2   <none>           <none>

glance-default-external-api-0   3/3     Running   0          71m     10.129.0.72    master-0   <none>           <none>
glance-default-external-api-1   3/3     Running   0          72m     10.128.1.152   master-2   <none>           <none>
glance-default-external-api-2   3/3     Running   0          72m     10.130.0.208   master-1   <none>           <none>

glance-default-internal-api-0   3/3     Running   0          72m     10.128.1.153   master-2   <none>           <none>
glance-default-internal-api-1   3/3     Running   0          72m     10.129.0.70    master-0   <none>           <none>
glance-default-internal-api-2   3/3     Running   0          72m     10.130.0.207   master-1   <none>           <none>

We can use the same approach to apply nodeAffinity to bzone and azone glanceAPIs and observe Pods being scheduled on the nodes that belong to zoneB and zoneA.

We create the following CRs:

apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
  name: glance-bzone-node-affinity
  namespace: openstack
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: zone
                operator: In
                values:
                  - zoneB

and:

apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
  name: glance-azone-node-affinity
  namespace: openstack
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: zone
                operator: In
                values:
                  - zoneA
$ oc get topology

NAME
glance-azone-node-affinity
glance-bzone-node-affinity
glance-czone-node-affinity
glance-czone-spread-pods
glance-default-spread-pods
NAME                            READY   STATUS    RESTARTS   AGE     IP             NODE       NOMINATED NODE   READINESS GATES
glance-azone-edge-api-0         3/3     Running   0          39s     10.129.0.161   master-0   <none>           <none>
glance-azone-edge-api-1         3/3     Running   0          51s     10.129.0.160   master-0   <none>           <none>
glance-azone-edge-api-2         3/3     Running   0          64s     10.129.0.159   master-0   <none>           <none>

glance-bzone-edge-api-0         3/3     Running   0          15m     10.130.0.239   master-1   <none>           <none>
glance-bzone-edge-api-1         3/3     Running   0          15m     10.130.0.238   master-1   <none>           <none>
glance-bzone-edge-api-2         3/3     Running   0          15m     10.130.0.237   master-1   <none>           <none>

glance-czone-edge-api-0         3/3     Running   0          124m    10.128.1.172   master-2   <none>           <none>
glance-czone-edge-api-1         3/3     Running   0          124m    10.128.1.171   master-2   <none>           <none>
glance-czone-edge-api-2         3/3     Running   0          124m    10.128.1.170   master-2   <none>           <none>

glance-default-external-api-0   3/3     Running   0          3h12m   10.129.0.72    master-0   <none>           <none>
glance-default-external-api-1   3/3     Running   0          3h13m   10.128.1.152   master-2   <none>           <none>
glance-default-external-api-2   3/3     Running   0          3h13m   10.130.0.208   master-1   <none>           <none>
glance-default-internal-api-0   3/3     Running   0          3h13m   10.128.1.153   master-2   <none>           <none>
glance-default-internal-api-1   3/3     Running   0          3h13m   10.129.0.70    master-0   <none>           <none>
glance-default-internal-api-2   3/3     Running   0          3h14m   10.130.0.207   master-1   <none>           <none>
+--------+         +--------+          +--------+
|        |         |        |          |        |
| ZONE A |         | ZONE B |          | ZONE C |
|  (m0)  |         |  (m1)  |          |  (m2)  |
|        |         |        |          |        |
+--------+         +--------+          +--------+
 |_ api-ext-0       |_ api-ext-1        |_ api-ext-2
 |_ api-int-0       |_ api-int-1        |_ api-int-2
 |                  |                   |
 |_ azone-edge-0    |_ bzone-edge-0     |_ czone-edge-0
 |_ azone-edge-1    |_ bzone-edge-1     |_ czone-edge-1
 |_ azone-edge-2    |_ bzone-edge-2     |_ czone-edge-2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment