oc label nodes master-0 node=node0 zone=zoneA --overwrite
oc label nodes master-1 node=node1 zone=zoneB --overwrite
oc label nodes master-2 node=node2 zone=zoneC --overwrite
+--------+ +--------+ +--------+
| | | | | |
| ZONE A | | ZONE B | | ZONE C |
| | | | | |
+--------+ +--------+ +--------+
|_ api-ext-0 |_ api-ext-1 |_ api-ext-2
|_ api-int-0 |_ api-int-1 |_ api-int-2
---
apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
name: glance-default-spread-pods
namespace: openstack
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
service: glance
In this case, we can observe how the TopologySpreadConstraints
matches with
the preferredAntiAffinityRules
, the system is stable and the scheduler is
able to schedule Pods
in each zone.
The topologySpreadConstraints
is applied to Pods that match the specified
labelSelector
, regardless of whether they are replicas of the same Pod or
different Pods. A key point to understand is that in Kubernetes, there's no
real concept of "replicas of the same Pod" at the scheduler level - what we
commonly call "replicas" are actually individual Pods that share the same
labels and are created by a controller (like a Deployment, StatefulSet, etc.).
Each Pod is scheduled independently, even if they were created as part of the
same set of replicas.
The topologySpreadConstraints
would apply to ALL 6 pods because they all
match the labelSelector
service: glance
.
The scheduler would try to spread all these pods across the nodes according to
the constraint, treating them as a single group of 6 pods that need to be
spread, not as separate groups of 2 Pods with 3 replicas.
When we define a TopologySpreadConstraints
, maxSkew
plays an important role.
In general, Kubernetes scheduler calculates pod spreading through this maxSkew
parameter as follows:
skew = max(|actualPodsInZone - avgPodsPerZone|)
Where:
- actualPodsInZone: the number of pods in a specific zone
- avgPodsPerZone: total pods / number of zones
For example, with 7 pods
and 3 zones
:
avgPodsPerZone = 7/3 ≈ 2.33
If distribution is [3,2,2], max skew is |3-2.33| = 0.67
If distribution is [4,2,1], max skew is |4-2.33| = 1.67
The maxSkew
parameter represents the maximum allowed difference from the average.
If we set maxSkew: 1
:
- [3,2,2] would be allowed (skew 0.67 < 1)
- [4,2,1] would not be allowed (skew 1.67 > 1)
In summary:
- Scheduler tries to minimize skew while respecting other constraints
- Higher maxSkew allows more uneven distribution
- Lower maxSkew enforces more balanced distribution
- maxSkew: 1 is a common choice to reach a reasonable balance
- whenUnsatisfiable: DoNotSchedule prevents exceeding maxSkew
- whenUnsatisfiable: ScheduleAnyway allows exceeding if necessary
To spread different types of Pods independently, we would need to use different
labels and different topology spread constraints for each type.
For example, we can select only a subset of Pods
(e.g. a specific GlanceAPI
,
called czone
), and spread the resulting Pods across zoneA
, zoneB
and zoneC
.
To select only the czone
glanceAPI, we rely on matchExpressions
, that fits
well a context where we do not necessarily propagate the same label keys from the
top level CR to the resulting Pods.
The following example, selects the glance-czone-edge-api
Pods, and spreads them
across the existing kubernetes nodes.
apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
name: glance-czone-spread-pods
namespace: openstack
spec:
topologySpreadConstraint:
- labelSelector:
matchExpressions:
- key: glanceAPI
operator: In
values:
- czone
maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
To achieve the above, a glanceAPI called czone
has been created, and a topologyRef
called glance-czone-spread-pods
has been applied.
+--------+ +--------+ +--------+
| | | | | |
| ZONE A | | ZONE B | | ZONE C |
| | | | | |
+--------+ +--------+ +--------+
|_ api-ext-0 |_ api-ext-1 |_ api-ext-2
|_ api-int-0 |_ api-int-1 |_ api-int-2
|_ czone-edge-0 |_ czone-edge-1 |_ czone-edge-2
However, a quite common scenario is to reach the exact opposite model, where
Pods
are scheduled in a particular/specific zone.
In our example, the idea is to take all Pods
that belong to glanceAPI: glance-czone-edge
and schedule all of them in ZoneC
(which is equals to
master-2
in a three nodes environment).
To achieve this goal, we create the following topology CR:
apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
name: glance-czone-node-affinity
namespace: openstack
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: zone
operator: In
values:
- zoneC
The above establishes a nodeAffinity, to make sure we schedule czone
glanceAPI
Pods in zoneC
.
In other words, we require czone
Pods
to be scheduled on a node that
has zoneC
as label, and this condition is stronger than the
preferredAntiAffinityRules
applied by default to the statefulSet.
Note that in this case we do not need any TopologySpreadConstraints
, because
we're not really interested in the Pods distribution, but we're trying to
achieve isolation between AZs.
+--------+ +--------+ +--------+
| | | | | |
| ZONE A | | ZONE B | | ZONE C |
| | | | | |
+--------+ +--------+ +--------+
|_ api-ext-0 |_ api-ext-1 |_ api-ext-2
|_ api-int-0 |_ api-int-1 |_ api-int-2
|_ czone-edge-0
|_ czone-edge-1
|_ czone-edge-2
The picture above can be checked with the following:
Every 2.0s: oc get pods -l service=glance -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
glance-czone-edge-api-0 3/3 Running 0 3m13s 10.128.1.172 master-2 <none> <none>
glance-czone-edge-api-1 3/3 Running 0 3m25s 10.128.1.171 master-2 <none> <none>
glance-czone-edge-api-2 3/3 Running 0 3m37s 10.128.1.170 master-2 <none> <none>
glance-default-external-api-0 3/3 Running 0 71m 10.129.0.72 master-0 <none> <none>
glance-default-external-api-1 3/3 Running 0 72m 10.128.1.152 master-2 <none> <none>
glance-default-external-api-2 3/3 Running 0 72m 10.130.0.208 master-1 <none> <none>
glance-default-internal-api-0 3/3 Running 0 72m 10.128.1.153 master-2 <none> <none>
glance-default-internal-api-1 3/3 Running 0 72m 10.129.0.70 master-0 <none> <none>
glance-default-internal-api-2 3/3 Running 0 72m 10.130.0.207 master-1 <none> <none>
We can use the same approach to apply nodeAffinity
to bzone
and azone
glanceAPIs and observe Pods being scheduled on the nodes that belong to
zoneB
and zoneA
.
We create the following CRs:
apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
name: glance-bzone-node-affinity
namespace: openstack
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: zone
operator: In
values:
- zoneB
and:
apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
name: glance-azone-node-affinity
namespace: openstack
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: zone
operator: In
values:
- zoneA
$ oc get topology
NAME
glance-azone-node-affinity
glance-bzone-node-affinity
glance-czone-node-affinity
glance-czone-spread-pods
glance-default-spread-pods
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
glance-azone-edge-api-0 3/3 Running 0 39s 10.129.0.161 master-0 <none> <none>
glance-azone-edge-api-1 3/3 Running 0 51s 10.129.0.160 master-0 <none> <none>
glance-azone-edge-api-2 3/3 Running 0 64s 10.129.0.159 master-0 <none> <none>
glance-bzone-edge-api-0 3/3 Running 0 15m 10.130.0.239 master-1 <none> <none>
glance-bzone-edge-api-1 3/3 Running 0 15m 10.130.0.238 master-1 <none> <none>
glance-bzone-edge-api-2 3/3 Running 0 15m 10.130.0.237 master-1 <none> <none>
glance-czone-edge-api-0 3/3 Running 0 124m 10.128.1.172 master-2 <none> <none>
glance-czone-edge-api-1 3/3 Running 0 124m 10.128.1.171 master-2 <none> <none>
glance-czone-edge-api-2 3/3 Running 0 124m 10.128.1.170 master-2 <none> <none>
glance-default-external-api-0 3/3 Running 0 3h12m 10.129.0.72 master-0 <none> <none>
glance-default-external-api-1 3/3 Running 0 3h13m 10.128.1.152 master-2 <none> <none>
glance-default-external-api-2 3/3 Running 0 3h13m 10.130.0.208 master-1 <none> <none>
glance-default-internal-api-0 3/3 Running 0 3h13m 10.128.1.153 master-2 <none> <none>
glance-default-internal-api-1 3/3 Running 0 3h13m 10.129.0.70 master-0 <none> <none>
glance-default-internal-api-2 3/3 Running 0 3h14m 10.130.0.207 master-1 <none> <none>
+--------+ +--------+ +--------+
| | | | | |
| ZONE A | | ZONE B | | ZONE C |
| (m0) | | (m1) | | (m2) |
| | | | | |
+--------+ +--------+ +--------+
|_ api-ext-0 |_ api-ext-1 |_ api-ext-2
|_ api-int-0 |_ api-int-1 |_ api-int-2
| | |
|_ azone-edge-0 |_ bzone-edge-0 |_ czone-edge-0
|_ azone-edge-1 |_ bzone-edge-1 |_ czone-edge-1
|_ azone-edge-2 |_ bzone-edge-2 |_ czone-edge-2