Last active
February 7, 2024 09:08
-
-
Save superseb/a9925c465b42bc5001b94c4ec241265a to your computer and use it in GitHub Desktop.
Rancher 2.x custom cluster YAML quicker node failure detection (k8s 1.13)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
services: | |
kubelet: | |
extra_args: | |
node-status-update-frequency: 4s | |
kube-api: | |
extra_args: | |
default-not-ready-toleration-seconds: 30 | |
default-unreachable-toleration-seconds: 30 | |
kube-controller: | |
extra_args: | |
node-monitor-period: 2s | |
node-monitor-grace-period: 16s | |
pod-eviction-timeout: 30s |
@patan32 Probably want to check rancher/rancher#43918, depending on what versions you are using it could be old/new chosen behavior or a new bug.
Here is a Rancher RKE2 example
spec: rkeConfig: machineGlobalConfig: kube-apiserver-arg: - '--default-not-ready-toleration-seconds=30' - '--default-unreachable-toleration-seconds=30' kube-controller-manager-arg: - '--node-monitor-period=2s' - '--node-monitor-grace-period=16s' - '--pod-eviction-timeout=30s' machineSelectorConfig: - config: kubelet-arg: - '--node-status-update-frequency=4s' - '--max-pods=200'
Tried on my Rancher RKE2 based cluster - can not recommend - did crash my master nodes or at least did not want to apply the settings. master nodes stuck on "waiting for kube-controller". the failed nodes told me:
journalctl -xeu rke2-server.service
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Reconciling ETCDSnapshotFile resources"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Tunnel server egress proxy mode: agent"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Starting managed etcd node metadata controller"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Starting k3s.cattle.io/v1, Kind=Addon controller"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Creating deploy event broadcaster"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Starting /v1, Kind=Node controller"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Cluster dns configmap already exists"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Labels and annotations have been set successfully on node: rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Starting /v1, Kind=Secret controller"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Updating TLS secret for kube-system/rke2-serving (count: 16): map[listener.cattle.io/cn-10.11.55.170:10.11.55.170 listener.cattle.io/cn->
Feb 06 16:22:36 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:36+01:00" level=info msg="Running kube-proxy --cluster-cidr=10.42.0.0/16 --conntrack-max-per-core=0 --conntrack-tcp-timeout-close-wait=0s --conntrack-tcp-timeout->
Feb 06 16:25:52 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: 2024/02/06 16:25:52 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Feb 06 16:28:52 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: 2024/02/06 16:28:52 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Feb 06 16:32:12 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: 2024/02/06 16:32:12 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
EDIT: found the problem: pod-eviction-timeout
was deprecated in 1.25 (kubernetes/website#39681).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello,
I am wondering how i can apply this to my RKE2 Cluster? When i go to the cluster in rancher i can't see edit yaml button. Any help is appreciated.