Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Foxhound401/b8135bc8ec71012f1fae9e5cbb43f2d3 to your computer and use it in GitHub Desktop.
Save Foxhound401/b8135bc8ec71012f1fae9e5cbb43f2d3 to your computer and use it in GitHub Desktop.

Problem Definition

There are many services running on EKS's nodegroup with the spot type, and when a large number of pods have been scheduled to the same couple of spot instances. There is a risk that these instances are retaken by AWS. When this happens the following scenario likely to happen

Pods that reside in the retaken instances don't have enough time to react to instance termination and cause hiccups, latency, or minor downtime until they have been rescheduled to other nodes.

There have been a couple of measurements applied to counter this:

From the Kubernetes templates standpoint:

  • Apply PDB (Pod) to needed service, this, first make sure that the availability of the service now falls below our pre-defined threshold (Ex: deployment A has 10 pods across 3 instances, when retaking happens, the pod will not be evicted until there are enough number of pod ready/running on others available instances )
  • Apply Rules to spread the pods across different instances (Assign Pods to Nodes) using either NodeSelector, Affinity/Anti-Affinity or Taints and Tolerations . This mitigates the large numbers of pods to be scheduled to the same instance.

From the AWS standpoints:

  • Enable Capacity Rebalancing on Managed Nodegroups on EKS. This option allows EKS to start the graceful shutdown sequence to move the deployments from the retaking instances to other available instances

Full flow of the rebalance recommendation handling

Have a look at this Diagram:

EC2 Spot Instance Rebalance Recommendations Flow

Step-by-step exaplanation:

  1. When a Spot node receives a rebalance recommendation, Amazon EKS automatically attempts to launch a new replacement Spot node and waits until it successfully joins the cluster.
  2. When a replacement Spot node is bootstrapped and in the Ready state on Kubernetes, Amazon EKS cordons and drains the Spot node that received the rebalance recommendation. Cordoning the Spot node ensures that the service controller doesn't send any new requests to this Spot node. It also removes it from its list of healthy, active Spot nodes.
  3. Depending on the Deployment, If it has PDB (PodDisruptionBudget) attached, PDB will prevent the pods to be evicted and hold the Node Drain process until there are enough pods that are satisfied. (In case of Termination Notice you will only have 2 minutes window to schedule the pod to other nodes, if the process takes more than 2 mins, your node will be terminated and pods that do not finish their work will be killed/crashed/brutally beheaded)
  4. Scheduler will look for nodes to schedule the draining pod, if it can't find resources for the pod, it will mark PodCondition schedulable as false and reason to unschedulable
  5. Cluster Autoscaler will periodically check for unschedulable pods (every 10 seconds by default), when it sees there are pods in the list, it tried to find a new place to run them, thus starting to scale up the nodegroup
  6. If all went well the node is spun up and pods will be scheduled there.

Notes: On Step 2, during the wait for Ready state on the newly created node, If a Spot two-minute interruption notice arrives before the replacement Spot node is in a Ready state, Amazon EKS starts draining the Spot node that received the rebalance recommendation.

Full flow of the Interruption Notice handle

EC2 Spot Instance Interruption Notices Flow

  1. When a Spot node receives an Interruption Notice, there is a 2 minutes period before the node has been terminated. The received node will start to cordons and drains, Pod will also receive SIGTERM.
  2. Depending on the Deployment, If it has PDB (PodDisruptionBudget) attached, PDB will prevent the pods to be evicted and hold the Node Drain process until there are enough pods that are satisfied.
  3. Scheduler will look for nodes to schedule the draining pod, if it can't find resources for the pod, it will mark PodCondition schedulable as false and reason to unschedulable
  4. Cluster Autoscaler will periodically check for unschedulable pods (every 10 seconds by default), when it sees there are pods in the list, it tried to find a new place to run them, thus starting scale up the nodegroup
  5. If all went well the node is spun up and pods will be scheduled there.

Responsibility

Who does what?

DEVOPS

  • Make sure all Deployments that need HA(High Availability) got their PDB set up
  • Monitoring the rebalance Recommendation Rate and see how effective it is in terms of pre-provisioning. Is it too sensitive and creates a cycle of spined-up <=> move pod <=> termination that can cause service fatigue, or hiccup?

DEVELOPERS

  • Handle SIGTERM on the application sides when pods received it. Your code should listen for this event and start shutting down cleanly at this point. This may include stopping any long-lived connections (like a database connection or WebSocket stream), saving the current state, or anything like that.

References & Gotchas

The Kubernetes termination lifecycle (For Dev)

Kubernetes does a lot more than monitor your application for crashes. It can create more copies of your application to run on multiple machines, update your application, and even run multiple versions of your application at the same time!

This means there are many reasons why Kubernetes might terminate a perfectly healthy container. If you update your deployment with a rolling update, Kubernetes slowly terminates old pods while spinning up new ones.

If you drain a node, Kubernetes terminates all pods on that node. If a node runs out of resources, Kubernetes terminates pods to free those resources.

It’s important that your application handle termination gracefully so that there is minimal impact on the service and the time-to-recovery is as fast as possible!

In practice, this means your application needs to handle the SIGTERM message and begin shutting down when it receives it. This means saving all data that needs to be saved, closing down network connections, finishing any work that is left, and other similar tasks.

Once Kubernetes has decided to terminate your pod, a series of events takes place. Let’s look at each step of the Kubernetes termination lifecycle.

1 - Pod is set to the “Terminating” State and removed from the endpoints list of all Services

At this point, the pod stops getting new traffic. Containers running in the pod will not be affected.

2 - preStop Hook is executed

The preStop Hook is a special command or http request that is sent to the containers in the pod.

If your application doesn’t gracefully shut down when receiving a SIGTERM you can use this hook to trigger a graceful shutdown. Most programs gracefully shut down when receiving a SIGTERM, but if you are using third-party code or are managing a system you don’t have control over, the preStop hook is a great way to trigger a graceful shutdown without modifying the application.

3 - SIGTERM signal is sent to the pod

At this point, Kubernetes will send a SIGTERM signal to the containers in the pod. This signal lets the containers know that they are going to be shut down soon.

Your code should listen for this event and start shutting down cleanly at this point. This may include stopping any long-lived connections (like a database connection or WebSocket stream), saving the current state, or anything like that.

Even if you are using the preStop hook, it is important that you test what happens to your application if you send it a SIGTERM signal, so you are not surprised in production!

4 - Kubernetes waits for a grace period

At this point, Kubernetes waits for a specified time called the termination grace period. By default, this is 30 seconds. It’s important to note that this happens in parallel to the preStop hook and the SIGTERM signal. Kubernetes does not wait for the preStop hook to finish.

If your app finishes shutting down and exits before the terminationGracePeriod is done, Kubernetes moves to the next step immediately.

If your pod usually takes longer than 30 seconds to shut down, make sure you increase the grace period. You can do that by setting the terminationGracePeriodSeconds option in the Pod YAML. For example, to change it to 60 seconds: gcp-terminationGracePeriodSecondsfmw1.PNG

5 - SIGKILL signal is sent to pod, and the pod is removed

If the containers are still running after the grace period, they are sent the SIGKILL signal and forcibly removed. At this point, all Kubernetes objects are cleaned up as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment