Kubernetes Cluster Autoscaler

October 30, 2022 7 min read

In this post I will about the Kubernetes Cluster Autoscaler.
I will talk through some of the key concepts and why it plays an important role in our clusters.

Table of Contents

What does the Cluster Autoscaler do?

The Kubernetes Cluster Autoscaler (CA) sits at the infrastructure level. It primarily has 2 responsibilities:

To expand the cluster when Kubernetes (K8s) requires additional node capacity to meet the demands of the deployments. This happens when pods are marked by K8s as being unschedulable within the current constraints of the cluster.
To contract the cluster by removing nodes when the demands of the deployments can still be met by a smaller pool of overall node capacity.

Scheduling pods

To understand the process of cluster autoscaling it is important to understand the role played by the K8s Scheduler (by default kube-scheduler).

k8s_pending_pod

When we make a request for a pod to be deployed to our cluster, the PodCondition of the pod is set to Pending.
The scheduler performs a series of calculations for the nodes available in the cluster to determine if it is feasible for any of the existing nodes to take our shiny new pod.

k8s_unschedulable_pod

In this scenario, our K8s Scheduler has determined that it cannot place our pod with the constraints of our cluster as things stand. In doing so the reason PodCondition on the pod will be set to unschedulable.

Note that the unschedulable flag on the PodCondition is not an exact key, value pair. For more detail on this, see the K8s documentation on pod conditions.

Adding nodes to the cluster

k8s_request_to_provider

The Cluster Autoscaler will check for any pods which have been marked as unschedulable by the K8s scheduler. By default, this check happens every 10s although this is configurable via the --scan-interval flag. Now the CA has determined that our current cluster cannot meet our demands. So it will send a scale out request to our cloud provider for a new node. Note that for brevity, I will continue to refer to that party as the cloud provider. In reality this could also be from within our organisation, for example if we were running our own hardware.

k8s_provider_responds_with_node

At this point we are entirely at the mercy of the provider to respond to our request with a new node.
This can happen quickly or take a few minutes. Or even longer!
Our CA expects nodes to be registered by default within 15m. As with practically everything else in k8s, this is also configurable via the --max-node-provision-time flag.
If a new node is not registered within this time, it will no longer be considered by our CA. The CA would request a different node type in this scenario.

k8s_node_is_provisioned

Scheduling to the new node

Once the node has been provisioned in our cluster, the PodCondition of our requested pod can be altered to reflect the fact that it can now be considered by the K8s Scheduler again.

k8s_pod_is_scheduled

The Scheduler will once again perform its calculations to find a node which is suitable to take our pod. Once it has found a node capable of taking our pod, it will then set the PodCondition to PodScheduled.

Over-provisioned nodes

If we have a scenario in which our cluster is over-provisioned then our CA will not mark any pods as unschedulable when performs its assessment of our cluster. Instead, it will determine that our nodes are being under-utilized. This is a problem as we are wasting resources, but our CA has the ability to re-arrange our deployments in a more efficient manner.

k8s_under_utilized_node

Here our CA has determined that we can take the pods currently in the cluster and deploy them to Node A. At this point, the CA will also take into consideration a number of other factors:

How much each pod can tolerate interruptions as set out by its PodDisruptionBudget.
The execution of any post pod life cycle hooks. This is the PreStop container lifecycle hook.
If any of the pods in question are critical to our application. This can be set via the priorityClassName flag on the pod.
If any of the pods have the safe-to-evict=false flag set on them.
If any of the nodes in question have their scale-down-disabled flag set to true. Which as you might have guessed will prevent that node from being included in any scale down action.

Assessment of used node capacity

If the sum of CPU and memory requests of all pods on Node B totals to less than 50% of the node allocatable load, then the CA will consider Node B to be a candidate for scaling down. This 50% figure is a default setting, which can be tweaked via the --scale-down-utilization-threshold flag.

k8s_node_allowances

Note that the exact proportions of these allowance groups can vary from 1 cloud provided to the other. The actual resources available to us on a node can be largely divided into these groups:

The operating system and system daemons such as SSH & systemd.
The Kubernetes components such as the Kubelet, the container runtime, node problem detector.
The allocatable slice is the thing we care most about. This is the portion of the node available to accommodate our pods.
The dreaded eviction threshold. Consider this a contingency to prevent a fatal over-consumption of resources on the node. Without this, we would run the risk of System Out of Memory (OOM) errors. When this happens, a node can go offline temporarily, which is clearly not ideal!

For our purposes we will assume that the pod currently on Node B is a viable candidate to be rehoused on Node A. We will also assume that it is acceptable for Node B to be scaled down by our CA.

Removing nodes

k8s_pod_memorised_for_node_a

Our pod which is currently sat on Node B is evicted from that node. As part of the calculations performed by K8s, the designated location for the pod is memorized. In our case, the Scheduler has decided that it wants to place this pod on Node A. Note that it may well be the case that the pods ends up on a different node than what was originally decided!

k8s_node_marked_as_tainted

The CA will mark Node B as Tainted. At this point a timer will begin, this defaults to a 10m period, also configurable via the --scale-down-unneeded-time flag. If we request more pods to be deployed to our cluster within that 10m period. Then our additional pods might just end up on Node B instead. Remember that it can be a bit of a chore to provision a new node as we’re entirely dependent on the provider. So you can consider this --scale-down-unneeded-time to be more of a case of are you sure you want to do this? kind of buffer.

Most importantly, the buffer offered by --scale-down-unneeded-time is in place to prevent thrashing.

Thrashing is the phenomenon whereby controller(s) perform opposing actions on a resource resulting in fluctuating behaviour…
Yikes, that was horrendously complicated so let’s unpack it. If the CA wanted to scale a node in and them immediately after it wanted to scale back out, that would result in the availability of nodes fluctuating rapidly. This behaviour is also known as flapping. In this case, the CA would be the controller, the opposing actions would be scaling in & out and the resource would be the node.

k8s_node_terminated

Assuming that --scale-down-unneeded-time period has passed, Node B will be marked as Tainted. This Tainted flag acts as a warning to the Scheduler, to not assign pods to this node as it is no longer suitable for the constraints of our cluster. This will prevent any new pods from being deployed to Node B in the meantime.

k8s_node_drained

The node will then be terminated and drained from the cluster safely. Our Scheduler can now set the pod to be placed on Node A.

The CA will not repeat its assessment of the cluster until the time period set out by scale-down-delay-after-delete. By default, this is also set to the --scan-interval flag (which itself defaults to 10s).