Kubernetes Cluster Autoscaler
In this post I will about the Kubernetes Cluster Autoscaler.
I will talk through some of the key concepts and why it plays an important role in our clusters.
Table of Contents
What does the Cluster Autoscaler do?
The Kubernetes Cluster Autoscaler (CA) sits at the infrastructure level. It primarily has 2 responsibilities:
- To expand the cluster when Kubernetes (K8s) requires additional node capacity to meet the demands of the deployments.
This happens when pods are marked by K8s as being
unschedulable
within the current constraints of the cluster. - To contract the cluster by removing nodes when the demands of the deployments can still be met by a smaller pool of overall node capacity.
Scheduling pods
To understand the process of cluster autoscaling it is important to understand the role played by the K8s Scheduler
(by default kube-scheduler
).
When we make a request for a pod to be deployed to our cluster, the PodCondition
of the pod is set to Pending
.
The scheduler performs a series of calculations for the nodes available in the cluster to determine if it is feasible
for any of the existing nodes to take our shiny new pod.
In this scenario, our K8s Scheduler has determined that it cannot place our pod with the constraints of our cluster
as things stand. In doing so the reason PodCondition
on the pod will be set to unschedulable
.
Note that the
unschedulable
flag on thePodCondition
is not an exact key, value pair. For more detail on this, see the K8s documentation on pod conditions.
Adding nodes to the cluster
The Cluster Autoscaler will check for any pods which have been marked as unschedulable
by the K8s scheduler.
By default, this check happens every 10s although this is configurable via the --scan-interval
flag.
Now the CA has determined that our current cluster cannot meet our demands.
So it will send a scale out request to our cloud provider for a new node.
Note that for brevity, I will continue to refer to that party as the cloud provider.
In reality this could also be from within our organisation, for example if we were running our own hardware.
At this point we are entirely at the mercy of the provider to respond to our request with a new node.
This can happen quickly or take a few minutes. Or even longer!
Our CA expects nodes to be registered by default within 15m.
As with practically everything else in k8s, this is also configurable via the --max-node-provision-time
flag.
If a new node is not registered within this time, it will no longer be considered by our CA.
The CA would request a different node type in this scenario.
Scheduling to the new node
Once the node has been provisioned in our cluster, the PodCondition
of our requested pod can be altered
to reflect the fact that it can now be considered by the K8s Scheduler again.
The Scheduler will once again perform its calculations to find a node which is suitable to take our pod.
Once it has found a node capable of taking our pod, it will then set the PodCondition
to PodScheduled
.
Over-provisioned nodes
If we have a scenario in which our cluster is over-provisioned then our CA will not mark any pods as unschedulable
when performs its assessment of our cluster. Instead, it will determine that our nodes are being under-utilized.
This is a problem as we are wasting resources, but our CA has the ability to re-arrange our deployments
in a more efficient manner.
Here our CA has determined that we can take the pods currently in the cluster and deploy them to Node A
.
At this point, the CA will also take into consideration a number of other factors:
- How much each pod can tolerate interruptions as set out by its
PodDisruptionBudget
. - The execution of any post pod life cycle hooks. This is the
PreStop
container lifecycle hook. - If any of the pods in question are critical to our application.
This can be set via the
priorityClassName
flag on the pod. - If any of the pods have the
safe-to-evict=false
flag set on them. - If any of the nodes in question have their
scale-down-disabled
flag set totrue
. Which as you might have guessed will prevent that node from being included in any scale down action.
Assessment of used node capacity
If the sum of CPU and memory requests of all pods on Node B
totals to less than 50% of the node allocatable load,
then the CA will consider Node B
to be a candidate for scaling down. This 50% figure is a default setting,
which can be tweaked via the --scale-down-utilization-threshold
flag.
Note that the exact proportions of these allowance groups can vary from 1 cloud provided to the other. The actual resources available to us on a node can be largely divided into these groups:
- The operating system and system daemons such as SSH & systemd.
- The Kubernetes components such as the Kubelet, the container runtime, node problem detector.
- The
allocatable
slice is the thing we care most about. This is the portion of the node available to accommodate our pods. - The dreaded eviction threshold. Consider this a contingency to prevent a fatal over-consumption of resources on the node. Without this, we would run the risk of System Out of Memory (OOM) errors. When this happens, a node can go offline temporarily, which is clearly not ideal!
For our purposes we will assume that the pod currently on Node B
is a viable candidate to be rehoused on Node A
.
We will also assume that it is acceptable for Node B
to be scaled down by our CA.
Removing nodes
Our pod which is currently sat on Node B
is evicted from that node.
As part of the calculations performed by K8s, the designated location for the pod is memorized.
In our case, the Scheduler has decided that it wants to place this pod on Node A
.
Note that it may well be the case that the pods ends up
on a different node than what was originally decided!
The CA will mark Node B
as Tainted
.
At this point a timer will begin, this defaults to a 10m period,
also configurable via the --scale-down-unneeded-time
flag.
If we request more pods to be deployed to our cluster within that 10m period.
Then our additional pods might just end up on Node B
instead.
Remember that it can be a bit of a chore to provision a new node as we’re entirely dependent on the provider.
So you can consider this --scale-down-unneeded-time
to be more of a case of are you sure you want to do this? kind of buffer.
Most importantly, the buffer offered by
--scale-down-unneeded-time
is in place to prevent thrashing.
Thrashing is the phenomenon whereby controller(s) perform opposing actions on a resource resulting
in fluctuating behaviour…
Yikes, that was horrendously complicated so let’s unpack it.
If the CA wanted to scale a node in and them immediately after it wanted to scale back out,
that would result in the availability of nodes fluctuating rapidly.
This behaviour is also known as flapping.
In this case, the CA would be the controller, the opposing actions would be scaling in & out
and the resource would be the node.
Assuming that --scale-down-unneeded-time
period has passed, Node B
will be marked as Tainted
.
This Tainted
flag acts as a warning to the Scheduler, to not assign pods to this node as it is no longer suitable
for the constraints of our cluster. This will prevent any new pods from being deployed to Node B
in the meantime.
The node will then be terminated and drained from the cluster safely.
Our Scheduler can now set the pod to be placed on Node A
.
The CA will not repeat its assessment of the cluster until the time period set out by scale-down-delay-after-delete
.
By default, this is also set to the --scan-interval
flag (which itself defaults to 10s).
Related posts: