Kubernetes Horizontal Pod Autoscaler

December 2, 2022 9 min read

In this post I will talk about the Kubernetes Horizontal Pod Autoscaler.
I will go over the concepts behind it and why we might want to use it.

Table of Contents

What is the Horizontal Pod Autoscaler?

The Horizontal Pod Autoscaler (HPA) is a resource which targets a specific workload, like a Deployment or a StatefulSet. Its responsibility is to scale the target resource (scaleTargetRef) out in response to an increased in demand. And to scale the target resource back in with a drop in load.

The HPA has the ability to increase & decrease the desired-replicas of a workload. In other words, the HPA can ask the workload to change the number of pods which are deployed.

One thing to note is that, K8s implements the HPA as a control loop as opposed to a continuous process. So this check is performed by default every 15s. Although this is configurable via the --scan-interval flag.

What is the heuristic used by the HPA?

The HPA keeps track of a given metric for the target resource. This is how it decides whether to scale out or in. Fundamentally, the HPA applies the following calculation when making this decision:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

This desiredReplicas value is used to drive the number of the workload replicas required. See the K8s docs for further detail on this algorithm.

Out of the box, K8s gives us the ability to monitor the CPU or memory as the basis of the desiredMetricValue for the HPA.

k8s_hpa_scrapes_metrics_api

The HPA controller gets these scraped metrics from the resource metrics API (for CPU or memory). Alternatively, we can point the HPA controller at a custom metrics API to open up the possibility of using different types of metrics which are not supported out of the box by K8s. This would be useful for situations whereby CPU or memory may not be suitable triggers to base our scaling on. The first example that might come to mind is if we are configuring the horizontal scaling of a worker/consumer component, in that case we might be more inclined to monitor the number of messages on the queue which is being subscribed to.

Scaling policies

Scaling policies can be added to an HPA exclusively and separately to the scaleUp or scaleDown actions. In this case, the scaling algorithm is applied to each of the individual metrics and the largest result (desiredReplicas) is selected and passed to the controller for the scaling operation.

When multiple policies are specified, the policy which is bringing about the highest change (the largest desiredReplicas) is the policy which is selected and passed to the controller for the scaling operation.
In this example from the K8s docs:

behavior:
  scaleDown:
    policies:
    - type: Pods
      value: 4
      periodSeconds: 60
    - type: Percent
      value: 10
      periodSeconds: 60

Here the 1st policy allows at a maximum of 4 replicas to be scaled down in 1 minute. Whereas the 2nd policy allows at most 10% of the replicas to be scaled down in a 1 minute window.

The actual result of this can introduce quite a lot of complexity due to the various permutations in which the policies would kick in, but can be useful if you need granularity.

Thrashing

In my post about the Kubernetes Cluster Autoscaler, I spoke briefly about the phenomenon of thrashing/flapping. When we apply our algorithm we need to ensure that we do not allow opposing scaling actions to be applied to our workload in quick succession. This would result in the HPA seemingly continuously scaling our workload in and out.

This can also happen when the resource metric is changing at a high rate. For example if the metric spikes and dips around the trigger value over a short period of time. The stabilizationWindowSeconds setting is used to tell a HPA to check against previously computed desired states in the window specified to decide whether to perform a scaling action.

Demo | setting up the deployment

So, we’ve got enough theory to back up our understanding of the HPA. Now let’s take a look at what this looks like in action. Please note that the following exercise is purely for demonstration purposes. In practice, we would not want to deploy our HPA in this imperative manner.

~ via v14.21.1 on ☁️  (eu-west-1)
❯ kubectl create namespace hpa-testing
namespace/hpa-testing created

With this command I will create a namespace for our demo. A namespace is an isolated environment which we can use to separate our resources from others within the scope of the same cluster.

~ via v14.21.1 on ☁️  (eu-west-1)
❯ helm install --set 'resources.limits.cpu=200m' \ 
  --set 'resources.limits.memory=200mi' \
  --set 'resources.requests.cpu=200m' \ 
  --set 'resources.requests.memory=200mi' \
  testing-app bitnami/apache -n hpa-testing

NAME: testing-app
LAST DEPLOYED: Fri Dec 2 20:06:43 2022
NAMESPACE: hpa-testing
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
CHART NAME: apache
CHART VERSION: 9.2.7
APP VERSION: 2.4.54

** Please be patient while the chart is being deployed **

1. Get the Apache URL by running:

** Please ensure an external IP is associated to the my-release-apache service before proceeding **
** Watch the status using: kubectl get svc --namespace hpa-testing -w my-release-apache **

  export SERVICE_IP=$(kubectl get svc --namespace hpa-testing my-release-apache --template "{{ range (index .status.loadBalancer.ingress 0) }}{{ . }}{{ end }}")
  echo URL            : http://$SERVICE_IP/


WARNING: You did not provide a custom web application. Apache will be deployed with a default page. Check the README section "Deploying your custom web application" in https://github.com/bitnami/charts/blob/main/bitnami/apache/README.md#deploying-a-custom-web-application.

Here I have installed a Helm chart for an Apache server to the namespace we previously created as hpa-testing. We will need to get our Apache application pointed to a URL for the purposes of this demonstration.
So following the steps described above:

export SERVICE_IP=$(kubectl get svc --namespace hpa-testing my-release-apache --template "{{ range (index .status.loadBalancer.ingress 0) }}{{ . }}{{ end }}")
echo URL            : http://$SERVICE_IP/

We will need to make a note of that URL value as we will need it soon.

~ via v14.21.1 on ☁️  (eu-west-1)
❯ kubectl get pods -n hpa-testing
NAME                                  READY   STATUS    RESTARTS   AGE
testing-app-apache-87fddb6df-j75bt   0/1     Running   0          41s

The only thing to really note here is that we now have an application running.
It’s not really important that it’s an Apache application. What is important is that we will use this as our target workload to scale in response to load.

~ via v14.21.1 on ☁️  (eu-west-1)
❯ kubectl get deployments -n hpa-testing
NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
testing-app-apache   0/1     1            0           45s

Checking our Deployment resource, we can see that the name has been set to testing-app-apache. We’ll use this name to set the target for our HPA.

For your HPA to scrape data from our metrics server we will need to ensure that we have a metrics-server running within our cluster.

Running the following command, should give an output that looks something like the following:

~ via v14.21.1 on ☁️  (eu-west-1)
❯ kubectl top nodes
NAME                                           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-172-22-106-126.eu-west-1.compute.internal   204m         5%     3852Mi          25%
ip-172-22-114-103.eu-west-1.compute.internal   165m         4%     3556Mi          23%
ip-172-22-42-85.eu-west-1.compute.internal     311m         7%     3007Mi          20%
ip-172-22-63-218.eu-west-1.compute.internal    285m         7%     3179Mi          21%
ip-172-22-73-69.eu-west-1.compute.internal     194m         4%     3466Mi          23%
ip-172-22-80-208.eu-west-1.compute.internal    243m         6%     4647Mi          31%
ip-172-22-92-218.eu-west-1.compute.internal    296m         7%     4162Mi          28%

If we get a result back, then we can proceed with relative confidence knowing that we have a metrics-server running.

Demo | Attaching an HPA to our deployment

Now, let’s get our HPA running and pointed at our application:

~ via v14.21.1 on ☁️  default
❯ kubectl autoscale deployment testing-app-apache --memory-percent=5 --min=1 --max=10 -n hpa-testing
horizontalpodautoscaler.autoscaling/testing-app-apache autoscaled

Here we’ve imperatively created a HPA with the following conditions:

When the average memory utilisation across all the pods goes above or below 50%, trigger a scaling action.
We want at least 1 replica running at all times.
No matter how high our load gets, we only allow a maximum of 10 replicas running.

The first point is important to note here. If we scaled out to have 2 pods in the following scenario: | Replica | Memory utilization (%) | | ———– | ———– | | Pod A | 90 | | Pod B | 30 |

Then we can say that our average memory utilisation is 60%. This exceeds our trigger of 50%.
So in this scenario, our HPA would kick in to gear and increase our number of replicas to 3.

Demo | Triggering the HPA to scale out

In order to trigger the HPA to do its thing, we will need to create artificial load against our application.

kubectl run -i --tty load-generator --rm --image=busybox:1.28 --restart=Never -- /bin/sh -c "while sleep 0.0000001; do wget -q -O- http://$SERVICE_IP/; done"

Here we can see as the load on our application increases, resource utilisation also increases.

The HPA control loop comes back around and the HPA calculates that desiredReplicas to be greater than the current replicas, thereby triggering a scale-out action.

Demo | Scale in

When we exit from the load-generator command, we terminate the artificial load that we had previously created.

After this point, the average memory utilisation across the pods will drop below our trigger value. The HPA will then intervene by driving the desired state with a reduced number of replicas.

Summary

As mentioned previously the imperative approach we took in deploying our components for this demo is not ideal for a number of reasons, and it is highly unlikely we would do this in production.

In reality, we would adopt a declarative methodology. Whereby we would describe the resources that we want k8s to deploy for us in the form of yaml manifests.
We would want to keep this under source control. And if we were also following a GitOps approach, we want to ensure that the manifests under version source control always represent the current state of our deployments. Which as you might have guessed is not what this demo shows!