Skip to main content
Workload Management

Mastering the Art of the Horizontal Pod Autoscaler: Dynamic Workload Adjustment in Action

Imagine running a busy web service that suddenly gets a traffic spike from a viral post. Without automatic scaling, your pods either crash under load or sit idle wasting resources. The Horizontal Pod Autoscaler (HPA) solves this by adjusting the number of pod replicas based on metrics like CPU, memory, or custom signals. In this guide, we'll walk through how HPA works, how to set it up, and what to watch out for when deploying it in real workloads. Why You Need HPA and What Goes Wrong Without It Static pod counts are a gamble. If you set replicas too low, your application becomes slow or fails during peak hours. Too high, and you burn money on unused compute. Without autoscaling, teams often over-provision to handle rare spikes, wasting up to 40% of cluster capacity according to industry surveys. Alternatively, they under-provision and suffer outages when demand surges.

Imagine running a busy web service that suddenly gets a traffic spike from a viral post. Without automatic scaling, your pods either crash under load or sit idle wasting resources. The Horizontal Pod Autoscaler (HPA) solves this by adjusting the number of pod replicas based on metrics like CPU, memory, or custom signals. In this guide, we'll walk through how HPA works, how to set it up, and what to watch out for when deploying it in real workloads.

Why You Need HPA and What Goes Wrong Without It

Static pod counts are a gamble. If you set replicas too low, your application becomes slow or fails during peak hours. Too high, and you burn money on unused compute. Without autoscaling, teams often over-provision to handle rare spikes, wasting up to 40% of cluster capacity according to industry surveys. Alternatively, they under-provision and suffer outages when demand surges.

HPA removes this guesswork by continuously monitoring resource usage and adjusting replicas automatically. For example, a typical e-commerce site might see traffic double during a flash sale. With HPA, the deployment scales from 5 to 20 pods within minutes, then scales back down after the event. Manual scaling would require an engineer watching dashboards and running kubectl commands—error-prone and slow.

The core problem HPA addresses is the mismatch between static capacity and dynamic demand. Workloads fluctuate due to time of day, marketing campaigns, or unpredictable events. Without dynamic adjustment, you either pay for idle capacity or risk poor user experience. HPA brings efficiency and reliability, but it requires careful configuration to avoid thrashing or scaling too slowly.

Common Symptoms of Missing Autoscaling

Teams often notice these signs: latency spikes during peak hours, frequent OOMKilled pods, or underutilized nodes with low CPU average. If you see these, HPA is likely the missing piece. However, not all workloads benefit equally—batch jobs or stateful applications with persistent storage may need different scaling strategies.

Prerequisites and Context: What You Need Before Configuring HPA

Before diving into HPA configuration, ensure your environment meets these requirements. First, you need a Kubernetes cluster version 1.23 or later—older versions have limited autoscaling features. Second, the Metrics Server must be installed and collecting resource metrics. Without it, HPA cannot read CPU or memory usage. You can verify with kubectl top nodes and kubectl top pods.

Third, your application should be designed to scale horizontally. Stateless services with shared-nothing architecture work best. If your app relies on local state or sticky sessions, scaling may break functionality. Fourth, define resource requests and limits for your containers. HPA uses requests to calculate target utilization, so missing requests will cause inaccurate scaling.

Finally, understand the metric types HPA supports: resource metrics (CPU, memory), custom metrics (e.g., requests per second from Prometheus), and external metrics (e.g., queue length from cloud services). We'll focus on CPU-based autoscaling as the most common starting point.

Install Metrics Server

Most managed Kubernetes clusters (EKS, GKE, AKS) have Metrics Server pre-installed. For self-managed clusters, deploy it using the official YAML manifest from the Kubernetes metrics-server repository. After installation, wait a minute for data to populate, then run kubectl top pods to confirm.

Set Resource Requests and Limits

Each container in your deployment should specify resources.requests.cpu and resources.requests.memory. For example, a request of 250m CPU means the pod is guaranteed at least a quarter of a core. HPA will target a percentage of this request value. Without requests, HPA cannot compute utilization and will fail to scale.

Core Workflow: Step-by-Step HPA Configuration

Let's walk through creating an HPA for a sample deployment named web-app with CPU target of 50%. We'll use kubectl commands, but you can also define HPA as a YAML manifest for version control.

Step 1: Create a deployment with resource requests. Example YAML snippet:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: app
        image: nginx:latest
        resources:
          requests:
            cpu: 200m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi

Step 2: Apply the deployment: kubectl apply -f deployment.yaml.

Step 3: Create the HPA using kubectl autoscale: kubectl autoscale deployment web-app --cpu-percent=50 --min=2 --max=10. This sets target CPU utilization to 50%, minimum 2 replicas, maximum 10.

Step 4: Verify HPA status: kubectl get hpa web-app. You'll see current CPU utilization, target, and number of replicas. Initially, it may show until metrics are collected.

Step 5: Generate load to test scaling. Use a tool like kubectl run -it --rm load-generator --image=busybox -- /bin/sh -c "while true; do wget -q -O- http://web-app-service; done". After a minute, run kubectl get hpa web-app -w to watch replicas increase.

YAML Manifest Approach

For production, define HPA in YAML:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

Apply with kubectl apply -f hpa.yaml. This version uses autoscaling/v2 API, which supports multiple metrics and custom metrics.

Tools, Setup, and Environment Realities

HPA works out of the box with Metrics Server, but production environments often need more sophisticated monitoring. Prometheus and custom metrics adapters allow scaling based on application-level metrics like HTTP request rate or queue depth. For example, you can scale based on the number of pending messages in a RabbitMQ queue.

To use custom metrics, install the Prometheus Adapter or the Microsoft Azure Adapter. These expose custom metrics to the Kubernetes API, which HPA can consume. The setup involves deploying the adapter, configuring metric collection rules, and referencing the metric name in HPA YAML.

Another reality: HPA does not scale down to zero. If you need zero replicas during idle periods, consider using Kubernetes Event-driven Autoscaling (KEDA), which can scale deployments to zero and scale up based on events. KEDA acts as an extension to HPA, providing additional triggers.

Cluster Autoscaler Integration

HPA works alongside the Cluster Autoscaler, which adds or removes nodes when pods cannot be scheduled due to resource constraints. When HPA requests more replicas, the Cluster Autoscaler may provision new nodes automatically. This combination ensures both pod-level and node-level scaling, but requires proper configuration to avoid conflicts. For instance, set the Cluster Autoscaler's scale-down delay to prevent premature node removal.

Managed Kubernetes Differences

On EKS, HPA works natively with CloudWatch metrics via the CloudWatch adapter. GKE offers a built-in HorizontalPodAutoscaler with additional features like custom metrics from Stackdriver. AKS integrates with Azure Monitor. While the core API is consistent, each cloud provider has unique adapters and limitations—check their documentation for specifics.

Variations for Different Constraints

Not all workloads fit the standard CPU-based HPA. Let's explore common variations.

Memory-Based Autoscaling

Memory-bound applications, like in-memory caches or databases, benefit from memory-based HPA. Configure the metric as resource.memory.averageUtilization. However, memory usage often does not scale linearly with load, and memory pressure can cause OOM kills before scaling kicks in. Set target utilization conservatively (e.g., 60%) to leave headroom.

Custom Metrics with Prometheus

For web services, scaling based on requests per second (RPS) is more responsive than CPU. Example: target 1000 RPS per pod. This requires Prometheus scraping request metrics and the Prometheus Adapter exposing them. The HPA YAML references the custom metric name, e.g., http_requests_per_second. This approach reduces latency spikes because scaling reacts to traffic directly rather than CPU lag.

Multiple Metrics

HPA v2 supports multiple metrics in a single definition. For instance, scale based on CPU and memory simultaneously, or CPU and custom metric. HPA evaluates each metric and chooses the largest desired replica count. This prevents one metric from causing under-scaling while another is high. However, avoid conflicting metrics—if CPU says scale down but memory says scale up, the larger count wins, which may lead to over-provisioning.

Behavior and Stabilization

HPA has a stabilization window that prevents rapid flapping. The default cooldown period is 5 minutes for scale-down and 3 minutes for scale-up. You can adjust these using the behavior field in autoscaling/v2. For example, to scale up faster during spikes, set scaleUp.stabilizationWindowSeconds: 60. To avoid thrashing, increase the scale-down window to 10 minutes.

Pitfalls, Debugging, and What to Check When It Fails

HPA can fail silently. Here are common issues and how to diagnose them.

Metrics Not Available

If kubectl get hpa shows for current utilization, Metrics Server may not be running or pods lack resource requests. Check Metrics Server logs: kubectl logs -n kube-system deployment/metrics-server. Also verify that pods have resources.requests set—without them, HPA cannot calculate utilization.

Scaling Too Slowly or Too Fast

Slow scaling often results from long stabilization windows or low metric resolution. Increase the frequency of metric collection by reducing Metrics Server's --metric-resolution flag (default 60s, can go to 15s). Fast scaling (thrashing) occurs when stabilization windows are too short or target utilization is too aggressive. Adjust the behavior field to add a longer cooldown.

Target Utilization Miscalculation

If CPU target is 50% but pods are at 80% and not scaling, check that the request value is realistic. For example, if a pod requests 100m CPU but actually needs 500m, the utilization will be 500% of request, causing HPA to scale aggressively. Conversely, if requests are too high, utilization stays low and HPA may scale down unnecessarily. Use kubectl top pods to compare actual usage with requests.

HPA Not Respecting Min/Max

If HPA scales below min or above max, check for conflicting configurations like PodDisruptionBudgets or multiple HPAs targeting the same resource. Also ensure the HPA YAML uses correct API version—v1 autoscaling does not support multiple metrics and may behave unexpectedly.

Debugging Commands

Use kubectl describe hpa to see events and conditions. Look for FailedGetResourceMetric or FailedComputeMetricsReplicas errors. For custom metrics, check the adapter logs. A useful pattern is to run kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 to list available custom metrics.

Finally, test your HPA in a staging environment with synthetic load before deploying to production. Tools like hey or locust can generate controlled traffic. Monitor scaling behavior over several minutes to ensure stability.

Share this article:

Comments (0)

No comments yet. Be the first to comment!