{"id":538,"date":"2025-01-27T09:50:29","date_gmt":"2025-01-27T09:50:29","guid":{"rendered":"https:\/\/blog.ngocha.biz\/?p=538"},"modified":"2025-01-27T09:50:29","modified_gmt":"2025-01-27T09:50:29","slug":"cluster-autoscaler","status":"publish","type":"post","link":"https:\/\/blog.ngocha.biz\/?p=538","title":{"rendered":"Cluster AutoScaler Setup on AWS EKS: A Comprehensive Guide"},"content":{"rendered":"<p>In this hands-on guide, you will to implement node autoscaling in EKS cluster using Cluster AutoScaler.<\/p><p>By default, when you deploy an EKS cluster, node auto-scaling is not enabled.<\/p><p>This means that even if you enable Horizontal Pod Autoscaler (HPA), pods may go into a pending state when node resources are exhausted. For pods to scale further, you need to scale the underlying nodes.<\/p><p>This is where Cluster Autoscaler comes into play. It automatically adjusts the number of nodes in your cluster to meet resource demands.<\/p><p>By the end of this guide, you will:<\/p><ol><li>Understand what Cluster AutoScaler is.<\/li><li>Learn how it works behind the scenes with AWS<\/li><li>Set up Cluster AutoScaler on EKS with hands-on experience<\/li><li>Explore different scenarios to test scaling based on AutoScaler<\/li><\/ol><h2 id=\"what-is-a-cluster-autoscaler\">What is a Cluster AutoScaler?<\/h2><p><a href=\"https:\/\/github.com\/kubernetes\/autoscaler\/blob\/master\/cluster-autoscaler\/README.md?ref=devopscube.com\" rel=\"noreferrer noopener\">Cluster AutoScaler<\/a> is a tool designed to automatically scale <a href=\"https:\/\/devopscube.com\/backup-etcd-restore-kubernetes\/\" rel=\"noreferrer\">Kubernetes cluster<\/a> nodes based on workloads. It is maintained by the Kubernetes community.<\/p><p>It supports almost all cloud platforms and managed Kubernetes services, such as EKS, AKS, GKE, etc&#8230;<\/p><p>When you deploy Cluster AutoScaler, it continuously monitors the API server for <strong>unscheduled Pods<\/strong> and automatically adds nodes to the cluster to make resources available for the pods.<\/p><p>Additionally, it scales down nodes when the cluster has more resources than needed.<\/p><p>As you may know, cloud-based Kubernetes implementations typically include <strong>node groups<\/strong> to manage worker nodes efficiently.<\/p><div class=\"kg-card kg-callout-card kg-callout-card-blue\"><div class=\"kg-callout-emoji\">\ud83d\udca1<\/div><div class=\"kg-callout-text\">A <b><strong style=\"white-space: pre-wrap;\">Node Group<\/strong><\/b> is a set of worker nodes within a Kubernetes cluster that share the same configuration, such as instance type, networking, and scaling policies<\/div><\/div><p>If there are multiple node groups present, the Cluster AutoScaler scales nodes using the node groups that match the specified <strong><code>expander strategy<\/code><\/strong> on the deployment.<\/p><div class=\"kg-card kg-callout-card kg-callout-card-blue\"><div class=\"kg-callout-emoji\">\ud83d\udca1<\/div><div class=\"kg-callout-text\">The Expander Strategy in Kubernetes Cluster Autoscaler determines which node group to scale up when additional resources are needed but multiple node groups are eligible for expansion.<\/div><\/div><p>There are a total of six expander strategies available, they are:<\/p>\n<!--kg-card-begin: html-->\n<ol class=\"wp-block-list is-style-cnvs-list-styled\">\n<li><!--kg-card-begin: html--><span style=\"box-sizing: border-box; margin: 0px; padding: 0px;\"><strong>least-waste<\/strong>&nbsp;&#8211; Select the node group that leaves the least amount of CPU and memory used after sca<\/span><!--kg-card-end: html-->ling.<\/li>\n\n\n<li><!--kg-card-begin: html--><span style=\"box-sizing: border-box; margin: 0px; padding: 0px;\"><strong>random<\/strong>&nbsp;&#8211; This is the default expander when no expander is specified, and it is&nbsp;<\/span><!--kg-card-end: html-->used when there is no problem scaling any node type.<\/li>\n\n\n<li><strong>most-pods<\/strong> &#8211; This expander scales the node group, which can schedule most pods.<\/li>\n\n\n<li><strong>least-nodes<\/strong> &#8211; Select this to scale the node group, which can schedule pods with minimum nodes.<\/li>\n\n\n<li><strong>price<\/strong> &#8211; Scales the node group whose cost is low, check <a href=\"https:\/\/github.com\/kubernetes\/autoscaler\/blob\/master\/cluster-autoscaler\/proposals\/pricing.md?ref=devopscube.com\" data-type=\"link\" data-id=\"https:\/\/github.com\/kubernetes\/autoscaler\/blob\/master\/cluster-autoscaler\/proposals\/pricing.md\">here<\/a> for more details.<\/li>\n\n\n<li><strong>priority<\/strong> &#8211; Select the node group that was assigned by the user in the <a href=\"https:\/\/github.com\/kubernetes\/autoscaler\/blob\/master\/cluster-autoscaler\/expander\/priority\/readme.md?ref=devopscube.com\" data-type=\"link\" data-id=\"https:\/\/github.com\/kubernetes\/autoscaler\/blob\/master\/cluster-autoscaler\/expander\/priority\/readme.md\">configuration file<\/a>.<\/li>\n<\/ol>\n<!--kg-card-end: html-->\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\"><div class=\"kg-callout-emoji\">\ud83d\udca1<\/div><div class=\"kg-callout-text\">You can set the expander strategy using the <code spellcheck=\"false\" style=\"white-space: pre-wrap;\">--expander<\/code> flag, which we will explore in detail in the hands-on section<\/div><\/div><p>We can deploy Cluster AutoScaler using two methods:<\/p><ol><li><strong>Auto-Discovery method (<\/strong>Recommended<strong>)<\/strong> &#8211; Automatically discovers every node groups ASGs with required tags and scales them if needed.<\/li><li><strong>Manual method<\/strong> &#8211; You have to specify the node groups ASG&#8217;s min capacity, maximum capacity, and name.<\/li><\/ol><h2 id=\"how-does-kubernetes-cluster-autoscaler-work\">How does Kubernetes Cluster AutoScaler work?<\/h2><p>The workflow diagram of AWS EKS Cluster AutoScaler is given below.<\/p><figure class=\"kg-card kg-image-card kg-card-hascaption\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2025\/03\/cluster-autoscaler-1-1-1.jpg\" class=\"kg-image\" alt=\"workflow diagram of how does Kubernetes Cluster AutoScaler work\" loading=\"lazy\" width=\"1920\" height=\"1080\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2025\/03\/cluster-autoscaler-1-1-1.jpg 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2025\/03\/cluster-autoscaler-1-1-1.jpg 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1600\/2025\/03\/cluster-autoscaler-1-1-1.jpg 1600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2025\/03\/cluster-autoscaler-1-1-1.jpg 1920w\" sizes=\"auto, (min-width: 720px) 720px\"><figcaption><span style=\"white-space: pre-wrap;\">Click the Image to view in HD<\/span><\/figcaption><\/figure><p>Here is how the cluster AutoScaler works.<\/p><ol><li>A manifest is applied to create a deployment on the cluster.<\/li><li>The API server tells the Scheduler to assign the deployment pods to the nodes.<\/li><li>The Pods are scheduled on nodes until resources are exhausted. Any remaining Pods that <strong>cannot be scheduled<\/strong> due to insufficient resources go into a <strong>Pending state<\/strong>.<\/li><li>The API server updates the <strong>pending status<\/strong> of these Pods along with the reason (e.g., insufficient CPU, memory).<\/li><li>The Cluster AutoScaler, which continuously monitors the API server, notices the pods are in a pending state because of resource unavailability.<\/li><li>The Autoscaler analyzes the resource requirements and selects the most suitable node group based on the configured <strong>expander strategy.<\/strong><\/li><li>Then, it gets the ASG associated with the node group and uses AWS APIs to request the ASG to scale nodes.<\/li><li>Once the ASG creates the required nodes, the Scheduler schedules the pods on the new node.<\/li><li>If workloads decrease (e.g., a job finishes or a deployment is scaled down), some nodes may no longer be needed.<\/li><li>A node is eligible for removal if it has been underutilized for a set time (default: 10 minutes). Once a node is identified for removal, the Cluster Autoscaler requests that AWS Auto Scaling Group (ASG) terminate it.<\/li><\/ol><h2 id=\"setup-prerequisites\"><strong>Setup Prerequisites<\/strong><\/h2><p>The prerequisites required for this setup are listed below.<\/p><ol><li><a href=\"https:\/\/devopscube.com\/create-aws-eks-cluster-eksctl\/\" rel=\"noreferrer noopener\">EKS Cluster<\/a><\/li><li><a href=\"https:\/\/devopscube.com\/install-configure-aws-cli-linux\/\" rel=\"noreferrer noopener\">AWS CLI<\/a><\/li><li><a href=\"https:\/\/devopscube.com\/kubectl-set-context\/\" rel=\"noreferrer\">Kubectl<\/a><\/li><li>eksctl<\/li><li>Permission to create <a href=\"https:\/\/devopscube.com\/aws-iam-role-instance-profile\/\" rel=\"noreferrer\">IAM Role<\/a> and Policy<\/li><li>Pod Identity agent plugin is enabled on the cluster<\/li><\/ol><h2 id=\"setup-cluster-autoscaler-on-eks-cluster\">Setup Cluster AutoScaler on EKS Cluster<\/h2><p>Let&#8217;s set up a Cluster AutoScaler on the EKS cluster, we will use the <code>auto-discovery<\/code> method for this setup.<\/p><p>For the auto-discovery method to work, ASGs must have the following tags.<\/p><ol><li>k8s.io\/cluster-autoscaler\/enabled<\/li><li>k8s.io\/cluster-autoscaler\/&lt;cluster-name&gt;<\/li><\/ol><p>Cluster AutoScaler uses these to find the ASGs automatically.<\/p><div class=\"kg-card kg-callout-card kg-callout-card-blue\"><div class=\"kg-callout-emoji\">\ud83d\udca1<\/div><div class=\"kg-callout-text\">The above tags are applied as default to the ASG when you create the node group using <a href=\"https:\/\/devopscube.com\/create-aws-eks-cluster-eksctl\/\" rel=\"noreferrer noopener\">eksctl<\/a>.<\/div><\/div><p>These tags might not apply when you create a node group using Terraform or a CLI command, make sure the node groups ASG has these tags.<\/p><p>To check if the node groups ASG have the mentioned tag, run the following command to get the names of all the ASG in your AWS.<\/p><pre><code>aws autoscaling describe-auto-scaling-groups --query \"AutoScalingGroups[*].AutoScalingGroupName\" --output table<\/code><\/pre><p>Then, run the following command to check the tags assigned to the specific ASG.<\/p><pre><code>aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names  &lt;asg-name&gt; --query \"AutoScalingGroups[*].Tags\" --output table<\/code><\/pre><p>Update the ASG name in the above command that you want to check the tags; the node groups ASG will have the node group name.<\/p><p>For example, if your node group name is <code>ng-spo<\/code>t, then your ASG name will be <code>eks-ng-spot-62ca5663-d8f9-a974-10c3-e0ca52223c7c<\/code>.<\/p><p>Now, follow the steps below one by one to set up Cluster AutoScaler on the EKS cluster.<\/p><h3 id=\"step-1-create-an-iam-policy\">Step 1: Create an IAM Policy<\/h3><p>Let&#8217;s start with creating an IAM policy for the Cluster AutoScaler, which assigns permission to scale nodes and other required permissions.<\/p><p>First, run the following command to create a JSON file with the required permissions.<\/p><pre><code>cat &lt;&lt;EoF &gt; ca-policy.json\n{\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Action\": [\n                \"autoscaling:DescribeAutoScalingGroups\",\n                \"autoscaling:DescribeAutoScalingInstances\",\n                \"autoscaling:DescribeLaunchConfigurations\",\n                \"autoscaling:DescribeTags\",\n                \"autoscaling:SetDesiredCapacity\",\n                \"autoscaling:TerminateInstanceInAutoScalingGroup\",\n                \"ec2:DescribeLaunchTemplateVersions\"\n            ],\n            \"Resource\": \"*\",\n            \"Effect\": \"Allow\"\n        }\n    ]\n}\nEoF<\/code><\/pre><p>Then, run the following command to create the IAM policy with the permission listed on <code>ca-policy.json<\/code>.<\/p><pre><code>aws iam create-policy   \\\n  --policy-name ca-policy \\\n  --policy-document file:\/\/ca-policy.json<\/code><\/pre><p>Now, run the following command to save the <code>ARN<\/code> of the policy as a <code>variable<\/code>, which will be helpful in the next step.<\/p><pre><code>export POLICY_ARN=$(aws iam list-policies --query \"Policies[?PolicyName=='ca-policy'].Arn\" --output text)<\/code><\/pre><p>Run the following command to check if the <a href=\"https:\/\/devopscube.com\/aws-arn-guide\/\" rel=\"noreferrer\">ARN<\/a> is saved as a variable.<\/p><pre><code>echo $POLICY_ARN<\/code><\/pre><p>If it shows the ARN, move on to the next step.<\/p><h3 id=\"step-2-create-an-iam-role\">Step 2: Create an IAM Role<\/h3><p>Once the policy is created, create an IAM role and attach the policy to the role.<\/p><p>Start by creating a JSON file that contains the trust policy for the role.<\/p><pre><code>cat &lt;&lt;EoF &gt; trust-policy.json\n{\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n      {\n        \"Effect\": \"Allow\",\n        \"Principal\": {\n          \"Service\": \"pods.eks.amazonaws.com\"\n        },\n        \"Action\": [\n          \"sts:AssumeRole\",\n          \"sts:TagSession\"\n        ]\n      }\n    ]\n}\nEoF<\/code><\/pre><p>Then, run the following command to create the IAM role with the role trust policy on <code>trust-policy.json<\/code>.<\/p><pre><code>aws iam create-role \\\n    --role-name  ca-role \\\n    --assume-role-policy-document file:\/\/trust-policy.json<\/code><\/pre><p>Now, run the following command to attach the policy to the role.<\/p><pre><code>aws iam attach-role-policy \\\n    --role-name ca-role \\\n    --policy-arn $POLICY_ARN<\/code><\/pre><p>Once the role creation and policy attachment are completed, run the following command to save the ARN of the role as a variable.<\/p><pre><code>export ROLE_ARN=$(aws iam get-role --role-name ca-role --query \"Role.Arn\" --output text)<\/code><\/pre><p>Run the following command to check if the ARN is saved as a variable.<\/p><pre><code>echo $ROLE_ARN<\/code><\/pre><p>If it shows the ARN, move on to the next step.<\/p><h3 id=\"step-3-download-and-modify-cluster-autoscaler-yaml\">Step 3: Download and Modify Cluster AutoScaler YAML<\/h3><p>Now, download the Cluster AutoScaler deployment <a href=\"https:\/\/devopscube.com\/create-kubernetes-yaml\/\" rel=\"noreferrer\">YAML<\/a> and modify it.<\/p><p>Run the following command to download the YAML file.<\/p><pre><code>wget https:\/\/raw.githubusercontent.com\/kubernetes\/autoscaler\/master\/cluster-autoscaler\/cloudprovider\/aws\/examples\/cluster-autoscaler-autodiscover.yaml<\/code><\/pre><p>Modify the following in the manifest file:<\/p><ol><li>In the deployment part, change the <a href=\"https:\/\/github.com\/kubernetes\/autoscaler\/tree\/master\/cluster-autoscaler?ref=devopscube.com#releases:~:text=Vultr-,Releases,-We%20recommend%20using\" rel=\"noreferrer noopener\">container image version<\/a> to the same version as your EKS cluster version. For example, if your cluster version is 1.30.8, specify the container version as v1.30.0.<\/li><li>Specify your cluster name in the command section <code>--node-group-auto-discovery=asg:tag=k8s.io\/cluster-autoscaler\/enabled,k8s.io\/cluster-autoscaler\/<strong>&lt;YOUR CLUSTER NAME&gt;<\/strong><\/code>.<\/li><\/ol><p>The modified deployment part will look like this:<\/p><pre><code>apiVersion: apps\/v1\nkind: Deployment\nmetadata:\n  name: cluster-autoscaler\n  namespace: kube-system\n  labels:\n    app: cluster-autoscaler\nspec:\n  replicas: 1\n  selector:\n    matchLabels:\n      app: cluster-autoscaler\n  template:\n    metadata:\n      labels:\n        app: cluster-autoscaler\n      annotations:\n        prometheus.io\/scrape: 'true'\n        prometheus.io\/port: '8085'\n    spec:\n      priorityClassName: system-cluster-critical\n      securityContext:\n        runAsNonRoot: true\n        runAsUser: 65534\n        fsGroup: 65534\n        seccompProfile:\n          type: RuntimeDefault\n      serviceAccountName: cluster-autoscaler\n      containers:\n        - image: registry.k8s.io\/autoscaling\/cluster-autoscaler:v1.30.0\n          name: cluster-autoscaler\n          resources:\n            limits:\n              cpu: 100m\n              memory: 600Mi\n            requests:\n              cpu: 100m\n              memory: 600Mi\n          command:\n            - .\/cluster-autoscaler\n            - --v=4\n            - --stderrthreshold=info\n            - --cloud-provider=aws\n            - --skip-nodes-with-local-storage=false\n            - --expander=least-waste\n            - --node-group-auto-discovery=asg:tag=k8s.io\/cluster-autoscaler\/enabled,k8s.io\/cluster-autoscaler\/eks-spot-cluster\n          volumeMounts:\n            - name: ssl-certs\n              mountPath: \/etc\/ssl\/certs\/ca-certificates.crt # \/etc\/ssl\/certs\/ca-bundle.crt for Amazon Linux Worker Nodes\n              readOnly: true\n          imagePullPolicy: \"Always\"\n          securityContext:\n            allowPrivilegeEscalation: false\n            capabilities:\n              drop:\n                - ALL\n            readOnlyRootFilesystem: true\n      volumes:\n        - name: ssl-certs\n          hostPath:\n            path: \"\/etc\/ssl\/certs\/ca-bundle.crt\"<\/code><\/pre><p>You can see I have changed the container version based on my cluster version and specified my cluster name in the command section.<\/p><p>You can also change the <code>expander<\/code> command to <code>random<\/code>, <code>most-pods<\/code>, <code>least-waste<\/code>, <code>priority<\/code> as per your requirements.<\/p><blockquote>If you want to run the Cluster AutoScaler in manual mode, remove the command:<br><br><strong>&#8211;node-group-auto-discovery=asg:tag=k8s.io\/cluster-autoscaler\/enabled,k8s.io\/cluster-autoscaler\/eks-spot-cluster<\/strong> from the above manifest file and use the:<br><br><strong>&#8211;nodes=1:4:eks-ng-spot-16ca48b9-1524-ecf0-3c0d-572a204ffa86<\/strong> to specify the nodes groups ASG manually.<br><br>The above command structure is <strong>&#8211;nodes=&lt;ASG-min&gt;:&lt;ASG-max&gt;:&lt;ASG name&gt;<\/strong>, in the command you have to specify the node groups ASG&#8217;s min capacity, maximum capacity and it&#8217;s name.<\/blockquote><p>Some additional commands that are enabled by default can also be customized:<\/p>\n<!--kg-card-begin: html-->\n<ol class=\"wp-block-list is-style-cnvs-list-styled\">\n<li><strong>&#8211;scale-down-delay-after-add<\/strong> &#8211; This command prevents the nodes from scaling down until the specified time, by default, it is 10 minutes.<\/li>\n\n\n<li><!--kg-card-begin: html--><span style=\"box-sizing: border-box; margin: 0px; padding: 0px;\"><strong>&#8211;scale-down-delay-after-delete<\/strong>&nbsp;&#8211; This command prevents the nodes from scaling down one after another, the default time to scale down nodes one by one is 10 seconds.<\/span><!--kg-card-end: html--><\/li>\n\n\n<li><strong>&#8211;scale-down-unneeded-time<\/strong> &#8211; This command tells how long a node can run underutilized before scaling down, and the default time is 10 minutes.<\/li>\n\n\n<li><!--kg-card-begin: html--><span style=\"box-sizing: border-box; margin: 0px; padding: 0px;\"><strong>&#8211;scan-interval<\/strong>&nbsp;&#8211; Time gap taken by the Cluster AutoScaler pods scrape data from the API server, by default, it scrapes data every 10 seco<\/span><!--kg-card-end: html-->nds.<\/li>\n<\/ol>\n<!--kg-card-end: html-->\n<p>Make the mentioned changes and run the following command to deploy Cluster AutoScaler and other resources required for the Cluster AutoScaler.<\/p><pre><code>kubectl apply -f cluster-autoscaler-autodiscover.yaml<\/code><\/pre><p>Once the deployment is up, run the following command to annotate the deployment.<\/p><pre><code>kubectl -n kube-system annotate deployment.apps\/cluster-autoscaler cluster-autoscaler.kubernetes.io\/safe-to-evict=\"false\"<\/code><\/pre><p>This annotation will prevent the Cluster AutoScaler pods from eviction during scaling.<\/p><h3 id=\"step-4-assign-the-iam-role-to-the-service-account\">Step 4: Assign the IAM Role to the Service Account<\/h3><p>The next step is to assign the IAM role to the Cluster AutoScalers service account using <code>Pod Identity<\/code> to provide scaling permission.<\/p><p>Before assigning the role, check if Pod identity is enabled on your cluster by running the following command.<\/p><pre><code>aws eks list-addons --cluster-name &lt;CLUSTER NAME&gt;<\/code><\/pre><p>Specify your cluster name in the above command.<\/p><p>If Pod Identity is enabled on your cluster, you can see it in the output as shown below.<\/p><figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2025\/03\/image-31-7.png\" class=\"kg-image\" alt=\"verifying if pod identity is enabled on the cluster\" loading=\"lazy\" width=\"473\" height=\"432\"><\/figure><p>If not listed, it means Pod Identity is not enabled on your cluster. Run the following command to enable Pod Identity on your cluster.<\/p><pre><code>aws eks create-addon --cluster-name &lt;CLUSTER NAME&gt; --addon-name eks-pod-identity-agent<\/code><\/pre><p>Once enabled, run the following command to assign the IAM role to the Cluster AutoScaler&#8217;s service account using Pod Identity.<\/p><pre><code>eksctl create podidentityassociation \\\n    --cluster &lt;CLUSTER NAME&gt; \\\n    --namespace kube-system \\\n    --service-account-name cluster-autoscaler \\\n    --role-arn $ROLE_ARN<\/code><\/pre><p><code>cluster-autoscaler<\/code> is the Cluster AutoScaler&#8217;s service account.<\/p><p>Then, restart the deployment to make the Cluster AutoScaler pods use the role.<\/p><pre><code>kubectl rollout restart deploy cluster-autoscaler -n kube-system<\/code><\/pre><h2 id=\"testing-cluster-autoscaler\">Testing Cluster AutoScaler<\/h2><p>The Cluster AutoScaler setup is ready. Let&#8217;s check if it&#8217;s working properly.<\/p><p>To check, create a <code>deploy.yaml<\/code> file and copy the below content:<\/p><pre><code>apiVersion: apps\/v1\nkind: Deployment\nmetadata:\n  name: nginx-app\nspec:\n  replicas: 4\n  selector:\n    matchLabels:\n      app: nginx-app\n  template:\n    metadata:\n      labels:\n        app: nginx-app\n    spec:\n      containers:\n      - name: app\n        image: nginx\n        resources:\n          requests:\n            memory: \"1Gi\"\n            cpu: \"500m\"<\/code><\/pre><p>This manifest file will create a deployment with <code>4<\/code> replicas and set the resource request to <code>1Gi Memory and 500m CPU<\/code>.<\/p><p>Currently, my cluster has<code> 1 <\/code>node of type t3.medium, which has <code>1 CPU and 2GB of Memory<\/code>. Set the resource request based on your node type, which makes the nodes scale.<\/p><p>Apply the manifest file using the following command.<\/p><pre><code>kubectl apply -f deploy.yaml<\/code><\/pre><p>List the pods using the below command<\/p><pre><code>kubectl get po<\/code><\/pre><p>You can see that two pods are still in a pending state because of insufficient resources.<\/p><figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2025\/03\/image-37-4.png\" class=\"kg-image\" alt=\"listing the pods\" loading=\"lazy\" width=\"1704\" height=\"570\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2025\/03\/image-37-4.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2025\/03\/image-37-4.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1600\/2025\/03\/image-37-4.png 1600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2025\/03\/image-37-4.png 1704w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure><p>Now, the total resource limit has exceeded the node capacity, which triggers the Cluster AutoScaler to trigger nodes based on the requirements.<\/p><figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2025\/03\/image-35-7.png\" class=\"kg-image\" alt=\"listing the nodes\" loading=\"lazy\" width=\"2000\" height=\"412\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2025\/03\/image-35-7.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2025\/03\/image-35-7.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1600\/2025\/03\/image-35-7.png 1600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2025\/03\/image-35-7.png 2186w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure><p>You can see the scale-up is triggered, and a new node is created.<\/p><p>The trigger will happen within 10-30 seconds, and the node will be up and running within 1 minute.<\/p><figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2025\/03\/image-36-5.png\" class=\"kg-image\" alt=\"listing the pods and nodes\" loading=\"lazy\" width=\"2000\" height=\"833\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2025\/03\/image-36-5.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2025\/03\/image-36-5.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1600\/2025\/03\/image-36-5.png 1600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2025\/03\/image-36-5.png 2234w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure><p>You can see that a new node is created as per the resource requirements, and all the pods are up and running.<\/p><p>Now, delete the deployment using the following command to see the scale-down process.<\/p><pre><code>kubectl delete -f deploy.yaml<\/code><\/pre><p>The unused nodes will be terminated after 10 minutes, it is the default node scale-down time.<\/p><figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2025\/03\/image-38-4.png\" class=\"kg-image\" alt=\"listing nodes\" loading=\"lazy\" width=\"2000\" height=\"350\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2025\/03\/image-38-4.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2025\/03\/image-38-4.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1600\/2025\/03\/image-38-4.png 1600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w2400\/2025\/03\/image-38-4.png 2400w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure><h2 id=\"common-issues-and-troubleshooting\">Common Issues and Troubleshooting<\/h2><p>Given below are some of the common issues when using Cluster AutoScaler and its troubleshooting.<\/p><h3 id=\"check-logs\">Check Logs<\/h3><p>Always start the troubleshooting by checking the Cluster AutoScaler logs.<\/p><p>Run the following command to get the logs.<\/p><pre><code>kubectl logs deployment\/cluster-autoscaler -n kube-system<\/code><\/pre><h3 id=\"cluster-autoscaler-does-not-detect-node-group-nodes\">Cluster AutoScaler does not detect Node Group Nodes<\/h3><p>Let&#8217;s say you have multiple node groups, and the Cluster AutoScaler is running, but the Cluster AutoScaler does not detect the node group nodes.<\/p><p>The following things may be the issue:<\/p><ol><li>The Cluster AutoScaler doesn&#8217;t have the required permission.<\/li><li>The ASGs of the node groups have incorrect tags.<\/li><li>Only in auto-discovery mode the node groups will be detected by the Cluster AutoScaler automatically. If you are using manual mode, you have to specify each node group using the <code>--nodes<\/code> flag.<\/li><\/ol><h3 id=\"pod-stuck-in-pending-state\">Pod Stuck in Pending State<\/h3><p>If your pod has been stuck in a pending state for more than 10 minutes, and the nodes are not scaling up even though the Cluster Autoscaler is running.<\/p><p>This may be caused by various reasons:<\/p><ol><li>The Cluster AutoScaler doesn&#8217;t have the required permission to trigger scaling.<\/li><li>The node group size limit has been reached.<\/li><li>The pods may have taints to deploy in specific nodes.<\/li><\/ol><h3 id=\"nodes-not-scaling-down\">Nodes not Scaling Down<\/h3><p>If your nodes are underutilized and still not scaling down, this may be caused by:<\/p><ol><li>Node groups minimum node limit has been reached.<\/li><li>A node might have pods that cannot be evicted.<\/li><\/ol><h3 id=\"cluster-autoscaler-pod-gets-evicted\">Cluster AutoScaler Pod gets Evicted<\/h3><p>If your Cluster AutoScaler pod is getting evicted, you have to add the <code>cluster-autoscaler.kubernetes.io\/safe-to-evict=\"false\"<\/code> annotation to your Cluster AutoScaler deployment.<\/p><p>Run the following command to add the annotation to the Cluster AutoScaler deployment.<\/p><pre><code>kubectl -n kube-system annotate deployment.apps\/cluster-autoscaler cluster-autoscaler.kubernetes.io\/safe-to-evict=\"false\"<\/code><\/pre><p>Then, restart the deployment to apply the changes.<\/p><pre><code>kubectl rollout restart deploy cluster-autoscaler -n kube-system<\/code><\/pre><h2 id=\"best-practises\">Best Practises<\/h2><p>Given below are some of the best practices for Cluster AutoScaler:<\/p><ol><li>Always specify resource requests and limits for your pods so that the Cluster AutoScaler can scale based on the requirements.<\/li><li>You can use taints and tolerations to schedule some pods on specific nodes.<\/li><li>Use the scale-down commands to adjust the scale-down time based on your workload. (eg. &#8211;scale-down-unneeded-time=2m).<\/li><li>Use HPA with Cluster AutoScaler, which makes sure HPA has enough nodes to scale pods.<\/li><\/ol><h2 id=\"conclusion\">Conclusion<\/h2><p>In this guide, you learned about Kubernetes Cluster Autoscaler, its functionality, and how to set it up on an Amazon EKS cluster. <\/p><p>You also explored testing the setup, customization options, best practices, and troubleshooting common issues.<\/p><p>For more advanced scaling strategies, especially for workloads requiring different instance types and smarter scaling decisions, consider exploring EKS Karpenter.<\/p>\n<hr><p><strong>Ngu\u1ed3n:<\/strong> <a href=\"https:\/\/devopscube.com\/cluster-autoscaler\/\" target=\"_blank\" rel=\"noopener noreferrer\">Cluster AutoScaler Setup on AWS EKS: A Comprehensive Guide \u2014 DevOpsCube<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>Source: https:\/\/devopscube.com\/cluster-autoscaler\/<\/p>\n","protected":false},"author":1,"featured_media":539,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-538","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-devops"],"_links":{"self":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts\/538","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=538"}],"version-history":[{"count":0,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts\/538\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/media\/539"}],"wp:attachment":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=538"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=538"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=538"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}