Mastodon Mastodon - K8S Cluster autoscaler crashlooping on EKS
 logo
  • Home 
  • Tags 
  • Blog posts 
  1. Home
  2. Blog posts
  3. K8S Cluster autoscaler crashlooping on EKS

K8S Cluster autoscaler crashlooping on EKS

Posted on February 23, 2023  (Last modified on July 2, 2024) • 2 min read • 326 words
K8s   Infrastructure  
K8s   Infrastructure  
Share via

On this page
  • Fixing the image version
  • Fixing the permissions issue
  • Error examples
    • Exmaple of “no error” crashloop
    • Example of “error” crashloop

I just fixed a badly crashlooping cluster autoscaler on EKS. Two things I found noteworthy:

  • On weird error messages (failed to list *v1.CSIStorageCapacity) it might be a version mismatch: The autoscaler version must match the k8s version used (e.g. k8s v1.21 requires cluster autoscaler 1.21.x)
  • On no error messages at all (see below for example) it might be a permissions issue with the AWS APIs.

Fixing the image version  

I’m using the autoscaler helm chart for it, this just needs an annotation:

image:
  # IMPORTANT: this value _MUST_ match the k8s version: x.y.(patch)
  tag: v1.21.1

Note: You do not have to use the repo mentioned in the original comment.

Fixing the permissions issue  

In my case, I referenced the wrong ServiceAccount in my AWS IRSA role trust policy. So for “future me”, check this:

  • does the autoscaler pod have IRSA credentials assigned from its ServiceAccount?
    • to add the reference using the helm chart, see below
  • does the ServiceAccount reference the correct AWS IRSA policy? (typos, wrong policy, …)
  • does the AWS IRSA policy reference the designated ServiceAccount correctly`? (wrong namespace, typos, …)

Also, i explicitly name the ServiceAccount created by the helm chart, so I don’t have to guess what I need to reference from AWS:

rbac:
  serviceAccount:
    annotations:
      # this adds the IRSA reference to the ServiceAccount created
      "eks.amazonaws.com/role-arn": "arn:aws:iam::123456789012:role/whatever-your-role-name-is"

    # this explicitly controls the name of the ServiceAccount, so we can refernce safely from AWS
    name: cluster-autoscaler

Error examples  

Exmaple of “no error” crashloop  

    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:71 +0xbe

goroutine 297 [sync.Cond.Wait]:
runtime.goparkunlock(...)
    /usr/local/go/src/runtime/proc.go:310
    [...]
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch.NewStreamWatcher
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:71 +0xbe

goroutine 299 [sync.Cond.Wait]:
runtime.goparkunlock(...)
    /usr/local/go/src/runtime/proc.go:310
    [...]
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch.NewStreamWatcher
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:71 +0xbe

goroutine 366 [IO wait]:
internal/poll.runtime_pollWait(0x7f2e1059ebb0, 0x72, 0xffffffffffffffff)
    /usr/local/go/src/runtime/netpoll.go:203 +0x55
    [...]
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:1706 +0xc56

goroutine 367 [select]:
net/http.(*persistConn).writeLoop(0xc0017e8480)
    /usr/local/go/src/net/http/transport.go:2336 +0x11c
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:1707 +0xc7b
Stream closed EOF for kube-system/cluster-autoscaler-aws-cluster-autoscaler-69f85b8f6d-hbpvv (aws-cluster-autoscaler)

Example of “error” crashloop  

Failed to watch *v1.CSIStorageCapacity: failed to list *v1.CSIStorageCapacity: the server could not find the requested resource
 Use VSCodium with Microsoft's proprietary marketplace
Circular imports for Python typing annotations 
On this page:
  • Fixing the image version
  • Fixing the permissions issue
  • Error examples
    • Exmaple of “no error” crashloop
    • Example of “error” crashloop
In case you want to follow me

Here are some links. The further to the right, the less active.

           
(c) Axel Bock | Powered by Hinode.
Link copied to clipboard
Code copied to clipboard