K8S Cluster autoscaler crashlooping on EKS

2023-02-23 1 minute read #k8s #infrastructure

I just fixed a badly crashlooping cluster autoscaler on EKS. Two things I found noteworthy:

On weird error messages (failed to list *v1.CSIStorageCapacity) it might be a version mismatch: The autoscaler version must match the k8s version used (e.g. k8s v1.21 requires cluster autoscaler 1.21.x)
On no error messages at all (see below for example) it might be a permissions issue with the AWS APIs.

Fixing the image version

I’m using the autoscaler helm chart for it, this just needs an annotation:

image:
  # IMPORTANT: this value _MUST_ match the k8s version: x.y.(patch)
  tag: v1.21.1

Note: You do not have to use the repo mentioned in the original comment.

Fixing the permissions issue

In my case, I referenced the wrong ServiceAccount in my AWS IRSA role trust policy. So for “future me”, check this:

does the autoscaler pod have IRSA credentials assigned from its ServiceAccount?
- to add the reference using the helm chart, see below
does the ServiceAccount reference the correct AWS IRSA policy? (typos, wrong policy, …)
does the AWS IRSA policy reference the designated ServiceAccount correctly`? (wrong namespace, typos, …)

Also, i explicitly name the ServiceAccount created by the helm chart, so I don’t have to guess what I need to reference from AWS:

rbac:
  serviceAccount:
    annotations:
      # this adds the IRSA reference to the ServiceAccount created
      "eks.amazonaws.com/role-arn": "arn:aws:iam::123456789012:role/whatever-your-role-name-is"

    # this explicitly controls the name of the ServiceAccount, so we can refernce safely from AWS
    name: cluster-autoscaler

Error examples

Exmaple of “no error” crashloop

    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:71 +0xbe

goroutine 297 [sync.Cond.Wait]:
runtime.goparkunlock(...)
    /usr/local/go/src/runtime/proc.go:310
    [...]
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch.NewStreamWatcher
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:71 +0xbe

goroutine 299 [sync.Cond.Wait]:
runtime.goparkunlock(...)
    /usr/local/go/src/runtime/proc.go:310
    [...]
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch.NewStreamWatcher
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:71 +0xbe

goroutine 366 [IO wait]:
internal/poll.runtime_pollWait(0x7f2e1059ebb0, 0x72, 0xffffffffffffffff)
    /usr/local/go/src/runtime/netpoll.go:203 +0x55
    [...]
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:1706 +0xc56

goroutine 367 [select]:
net/http.(*persistConn).writeLoop(0xc0017e8480)
    /usr/local/go/src/net/http/transport.go:2336 +0x11c
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:1707 +0xc7b
Stream closed EOF for kube-system/cluster-autoscaler-aws-cluster-autoscaler-69f85b8f6d-hbpvv (aws-cluster-autoscaler)

Example of “error” crashloop

Failed to watch *v1.CSIStorageCapacity: failed to list *v1.CSIStorageCapacity: the server could not find the requested resource

Fly, Penguin!

I blog so I don't forget.