Fly, Penguin!

I blog so I don't forget.

K8S Cluster autoscaler crashlooping on EKS

1 minute read #k8s #infrastructure

I just fixed a badly crashlooping cluster autoscaler on EKS. Two things I found noteworthy:

  • On weird error messages (failed to list *v1.CSIStorageCapacity) it might be a version mismatch: The autoscaler version must match the k8s version used (e.g. k8s v1.21 requires cluster autoscaler 1.21.x)
  • On no error messages at all (see below for example) it might be a permissions issue with the AWS APIs.

Fixing the image version

I’m using the autoscaler helm chart for it, this just needs an annotation:

  # IMPORTANT: this value _MUST_ match the k8s version: x.y.(patch)
  tag: v1.21.1

Note: You do not have to use the repo mentioned in the original comment.

Fixing the permissions issue

In my case, I referenced the wrong ServiceAccount in my AWS IRSA role trust policy. So for “future me”, check this:

  • does the autoscaler pod have IRSA credentials assigned from its ServiceAccount?
    • to add the reference using the helm chart, see below
  • does the ServiceAccount reference the correct AWS IRSA policy? (typos, wrong policy, …)
  • does the AWS IRSA policy reference the designated ServiceAccount correctly`? (wrong namespace, typos, …)

Also, i explicitly name the ServiceAccount created by the helm chart, so I don’t have to guess what I need to reference from AWS:

      # this adds the IRSA reference to the ServiceAccount created
      "": "arn:aws:iam::123456789012:role/whatever-your-role-name-is"

    # this explicitly controls the name of the ServiceAccount, so we can refernce safely from AWS
    name: cluster-autoscaler

Error examples

Exmaple of “no error” crashloop

    /gopath/src/ +0xbe

goroutine 297 [sync.Cond.Wait]:
created by
    /gopath/src/ +0xbe

goroutine 299 [sync.Cond.Wait]:
created by
    /gopath/src/ +0xbe

goroutine 366 [IO wait]:
internal/poll.runtime_pollWait(0x7f2e1059ebb0, 0x72, 0xffffffffffffffff)
    /usr/local/go/src/runtime/netpoll.go:203 +0x55
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:1706 +0xc56

goroutine 367 [select]:
    /usr/local/go/src/net/http/transport.go:2336 +0x11c
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:1707 +0xc7b
Stream closed EOF for kube-system/cluster-autoscaler-aws-cluster-autoscaler-69f85b8f6d-hbpvv (aws-cluster-autoscaler)

Example of “error” crashloop

Failed to watch *v1.CSIStorageCapacity: failed to list *v1.CSIStorageCapacity: the server could not find the requested resource