- GitLab runners, EKS, IRSA and S3 caching

You do not want to set AWS_* credentials all over your gitlab pipelines
You have distributed GitLab runners hosted on kubernetes, and you want to use S3 as a build cache.
You want to access AWS resources from your build chain

The solution

There’s an official AWS solution: “IRSA”, which stands for “IAM roles for k8s service accounts”. (There is at least another non-AWS one, kube2iam, but this is not subject of this post.)

IRSA assigns an AWS IAM role to a K8S serviceaccount, which then in turn can be specified for a running pod. There are a couple of pages documenting this, but they are all either too much or too little. I now try do write what I wanted to have.

Also, nobody has ever documented using that for configuring S3 caching on a K8S GitLab runner. Which is not easy, cause the documentation could honestly be better.

Advantages of this solution

That works with k8s runners just as well as with fargate runners (the way I do them)
You are independent from AWS_* credentials in your pipeline
You can simply change the IAM role in your pipeline job to get different permissions, you don’t need to create a new user and create new credentials and set them as CI variables, etc.

What you need

You need (out of scope):

An AWS EKS cluster
An OIDC identiry provider for your cluster (create using the console / eksctl, or using terraform)

For every role you want to use in the cluster, you need …

… an IAM role with referencing k8s namespace & service account name
… a K8S cluster service account referencing the role ARN
this is a reciprocal reference, one references the other, and vice versa!

That’s what we will do in here.

The IAM role

You basically configure a role as every other one, but you will specify a special trust policy. That policy references the OIDC provider, the k8s namespace and the service account in that namespace.

# terraform code - die hard terraform addict here, sorry.

locals {
  oidc_arn = "arn:aws:iam::123456789012:oidc-provider/oidc.eks.eu-central-1.amazonaws.com/id/deadbeefaffe1234deadbeefaffe1234"
  oidc_url = "https://oidc.eks.eu-central-1.amazonaws.com/id/deadbeefaffe1234deadbeefaffe1234"
  k8s_namespace = "gitlab"
  k8s_service_account_name = "gitlab-runner"
}

# eks / pod / role stuff: https://is.gd/2skBH7
resource "aws_iam_role" "k8s_irsa_example_role" {
  name                = "k8s-irsa-example-role"
  assume_role_policy  = data.aws_iam_policy_document.trust_k8s_irsa_example_role.json
  managed_policy_arns = [aws_iam_policy.some_policy.arn]
}

# this is the PERMISSIONS policy
data "aws_iam_policy_document" "perms_k8s_irsa_example_role" {
  statement {
    actions = ["s3:*", ]
    resources = [
      "arn:aws:s3:::my-super-duper-gitlab-runner-cache",
      "arn:aws:s3:::my-super-duper-gitlab-runner-cache/*",
    ]
  }}

# this is the TRUST policy
# the TRUST policy references k8s namespace and service account
data "aws_iam_policy_document" "trust_k8s_irsa_example_role" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRoleWithWebIdentity"]
    principals {
      type        = "Federated"
      identifiers = [local.oidc.arn]
    }
    condition {
      test     = "StringEquals"
      variable = "${replace(var.oidc_url, "https://", "")}:sub"
      values   = ["system:serviceaccount:${local.k8s_namespace}:${local.k8s_service_account_name}"]
    }
  }
}

For completeness the trust policy as JSON:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.eu-central-1.amazonaws.com/id/deadbeefaffe1234deadbeefaffe1234"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.eu-central-1.amazonaws.com/id/deadbeefaffe1234deadbeefaffe1234:sub": "system:serviceaccount:gitlab:gitlab-runner"
        }
      }
    }
  ]
}

This is, in fact, all you need from the AWS side of things.

Some notes:

Be absolutely aware that this policy references the name and the namespace in k8s. Also, the service account will reference this policy ARN. So if you change either name, you have to always adjust the other side!
For GitLab, you might want to have two of those (you’ll see why later)

General serviceAccount info

So this is how any serviceAccount connected to an IAM role looks like:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    # this is the magic
    # it references the role ARN, which contains the role's name
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/k8s-irsa-example-role

  # the name referenced in the AWS IAM trust policy
  # ("system:serviceaccount:<namespace>:<SERVICEACCOUNTNAME>")
  name: gitlab-runner

  # ("system:serviceaccount:<NAMESPACE>:<serviceaccountname>")
  # if you change one, you change both - they reference each other BY NAME
  namespace: gitlab

If you want to test this right now, feel free:

# $ kubectl apply -n gitlab -f THIS_FILE.yml
# $ kubectl exec -ti -n gitlab quick-debug-pod -- /bin/bash
# and then install awscli & ca-certificates to try accessing your s3 bucket :)
apiVersion: v1
kind: Pod
metadata:
  name: quick-debug-pod
  namespace: gitlab
spec:
  serviceAccountName: gitlab-runner
  containers:
    - name: shell
      image: "ubuntu:latest"
      args: [sleep, infinity]
      resources:
        limits:
          memory: "128Mi"
          cpu: "500m"

General info about GitLab configuration

REMARK: If you only want to enable your jobs to access AWS resources, you’re in luck. You can skip straight to “Addendum - enabling gitlab JOBS” and you’re done.

If you want to use the S3 cache, well, … more work.

Let’s start like this: WHAT YOU MUST KNOW, SUPER IMPORTANT:

The main gitlab runner instance will create a pre-signed S3 URL for the gitlab job, and pass it to the k8s pod running the actual job, so the pod running the gitlab job will not need those cache access rights.
That means, the S3 policy from above must be attached to the runner itself, and not the pods created by the runner that run the actual pipelines.

Also, going forward I assume you deploy the gitlab runner using the GitLab runner helm chart.

Annotate runner pod, short variant

If you …

have only one GitLab runner,
and you know the serviceAccount name created by the helm chart,

… you’re almost done.

You just need to replace the serviceAccount name (gitlab-runner) name with the actual serviceAccount name on your system, and change one thing in the helm chart:

## https://is.gd/Qu1gGv
rbac:
  create: true
  serviceAccountAnnotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/k8s-irsa-example-role

See the bottom on how to configure the cache, this is just the role association.

Annotate runner pod, LONG variant

In my case it was not so easy:

First, I have several runners deployed, and all of them use a different serviceAcount, so that one Role is not going to cut it.
Second, I like to have more control over naming, I don’t want to “guess” which name the helm chart gives my serviceAccount, and I don’t want to be vulnerable to name changes.

So I had to do a bit more work.

To work in K8S, the GitLab runner needs a …

serviceAccount,
role,
roleBinding.

We will create those three resources manually now, so we have control over naming and they can be re-used by several gitlab runners (e.g. fargate & cluster).

In K8S objects, that looks like this:

# the role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: gitlab-runner-common
  namespace: gitlab
rules:
  - apiGroups:
      - ""
    resources:
      - "*"
    verbs:
      - "*"
---
# the service account
apiVersion: v1
automountServiceAccountToken: true
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/k8s-irsa-gitlab-runner-common
  name: gitlab-runner-common
  namespace: gitlab
---
# the role binding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: gitlab-runner-common
  namespace: gitlab
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: gitlab-runner-common
subjects:
  - kind: ServiceAccount
    name: gitlab-runner-common
    namespace: gitlab

(Btw, I’m using the serviceaccount, role & rolebinding charts here to simplify things - I only want to deal with helm and not with “plain” k8s objects.)

Adjust your helm chart

Let’s recap: We now have …

An AWS IAM role
- referencing a namespace & serviceAccount in k8s by name
- with permissions to access S3
A k8s role bound to a serviceAccount in a certain namespace via a rolebinding

To do that, we need to …

disable RBAC creation and reference “our” serviceAccount, and
configure the cache.

Here are the relevant values.yaml file parts:

## https://is.gd/Qu1gGv
rbac:
  # this is basically documented NOWHERE, except here:
  #   https://gitlab.com/gitlab-org/gitlab-runner/-/issues/25972#note_477243643
  create: false
  serviceAccountName: gitlab-runner-common

Configure GitLab S3 cache

Finally, still in the helm chart’s values file, we do this:

runners:
  name: "whatever-runner"

  config: |
    [[runners]]
      # ...
      [runners.cache]
        Type = "s3"
        Path = "runners-all"
        Shared = true
        [runners.cache.s3]
          ServerAddress = "s3.amazonaws.com"
          BucketName = "my-super-duper-gitlab-runner-cache"
          BucketLocation = "eu-central-1"
          Insecure = false
      # ...

Yup, that’s actually it.

We should be done.

Enjoy :)

Addendum - enabling gitlab JOBS

Now, remember that anything from above will NOT give the actual pipline jobs any additional AWS permissions.

If you want to access AWS resources out of your jobs, just do this:

Create another role, with another association to a serviceAccount,
and configure this serviceAccount to be used by the job pods.

How? Simple. Modify the values file:

runners:
  config: |
    [[runners]]
      # ...
      [runners.kubernetes]
        # ...
        service_account = "gitlab-jobs"

Bonus - a test pipeline

# .gitlab-ci.yml
stages:
  - test

variables:
  CACHE_DIR: hey-ho-cache

.all: &all
  before_script:
    - mkdir -p "$CACHE_DIR"
    - NOWDATE=$(date +%Y%m%d_%H%M%S)
    - CACHEFILE="${CACHE_DIR}/heyho_${NOWDATE}"
    - set -x

.cache: &cache
  paths: [hey-ho-cache]

runner-test-create-cache:
  <<: *all
  image: alpine:edge
  stage: test
  script:
    - echo "waahwaahboogah $NOWDATE" > $CACHEFILE
  cache:
    <<: *cache
  tags:
    - fargate-small

runner-test-check-cache:
  <<: *all
  image: alpine:edge
  stage: test
  script:
    - cat $CACHE_DIR/*
  cache:
    <<: *cache
  needs:
    - runner-test-create-cache
  tags:
    - cluster
    - kubernetes

Troubleshooting

It doesnt work

It does.

If it does not, you have a naming error in your references, or you’re using a wrong serviceAccount for your pods, or you assigned the permission to the wrong entity (jobs instead of runner, or vice versa).

First, check your annotations and that they reference exiting things.

It still doesnt work

It does. Your references are wrong.

I checked.

Check again.

Really.

Believe me.

But …

NO. REALLY.

And if not, I can’t help you, cause this works.

(If I have an error here, the principle should be clear - nevertheless I would appreciate a hint if you find one)

Sources

Managing S3 bucket permissions with terraform

Unix path too long for domain socket

On this page:

In case you want to follow me

Here are some links. The further to the right, the less active.

Posts

The problem