Each time we restarted Atlantis, the instrument we use to plan and apply Terraform adjustments, we’d be caught for half-hour ready for it to return again up. No plans, no applies, no infrastructure adjustments for any repository managed by Atlantis. With roughly 100 restarts a month for credential rotations and onboarding, that added as much as over 50 hours of blocked engineering time each month, and paged the on-call engineer each time.
This was in the end attributable to a protected default in Kubernetes that had silently change into a bottleneck because the persistent quantity utilized by Atlantis grew to thousands and thousands of information. Right here’s how we tracked it down and glued it with a one-line change.
Mysteriously sluggish restarts
We handle dozens of Terraform initiatives with GitLab merge requests (MRs) utilizing Atlantis, which handles planning and making use of. It enforces locking to make sure that just one MR can modify a undertaking at a time.
It runs on Kubernetes as a singleton StatefulSet and depends on a Kubernetes PersistentVolume (PV) to maintain monitor of repository state on disk. Each time a Terraform undertaking must be onboarded or offboarded, or credentials utilized by Terraform are up to date, we’ve to restart Atlantis to select up these adjustments — a course of that may take half-hour.
The sluggish restart was obvious after we not too long ago ran out of inodes on the persistent storage utilized by Atlantis, forcing us to restart it to resize the amount. Inodes are consumed by every file and listing entry on disk, and the quantity out there to a filesystem is decided by parameters handed when creating it. The Ceph persistent storage implementation supplied by our Kubernetes platform doesn’t expose a solution to move flags to mkfs, so we’re on the mercy of default values: rising the filesystem is the one solution to develop out there inodes, and restarting a PV requires a pod restart.
We talked about extending the alert window, however that will simply masks the issue and delay our response to precise points. As an alternative, we determined to analyze precisely why it was taking so lengthy.
Once we have been requested to do a rolling restart of Atlantis to select up a change to the secrets and techniques it makes use of, we’d run kubectl rollout restart statefulset atlantis, which might gracefully terminate the present Atlantis pod earlier than spinning up a brand new one. The brand new pod would seem virtually instantly, however it could present:
$ kubectl get pod atlantis-0
atlantis-0 0/1
Init:0/1 0 30m
…so what offers? Naturally, the very first thing to verify can be occasions for that pod. It is ready round for an init container to run, so perhaps the pod occasions would illuminate why?
$ kubectl occasions --for=pod/atlantis-0
LAST SEEN TYPE REASON OBJECT MESSAGE
30m Regular Killing Pod/atlantis-0 Stopping container atlantis-server
30m Regular Scheduled Pod/atlantis-0 Efficiently assigned atlantis/atlantis-0 to 36com1167.cfops.internet
22s Regular Pulling Pod/atlantis-0 Pulling picture "oci.example.com/git-sync/master:v4.1.0"
22s Regular Pulled Pod/atlantis-0 Efficiently pulled picture "oci.example.com/git-sync/master:v4.1.0" in 632ms (632ms together with ready). Picture measurement: 58518579 bytes.That appears virtually regular… however what’s taking so lengthy between scheduling the pod and truly beginning to pull the picture for the init container? Sadly that was all the info we needed to go on from Kubernetes itself. However absolutely there had to be one thing extra that may inform us why it is taking so lengthy to truly begin operating the pod.
In Kubernetes, a element known as kubelet that runs on every node is accountable for coordinating pod creation, mounting persistent volumes, and lots of different issues. From my time on our Kubernetes group, I do know that kubelet runs as a systemd service and so its logs needs to be out there to us in Kibana. For the reason that pod has been scheduled, we all know the host title we’re desirous about, and the log messages from kubelet embrace the related object, so we might filter for atlantis to slender down the log messages to something we discovered fascinating.
We have been capable of observe the Atlantis PV being mounted shortly after the pod was scheduled. We additionally noticed all the key volumes mount with out difficulty. Nevertheless, there was nonetheless a giant unexplained hole within the logs. We noticed:
[operation_generator.go:664] "MountVolume.MountDevice succeeded for volume "pvc-94b75052-8d70-4c67-993a-9238613f3b99" (UniqueName: "kubernetes.io/csi/rook-ceph-nvme.rbd.csi.ceph.com^0001-000e-rook-ceph-nvme-0000000000000002-a6163184-670f-422b-a135-a1246dba4695") pod "atlantis-0" (UID: "83089f13-2d9b-46ed-a4d3-cba885f9f48a") gadget mount path "/state/var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph-nvme.rbd.csi.ceph.com/d42dcb508f87fa241a49c4f589c03d80de2f720a87e36932aedc4c07840e2dfc/globalmount"" pod="atlantis/atlantis-0"
[pod_workers.go:1298] "Error syncing pod, skipping" err="unmounted volumes=[atlantis-storage], unattached volumes=[], failed to process volumes=[]: context deadline exceeded" pod="atlantis/atlantis-0" podUID="83089f13-2d9b-46ed-a4d3-cba885f9f48a"
[util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="atlantis/atlantis-0"The final two messages looped a number of occasions till finally we noticed the pod truly begin up correctly.
So kubelet thinks that the pod is in any other case able to go, however it’s not beginning it and one thing’s timing out.
The bottom-level logs we had on the pod did not present us what is going on on. What else do we’ve to take a look at? Nicely, the final message earlier than it hangs is the PV being mounted onto the node. Ordinarily, if the PV has points mounting (e.g. because of nonetheless being caught mounted on one other node), that can bubble up as an occasion. However one thing’s nonetheless happening right here, and the one factor we’ve left to drill down on is the PV itself. So I plug that into Kibana, for the reason that PV title is exclusive sufficient to make search time period… and instantly one thing jumps out:
[volume_linux.go:49] Setting quantity possession for /state/var/lib/kubelet/pods/83089f13-2d9b-46ed-a4d3-cba885f9f48a/volumes/kubernetes.io~csi/pvc-94b75052-8d70-4c67-993a-9238613f3b99/mount and fsGroup set. If the amount has loads of information then setting quantity possession could possibly be sluggish, see Keep in mind how I mentioned initially we might simply run out of inodes? In different phrases, we’ve a lot of information on this PV. When the PV is mounted, kubelet is operating chgrp -R to recursively change the group on each file and folder throughout this filesystem. No surprise it was taking so lengthy — that is a ton of entries to traverse even on quick flash storage!
The pod’s spec.securityContext included fsGroup: 1, which ensures that processes operating underneath GID 1 can entry information on the amount. Atlantis runs as a non-root consumer, so with out this setting it wouldn’t have permission to learn or write to the PV. The way in which Kubernetes enforces that is by recursively updating possession on the complete PV each time it is mounted.
Fixing this was heroically…boring. Since model 1.20, Kubernetes has supported a further discipline on pod.spec.securityContext known as fsGroupChangePolicy. This discipline defaults to At all times, which ends up in the precise habits we see right here. It has another choice, OnRootMismatch, to solely change permissions if the basis listing of the PV would not have the appropriate permissions. When you don’t know precisely how information are created in your PV, don’t set fsGroupChangePolicy: OnRootMismatch. We checked to be sure that nothing needs to be altering the group on something within the PV, after which set that discipline:
spec:
template:
spec:
securityContext:
fsGroupChangePolicy: OnRootMismatchNow, it takes about 30 seconds to restart Atlantis, down from the half-hour it was after we began.
Default Kubernetes settings are smart for small volumes, however they’ll change into bottlenecks as information grows. For us, this one-line change to fsGroupChangePolicy reclaimed almost 50 hours of blocked engineering time per 30 days. This was time our groups had been spending ready for infrastructure adjustments to undergo, and time that our on-call engineers had been spending responding to false alarms. That’s roughly 600 hours a 12 months returned to productive work, from a repair that took longer to diagnose than deploy.
Secure defaults in Kubernetes are designed for small, easy workloads. However as you scale, they’ll slowly change into bottlenecks. When you’re operating workloads with giant persistent volumes, it’s price checking whether or not recursive permission adjustments like this are silently consuming your restart time. Audit your securityContext settings, particularly fsGroup and fsGroupChangePolicy. OnRootMismatch has been out there since v1.20.
Not each repair is heroic or advanced, and it’s normally price asking “why does the system behave this way?”
If debugging infrastructure issues at scale sounds fascinating, we’re hiring. Come be a part of us on the Cloudflare Group or our Discord to speak store.



