Who's killing my pods?
In the past few days, I've been trying to deploy a Kubernetes cluster using kubeadm in Debian.
After kubeadm
configured the control plane, something was killing the pods randomly after a while.
This post follows the story of how I tried to fix this issue.
TL;DR; Always match the same cgroup driver of the container runtime with the one from the kubelet
The setup
Instead of using docker and avoiding the container inside the container rabbit hole, I wanted an actual virtual machine to mirror a production cluster as close to reality as possible. I had some experience with Libvirt and QEMU/KVM, but wasn't in the mood to configure my cluster in XML.
I could use vagrant with libvirt plugin to manage these XML files for me but decided to go with terraform-provider-libvirt instead. Mainly as an excuse to mess around with Terraform. You can see the result of the experiments in this repo.
The problem and the (nonideal) solution
The first challenge I had to tackle was not using containerd from the official Debian repository because it was too old.
At first, the cluster initialized correctly, and I could see the pods. Then, after some minutes, something was restarting the pods until it was impossible to query the apiserver to fetch any information about it. At first, I thought an OOM trigger was the culprit, but the virtual machine wasn't memory-constrained. Besides, no logs in Kubernetes pointed to this fact. I looked at the logs and assumed the OS was terminating the etcd process for some reason, wreaking havoc on the apiserver and consequently all its dependencies.
Luckily, searching on the Internet, I found the first clue:
> What helped me was to add systemd.unified_cgroup_hierarchy=0 to my GRUB_CMDLINE_LINUX_DEFAULT in the grub file (/etc/default/grub for Debian)
I added this option, and it worked! This option tells systemd to fallback to cgroups v1. One thing that the original poster thought was weird is that he's using the latest containerd version, which should support cgroups v2 out of the box by default. But hold that thought for a moment.
Difference between cgroups v1 and v2
cgroups (control groups) is a Linux kernel mechanism to manage the resources a process can access. Together with namespaces, they are the technologies that containers use under the hood.
Besides constraining resources, it also exposes data about the process resources. As an example, cadvisor heavily uses it to report the stats of the pods. So, whenever you're reading some beautiful Grafana panel about your pod memory or CPU, this data is probably coming from cgroups.
The cgroups API, like a lot of Linux abstractions, uses files to interface with userspace. You can limit the amount of memory of a process, CPU usage and the number of children a process might have, for example. The different resources are called controllers or subsystems.
# Notice how there is the cpu after cgroup > mkdir /sys/fs/cgroup/cpu/custom # Adding the shell pid to > echo $$ > /sys/fs/cgroup/cpu/custom/cgroup.procs # From 100000 quota > echo 10000 > /sys/fs/cgroup/cpu/custom/cpu.cfs_quota_us # Only uses 10% of the cpu > stress -c 1
A significant difference is that in v2, cgroups doesn't separate a different folder for each controller. Instead, they reside now inside the same directory. It's called unified for this reason.
# No cpu folder. All controllers are the same > mkdir /sys/fs/cgroup/custom # Add the current shell session > echo $$ > /sys/fs/cgroup/custom/cgroup.procs # From 100000 quota > echo 10000 > /sys/fs/cgroup/custom/cpu.max # Same with before: it only uses 10% of the cpu > stress -c 1
One last important detail is that cgroups is also hierarchical. Therefore, the cgroup's children directories can inherit or change their behaviour based on the parent's values.
If you have an hour to spare, do watch this talk from Michael Kerrisk (from the "Linux Programming Interface" book fame) introducing cgroups v2.
What's that systemd config doing under the hood?
Coming back to the original issue and the fix with systemd.unified_cgroup_hierarchy
. If you're curious, the code in systemd related to the option given in that thread is surprisingly easy to follow.
This option impacts the return of cg_is_unified_wanted
and cg_is_legacy_wanted
functions.
Based on both results, systemd will mount the entries of the mount_table
variable with the option from cgroups v1 or v2.
By specifying 0, cg_is_legacy_wanted
returns true and system mounts cgroups
with v1 options.
In the end, systemd will perform roughly the following commands on boot time depending on the version:
# When cg_is_legacy_wanted is true, it means the user wants v1 > mount -t tmpfs /sys/fs/cgroup > mkdir /sys/fs/cgroup/{pids,memory} > mount -t cgroup -o pids cgroup /sys/fs/cgroup/pids > mount -t cgroup -o memory cgroup /sys/fs/cgroup/memory # Rest of the mount for all controllers # If cg_is_unified_wanted is true, it means the user wants v2 > mount -t cgroup2 cgroup2 /sys/fs/cgroup
Obviously, systemd doesn't use shell commands but mounts via sys/mount.h
. The result is the same, though.
The better solution
The cluster was up and running. But using the nonideal cgroups v1.
Looking through a GitHub issue on containerd related to cgroups v2, someone recommended the usage of the SystemdCgroup
option.
# Content of file /etc/containerd/config.toml version = 2 [plugins] [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".containerd] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] SystemdCgroup = true
And voilá. After restarting containerd, nothing was killing the pods anymore. I don't need to fiddle with kernel arguments, and I can use the more recent cgroups v2. I'm happy with that.
What's this config doing under the hood?
kubelet doesn't create the container processes or set up the cgroups for them.
It communicates with containerd via the CRI (Container Runtime Interface) specification. containerd
in turn delegates the heavy lifting of creating the container cgroups to runc
via the systemd-cgroup
option.
runc
, based on this option, sends a message to systemd
through D-Bus.
Okay, but who's killing your pods?
What's a good detective story without revealing the real author of the crimes? We need to trace where the pods termination happens to have some closure.
Turning on the kubelet log level visibility to DEBUG
and looking at the order of the events, I realised that it's indeed the kubelet that's the culprit, and it's telling containerd to kill the pod.
Where?
It all starts on the Start
method (Ha!).
Something somewhere sends a value into the channel to conciliate the actual state of the pod with the desired state.
I left out the code that sends this value because it's irrelevant to our investigation.
// in Start on status_manager.go go wait.Forever(func() { for { select { case syncRequest := <-m.podStatusChannel: klog.V(5).InfoS("Status Manager: syncing pod with status from podStatusChannel", "podUID", syncRequest.podUID, "statusVersion", syncRequest.status.version, "status", syncRequest.status.status) m.syncPod(syncRequest.podUID, syncRequest.status) } } }
// in syncPod on kubelet.go // pcm is a podContainerManagerImpl struct if !pcm.Exists(pod) && !firstSync { p := kubecontainer.ConvertPodStatusToRunningPod(kl.getRuntime().Type(), podStatus) if err := kl.killPod(pod, p, nil); err == nil { podKilled = true } else { klog.ErrorS(err, "KillPod failed", "pod", klog.KObj(pod), "podStatus", podStatus) } }
Here, it checks if the pod still exists. If not, it calls killPod
, telling containerd to kill it.
Notice how there is no log here stating the exact reason why the pod was killed. This makes things challenging to troubleshoot.
Here is the part where the check happens:
// in cgroup_manager_linux.go // pcm will call this method func (m *cgroupManagerImpl) Exists(name CgroupName) bool { return m.Validate(name) == nil } func (m *cgroupManagerImpl) Validate(name CgroupName) error { if libcontainercgroups.IsCgroup2UnifiedMode() { cgroupPath := m.buildCgroupUnifiedPath(name) neededControllers := getSupportedUnifiedControllers() enabledControllers, err := readUnifiedControllers(cgroupPath) if err != nil { return fmt.Errorf("could not read controllers for cgroup %q: %w", name, err) } difference := neededControllers.Difference(enabledControllers) if difference.Len() > 0 { return fmt.Errorf("cgroup %q has some missing controllers: %v", name, strings.Join(difference.List(), ", ")) } return nil // valid V2 cgroup } // Rest of cgroups v1 logic }
kubelet parsed the cgroup of this pod as kubepods-burstable-<pod_id>.slice
inside kubepods-burstable.slice
. I grepped the PID of the container in /sys/fs/cgroup
and found that it was in cgroup kubepods-besteffort-<pod_id>.slice:cri-containerd:<container_id>
inside system.slice
. The container cgroup was not related at all to the pod cgroup.
With the broken version, the best effort cgroup has the following configuration (same with burstable):
/sys/fs/cgroup/ ├── kubepods.slice │ ├── kubepods-besteffort.slice │ │ └── kubepods-besteffort-pod<pod_id>.slice ├── system.slice │ ├── kubepods-besteffort-pod<pod_id>.slice:cri-containerd:<container_id>
With the systemd option working correctly:
/sys/fs/cgroup/kubepods.slice/ ├── kubepods-besteffort.slice │ └── kubepods-besteffort-pod<pod_id>.slice │ └── cri-containerd-<container_id>.scope
One thing that I thought it was strange is that the pod cgroup does exist in the v2 format. Putting a log with the error there surprised me with the following message:
cgroup [\"kubepods\" \"besteffort\" \"pod7149273f-1369-42ff-ae1f-79b1529bba7b\"] has some missing controllers: cpuset
Okay. kubelet identifies the cgroup as missing because it's missing a controller.
Who's removing the cpuset
controller then?
Kubernetes QoS
Let's digress a bit. Kubernetes assigns a pod in one of the three Qualify of Service (QoS) classes. It sets different CPU scheduling and decides who will die first in case of memory pressure. The three types are:
- Guaranteed: Pods that are strict about their CPU and memory limit and requests
- Burstable: Pods that are less strict but still define at least one limit or request in one of its containers
- BestEffort: Pods that don't specify any limit or requests
kubelet uses the cpu.weight
file to allocate CPU time for the given processes based on their QoS.
This calculation happens every minute, and in the end, it will send a D-Bus message to systemd with the CPUWeight
property to the QoS cgroup.
kubelet sets the minimum share of one to the BestEffort cgroup and calculates the BestEffort shares based on the existing requests of active pods.
As a good citizen, kubernetes rewards the pod with more CPU time.
So, always specify a limit and request in your pod definition.
What does this have to do with the previous error?
I noticed that the pod died a couple of seconds after kubelet sent this CPUWeight
request every minute.
Looking at the files of the QoS cgroup, I could see that the cgroup.subtree_control
was temporarily missing the cpuset
controller.
cgroups uses this file to block or allow what controllers the children can access.
> cd /sys/fs/cgroup > mkdir -p custom_parent/custom_child > ls -l custom_parent/custom_child/cpu* 2 > cat custom_parent/custom_child/cgroup.controllers # Nothing is returned > echo +cpu > custom_parent/cgroup.subtree_control > cat custom_parent/custom_child/cgroup.controllers cpu > ls -l custom_parent/custom_child/cpu* | wc -l 5
The cpu
files, like cpu.weight
or cpu.max
are only accessible when adding the cpu
controller to cgroup.subtree_control
.
This is a demo of what's happening every minute. I called systemd directly, so it's more deterministic, but it's the same operations under the hood.
Notice how after a couple of seconds, kubelet will add the cpuset
controller to get back to "normality".
This happens because runc
will try to create the cgroup via file API after sending the D-Bus messages to systemd. Not sure if it's just to guarantee that the cgroup is created correctly.
By the way, it's the kubelet who uses runc
as a library to create the QoS cgroup, not containerd.
WTF systemd?
So, one crucial detail is that something removes the controller only when using the systemd API. It's still there when writing to cpu.weight
directly. So it's probably not the kernel messing up with the controller.
Issuing a strace -p 1
, I found out that systemd was the process removing the cpuset
in the cgroup.subtree_control
file of the QoS cgroup.
systemd doesn't remove the controller in the working setup because runc
adds the cgroup via systemd with the Delegate
option:
This option, according to the docs, does something relevant for us:
systemd won't fiddle with your sub-tree of the cgroup tree anymore. It won't change attributes of any cgroups below it, nor will it create or remove any cgroups thereunder, nor migrate processes across the boundaries of that sub-tree as it deems useful anymore.
So, with Delegate
as a children of the pod and Qos cgroups, runc
says: "Fuck off, systemd. I know what I'm doing. This cgroup and all its parents belong to me". systemd will reply: "Okay. Carry on. I will leave you alone". When there is no delegated cgroup, systemd will say: "Aha. All of these cgroups belong to me now. I will do whatever I want with it!".
The tricky question nagging me is: "Why is systemd removing this controller in the first place?".
Following what systemd is doing is unsurprisingly hard to follow.
Linux introduced cpuset
in v2, and the same feature was in the cpu
controller for v1.
Perhaps it's not considering it when it restarts the controllers of non-delegated cgroup.
I'm probably not seeing the forest for the trees and
I leave it as an exercise for the reader to find out =P.
In the end, this "bug" might be a feature; otherwise, kubelet wouldn't restart the pod, and I would think the control plane was healthy.
Wrapping up
A short summary of the broken version events:
- containerd will create the container cgroup outside of
kubepods-<qos>-<pod_id>.slice
- kubelet sends a D-Bus message to systemd to change the
CPUWeight
property of the QoS cgroup - systemd writes this value to
cpu.weight
file - systemd removes the
cpuset
controller for whatever reason - kubelet will try to sync the pod and realize that a controller is missing
- kubelet kills the pod because it thinks the cgroup is "gone."
- kubelet "syncs" the QoS cgroup again and adds the
cpuset
via file API - the pod is up
- go back to 1. the control plane is broken
The version with SystemdCgroup
in containerd config:
- containerd creates the container cgroup inside the pod cgroup with the
Delegate
option - kubelet sends a D-Bus message and systemd writes the value to
cpu.weight
file - systemd won't mess up with the parents of the delegated container cgroup
- kubelet doesn't kill the pod because the
cpuset
controller is still there - the control plane is healthy
In my opinion, it's kubelet's responsibility to not allow the container manager and runtime to use a different cgroup driver. For instance, kubelet returned an error when docker didn't match the driver. Since 1.22, kubelet removed the docker integration, and it only supports managers implementing the CRI now. Apparently, the CRI specification doesn't provide an agnostic way to identify the cgroup driver of the container runtime. I'm still unable to see the whole picture yet, and the best way to avoid people shooting themselves in the foot.
Conclusion
It was fun to troubleshoot all of this. I made some wrong assumptions (as usual), and I couldn't imagine that I would need to go that deep to find out what was going on.
I'm surprised at how libvirt-terraform-provider helped me. Investing some time to set up a declarative approach paid dividends. It was useful to run multiple hosts, try out new OSes and have multiple machines running simultaneously with different cgroup versions.
Let's see what the future holds now that I can bootstrap my own cluster in an isolated environment =).