Exploring Linux Control Groups
Introduction
Linux Control groups (cgroups) is a mechanism in the Linux kernel for limiting the system resources such as CPU, memory, I/O, etc. used by a set of processes. Control groups along with Linux namespaces are two basic building blocks for enabling container resource isolation and management, used for enabling OS-level virtualization, including Linux Containers (or LXC).
The cgroups mechanism partitions groups of processes and their children into hierarchical groups with controlled resource limits.
Few technical terms related to cgroups and how it groups processes together -
-
A cgroup associates a set of tasks (a thread) with parameters for one or more subsystems
-
A subsystem is a module which uses the grouping facility of cgroups in a particular way. Typically, it’s a resource controller that sets per-cgroup resource limits.
-
A hierarchy is a set of cgroups arranged in a tree. Each task can be associated with only a single cgroup in a hierarchy, and a set of subsystems. Each hierarchy has one or more subsystems attached to it, and an instance of the cgroup virtual filesystem associated with it.
There can be multiple active hierarchies of task cgroups at any time. Each hierarchy is a partition of all the tasks in a system.
For more details and how it is implemented, see the Kernel cgroups documentation.
Now let’s see it in action!
Experiments
The following tests have been conducted on Ubuntu 18.04
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic
$ uname -r
5.4.0-72-generic
Ubuntu 18.04 uses cgroup-v1 by default, if /sys/fs/cgroup/cgroup.controllers
is present, then cgroup-v2 is used. To list all the cgroups, see:
$ cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpuset 8 1 1
cpu 2 88 1
cpuacct 2 88 1
blkio 9 88 1
memory 3 93 1
devices 7 88 1
freezer 10 1 1
net_cls 4 1 1
perf_event 11 1 1
net_prio 4 1 1
hugetlb 12 1 1
pids 6 94 1
rdma 5 1 1
The cgroup virtual filesystem is located in /sys/fs/cgroup
.. Checkout the cgroups man page for details on each.
$ ls /sys/fs/cgroup
blkio cpu,cpuacct cpuset devices freezer hugetlb memory net_cls,net_prio perf_event pids rdma systemd unified cpu cpuacct net_cls net_prio
For example, to check the current memory usage (see this for more details) -
$ cat /sys/fs/cgroup/memory/memory.usage_in_bytes
5955682304
Which is about 5.54 GiB, and is very close to what free -m
says (total - available)
Note:: All memory value reads will be a multiple of the kernel’s page size (i.e. 4096 bytes or 4KiB). This is the smallest allocatable size of memory
Inside container, it gives -
$ docker run -it ubuntu:20.04 bash
$ cat /sys/fs/cgroup/memory/memory.usage_in_bytes
7639040
Around 7.2 MiB, pretty less!
By default, containers are allowed to use as much memory as the kernel scheduler allows. This can be seen from the host machine -
$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes
9223372036854771712
$ cat /sys/fs/cgroup/memory/docker/cc29550dc71dcd17768aadf04d08f33f4371ae8d065cb37470b9664ed81836d2/memory.limit_in_bytes
9223372036854771712
Docker creates a new cgroup for each container, with the container ID (docker container ps
) as the directory. Various resource constraints can be applied to a container, see Docker Resource Constraint docs and Docker Run options.
Applying the --memory
limit of 20 MB -
$ docker run -it --memory 20m ubuntu:20.04 bash
$ cat /sys/fs/cgroup/memory/docker/014735a67658dfd8307bb032905cf0a12b731d60a7977df16df99ec89935b03a/memory.limit_in_bytes
20971520
Memory Limits
The following steps are from this article with some modifications so that it works on my system.
Install some packages -
sudo apt-get install libcgroup1 cgroup-tools
Test Script -
$ cat test.sh
#!/bin/sh
while true; do
echo "hello world"
sleep 60
done
The manual approach is used below. For using the utilities provided by the libcgroup
package, and persistent cgroups, etc. check out the article.
To create a cgroup called foo
under the memory subsystem, create a directory under /sys/fs/cgroup/memory
-
sudo mkdir /sys/fs/cgroup/memory/foo
To set a limit for the foo
cgroup, we’ll have to write to the file memory.limit_in_bytes
. Set the limit to 50 MB:
$ echo 50000000 | sudo tee /sys/fs/cgroup/memory/foo/memory.limit_in_bytes
50000000
Verify that the value was written:
$ cat /sys/fs/cgroup/memory/foo/memory.limit_in_bytes
49999872
Remember that value written will be a multiple of the page size, 4096 bytes in this case.
Start the process in background:
$ sh test.sh &
[1] 5709
hello world
Using the PID, move the process to foo
cgroup under the memory
subsystem:
$ echo 5709 | sudo tee /sys/fs/cgroup/memory/foo/cgroup.procs
5709
Using the same PID, verify that the process is running within the desired cgroup:
$ ps -o cgroup 5709
CGROUP
10:memory:/foo,8:blkio:/user.slice,6:cpu,cpuacct:/user.slice,5:devices:/user.slice,4:pids:/user.slice/user-1000.slice/user@1000.service,1:name=systemd:/user.slice/user-1000.slice/user@1000.service/gnome-t
Memory used by the process can be seen as -
$ cat /sys/fs/cgroup/memory/foo/memory.usage_in_bytes
495616
Now let’s see what happens when the process exceeds the memory limit. Kill the original process, and then delete the foo
cgroup. I’ve used some libcgroup
tools below -
sudo cgdelete memory:foo
Again create the foo
cgroup -
sudo cgcreate -g memory:foo
Set the limit to 5000 bytes which is greater than what the process normally used -
$ echo 5000 | sudo tee /sys/fs/cgroup/memory/foo/memory.limit_in_bytes
5000
$ cat /sys/fs/cgroup/memory/foo/memory.limit_in_bytes
4096
Limit is set to a multiple of 4096 lesser than the desired value. Start the script and move it to the cgroup, and then check the system logs-
$ sh test.sh &
[1] 6339
hello world
$ echo 6339 | sudo tee /sys/fs/cgroup/memory/foo/cgroup.procs
6339
$ tail /var/log/syslog
May 15 12:25:53 rajat-G5-5587 kernel: [17565.744476] pglazyfreed 0
May 15 12:25:53 rajat-G5-5587 kernel: [17565.744476] thp_fault_alloc 0
May 15 12:25:53 rajat-G5-5587 kernel: [17565.744476] thp_collapse_alloc 0
May 15 12:25:53 rajat-G5-5587 kernel: [17565.744476] Tasks state (memory values in pages):
May 15 12:25:53 rajat-G5-5587 kernel: [17565.744476] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
May 15 12:25:53 rajat-G5-5587 kernel: [17565.744478] [ 6339] 1000 6339 1158 223 53248 0 0 sh
May 15 12:25:53 rajat-G5-5587 kernel: [17565.744479] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=sh,pid=6339,uid=1000
May 15 12:25:53 rajat-G5-5587 kernel: [17565.744483] Memory cgroup out of memory: Killed process 6339 (sh) total-vm:4632kB, anon-rss:68kB, file-rss:824kB, shmem-rss:0kB, UID:1000 pgtables:52kB oom_score_adj:0
May 15 12:25:53 rajat-G5-5587 kernel: [17565.744588] oom_reaper: reaped process 6339 (sh), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Notice that the Out-of-Memory killer (oom-killer) killed the process as soon as it hit the 4KB limit. Verify that the process is no longer running-
$ ps -o cgroup 6339
CGROUP
PID limits
By default, the number of processes which can be created has no limit. To add such a limit, set it in the pids.max
file inside the pids
subsystem. This file is not available in the root cgroup, see pids
subsystem docs also.
Docker container by default also has no limit of processes -
$ cat /sys/fs/cgroup/pids/docker/bb33d6ba5fb72615615d593c144b1d6c85c684e63241ccb57df6bc737c5bd0bb/pids.max
max
Here’s an example of limiting the number of processes which can be created:
$ sudo cgcreate -g pids:foo
$ cat /sys/fs/cgroup/pids/foo/pids.max
max
$ echo 20 | sudo tee /sys/fs/cgroup/pids/foo/pids.max
20
$ cat /sys/fs/cgroup/pids/foo/pids.max
20
We’ve set the limit to be 20 processes. Now the test script containing the famous fork bomb (Side note: here’s a next-level one)-
$ cat test.sh
#!/bin/bash
# Fork bomb
:(){ :|:& };:
Run the script under the foo
cgroup using the cgexec
tool -
sudo cgexec -g pids:foo ~/test.sh
You’ll get lots of fork: retry: Resource temporarily unavailable
messages, but no crash.
$ tail /var/log/syslog
...
May 15 13:17:48 rajat-G5-5587 kernel: [ 1367.341566] cgroup: fork rejected by pids controller in /foo
...
There’s still a catch here! Deleting the cgroup directly will cause problems, since when a cgroup is deleted, all the tasks will move to the parent group, which is the user
group, and it doesn’t have a limit on PIDs and hence the fork bomb will continue.
First, kill all the running processes under the cgroup, and then delete the cgroup-
sudo kill -9 $(< /sys/fs/cgroup/pids/foo/tasks)
sudo cgdelete pids:foo
Note that killing the processes directly worked here, but to reliably kill all the processes, you’ll have to freeze all the processes (using the freezer
subsystem), send SIGKILL and then unfreeze them.
This is intended to be the first of a series of blogs while trying to implement some of the concepts in the paper Houdini’s Escape: Breaking the Resource Rein of Linux Control Groups.
References: