Starting with kernel 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities
, which can be independently enabled and disabled. This way the full set of privileges is reduced and decreasing the risks of exploitation.
This story started by removing SYS_ADMIN
and SYS_RESOURCE
Linux capabilities from K8s container which hosts DB2. Why we removed them? Because by the time DB2 has to run as root(to tune kernel parameters), so we want to minimize the privilege of the root user by removing some risky Linux capabilities from pod/container.
And a container is really just a process running on the system, separated using cgroups and namespaces in the kernel. This means that capabilities can be assigned to the container in just the same way as with any other process and this is handled by the container runtime when it creates the container.
How to add/remove Linux capabilities in K8s
Secure Your Containers with this One Weird Trick: The way I describe it is that most people think of root as being all powerful. This isn’t the whole picture, the root user with all capabilities is all powerful.
Linux capabilities in Kubernetes
For the purpose of performing permission checks, traditional UNIX implementations distinguish two categories of processes: privileged
processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged
processes (whose effective UID is nonzero).
Basic Capability Thing
Header File
Linux capabilities are defined in a header file with the non-surprising name capability.h
, in /usr/include/linux/capability.h
. They’re pretty self-explanatory and well commented
Capability Number
To see the highest capability number for your kernel, use the data from the /proc
file system.
1 | # cat /proc/sys/kernel/cap_last_cap |
Current Capabilities
To see the current capabilities list, run capsh --print
, for example, as normal user dsadm
:
1 | # capsh --print |
you see the Current: =
is empty, but if you run as root
user:
1 | $ capsh --print |
To see the capabilities for a particular process, run cat /proc/<PID>/status | grep -i cap
:
1 | # cat /proc/1/status | grep -i cap |
This is the bit map for capabilities, the meaning for each is:
- CapInh = Inherited capabilities
- CapPrm – Permitted capabilities
- CapEff = Effective capabilities
- CapBnd = Bounding set
- CapAmb = Ambient capabilities set
The
CapBnd
defines the upper level of available capabilities. During the time a process runs, no capabilities can be added to this list. Only the capabilities in the bounding set can be added to the inheritable set, which uses the capset() system call. If a capability is dropped from the boundary set, that process or its children can no longer have access to it.
Using the capsh
utility we can decode them into the capabilities name:
1 | # capsh --decode=00000000a884a5fb |
The another easy way is use getpcaps
utility:
1 | # getpcaps 1965 |
It is also interesting to see the capabilities of a set of processes that have a relationship.
1 | # getpcaps $(pgrep db2) |
Limit Capability
You can test what happens when a particular capability is dropped by using the capsh
utility. This is a way to see what capabilities a particular program may need to function correctly. The capsh
command can run a particular process and restrict the set of available capabilities.
1 | capsh --print -- -c "/bin/ping -c 1 localhost" |
After dropping cap_net_raw
, ping
not permitted.
1 | capsh --drop=cap_net_raw --print -- -c "/bin/ping -c 1 localhost" |
Capability Meet
List the capabilities I have seen so far:
- CAP_SYS_ADMIN
Without it, I cannot perform
hostname
command for docker container in K8s. - CAP_SYS_RESOURCE This is for adjust DB2 kernel parameters
These 3 necessaries are for DB2:
- CAP_SETFCAP Set arbitrary capabilities on a file. (actually this is default in unprivileged docker container)
- CAP_SYS_NICE
- CAP_IPC_OWNER Bypass permission checks for operations on System V IPC objects.
My Questions
-
Is capability granted to user or process? Frist ensure the env(container) has enough Linux caps. Then using command grant process certains caps, but need root privilege to do that, then you can run process as ordinary user.
-
Privilege process bypass all kernel permission check? Does that mean linux capabilities are only for non-privilege user or process? I think there is a global or default capability set in system to determine what ever processes on system is allowed to do. Then you can fine-tune for unprivileged process.
-
If we have root and normal user both in docker container, so capabilities are applied on root or normal user or both? After testing and comparing by
capsh --print
with different user in xmeta container, I think capabilities are applied on all users in K8s environment.
Later I post blogs to talk about <<Capability in Docker>>
.