Linux Capability

Starting with kernel 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities, which can be independently enabled and disabled. This way the full set of privileges is reduced and decreasing the risks of exploitation.

This story started by removing SYS_ADMIN and SYS_RESOURCE Linux capabilities from K8s container which hosts DB2. Why we removed them? Because by the time DB2 has to run as root(to tune kernel parameters), so we want to minimize the privilege of the root user by removing some risky Linux capabilities from pod/container.

And a container is really just a process running on the system, separated using cgroups and namespaces in the kernel. This means that capabilities can be assigned to the container in just the same way as with any other process and this is handled by the container runtime when it creates the container.

How to add/remove Linux capabilities in K8s

Secure Your Containers with this One Weird Trick: The way I describe it is that most people think of root as being all powerful. This isn’t the whole picture, the root user with all capabilities is all powerful.

Linux capabilities in Kubernetes

For the purpose of performing permission checks, traditional UNIX implementations distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero).

Basic Capability Thing

Header File

Linux capabilities are defined in a header file with the non-surprising name capability.h, in /usr/include/linux/capability.h. They’re pretty self-explanatory and well commented

Capability Number

To see the highest capability number for your kernel, use the data from the /proc file system.

1
2
# cat /proc/sys/kernel/cap_last_cap
36

Current Capabilities

To see the current capabilities list, run capsh --print, for example, as normal user dsadm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# capsh --print

Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,
cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,
cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,
cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,
cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,
cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,
35,36
Securebits: 00/0x0/1'b0
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
uid=1002(dsadm)
gid=1002(dsadm)
groups=1002(dsadm)

you see the Current: = is empty, but if you run as root user:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ capsh --print

Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,
cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,
cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,
cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,
cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,
cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,
35,36+ep
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,
cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,
cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,
cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,
cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,
cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,
35,36
Securebits: 00/0x0/1'b0
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
uid=0(root)
gid=0(root)
groups=0(root)

To see the capabilities for a particular process, run cat /proc/<PID>/status | grep -i cap:

1
2
3
4
5
6
7
# cat /proc/1/status | grep -i cap

CapInh: 00000000a884a5fb
CapPrm: 00000000a884a5fb
CapEff: 00000000a884a5fb
CapBnd: 00000000a884a5fb
CapAmb: 0000000000000000

This is the bit map for capabilities, the meaning for each is:

  • CapInh = Inherited capabilities
  • CapPrm – Permitted capabilities
  • CapEff = Effective capabilities
  • CapBnd = Bounding set
  • CapAmb = Ambient capabilities set

The CapBnd defines the upper level of available capabilities. During the time a process runs, no capabilities can be added to this list. Only the capabilities in the bounding set can be added to the inheritable set, which uses the capset() system call. If a capability is dropped from the boundary set, that process or its children can no longer have access to it.

Using the capsh utility we can decode them into the capabilities name:

1
2
3
4
5
# capsh --decode=00000000a884a5fb

0x00000000a884a5fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,
cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,cap_sys_chroot,
cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap

The another easy way is use getpcaps utility:

1
2
3
4
5
# getpcaps 1965

Capabilities for `1965': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+eip

It is also interesting to see the capabilities of a set of processes that have a relationship.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# getpcaps $(pgrep db2)

Capabilities for `1965': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+eip
Capabilities for `2151': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+i
Capabilities for `2245': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+eip
Capabilities for `2246': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+eip
Capabilities for `2247': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+eip
Capabilities for `2249': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+i
Capabilities for `2614': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+i
Capabilities for `4213': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+i
Capabilities for `4238': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+i

Limit Capability

You can test what happens when a particular capability is dropped by using the capsh utility. This is a way to see what capabilities a particular program may need to function correctly. The capsh command can run a particular process and restrict the set of available capabilities.

1
capsh --print -- -c "/bin/ping -c 1 localhost"

After dropping cap_net_raw, ping not permitted.

1
capsh --drop=cap_net_raw --print -- -c "/bin/ping -c 1 localhost"

Capability Meet

List the capabilities I have seen so far:

  • CAP_SYS_ADMIN Without it, I cannot perform hostname command for docker container in K8s.
  • CAP_SYS_RESOURCE This is for adjust DB2 kernel parameters

These 3 necessaries are for DB2:

  • CAP_SETFCAP Set arbitrary capabilities on a file. (actually this is default in unprivileged docker container)
  • CAP_SYS_NICE
  • CAP_IPC_OWNER Bypass permission checks for operations on System V IPC objects.

My Questions

  1. Is capability granted to user or process? Frist ensure the env(container) has enough Linux caps. Then using command grant process certains caps, but need root privilege to do that, then you can run process as ordinary user.

  2. Privilege process bypass all kernel permission check? Does that mean linux capabilities are only for non-privilege user or process? I think there is a global or default capability set in system to determine what ever processes on system is allowed to do. Then you can fine-tune for unprivileged process.

  3. If we have root and normal user both in docker container, so capabilities are applied on root or normal user or both? After testing and comparing by capsh --print with different user in xmeta container, I think capabilities are applied on all users in K8s environment.

Later I post blogs to talk about <<Capability in Docker>>.

Resources

Linux Programmer’s Manual Linux capabilities 101

0%