Docker Runtime Capabilities

In my blog <<Linux Capability>>. I talk the basic and general knowlwdge about Capability. This blog will focus on Capability in Docker container.

In docker run command, there are some flags about runtime privilege and capabilities:

--cap-add: Add Linux capabilities
--cap-drop: Drop Linux capabilities
--privileged=false: Give extended privileges to this container
--device=[]: Allows you to run devices inside the container without the --privileged flag.

By default, Docker containers are unprivileged and cannot, for example, run a Docker daemon inside a Docker container. This is because by default a container is not allowed to access any devices (/dev) on host, but a “privileged” container is given access to all devices on host.

The --privileged flag gives all capabilities to the container, and it also lifts all the limitations enforced by the device cgroup controller. In other words, the container can then do almost everything that the host can do. This flag exists to allow special use-cases, like running Docker within Docker.

How to verify? you can run a busybox with --privileged enabled or not, first try enable it:

1	docker run --rm -it --privileged busybox sh

then let’s check init process capabilities (busybox doesn’t have getpcaps):

# cat /proc/1/status | grep -i cap

CapInh: 0000001fffffffff
CapPrm: 0000001fffffffff
CapEff: 0000001fffffffff
CapBnd: 0000001fffffffff
CapAmb: 0000000000000000

then decode in another machine, we can see full capabilities here:

# capsh --decode=0000001fffffffff

0x0000001fffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,
cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,
cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,
cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,
cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,
cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,
35,36

if not enabled, only see default ones:

# capsh --decode=00000000a80425fb

0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,
cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,
cap_audit_write,cap_setfcap

By default, Docker has a default list of capabilities that are kept. The following table lists the Linux capability options which are allowed by default and can be dropped.

SETPCAP: Modify process capabilities.
MKNOD: Create special files using mknod(2).
AUDIT_WRITE: Write records to kernel auditing log.
CHOWN: Make arbitrary changes to file UIDs and GIDs (see chown(2)).
NET_RAW: Use RAW and PACKET sockets.
DAC_OVERRIDE: Bypass file read, write, and execute permission checks.
FOWNER Bypass: permission checks on operations that normally require the file system UID of the process to match the UID of the file.
FSETID: Don’t clear set-user-ID and set-group-ID permission bits when a file is modified.
KILL: Bypass permission checks for sending signals.
SETGID: Make arbitrary manipulations of process GIDs and supplementary GID list.
SETUID: Make arbitrary manipulations of process UIDs.
NET_BIND_SERVICE: Bind a socket to internet domain privileged ports (port numbers less than 1024).
SYS_CHROOT: Use chroot(2), change root directory.
SETFCAP: Set file capabilities.

Further reference information is available on the capabilities(7) - Linux man page

Resource

Docker run reference Docker security