top command ran in container within the pod shows the host machine overview metrics and container level process metrics. The reason is containers inside pod partially share /proc with the host system includes path about a memory and CPU information. The top utilizes /proc/stat(host machine), /proc/<pid>/stat(container process), they are not aware of the namespace.

P.S: lxcfs this FUSE filesystem can create container native /proc! Make container more likes a VM.

The two methods below collect data from different sources and they are also referring to different metrics.

For k8s OOMKiller event, using kubectl top to predicate and track is more accurate.

Kubectl Top

K8s OOMkiller uses container_memory_working_set_bytes(from cadviosr metrics, can also show in prometheus if deployed) as base line to decide the pod kill or not. It is an estimate of how much memory cannot be evicted, the kubectl top uses this metrics as well.

After metrics-server is installed:

# show all containers resource usage insdie a pod
kubectl top pod <pod name> --containers
# show pod resource usage
kubectl top pod
# show node resource usage
kubectl top node

In prometheus expression browser, you can get the same value as kubectl top:

1
2
3

# value in Mib
# pod, container are the label name, depends on your case
container_memory_working_set_bytes{pod=~"<pod name>",container=~"<container name>"} / 1024 / 1024

看看Prom alert是从何处取得 container/pod memory 数据的，注意数据源是来自哪个 metrics，以及使用 grafana 也是如此.

Docker Stats

docker stats memory display collects data from path /sys/fs/cgroup/memory with some calculations, see below explanation.

On host machine, display the container stats (CPU and Memory usages)

1 2	# similar to top docker stats --no-stream <container id>

Actually docker CLIs fetch data from Docker API, for instance v1.41 (run docker version to know API supported verion), you can get stats data by using curl command:

1	curl --unix-socket /var/run/docker.sock "http:/v1.41/containers/<container id>/stats?stream=false" \| jq

"memory_stats": {
  "usage": 80388096,
  "max_usage": 148967424,
  "stats": {
    "active_anon": 9678848,
    "active_file": 13766656,
    "cache": 31219712,
    "inactive_file": 17829888
    ...
    "total_inactive_file": 17829888,

From this docker stats description: On Linux, the Docker CLI reports memory usage by subtracting cache usage from the total memory usage. The API does not perform such a calculation but rather provides the total memory usage and the amount from the cache so that clients can use the data as needed. The cache usage is defined as the value of total_inactive_file field in the memory.stat file on cgroup v1 hosts.

On Docker 19.03 and older, the cache usage was defined as the value of cache field. On cgroup v2 hosts, the cache usage is defined as the value of inactive_file field.

memory_stats.usage is from /sys/fs/cgroup/memory/memory.usage_in_bytes. memory_stats.stats.inactive_file is from /sys/fs/cgroup/memory/memory.stat.

So here it is:

1	80388096 - 17829888 = 62558208 => 59.66s Mib

This does perfectly match docker stats value in MEM USAGE column.

The dockershim is deprecated in k8s!! If containerd runtime is used instead, to explore metrics usage you can check cgroup in host machine or go into container check /sys/fs/cgroup/cpu.

To calculate the container memory usage as docker stats in the pod without installing third party tool:

# memory in Mib: used
cd /sys/fs/cgroup/memory
cat memory.usage_in_bytes | numfmt --to=iec

# memory in Mib: used - inactive(cache)
cd /sys/fs/cgroup/memory
used=$(cat memory.usage_in_bytes)
inactive=$(grep -w inactive_file memory.stat | awk {'print $2'})
# numfmt: readable format
echo $(($used-$inactive)) | numfmt --to=iec

To calculate the container cpu usage as docker stats in the pod without installing third party tool:

# cpu, cpuacct dir are softlinks
cd /sys/fs/cgroup/cpu,cpuacct
# cpuacct.stat:
# Reports the total CPU time in nanoseconds 
# spent in user and system mode by all tasks in the cgroup.
utime_start=$(cat cpuacct.stat| grep user | awk '{print $2}')
stime_start=$(cat cpuacct.stat| grep system | awk '{print $2}')
sleep 1
utime_end=$(cat cpuacct.stat| grep user | awk '{print $2}')
stime_end=$(cat cpuacct.stat| grep system | awk '{print $2}')
# getconf CLK_TCK aka sysconf(_SC_CLK_TCK) returns USER_HZ 
# aka CLOCKS_PER_SEC which seems to be always 
# 100 independent of the kernel configuration.
HZ=$(getconf CLK_TCK)
# get container cpu usage
# on top of user/system cpu time
echo $(( (utime_end+stime_end-utime_start-stime_start)*100/HZ/1 )) "%"
# if the outcome is 200%, means 2 cpu usage, so on and so forth

Readings

How much is too much? The Linux OOMKiller and “used” memory We can see from this experiment that container_memory_usage_bytes does account for some filesystem pages that are being cached. We can also see that OOMKiller is tracking container_memory_working_set_bytes. This makes sense as shared filesystem cache pages can be evicted from memory at any time. There’s no point in killing the process just for using disk I/O.
Kubernetes top vs Linux top kubectl top shows metrics for a given pod. That information is based on reports from cAdvisor, which collects real pods resource usage.
cAdvisor: container advisor cAdvisor (Container Advisor, go project) provides container users an understanding of the resource usage and performance characteristics of their running containers.