Summary of some jq use cases and caveats daily. Quick reference:

Need to read through some common use cases:

yq is another open source tool relies on jq to query yaml file, tt has most common operations available and syntactically like jq.

  1. Must quote the expression For example:
1
2
# -r: raw string
k get pod banzai-vault-0 -o json | jq -r '.spec.containers[].name'
  1. Filter out null value For example, I want to get the configMap name from a deployment, the volumes may have multiple subitems and one of them is config map, we need to filter out null:
1
kubectl get deploy nwiaas -o json | jq -r '.spec.template.spec.volumes[].configMap.name | select(. != null)'

This is a simple case to introduce select function

  1. Replace field values Using |= operator, see here.

For example, I want to create one job from cronjob but with different args:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# --dry-run=client: print without sending object
kubectl create job \
--from=cronjob/<cronjob name> \
<job name> \
--dry-run=client \
--output json > original.json

# use jq docker
docker run -it --rm \
-v $(pwd)/original.json:/original.json \
--entrypoint=/bin/bash \
stedolan/jq \
-c "cat /original.json |jq -r '.spec.template.spec.containers[].args |= [\"python\", \"make.py\", \"do_something\"]' | cat" \
| kubectl apply -f -

Here I use cat at end to remove color supplement, otherwise kubectl apply will fail.

  1. Loop and extract specified values For example, the Elasticsearch API returns a list of map object and each map has the same number of fields, I want to extract few of the fields and output with specific format.
1
2
3
# [.index,.shard,.prirep,.node]: generate [] array format and put items in
curl -s "http://localhost:9200/_cat/shards/.ds-*?h=index,shard,prirep,state,node&format=json" | \
jq -r '.[] | [.index,.shard,.prirep,.node] | @csv' | sed 's/\"//g'

It will output csv style string, for example:

1
"xxx","yyy","zzz","qqq"

Or format as a table:

1
2
3
4
5
# [.index,.shard,.prirep,.node]: generate an array
# join(":|:"): join array elements
# column -s : -t: use `:` as separator and make table
curl -s "http://localhost:9200/_cat/shards/.ds-*?h=index,shard,prirep,state,node&format=json" | \
jq -r '.[] | [.index, .shard, .prirep, .node] | join(":|:")' | column -s : -t | sort

The output is like:

1
2
3
.ds-k8s-nodes-filebeat-2022.04.28-000020  |  0  |  p  |  172.16.0.172
.ds-k8s-nodes-filebeat-2022.04.28-000020 | 1 | p | 172.16.0.164
.ds-k8s-nodes-filebeat-2022.04.28-000020 | 2 | p | 172.16.0.247

Command line application ingredients:

  1. arguments, options(flags)
  2. stdin, stdout, stderr
  3. meaningful exit code and logging
  4. signal handle
  5. output colors: rich, colorama modules

Non-click

If no click, how to split subcommand into dedicated file, here is a good example with argparse and importlib module.

Click

Introduction video: https://youtu.be/kNke39OZ2k0

My bootstrap click subcommand lazy loading framework github repo, then I can focus on the functional part.

Colorful output, but logging module(for general purpose) is prefered, 如果不是complex应用,其实使用click 自带的颜色输出就足够了.

Cheatsheet to help install/upgrade/rollback helm charts.

Repo Chart

Note chart release name can be different from chart name.

Add and sync helm repo:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# add helm repo if needs
helm repo add <repo name> <repo URL> \
--username xx \
--password xx

# sync helm repo
# you need to run this if any chart version updated
helm repo update

# list helm repos
helm repo list

# search
# -l: lish all verions
# --version: regexp to filter version
helm search repo <chart name> [-l] [--version ^1.0.0]
# CHART VERSION is for chart upgrade or install
# show latest VERSION here
NAME CHART VERSION APP VERSION DESCRIPTION
xxx 2.1.0 3.3.2 xxxxxx

View helm installed chart:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# list installed charts
helm list [-n <namespace>]

# check chart history
# xxx-1.0.1: xxx is chart name, 1.0.1 is chart version
helm history <release name> [-n <namespace>]
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
1 Mon Oct 11 19:41:41 2021 superseded xxx-1.0.1 1.8.3 Install complete
2 Mon Oct 18 17:39:37 2021 superseded xxx-1.0.2 1.8.3 Upgrade complete
3 Mon Oct 18 17:41:29 2021 superseded xxx-1.0.1 1.8.3 Rollback to 1
4 Mon Oct 18 17:44:21 2021 superseded xxx-1.0.1 1.8.3 Upgrade complete
5 Mon Oct 18 17:55:28 2021 deployed xxx-1.0.2-66 1.8.3 Upgrade complete

# check chart installed status/output
helm status <release name> [-n <namespace>]

Install or Upgrade chart:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# get current chart values
# edit if necessary
helm get values <release name> [-n <namespace>] > values.yaml

# upgrade with specified version and values.yaml
# -f: specify vaules if necessary, if not, will reuse the existing as helm get values output
# where to get version: helm search
helm install/upgrade <release name> [repo/chart] [-n <namespace>] --version <version> [-f values.yaml]
# helm upgrade example gcloud-helm/example --version 1.0.1 -f values.yaml
# if run helm history, will displayed as xxx-1.0.3 in CHART column
# 1.0.3 is the --version value
# note that if the version format is 1.0.3.19, have to convert to 1.0.3-19 in command

# see upgrade result
helm history <release name> [-n <namespace>]

Rollback chart:

1
2
3
4
5
6
7
# REVISION is from helm history, see above
# if REVISION is ignored, rollback to previous release
helm rollback <release name> [-n <namespace>] [REVISION]
# note that rollback will also rollback the values

# see rollback result
helm history <release name> [-n <namespace>]

Uninstall and install chart:

1
2
3
# --keep-history if necessary
# in case need to rollback
helm uninstall <release name> [--keep-history] [-n <namespace>]

Download chart package to local:

1
2
3
4
# helm-repo is repo name from "helm repo list"
# example is chart name from that repo
# --version specify version of the chart
helm pull helm-repo/example --version 1.0.1

Local Chart

For developing purpose, we can install directly from local chart folder.

You need to create Chart.yaml which contains chart version, appVersion, etc.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# without k8s cluster
# rendering template and check
# --debug: verbose output
helm template [--values <path to yaml file>] \
[--debug] \
<path to chart folder> \
| less

# in k8s cluster
# real helm install but without commit
# can generate a release name as [release]
helm install [release name] <path to chart folder> \
[-n <namespace>] \
--dry-run \
--debug 2>&1 \
| less

# real install
helm install [release name] <path to chart folder>

# uninstall
helm uninstall <release name> [-n <namespace>]

应用容器,一般来说只有一个main serivce process(it can spawn child processes). 需要一个 init process(PID 1) 去管理 children reaping, handle signals, 也就是说, 如果你的 service process 有 fork 但是 no reaping,那么你就需要一个 init process 了,否则会造成 zombie process.

特别是在容器中使用第三方 app 的时候,不清楚对方是否会产生 child processes 或者 reaping, 所以最好使用 init process, see this articale and what is the advantage of tini.

Exploration

这里一篇文章关于 run multiple services in docker, the options could be:

  • using --init, it is docker-init process backed by tini.
1
2
# ps aux can see docker-init
docker run --init -itd nginx
  • using wrapper script, for example, entrypoint script.
  • main process along with temporary processes, set job control in wrapper script.
  • install dedicated init process and config them, for example, supervisord, tini, dumb-init, etc.

For catching signals and child process reaping, if not using tini or other dedicated init process, you need to write code by yourself.

init Process

可以看看container commonly used init processes:

For tini, using steps:

1
2
3
4
5
6
# in alpine docker
RUN apk add --no-cache tini
# tini is now available at /sbin/tini
ENTRYPOINT ["/sbin/tini", "--"]
# or
ENTRYPOINT ["/sbin/tini", "--", "/docker-entrypoint.sh"]

How tini Proxies Signal

之前看了一篇关于 Linux delivery signal 之于 container init process 的文章,提到了在 container 中 kill 1 的操作为什么有时会失败,然后讲了什么时候 kernel 会把信号推送到 init process,以及什么时候不会。这篇文章只提到了源码的一部分,也就是 init process(SIGNAL_UNKILLABLE) + non-default signal handler + current namespace, see the second if condition: https://github.com/torvalds/linux/blob/a76c3d035872bf390d2fd92d8e5badc5ee28b17d/kernel/signal.c#L79-L99

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
static bool sig_task_ignored(struct task_struct *t, int sig, bool force)
{
void __user *handler;

handler = sig_handler(t, sig);

/* SIGKILL and SIGSTOP may not be sent to the global init */
if (unlikely(is_global_init(t) && sig_kernel_only(sig)))
return true;

/***** see this condition *****/
if (unlikely(t->signal->flags & SIGNAL_UNKILLABLE) &&
handler == SIG_DFL && !(force && sig_kernel_only(sig)))
return true;
/*******/

/* Only allow kernel generated signals to this kthread */
if (unlikely((t->flags & PF_KTHREAD) &&
(handler == SIG_KTHREAD_KERNEL) && !force))
return true;

return sig_handler_ignored(handler, sig);
}

The emphasis is on sigcgt bitmask, this is correct as docker has documented here: https://docs.docker.com/engine/reference/run/#foreground

1
A process running as PID 1 inside a container is treated specially by Linux: it ignores any signal with the default action. As a result, the process will not terminate on SIGINT or SIGTERM unless it is coded to do so.

也就是说,用户如果在 init process 注册了 SIGTERM handler(sigcgt bit set to 1) 那么 handler == SIG_DFL is false,所以 init process 就可以收到了.

但问题是我查看 tini init process signal bitmask sigcgt is 0 for all fields, 所以 kernel 甚至都不会把信号传递过去, so how come the tini forwards signal if no signal would be delivered at all? I have opened a question regarding this.

From the author’s comment, I know The way Tini catches signals is by blocking all signals that should be forwarded to the child, and then waiting for them via sigtimedwait. If takes a closer look at the caller of sig_task_ignored: https://github.com/torvalds/linux/blob/a76c3d035872bf390d2fd92d8e5badc5ee28b17d/kernel/signal.c#L101-L120

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static bool sig_ignored(struct task_struct *t, int sig, bool force)
{
/*
* Blocked signals are never ignored, since the
* signal handler may change by the time it is
* unblocked.
*/
if (sigismember(&t->blocked, sig) || sigismember(&t->real_blocked, sig))
return false;

/*
* Tracers may want to know about even ignored signal unless it
* is SIGKILL which can't be reported anyway but can be ignored
* by SIGNAL_UNKILLABLE task.
*/
if (t->ptrace && sig != SIGKILL)
return false;

return sig_task_ignored(t, sig, force);
}

You will see Blocked signals are never ignored! So tini will always receive the signals from kernel.

关于 tini main loop 中的 sigtimedwait 其实就是block execution 等待信号的到来 https://github.com/krallin/tini/blob/378bbbc8909a960e89de220b1a4e50781233a740/src/tini.c#L501-L514

1
2
3
4
5
6
7
8
9
10
11
12
13
14
int wait_and_forward_signal(sigset_t const* const parent_sigset_ptr, pid_t const child_pid) {
siginfo_t sig;

if (sigtimedwait(parent_sigset_ptr, &sig, &ts) == -1) {
switch (errno) {
case EAGAIN:
break;
case EINTR:
break;
default:
PRINT_FATAL("Unexpected error in sigtimedwait: '%s'", strerror(errno));
return 1;
}
} else {

可以参考 signal(7) section Synchronously accepting a signal 关于 sigtimedwait 的讲解.

还可以通过 sudo strace -p <pid> 去观察 tini 是如何 forward signal 的:

1
2
3
4
5
6
7
8
9
10
11
# tini is init process
# bash is child
docker run --name test -itd tini_image:latest bash

# find out pid in host namespace
ps -ef | grep bash

sudo strace -p <tini pid>
sudo strace -p <child bash pid>

docker stop test

Benchmark

Need to first install stress, stress-ng packages, benchmark with multi-process.

1
2
3
4
5
6
7
8
9
# stress cpu with 1 process
stress --cpu 1 --timeout 600
# stress cpu with 8 processes
stress -c 8 --timeout 600

# stress io
# stress -i 1 --timeout 600 does not work well
# because VM sync buffer is small
stress-ng --io 1 --hdd 1 --timeout 600

Need to install sysbench, benchmark with multi-thread.

1
2
# 以10个线程运行5分钟的基准测试,模拟多线程切换的问题
sysbench --threads=10 --max-time=300 threads run

Send TCP/IP packets, for network, firewall check.

1
2
3
4
5
6
yum install hping3 -y

# -S: TCP SYN
# -p: target port
# -i: interval, u100: 100 microsecond
hping3 -S -p 80 -i u100 192.168.0.30

Another useful benchmark tool is iperf3 to measure various network performance in a server-client mode. Search iperf3 Command for more details.

Analysis

Need to install perf.

1
2
3
4
5
6
7
8
9
10
11
# similar to top, real time cpu usage display
# Object: [.] userspace, [k] kernal
perf top
# -g: enables call-graph (stack chain/backtrace) recording.
perf top -g -p <pid>

# record profiling and inspect later
# -g: enables call-graph (stack chain/backtrace) recording.
perf record -g
perf report
# record can also help find short live process

CPU

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# boot time, load average(runnable/running + uninterruptable IO), user
uptime
w

# -c: show command line
# -b: batch mode
# -n: iteration
top -c -b -n 1 | head -n 1

# -d: highlight the successive difference
watch -d "uptime"

# overall system metrics
# focus on in, cs, r, b, check man for description
# 注意r 的个数是否远超CPU 个数
# in 太多也是个问题
# us sy 看cpu 主要是被用户 还是 内核 占据
vmstat -w -S m 2

# cpu core number
lscpu
## press 1 to see cpus list
top

# check all cpus metrics
# 判断cpu usage 升高是由于iowait 还是 computing
mpstat -P ALL 1

# check which process cause cpu utilization high
# -u: cpu status: usr, sys, guest, total
pidstat -u 1

# short live process check
perf top
execsnoop

Versatile tool for generating system resource statistics

1
2
3
4
# combination of cpu, disk, net, system
# when CPU iowait high, can use it to compare
# iowait vs disk read/wirte vs network rec/send
dstat

Context switch check:

1
2
3
4
5
6
# process context switch metrics
# -w: context switch
# cswch: voluntary, 系统资源不足时,就会发生自愿上下文切换
# nvcswch: non-voluntary, 大量进程都在争抢 CPU 时,就容易发生非自愿上下文切换
# -t: thread
pidstat -t -w 1 -p <pid number>

Interrupts check:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# hard interrupts
# RES: 重调度中断
watch -d cat /proc/interrupts

# acculumated soft interrupt
# big change *rate* is usually from:
# RCU: kernel ready-copy update lock
# NET_TX
# NET_RX
# TIMER
# SCHED
watch -d cat /proc/softirqs
# soft interrupt kernel thread
# [ksoftirqd/<CPU id>]
ps aux | grep ksoftirq

# if softiq NET_RX/NET_TX is too high
# -n DEV: statistics of network device
# PPS: rxpck/s txpck/s
# BPS: rxkB/s txkB/s
sar -n DEV 1

Memory

To check process memory usage, using top(VIRT, RES, SHR) and ps(VSZ, RSS)

1
2
# adjust oom score [-17,15], the higher the kill-prone
echo -16 > /proc/$(pidof <process name>)/oom_adj

Check OOM killed process:

1
dmesg |grep -iE "kill|oom|out of memory"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# check memory
# -h: readable
# -w: wide display
free -hw

# buffer is from /proc/meminfo Buffers
cat /proc/meminfo | grep -E "Buffers"
# cache is from /proc/meminfo Cached + SReclaimable
cat /proc/meminfo | grep -E "SReclaimable|Cached"
# understand what is buffer and cache, man proc
# --Buffers:
# Relatively temporary storage for raw disk blocks that shouldn't get tremendously large (20MB or so)
# --Cached:
# In-memory cache for files read from the disk (the page cache). Doesn't include SwapCached
# --Slab:
# In-kernel data structures cache.
# --SReclaimable:
# Part of Slab, that might be reclaimed, such as caches.

# -w: wide display
# -S: unit m(mb)
# 2: profile interval
vmstat -w -Sm 2

Check cache hits (need to install from BCC):

1
2
3
4
# system overall cache hit
cachestat
# process level cache hit
cachetop

Check memory leak:

1
2
3
# or valgrind
# in bcc-tools with cachestat and cachetop
memleak -a -p $(pidof app_name)

If swap is enabled, we can adjust the swappiness:

1
2
3
# [0, 100], the higher the swappiness-prone
# reclaim anonymous page from heap
echo 90 > /proc/sys/vm/swappiness

As oppose to swappiness, another reclamation is for file-backed page from cache/buffer.

Release caches, used carefully in production:

1
2
3
# sync: flush dirty pages to disk
# use carefully, drop both inode and dentry cache
sync; echo <1 or 2 or 3> > /proc/sys/vm/drop_caches

Check kernel slab details:

1
2
3
4
5
6
7
# man slabinfo
# pay attendtion to dentry and inode_cache
cat /proc/slabinfo | grep -E '^#|dentry|inode'

# real time kernel slab usage
# -s c: sort by cache size
slabtop -s c

I/O

  1. Check top for overall iowait performance
  2. Check iostat/sar for device performance
  3. Check pidstat for outstanding I/O process/thread
  4. Check strace for system read/write calls
  5. Check lsof for process/thead operating files

System proc files correlates to I/O:

  • /proc/slabinfo
  • /proc/meminfo
  • /proc/diskstats
  • /proc/pid/io

Check disk space usage and inode usage

1
2
3
4
5
6
7
8
9
10
11
# -T: file system type
df -hT
df -hT /dev/sda2

# -i: inode
df -ih

# directory storage size
# -s: summary
# -c: total
du -sc * | sort -nr

Check overall device statistics:

1
2
3
4
5
6
7
8
9
10
11
12
13
# -d: device report
# -x: extended fields
iostat -dx 1
# pay attention to
# %util: 磁盘使用率
# r/s,w/s: IOPS
# rKB/s, wKB/s: throughput
# r_await,w_await: delay

# -d: disk report
# -p: pretty print
# tps != IOPS
sar -dp 1

Check process I/O status

1
2
3
4
5
6
7
8
9
10
11
# -d: io status
# -p: pid
# -t: thread
pidstat -d [-t] -p <pid number> 1

# simple top-like I/O monitor
# -b: batch mode
# -n: iteration number
# -o: only show actually doing I/O processes/threads
# -P: only show process
iotop -b -n 1 -o [-P]

Check system calls on I/O to locate files:

1
2
3
4
5
6
7
8
# -f: threads
# -T: execution time
# -tt: system timestamp
# any read/write operations?
strace [-f] [-T] [-tt] -p <pid>

# check files opened by process
lsof -p <pid>

Also search and check <<Linux Check Disk Space>> for lsof usage. And <<Linux Storage System>> to manage disk storage. And <<Linux Make Big Files>> to make big file for testing.

Other BCC tools useful:

1
2
3
4
# trace file read/write
filetop
# trace kernel open system call
opensnoop

Network

System kind errors:

1
2
3
# not only used for network but for general purpose
# -e: show local timestamp
dmesg -e | tail

sar network related commands:

1
2
3
4
5
6
7
8
9
10
# -n: statistics of network device
# DEV: network devices statistic
sar -n DEV 1

# see man sar for details
sar -n UDP 1
# ETCP: statistics about TCPv4 network errors
sar -n ETCP 1
# EDEV: statistics on failures (errors) from the network devices
sar -n EDEV 1

Network stack statistics:

1
2
3
4
5
6
7
8
9
10
11
12
13
# see tcp, udp numeric listening 
# -p: PID and name of the program to which each socket belongs.
netstat -tunlp

# check tcp connection status
# LISTEN/ESTAB/TIME-WAIT, etc
# -a: display all sockets
# -n: no reslove service name
# -t: tcp sockets
ss -ant | awk 'NR>1 {++s[$1]} END {for(k in s) print k,s[k]}'

# check interface statistics
ip -s -s link

Network sniffing, see another blog <<Logstash UDP Input Data Lost>> for more tcpdump usage. Last resort and expensive, check if mirror traffic is available in production.

1
2
3
4
# -i: interface
# -nn: no resolution
# tcp port 80 and src 192.168.1.4: filter to reduce kernel overhead
tcpdump -i eth0 -nn tcp port 80 and src 192.168.1.4 -w log.pcap

For example, tcpdump to pcap file and analyzed by wireshark later, using rotated or timed file to control the file size. Don’t force kill the tcpdump process because that will corrupt the pcap file.

UDP statistics

1
2
3
4
# -s: summary statistics for each protocol
# -u: UDP statistics
# for example 'receive buffer errors' usually indicates UDP packet dropping
watch -n1 -d netstat -su

For example, the receive buffer errors increases frequently usually means UDP packets dropping and needs to increase socket receiving buffer size or app level buffer/queue size.

Simulate packet loss for inbound(iptables) and outbound traffic(tc-netem), check this post for detail.

最近针对目前的pipeline 有2个优化:

  1. 把deployer base image 单独分出来,在这一阶段安装一些general common packages.
  2. 把base image flavor 从 Alpine 换到 Debian.

对于1, 很容易理解,因为每次CI/CD pipeline 都会rebuild deployer image,大量的packages 安装工作比较耗时,如果大部分工作都在 base image中安装好了,剩余的工作量就少很多。

对于2,以前我没有想到这一点,怎么选择合适的base image 以及为什么需要切换base image呢?首先从Python image的各种类型开始了解(不同的Application 情况不同,这里只讨论Python).

Python Image Variants There are 4 types:

  • python:[version]
  • python:[version]-slim
  • python:[version]-alpine
  • python:[version]-windowsservercore

前2种都是Debian-based, 只是内部安装的packages 数量不同,slim 只安装了运行Python所需的最小packages数量. 关于Debain release code name, such as buster, stretch. See this reference.

可以查看使用的Python image Dockerfile 了解哪些packages 已经安装,避免重复,比如slim Dockerfile.

至于为什么对于Python image需要从Alpine 切换到 Debian呢?这篇文章总结到了: Using Alpine can make Python Docker builds 50× slower. 针对我们使用的Python image, 我也做了比较,确实debian based image对于PIP requirements.txt 安装要更快,但随着Alpine base的更新以后情况可能会改变。

这篇文章所在网站的其他内容也很有参考价值:

Other good articles:

Dockerfile Best Practice

Dockerfile Best Practices.

1
2
3
4
# --no-cache: do not rely on build cache
# -f: specify dockerfile
# context: build context location, usually .(current dir)
docker build --no-cache -t helloapp:v2 -f dockerfiles/Dockerfile <context path>

Make sure do not include unnecessary files in your build context, that will result in larger image size. Or using .dockeringore to exclude files from build context.

Pipe in dockerfile, no files will be sent to build context:

1
2
3
4
5
6
7
8
# cannot use COPY in this way
# -: read Dockerfilr from stdin
echo -e 'FROM busybox\nRUN echo "hello world"' | docker build -
# here document
docker build -<<EOF
FROM busybox
RUN echo "hello world"
EOF

Omitting the build context can be useful in situations where your Dockerfile does not require files to be copied into the image, and improves the build-speed, as no files are sent to the daemon.

Multi-Stage builds allow you to drastically reduce the size of your final image, without struggling to reduce the number of intermediate layers and files. For example, the Elasticserach curator Dockerfile also adopt this workflow:

  • Install tools you need to build your application
  • Install or update library dependencies
  • Generate your application
1
2
# the simplest base image
FROM scratch

To reduce complexity, dependencies, file sizes, and build times, avoid installing extra or unnecessary packages just because they might be “nice to have.” For example, you don’t need to include a text editor in a database image.

Only the instructions RUN, COPY, ADD create layers. Other instructions create temporary intermediate images, and do not increase the size of the build.

Sort multi-line arguments, for example debian:

1
2
3
4
5
6
7
8
9
10
# Always combine RUN apt-get update with apt-get install in the same RUN
# otherwise apt-get update clause will be skipped in rebuild if no --no-cache
RUN apt-get update && apt-get install -y \
bzr \
cvs \
git \
mercurial \
subversion \
&& rm -rf /var/lib/apt/lists/*
# clean up the apt cache by removing /var/lib/apt/lists it reduces the image size

Dockerfile Clause

LABEL can be used to filter image with with -f option in docker images command.

Using pipe:

1
2
3
4
5
# Docker executes these commands using the /bin/sh -c interpreter
RUN set -o pipefail && wget -O - https://some.site | wc -l > /number

# or explicitly specify shell to support -o pipefail
RUN ["/bin/bash", "-c", "set -o pipefail && wget -O - https://some.site | wc -l > /number"]

CMD should rarely be used in the manner of CMD ["param", "param"] in conjunction with ENTRYPOINT, unless you and your expected users are already quite familiar with how ENTRYPOINT works. CMD should almost always be used in the form of CMD ["executable", "param1", "param2"…].

Use ENTRYPOINY with docker-entrypoint.sh helper script is also common:

1
2
3
4
5
COPY ./docker-entrypoint.sh /
ENTRYPOINT ['/docker-entrypoint.sh']
# will be substituted with command in docker run end
# docker run --it --rm image_name:tag <param1> <param2> ...
CMD ["--help"]

Each ENV line creates a new intermediate layer, just like RUN commands. This means that even if you unset the environment variable in a future layer, it still persists in this layer and its value can be dumped. To prevent this, and really unset the environment variable, use a RUN command with shell commands, to set, use, and unset the variable all in a single layer. You can separate your commands with ; or &&:

1
2
3
4
5
6
# syntax=docker/dockerfile:1
FROM alpine
RUN export ADMIN_USER="mark" \
&& echo $ADMIN_USER > ./mark \
&& unset ADMIN_USER
CMD sh

Although ADD and COPY are functionally similar, generally speaking, COPY is preferred. If multiple files need to be COPY, copy them separately in use rather than all in one go, this can help to invalidate the cache.

Because image size matters, using ADD to fetch packages from remote URLs is strongly discouraged; you should use curl or wget instead. That way you can delete the files you no longer need after they’ve been extracted and you don’t have to add another layer in your image.

You are strongly encouraged to use VOLUME for any mutable and/or user-serviceable parts of your image. (I rarely use)

Avoid installing or using sudo as it has unpredictable TTY and signal-forwarding behavior that can cause problems. If you absolutely need functionality similar to sudo, such as initializing the daemon as root but running it as non-root, consider using gosu.

Lastly, to reduce layers and complexity, avoid switching USER back and forth frequently.

For clarity and reliability, you should always use absolute paths for your WORKDIR.

Think of the ONBUILD command as an instruction the parent Dockerfile gives to the child Dockerfile.

Docker Compose Demo

Github repo: https://github.com/chengdol/InfraTree/tree/master/docker-monitoring

The steps are in README.

Grafana

Grafana document: https://grafana.com/docs/grafana/latest/

Grafana docker image: https://grafana.com/docs/grafana/latest/installation/docker/

  • Running in container
  • Connecting to Prometheus
  • Visualizing query results
  • Packaging the dashboard

Grafana support querying time-series database like prometheus and influxdb, also support Elasticsearch logging & analytics database.

1
2
3
## pull image alpine based
docker pull grafana/grafana:7.0.0
docker run --detach --name=grafana --publish-all grafana/grafana:7.0.0

The default login is admin/admin. After login, go to set Data Sources, select prometheus and specify the url, then import data in dashboard.

Readings

  • Grafana vs. Kibana: The Key Differences to Know The key difference between the two visualization tools stems from their purpose. Grafana is designed for analyzing and visualizing metrics such as system CPU, memory, disk and I/O utilization. Grafana does not allow full-text data querying. Kibana, on the other hand, runs on top of Elasticsearch and is used primarily for analyzing log messages.

top command ran in container within the pod shows the host machine overview metrics and container level process metrics. The reason is containers inside pod partially share /proc with the host system includes path about a memory and CPU information. The top utilizes /proc/stat(host machine), /proc/<pid>/stat(container process), they are not aware of the namespace.

P.S: lxcfs this FUSE filesystem can create container native /proc! Make container more likes a VM.

The two methods below collect data from different sources and they are also referring to different metrics.

For k8s OOMKiller event, using kubectl top to predicate and track is more accurate.

Kubectl Top

K8s OOMkiller uses container_memory_working_set_bytes(from cadviosr metrics, can also show in prometheus if deployed) as base line to decide the pod kill or not. It is an estimate of how much memory cannot be evicted, the kubectl top uses this metrics as well.

After metrics-server is installed:

1
2
3
4
5
6
# show all containers resource usage insdie a pod
kubectl top pod <pod name> --containers
# show pod resource usage
kubectl top pod
# show node resource usage
kubectl top node

In prometheus expression browser, you can get the same value as kubectl top:

1
2
3
# value in Mib
# pod, container are the label name, depends on your case
container_memory_working_set_bytes{pod=~"<pod name>",container=~"<container name>"} / 1024 / 1024

看看Prom alert是从何处取得 container/pod memory 数据的,注意数据源是来自哪个 metrics,以及使用 grafana 也是如此.

Docker Stats

docker stats memory display collects data from path /sys/fs/cgroup/memory with some calculations, see below explanation.

On host machine, display the container stats (CPU and Memory usages)

1
2
# similar to top
docker stats --no-stream <container id>

Actually docker CLIs fetch data from Docker API, for instance v1.41 (run docker version to know API supported verion), you can get stats data by using curl command:

1
curl --unix-socket /var/run/docker.sock "http:/v1.41/containers/<container id>/stats?stream=false" | jq
1
2
3
4
5
6
7
8
9
10
"memory_stats": {
"usage": 80388096,
"max_usage": 148967424,
"stats": {
"active_anon": 9678848,
"active_file": 13766656,
"cache": 31219712,
"inactive_file": 17829888
...
"total_inactive_file": 17829888,

From this docker stats description: On Linux, the Docker CLI reports memory usage by subtracting cache usage from the total memory usage. The API does not perform such a calculation but rather provides the total memory usage and the amount from the cache so that clients can use the data as needed. The cache usage is defined as the value of total_inactive_file field in the memory.stat file on cgroup v1 hosts.

On Docker 19.03 and older, the cache usage was defined as the value of cache field. On cgroup v2 hosts, the cache usage is defined as the value of inactive_file field.

memory_stats.usage is from /sys/fs/cgroup/memory/memory.usage_in_bytes. memory_stats.stats.inactive_file is from /sys/fs/cgroup/memory/memory.stat.

So here it is:

1
80388096 - 17829888 = 62558208 => 59.66s Mib

This does perfectly match docker stats value in MEM USAGE column.

The dockershim is deprecated in k8s!! If containerd runtime is used instead, to explore metrics usage you can check cgroup in host machine or go into container check /sys/fs/cgroup/cpu.

To calculate the container memory usage as docker stats in the pod without installing third party tool:

1
2
3
4
5
6
7
8
9
10
# memory in Mib: used
cd /sys/fs/cgroup/memory
cat memory.usage_in_bytes | numfmt --to=iec

# memory in Mib: used - inactive(cache)
cd /sys/fs/cgroup/memory
used=$(cat memory.usage_in_bytes)
inactive=$(grep -w inactive_file memory.stat | awk {'print $2'})
# numfmt: readable format
echo $(($used-$inactive)) | numfmt --to=iec

To calculate the container cpu usage as docker stats in the pod without installing third party tool:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# cpu, cpuacct dir are softlinks
cd /sys/fs/cgroup/cpu,cpuacct
# cpuacct.stat:
# Reports the total CPU time in nanoseconds
# spent in user and system mode by all tasks in the cgroup.
utime_start=$(cat cpuacct.stat| grep user | awk '{print $2}')
stime_start=$(cat cpuacct.stat| grep system | awk '{print $2}')
sleep 1
utime_end=$(cat cpuacct.stat| grep user | awk '{print $2}')
stime_end=$(cat cpuacct.stat| grep system | awk '{print $2}')
# getconf CLK_TCK aka sysconf(_SC_CLK_TCK) returns USER_HZ
# aka CLOCKS_PER_SEC which seems to be always
# 100 independent of the kernel configuration.
HZ=$(getconf CLK_TCK)
# get container cpu usage
# on top of user/system cpu time
echo $(( (utime_end+stime_end-utime_start-stime_start)*100/HZ/1 )) "%"
# if the outcome is 200%, means 2 cpu usage, so on and so forth

Readings

  • How much is too much? The Linux OOMKiller and “used” memory We can see from this experiment that container_memory_usage_bytes does account for some filesystem pages that are being cached. We can also see that OOMKiller is tracking container_memory_working_set_bytes. This makes sense as shared filesystem cache pages can be evicted from memory at any time. There’s no point in killing the process just for using disk I/O.

  • Kubernetes top vs Linux top kubectl top shows metrics for a given pod. That information is based on reports from cAdvisor, which collects real pods resource usage.

  • cAdvisor: container advisor cAdvisor (Container Advisor, go project) provides container users an understanding of the resource usage and performance characteristics of their running containers.

//TODO: [ ] see reading section [ ] series https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-b190cc97f0f6 [ ] prom config with service discovry, for example consul

Docker Compose Demo

Github repo: https://github.com/chengdol/InfraTree/tree/master/docker-monitoring

The steps are in README.

Prometheus

Open-source monitoring and alerting system: https://prometheus.io/ Prometheus collects and stores its metrics as time-series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.

Architecture

Learning targets:

  • Know how to set up prometheus cluster for testing purpose
  • Know how to configure prometheus/alertmanager/grafana
  • Know how to export different kind of metrics
  • Know how to PromQL
  • Know how to integrate with Grafana

Metric type

Explanation (counter, gauge, histogram, summary): https://www.youtube.com/watch?v=nJMRmhbY5hY

Conuter: request count, task completed, error count, etc. Query how fast the value is increasing, rate() only applies for counter as it is monotonic increasing.

Guage: memory usage, queue size, kafka lag, etc. For example, avg_over_time() on gauge type.

Histogram: duration of http request, response size, etc. To late calculate average and percentile, happy with approximation. You can use default bucket or customizing your own. The vaule in bucket is accumulated, add to all buckets that greater than current value.

Summary: duration of http request, response size, etc. complex than Histogram, no idea the value range so cannot histogram.

PromQL

First, understand metric type in prometheus https://www.youtube.com/watch?v=nJMRmhbY5hY

Helpful promQL visualizing tool, cheat sheet: https://promlabs.com/promql-cheat-sheet

How to know labels of a specific metric? Using prometheus query browser run metric name and see the console output, it will contains all labels of that metric.

Alert Expr

Excluding Time Slot from Alert Expr

This is helpful as we know it is no-ops. Now Prometheus supports time-based muting:

1
2
# excluding specific time slot
some_metrics_vector and ON() absent(day_of_week() == 0 AND hour() >=3 < 4 AND minute() >= 10 < 50)

The explanation please see here, so for logical operators they are case-insensitive, and or AND, either is fine.

You can verify in prometheus expression browser first then writing to alert expr.

Tips:: In alert debug, to see label instance or job value in description, for example:

1
2
annotations:
description: '{{ $labels.instance }} is not responding for 10 minutes.'

Just run that alert expression manually in expression browser, modify the alert expression to see the output labels.

Capture the Counter First Event

There is case the we want to capture the first 0(non-existence) -> 1 counter event and fire alert, this can be captured by unless + offset, and after 1 we can use increase to catch:

1
2
# ((0 -> 1 case capture) or (1 -> 1+ case capture))
((_metric_counter_ unless _metric_counter_ offset 15m) or (increase(_metric_counter_[15m]))) > 0

Query Example

Here I list some examples to explain and practice common PromQL. Part of them are from Grafana dashboard as they have embedded variables, but the syntax and usage is the same in prometheus expression browser and Grafana.

Understand instant and range vector and how rate and irate works: https://www.metricfire.com/blog/understanding-the-prometheus-rate-function/

rate(average rate!) or irate(instant rate, last 2 data points only) calculates the per-second average rate of how fast a value is increasing over a period of time, they automatically adjusts for counter resets. If you want to use any other aggregation(such as sum) together with rate then you must apply rate first, otherwise the counter resets will not be caught and you will get weird results.

irate(spike) should only be used when graphing volatile, fast-moving counters.

Use rate(trend) for alerts and slow-moving counters, as brief changes in the rate can reset the FOR clause and graphs consisting entirely of rare spikes are hard to read.

Also remember rate first then aggregation rather than inversely https://www.robustperception.io/rate-then-sum-never-sum-then-rate

For group_left(many to one!) and group_right(one to many!), here is the example.

One query example for system load average dashboard:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# ${interval}, ${load}, ${service}, $env: 
# these variables are defined from by dashboard config variables

# explain on label_replace
# 在avg_over_time()得到的向量中,对于instance这个label,看是否match $env-(.+) 这个正则表达式
# 如果有match,则$1 就是对应正则中的第一个(.+)的真实值,然后在label_replace返回的新向量中,增加一个
# label name=$1,如果没有match,则返回原来的向量
avg(
label_replace(avg_over_time(node_load${load}{instance=~"^.+-${service:regex}-[0-9]+$"}[${interval}]),
"instance_group",
"$1",
"instance",
"$env-(.+)")
) by (instance_group) > 6
# then average the new vector, group it by instance_group label and check if the average group level LA > 6

Above varaibles are Query type: https://grafana.com/docs/grafana/latest/variables/variable-types/add-query-variable/

Grafana

As the key visual component of monitoring system, please see the separate post <<Grafana Quick Start>>

AlertManager

How to connect alertmaneger to prometheus: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config

How to config alertmanager itself: https://prometheus.io/docs/alerting/latest/configuration/

Config example to start: https://github.com/prometheus/alertmanager/blob/main/doc/examples/simple.yml Tool to generate routing tree: https://prometheus.io/docs/alerting/latest/configuration/

Alertmanager repo and docker: https://github.com/prometheus/alertmanager

The example alertmanager start command:

1
/bin/alertmanager --config.file=/etc/config/alertmanager.yml --storage.path=/data --web.route-prefix=/ --web.external-url=https://xxx.xxx/alertmanager

Run identical Prometheus servers on two or more separate machines. Identical alerts will be deduplicated by the Alertmanager.

For high availability of the Alertmanager, you can run multiple instances in a Mesh cluster and configure the Prometheus servers to send notifications to each of them.

To silence one alert, using New Silence and in matcher use alertname as key and alertname vaule as value(can add more key-value to filter more). If silence multiple alerts, using regex. Preview silence can show you how many current active alerts are affected, or you can just silence it so no new alert will come.

Integrated with K8s

https://www.youtube.com/watch?v=bErGEHf6GCc&list=PLpbcUe4chE7-HuslXKj1MB10ncorfzEGa https://www.youtube.com/watch?v=CmPdyvgmw-A https://www.youtube.com/watch?v=h4Sl21AKiDg

https://www.youtube.com/watch?v=5o37CGlNLr8 https://www.youtube.com/watch?v=LQpmeb7idt8

Readings

On any of Kafka cluster nodes runs following commands.

Topic

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# list topics created
./bin/kafka-topics.sh \
--bootstrap-server localhost:9092 \
--list

# display:
# number of partitions of this topic
# relica factor
# overridden configs
# in-sync replicas
./bin/kafka-topics.sh \
--bootstrap-server localhost:9092 \
--topic <topic name> \
--describe
# the output for example:
# "Isr" is a status, it shows which replica is in-sync, the below means all replicas are good
# Configs field shows the override settings of default
Topic: apple PartitionCount: 3 ReplicationFactor: 3 Configs: cleanup.policy=delete,segment.bytes=536870912,retention.ms=172800000,retention.bytes=2000000000
Topic: apple Partition: 0 Leader: 28 Replicas: 28,29,27 Isr: 28,29,27
Topic: apple Partition: 1 Leader: 29 Replicas: 29,27,28 Isr: 29,27,28
Topic: apple Partition: 2 Leader: 27 Replicas: 27,28,29 Isr: 27,28,29

# only show overridden config
./bin/kafka-topics.sh \
--bootstrap-server localhost:9092 \
--topics-with-overrides \
--topic <topic name> \
--describe
# or using kafka-config
./bin/kafka-configs.sh \
--zookeeper <zookeeper>:2181 \
--entity-type topics \
--entity-name <topic name> \
--describe

# add or update topic config
./bin/kafka-configs.sh \
--zookeeper <zookeeper>:2181 \
--entity-type topics \
--entity-name <topic name> \
--alter \
--add-config 'max.message.bytes=16777216'

# remove topic overrides
./bin/kafka-configs.sh \
--zookeeper <zookeeper>:2181 \
--entity-type topics \
--entity-name <topic name> \
--alter \
--delete-config 'max.message.bytes'

# list all consumer groups of a topic
# no direct command

Consumer Group

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# list consumer groups
./bin/kafka-consumer-groups.sh \
--bootstrap-server localhost:9092 \
--list

# check partition/offset/lag messages in each consumer group/topic
# also see topics consumed by the group

# Consumer lag indicates the lag between Kafka producers and consumers. If the rate of
# production of data far exceeds the rate at which it is getting consumed, consumer
# groups will exhibit lag.
# From column name:
# LAG = LOG-END-OFFSET - CURRENT-OFFSET
# CLIENT-ID: xxx-0-0: means consumer 0 and its worker thread 0
./bin/kafka-consumer-groups.sh \
--bootstrap-server localhost:9092 \
--group <consumer group name> \
--describe

# read last one message in topic of consumer group
# note that one topic can be consumed by different consumer group
# each has separate consumer offset
./bin/kafka-console-consumer.sh \
--bootstrap-server localhost:9092 \
--topic <topic name> \
--group <consumer group name> \
--max-messages 1

Partition

1
2
3
4
5
6
7
# increase partition number
# partition number can only grow up
./bin/kafka-topics.sh \
--bootstrap-server localhost:9092 \
--topic <topic name> \
--partitions <new partition number> \
--alter

Delete Messages

If there is bad data in message that stucks the consumer, we can delete them from the specified partition:

1
2
3
./bin/kafka-delete-records.sh \
--bootstrap-server localhost:9092 \
--offset-json-file deletion.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
"partitions": [
{
"topic": "<topic name>",
// partition number, such as 0
"partition": 0,
// offset, delete all message from the beginning of partition till this
// offset(excluded).
// The offset specified is one higher than the problematic offset reported
// in the log
"offset": 149615102
}
],
// check ./bin/kafka-delete-records.sh --help to see the version
"version": 1
}

Note that if all messages need deleting from the topic, then specify in the JSON an offset of -1.

0%