Occasionally VM instance reboot in an unplanned way and triggers alert, usually this is due to hardware or software issue on physical machine hosting the VM that causes the VM to crash.

From last reboot will see reboot records, or who -b displaying last time reboot info.

For CentOS/RHEL systems, you’ll find the logs at /var/log/messages while for Ubuntu/Debian systems, its logged at /var/log/syslog.

1
2
3
# exclude irrelevant info
# then looking around the possible key word log
sudo grep -iv ': starting\|kernel: .*: Power Button\|watching system buttons\|Stopped Cleaning Up\|Started Crash recovery kernel' /var/log/messages | grep -iw 'recover[a-z]*\|power[a-z]*\|shut[a-z ]*down\|rsyslogd\|ups'

From the command output above, skim /var/log/messages file around the timestamp, output example:

1
2
3
4
May  3 23:31:52 xxxxx systemd: Started Update UTMP about System Boot/Shutdown.
May 3 23:31:55 xxxxx rsyslogd: [origin software="rsyslogd" swVersion="8.24.0-52.el7" x-pid="949" x-info="http://www.rsyslog.com"] start
May 3 23:31:55 xxxxx systemd: Started Google Compute Engine Shutdown Scripts.
May 3 23:37:43 xxxxx audispd: node=xxxxx type=EXECVE msg=audit(1620085063.649:1984): argc=3 a0="last" a1="-x" a2="shutdown"

I see the reboot was caused by Shutdown Scripts, further check the VM instance log or StackDriver log (if on public cloud platform, check platform log system is more convenient than ssh to checking VM log), get error:

1
compute.instances.hostError

Then it was clear, What happened with host error?

This kind of operation is useful when the initiator is off-line or the changes are straightforward so no need to wait until initiator gets involved.

For direct PR or MR without fork, the steps are:

  1. checkout target branch of the PR
  2. perform changes on this branch
  3. git add --all
  4. git commit --amend
  5. git push -f

For fork scenario, see Adding Commits to Someone Else’s Pull Request.

对于理解,排查,解决系统性能问题有非常大的助益, 这本书简直就是 oncall saver, 我对这方面非常感兴趣,很幸运目前能遇到它而不是在几年之后。

此外我还有一篇博客总结了一下系统性能调优: <<Linux Performance Tuning>>.

[x] 希望在6个月内把整本书通读一遍,结合实验掌握书中的要点。 [x] 09/18/2021 revisit

yum versionlock, to restrict a package to a fixed version against yum update/upgrade.

The plugin stores a package list in /etc/yum/pluginconf.d/versionlock.list, which you can edit directly. Yum will normally attempt to update all packages, but the plugin will exclude the packages listed in the versionlock.list file.

1
2
3
4
5
6
7
8
9
10
11
12
13
# install version lock
yum install yum-plugin-versionlock

# list lock
yum versionlock list [package name]

# delete lock
yum versionlock delete 0:elasticsearch*
# clear all version lock
yum versionlock clear

# add lock
yum versionlock add elasticsearch-7.10.2

When I was working at IBM, I applied a dedicated cluster for Ansible learning. After I left, I decide to use Vagrant to create local cluster for the same purpose.

NOTE: I have also created a docker sponsored Ansible testing environment, please see here

Please check Vagrant Ansible testing cluster repo. Follow the README to set up and play with ansible. The problems I had at the time of creating the repo:

  1. how to establish the SSH connection to Vagrant VM.
  2. the sed insert has subtle difference in Mac.

Ansible Install

Ansible Install guide

  • Control node requirements: Starting with ansible-core 2.11, the project will only be packaged for Python 3.8 and newer.

If you are using Ansible to manage machines in a cloud, consider using a machine inside that cloud as your control node. In most cases Ansible will perform better from a machine on the cloud than from a machine on the open Internet.

Managed node requirements: Although you do not need a daemon on your managed nodes, you do need a way for Ansible to communicate with them. For most managed nodes, Ansible makes a connection over SSH and transfers modules using SFTP. For any machine or device that can run Python, you also need Python 2 (version 2.6 or later) or Python 3 (version 3.5 or later).

If install on Linux using yum (I use pip install in virtualenv in the demo, see repo README):

1
2
3
4
5
6
7
8
9
# search ansible package
# ansible.noarch
# ansible-python3.noarch
# SSH-based configuration management, deployment, and task execution system
yum search ansible
# python2
sudo yum install -y -q ansible
# python3
sudo yum install -y -q ansible-python3

Ansible Inventory

How to build your inventory, for inventory file, 主要涉及一些ssh connection的设置. Position the target hosts from group

Ansible Config

Ansible Configuration Settings for ansible.cfg file.

Ansible Yaml Format

Yaml syntax, Especially the difference between > and | for multi-line coding:

Spanning multiple lines using a | will include the newlines and any trailing spaces. Using a > will fold newlines to spaces; In either case the indentation will be ignored.

1
2
3
4
5
6
7
8
9
include_newlines: |
exactly as you see
will appear these three
lines of poetry

fold_newlines: >
this is really a
single line of text
despite appearances

Ansible Run

Ad-hoc command example:

1
2
3
4
5
# -v: verbose, display output
# can specify single machine
ansible -v -i vagrant_ansible_inventory.ini worker1 -m ping
# all
ansible -v -i vagrant_ansible_inventory.ini all -m shell -a 'echo $(whoami)'

Playbook Role, check role directory structure and how to use role.

1
2
3
4
5
6
7
8
9
# -e|--extra-vars: pass extra variables
# -b: become
# -v: verbose
ansible-playbook [-b] -v -i vagrant_ansible_inventory.ini setup.yml \
-e '{"version":"1.10.5","other_variable":"foo"}' # json format

ansible-playbook [-b] -v -i vagrant_ansible_inventory.ini setup.yml \
-e "foo=23" \
-e "bar=hello"

Reference Reuse existing Persistent Volume

Background Prod env upgrade failed, rolled back leads to historical monitoring data lost. The root cause is old PVC was removed accidently thus the corresponding PV got released, thanksfully the PV reclaim policy is retain so the data on disk was still preserved.

这里和以前遇到的不同的地方是使用了storage class, PV 定义中有claimRef 去做绑定.

Question How to access the historical data on released PV? 显而易见需要重新bound.

Solution The PV is actually provisioned dynamically by custom storage class gce-regional-ssd, in its definition, it is preserved for specific PVC by specifying the claimRef field:

1
2
3
4
5
6
7
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: monitoring-alertmanager
namespace: monitoring
resourceVersion: "21865"
uid: 4036d03f-fe2f-4d6f-bae8-dd67f33ad423

Since the PVC monitoring-alertmanager is alreay used another PV, to make this one available, kubectl edit to remove the uid and resourceVersion, modify the name, save and quit:

1
2
3
4
5
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: monitoring-alertmanager-old
namespace: monitoring

For now the PV becomes available only to PVC which is named monitoring-alertmanager-old. Or if you set claimRef to empty, PV will open to all PVCs.

Then creating that PVC to consume the PV (binding):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
labels:
app: monitoring
component: alertmanager
name: monitoring-alertmanager-old
namespace: monitoring
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: gce-regional-ssd
volumeMode: Filesystem

Then mount the PVC to your target resource to access the data.

其实如果describe PV, 可以找到在GCE中具体的Persistent Disk (under Compute Engine), 保险起见可以先snapshot 该persistent disk再做其他操作:

1
2
3
4
5
6
Source:
Type: GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
PDName: gke-us-east4-f3e3fedd--pvc-4036d03f-fe2f-4d6f-bae8-dd67f33ad423
FSType: ext4
Partition: 0
ReadOnly: false

最近一周在on-call,这里整理一下Linkedin Learn: Linux performance的内容. 这里主要关注4个方面: CPU, Memory, Disk and FileSystem IO, Network. 其实很多时候,alerts or incidents are derived from performance issues, of course we have insight checks for services themselves. 建议阅读 <<How linux works>>这本书,里面都讲到了。

此外,极客时间的课程<<Linux性能优化实战>>非常好,受益匪浅。我还阅读了书籍<<Systems Performance>>.

From 极客时间

性能领域大师布伦丹·格雷格 personal website

For more comprehensive cheat sheet, please switch to <<On-Call System Performance>>.

CPU

Understand the meaning of uptime command(check man). Load average is not CPU usage. 平均负载是指单位时间内,系统处于可运行状态®和不可中断状态(D)的平均进程数,也就是平均活跃进程数(可以这么理解,但源码它实际上是活跃进程数的指数衰减平均值),它和 CPU 使用率并没有直接关系. (有时需要对比系统的运行时间,有的故障可能是系统重启导致的)

1
2
3
4
5
6
7
8
# check CPU number
lscpu
# check load average
uptime
w
# or using top with 1 to show each cpu
# and load average
top -b -n 1 | head

关于不可中断状态,当一个进程向磁盘读写数据时,为了保证数据的一致性,在得到磁盘回复前,它是不能被其他进程或者中断打断的,这个时候的进程就处于不可中断状态。如果此时的进程被打断了,就容易出现磁盘数据与进程数据不一致的问题. 不可中断状态实际上是系统对进程和硬件设备的一种保护机制.

综合top, ps 中查看进程 status code, such as SLsl,很有意义,可以了解进程的组成等,比如多线程.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
man ps
# PROCESS STATE CODES
D uninterruptible sleep (usually IO)
R running or runnable (on run queue)
S interruptible sleep (waiting for an event to complete)
T stopped by job control signal
t stopped by debugger during the tracing
W paging (not valid since the 2.6.xx kernel)
X dead (should never be seen)
Z defunct ("zombie") process, terminated but not reaped by its parent
I idle
# For BSD formats and when the stat keyword is used, additional characters may be displayed:
< high-priority (not nice to other users)
N low-priority (nice to other users)
L has pages locked into memory (for real-time and custom IO)
s is a session leader (会话是指共享同一个控制终端的一个或多个进程组)
l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
+ is in the foreground process group (进程组表示一组相互关联的进程,比如每个子进程都是父进程所在组的成员)

比如当平均负载为2时, 意味着什么呢? When number of CPU is larger, equal or less then 2. Check lscpu or grep 'model name' /proc/cpuinfo | wc -l to see the logical CPU size.

三个不同时间间隔的平均值,其实给我们提供了,分析系统负载趋势的数据来源,让我们能更全面、更立体地理解目前的负载状况. 当平均负载高于 CPU 数量 70% 的时候,你就应该分析排查负载高的问题了。一旦负载过高,就可能导致进程响应变慢,进而影响服务的正常功能. 最推荐的方法,还是把系统的平均负载监控起来(prometheus + grafana),然后根据更多的历史数据,判断负载的变化趋势。当发现负载有明显升高趋势时,比如说负载翻倍了,你再去做分析和调查。

当发现负载高的时候,你可以使用 iostat, mpstat, pidstat 等工具,辅助分析负载的来源.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# stress cpu with 1 process
stress --cpu 1 --timeout 600
# stress io
# stress -i 1 --timeout 600 does not work well
# because VM sync buffer is small
stress-ng --io 1 --hdd 1 --timeout 600
# stress cpu with 8 processes
stress -c 8 --timeout 600

# check uptime updates
# -d: highlight the successive difference
watch -d "uptime"

# check all cpus status
# 判断cpu usage 升高是由于iowait 还是 computing
mpstat -P ALL 1

# check which process cause cpu usage high
# -u: cpu status
pidstat -u 5 2
# -d: io status
# -p: pid
pidstat -d -p 12345 5 2

在查看pidstat 时,可能没有%wait column,则需要升级版本到11.5.5 (centos 8),见sysstat git repo. 或者如果在prod VM上无安装,直接把编译好的文件丢到系统上就可以运行。

理解 CPU 上下文切换: 分为进程上下文切换、线程上下文切换和中断上下文切换。

注意,系统调用过程中,并不会涉及到虚拟内存等进程用户态的资源,也不会切换进程。这跟我们通常所说的进程上下文切换是不一样的。进程上下文切换,是指从一个进程切换到另一个进程运行。而系统调用过程中一直是同一个进程在运行。所以,系统调用过程通常称为特权模式切换,而不是上下文切换。但实际上,系统调用过程中,CPU 的上下文切换还是无法避免的。

线程与进程最大的区别在于,线程是调度的基本单位,而进程则是资源拥有的基本单位。说白了,所谓内核中的任务调度,实际上的调度对象是线程;而进程只是给线程提供了虚拟内存、全局变量等资源。

跟进程上下文不同,中断上下文切换并不涉及到进程的用户态。所以,即便中断过程打断了一个正处在用户态的进程,也不需要保存和恢复这个进程的虚拟内存、全局变量等用户态资源。中断上下文,其实只包括内核态中断服务程序执行所必需的状态,包括 CPU 寄存器、内核堆栈、硬件中断参数等。

vmstat 是一个常用的系统性能分析工具,主要用来分析系统的内存使用情况,也常用来分析 CPU 上下文切换和中断的次数。 sysbench 是一个多线程的基准测试工具,一般用来评估不同系统参数下的数据库负载情况。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# benchmark tool
sysbench --threads=10 --max-time=300 threads run

# 系统整体的性能
# focus on in, cs, r, b, check man for description
# 注意r 的个数是否远超CPU 个数
# in 太多也是个问题
# us sy 看cpu 主要是被用户 还是 内核 占据
# -w: wide display
# -S: unit m(mb)
# 2: profile interval
vmstat -w -S m 2

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 230548 24 4068628 0 0 0 33 16436 29165 25 2 73 0 0
1 0 0 226888 24 4069028 0 0 0 43 15443 27179 39 2 59 0 0
3 0 0 225588 24 4070544 0 0 0 523 20873 38865 36 2 61 0 0

# 查看process 的性能
# only show at least one column non-zero item
# -w: show context switch
# -u: show cpu stat
pidstat -w -u 2
# -t: thread level, helpful!
# -p: pid
# check both process and its thread context switch status
pidstat -wt -p 14233 2

# analyze interrupt, good point to check proc file
# focus on the most frequently changing part, 然后查看指代什么行为
watch -d 'cat /proc/interrupts'
# RES 重调度中断

如果系统的上下文切换次数比较稳定,那么从数百到一万以内,都应该算是正常的。但当上下文切换次数超过一万次,或者切换次数出现数量级的增长时,就很可能已经出现了性能问题。

这里比较一下2个测压工具: stress 基于多进程的,会fork多个进程,导致进程上下文切换,导致cpu %us开销很高; sysbench 基于多线程的,会创建多个线程,单一进程基于内核线程切换,导致cpu %sy的内核开销很高。

一些思路: 登录到服务器,现在系统负载怎么样。高的话有三种情况,首先是cpu使用率,其次是io使用率,之后就是两者都高。 cpu 使用率高,可能确实是使用率高,也的可能实际处理不高而是进程太多切换上下文频繁,也可能是进程内线程的上下文切换频繁。 io 使用率高,说明 io 请求比较大,可能是文件io,网络io.

这里你一定要记得,当碰到无法解释的 CPU 使用率问题时,先要检查一下是不是短时应用在捣鬼。短时应用的运行时间比较短,很难在 top 或者 ps 这类展示系统概要和进程快照的工具中发现,你需要使用记录事件的工具来配合诊断,比如 execsnoop 或者 perf top.

当 iowait 升高时,进程很可能因为得不到硬件的响应,而长时间处于不可中断状态。从 ps 或者 top 命令的输出中,你可以发现它们都处于 D 状态,也就是不可中断状态. D 状态的进程会导致平均负载升高, I 状态的进程却不会.

正常情况下,不可中断状态在很短时间内就会结束。所以,短时的不可中断状态进程,我们一般可以忽略. 如果系统或硬件发生了故障,进程可能会在不可中断状态D 保持很久,甚至导致系统中出现大量不可中断进程。这时,你就得注意下,系统是不是出现了 I/O 等性能问题。看僵尸进程,当一个进程创建了子进程后,它应该通过系统调用 wait() 或者 waitpid() 等待子进程结束,回收子进程的资源;而子进程在结束时,会向它的父进程发送 SIGCHLD 信号,所以,父进程还可以注册 SIGCHLD 信号的处理函数,异步回收资源。如果父进程没有处理子进程的终止,那么子进程就会一直处于僵尸状态。大量的僵尸进程会用尽 PID 进程号,导致新进程不能创建,所以这种情况一定要避免。

1
2
# versatile tool for generating system resource statistics
dstat

strace 正是最常用的跟踪进程系统调用的工具, 但僵尸进程都是已经退出的进程,所以就没法儿继续分析它的系统调用:

1
2
# -f: trace threads generated
strace -p <pid> [-f]

strace hanging at futex, FUTEX_WAIT: The process is multi-threaded, you are tracing the original parent thread, and it’s doing nothing but waiting for some other threads to finish.

If top, pidstat 这类工具已经不能给出更多的信息了。这时,我们就应该求助那些基于事件记录的动态追踪工具了。

1
2
3
4
5
# wait for 15 seconds
# crtl+c quit
perf record -g
# check
pert report

iowait 高不一定代表 I/O 有性能瓶颈。当系统中只有 I/O 类型的进程在运行时,iowait 也会很高,但实际上,磁盘的读写远没有达到性能瓶颈的程度。碰到 iowait 升高时,需要先用 dstat, pidstat 等工具,确认是不是磁盘 I/O 的问题,然后再找是哪些进程导致了 I/O。

其实除了 iowait,软中断(softirq)导致 CPU 使用率升高也是最常见的一种性能问题。

为了解决中断处理程序执行过长和中断丢失的问题,Linux 将中断处理过程分成了两个阶段,也就是上半部和下半部: 上半部用来快速处理中断,它在中断禁止模式下运行,主要处理跟硬件紧密相关的或时间敏感的工作。 下半部用来延迟处理上半部未完成的工作,通常以内核线程的方式运行。

上半部直接处理硬件请求,也就是我们常说的硬中断,特点是快速执行;而下半部则是由内核触发,也就是我们常说的软中断,特点是延迟执行。

上半部会打断 CPU 正在执行的任务,然后立即执行中断处理程序。而下半部以内核线程的方式执行,并且每个 CPU 都对应一个软中断内核线程,名字为 “ksoftirqd/CPU编号”,比如说, 0 号 CPU 对应的软中断内核线程的名字就是 ksoftirqd/0

经常听同事说大量的网络小包会导致性能问题,一直不太理解,从今天的课程来看,大量的小网络包会导致频繁的硬中断和软中断呢.软中断问题在大流量网络中最为常见.

注意cpu 使用率 和 load average 没有直接关系,这个在最开始提到过了,它们各自定义不一样。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# test tool
# -S参数表示设置TCP协议的SYN(同步序列号),-p表示目的端口为80
# -i u100表示每隔100微秒发送一个网络帧
hping3 -S -p 80 -i u100 192.168.0.30

# check cpu %si and %hi
# load average may be low
top

# check softirq frequencies and category
# usually NET_RX ratio change a lot
watch -d cat /proc/softirqs

# check network rx/tx packet vs KBS per second
# -n DEV 表示显示网络收发的报告,间隔1秒输出一组数据
# can calculate bytes/packet
sar -n DEV 1

# -i eth0 只抓取eth0网卡
# -n不解析协议名和主机名
# tcp port 80表示只抓取tcp协议并且端口号为80的网络帧
tcpdump -i eth0 -n tcp port 80

在tcpdump的输出中,观察Flag[x] 类型, 比如Flag[S] 表示SYN 包。可以找到source IP 并且通过防火墙隔离。

[ ] cassandra perf analysis find cause? [x] why cassandra Sleep with high CPU usage? -> multi-threads are running [x] why process is S but some threads are R, parent is doing nothing but waiting children threads.

倪老师您好,我在网上看到一个关于iowait指标的解释,非常形象,但不确定是否准确,帮忙鉴别一下,谢谢, 链接 http://linuxperf.com/?p=33

sar -w 或者 sar -w 1 也能直观的看到每秒生成线程或者进程的数量。Brendan Gregg 确实是这个领域的大师,贡献了很多的技术理念和实践经验。他的《性能之巅》可以和本课对比着看,会有更多的理解.

Memory

Understanding concepts:

这里查了一下tmpfs的解释和作用:

  • what is tmpfs
  • tmpfs 详解 It is intended to appear as a mounted file system, but data is stored in volatile memory instead of a persistent storage device. A similar construction is a RAM disk, which appears as a virtual disk drive and hosts a disk file system. tmpfs is meant only for ephemeral files
1
2
3
4
5
6
7
8
9
10
11
12
13
# resize mounted tmpfs size
# only temporary
mount -o remount,size=300M tmpfs /dev/shm
# -T: see file system type
df -hT

# create new tmpfs mounted
mkdir /data
# -t: type
# -o: options
# second tmpfs: device name
mount -t tmpfs -o size=100M tmpfs /data
umount /data

tmpfs用途还是较广的,Linux中可以把一些程序的临时文件放置在tmpfs中,利用tmpfs比硬盘速度快的特点来提升系统性能.

需要理解free, top, ps 中关于memory 部分的含义, for example: buffers, cached, shared.

/proc 是 Linux 内核提供的一种特殊文件系统,是用户跟内核交互的接口。比方说,用户可以从 /proc 中查询内核的运行状态和配置选项,查询进程的运行状态、统计数据等,当然,你也可以通过 /proc 来修改内核的配置.

1
2
3
4
# generate files
dd if=/dev/urandom of=/tmp/file bs=1M count=500
# check bi/bo and mem
vmstat 1 20

在读写普通文件时,会经过文件系统,由文件系统负责与磁盘交互;而读写磁盘或者分区时,就会跳过文件系统,也就是所谓的“裸I/O“。这两种读写方式所使用的缓存是不同的,也就是文中所讲的 Cache 和 Buffer 区别。

我的分析步骤:使用topps 查询系统中大量占用内存的进程,使用cat /proc/[pid]/statuspmap -x <pid>查看某个进程使用内存的情况和动态变化。

查看缓存的实际效果使用缓存命中率: 是指直接通过缓存获取数据的请求次数,占所有数据请求次数的百分比。

1
2
3
4
5
6
7
8
9
10
11
12
# read file
dd if=/tmp/file of=/dev/null bs=1M count=500

yum install -y bcc-tools
# need to manually export
export PATH=$PATH:/usr/share/bcc/tools
# check overall cache hit rate
cachestat 1 3
# check process cache hit rate
# similar to top command mechanism
# 3: 3 seconds update
cachetop 3

Investigate file cache size by pcstat

1
2
3
# install go first
# use go to fetch and install pcstat, see github page
pcstat /bin/ls

但同时也要注意,如果我们把 dd 当成测试文件系统性能的工具,由于缓存的存在,就会导致测试结果严重失真。所以测试前要查看文件在缓存中的大小,先清理一下缓存:

1
echo 3 > /proc/sys/vm/drop_caches

不过要注意,Buffers 和 Cache 都是操作系统来管理的,应用程序并不能直接控制这些缓存的内容和生命周期。所以,在应用程序开发中,一般要用专门的缓存组件,来进一步提升性能。比如,程序内部可以使用堆或者栈明确声明内存空间,来存储需要缓存的数据。再或者,使用 Redis 这类外部缓存服务,优化数据的访问效率。

dd命令也支持直接IO的 有选项oflag和iflag 所以dd也可以用来绕过cache buff做测试, 注意直接IO和裸IO 不一样,直接IO是跳过Buffer,裸IO是跳过文件系统(还是有buffer的)。

还要注意的是,就算是第一次读,也会有缓存命中,因为系统会预读一部分到内存中,直接IO虽然跳过了buffer, 但cache hit rate is 100%,根据缓存命中次数(每次命中就是one page 4K size),计算出在当时时间间隔中的命中大小: HITS * 1024 * 4K, 在除以时间间隔,就可知速率K/s.

dd command bs(block size) option value, 4K is fine, for large storage hard drive, 1M is good to go, bs vaule deponds on your RAM size, it is the size that processed of each operation.

Memory leak, 理解内存的分配与回收,哪些虚拟内存段容易发生泄漏问题: heap and 内存映射段(包括动态链接库和共享内存,其中共享内存由程序动态分配和管理)。所以,如果程序在分配后忘了回收,就会导致跟堆内存类似的泄漏问题。

1
2
3
4
5
6
# -r: mem statistic
# -S: swap statistic
# 3: refresh rate second
sar -r -S 3
# in bcc-tools with cachestat and cachetop
memleak -a -p $(pidof app_name)

memleak的输出结果中addr 表示分配的地址,然后后面会给出哪个stack请求的分配,列出了相关调用。 还有一个常见的内存debug tool: valgrind.

文件页(file-backed page) 和 匿名页(anonymous page)的区别和回收.

了解swap的机制。watermark 也就是阈值。 事实上不仅 hadoop,包括 ES 在内绝大部分 Java 的应用都建议关 swap,这个和 JVM 的 gc 有关,它在 gc 的时候会遍历所有用到的堆的内存,如果这部分内存是被 swap 出去了,遍历的时候就会有磁盘IO。

Swapping, memory limits, and cgroups. 开启swapping 会使cgroup memory limit失效。

smem --sort swap命令可以直接将进程按照swap使用量排序显示.

记录被OOM杀掉的进程:

1
dmesg | grep -E "kill|oom|out of memory"

[ ] cgroup mem vs proc mem data? https://www.cnblogs.com/muahao/p/9593869.html

I/O

在 Linux 中一切皆文件。不仅普通的文件和目录,就连块设备、套接字、管道等,也都要通过统一的文件系统来管理。 为了方便管理,Linux 文件系统为每个文件都分配两个数据结构,索引节点(index node)和目录项(directory entry).

索引节点,简称为 inode,用来记录文件的元数据,比如 inode 编号、文件大小、访问权限、修改日期、数据的位置等。索引节点和文件一一对应,它跟文件内容一样,都会被持久化存储到磁盘中。所以记住,索引节点同样占用磁盘空间(df -i <dir>)。

目录项,简称为 dentry,用来记录文件的名字、索引节点指针以及与其他目录项的关联关系。多个关联的目录项,就构成了文件系统的目录结构。不过,不同于索引节点,目录项是由内核维护的一个内存数据结构,所以通常也被叫做目录项缓存。

实际上,磁盘读写的最小单位是扇区,然而扇区只有 512B 大小,如果每次都读写这么小的单位,效率一定很低。所以,文件系统又把连续的扇区组成了逻辑块,然后每次都以逻辑块为最小单元,来管理数据。常见的逻辑块大小为 4KB,也就是由连续的 8 个扇区组成。

目录项、索引节点、逻辑块以及超级块,构成了 Linux 文件系统的四大基本要素。不过,为了支持各种不同的文件系统,Linux 内核在用户进程和文件系统的中间,又引入了一个抽象层,也就是虚拟文件系统 VFS(Virtual File System)

VFS 定义了一组所有文件系统都支持的数据结构和标准接口。这样,用户进程和内核中的其他子系统,只需要跟 VFS 提供的统一接口进行交互就可以了,而不需要再关心底层各种文件系统的实现细节。

文件读写方式的各种差异,导致 I/O 的分类多种多样。最常见的有,缓冲与非缓冲 I/O、直接与非直接 I/O、阻塞与非阻塞 I/O、同步与异步 I/O 等。

1
2
3
4
# check dentry and inode cache
cat /proc/slabinfo | grep -E '^#|dentry|inode'
# dispaly kernel slab cache real time
slabtop

衡量磁盘IO性能, 须要提到五个常见指标,也就是我们经常用到的,使用率、饱和度、IOPS、吞吐量以及响应时间. 在数据库、大量小文件等这类随机读写比较多的场景中,IOPS 更能反映系统的整体性能;而在多媒体等顺序读写较多的场景中,吞吐量才更能反映系统的整体性能。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# -d: Display the device utilization report
# -x: Display extended statistics
iostat -d -x 2

# process io statistic
pidstat -d 1

# sort by io
iotop

# trace system calls
strace -p <pid>
# -f: show threads system calls
strace -f -p <pid>
# -T: 显示系统调用的时长
# -tt: 显示跟踪时间
strace -f -T -tt -p <pid>

# -t: show threads
# -a: show command
pstree -t -a -p <pid>

# check open files
# note use pid not tid(please use its parent pid)
lsof -p <pid>

你可以用 iostat 获得磁盘的 I/O 情况,也可以用 pidstat、iotop 等观察进程的 I/O 情况。 使用率是从时间角度衡量I/O,但是磁盘还可以支持并行写,所以即使使用率100%,有可能还可以接收新的I/O(不饱和)

很有意思的思路: 一般来说,生产系统的应用程序,应该有动态调整日志级别的功能。继续查看源码,你会发现,这个程序也可以调整日志级别。如果你给它发送 SIGUSR1 信号,就可以把日志调整为 INFO 级;发送 SIGUSR2 信号,则会调整为 WARNING 级

在排查应用程序问题时,我们可能需要,在线上环境临时开启应用程序的调试日志。有时候,事后一不小心就忘了调回去。没把线上的日志调高到警告级别,可能会导致 CPU 使用率、磁盘 I/O 等一系列的性能问题,严重时,甚至会影响到同一台服务器上运行的其他应用程序。

1
2
3
4
5
# File reads and writes by filename and process. Top for files
filetop

# Trace open() syscalls. Uses Linux eBPF/bcc
opensnoop

MySQL 的 MyISAM 引擎,主要依赖系统缓存加速磁盘 I/O 的访问。可如果系统中还有其他应用同时运行, MyISAM 引擎很难充分利用系统缓存。缓存可能会被其他应用程序占用,甚至被清理掉。所以,不建议把应用程序的性能优化完全建立在系统缓存上。最好能在应用程序的内部分配内存,构建完全自主控制的缓存;或者使用第三方的缓存应用,比如 Memcached、Redis 等。

学习了redis的一些特性和配置。

为了更客观合理地评估优化效果,我们首先应该对磁盘和文件系统进行基准测试,得到文件系统或者磁盘 I/O 的极限性能。fio 文件系统和磁盘 I/O 性能基准测试工具, fio is a tool that will spawn a number of threads or processes doing a particular type of I/O action as specified by the user. The typical use of fio is to write a job file matching the I/O load one wants to simulate:

Note that 用磁盘路径测试写,会破坏这个磁盘中的文件系统,所以在使用前,你一定要事先做好数据备份。

1
2
3
4
5
6
7
8
9
10
11
# 随机读
fio -name=randread -direct=1 -iodepth=64 -rw=randread -ioengine=libaio -bs=4k -size=1G -numjobs=1 -runtime=1000 -group_reporting -filename=/dev/sdb

# 随机写
fio -name=randwrite -direct=1 -iodepth=64 -rw=randwrite -ioengine=libaio -bs=4k -size=1G -numjobs=1 -runtime=1000 -group_reporting -filename=/dev/sdb

# 顺序读
fio -name=read -direct=1 -iodepth=64 -rw=read -ioengine=libaio -bs=4k -size=1G -numjobs=1 -runtime=1000 -group_reporting -filename=/dev/sdb

# 顺序写
fio -name=write -direct=1 -iodepth=64 -rw=write -ioengine=libaio -bs=4k -size=1G -numjobs=1 -runtime=1000 -group_reporting -filename=/dev/sdb

fio 支持 I/O 的重放。借助前面提到过的 blktrace,再配合上 fio,就可以实现对应用程序 I/O 模式的基准测试。你需要先用 blktrace ,记录磁盘设备的 I/O 访问情况;然后使用 fio ,重放 blktrace 的记录。我们就通过 blktrace+fio 的组合使用,得到了应用程序 I/O 模式的基准测试报告.

还谈到了磁盘优化的几个方面,比较深入了,不太理解. 平常没机会从系统层面优化磁盘性能参数。 能做的就是减少磁盘写入,以及错峰操作磁盘。 比如在凌晨或业务低谷时,压缩备份日志,减少对正常业务的影响。

[ ] If threads share the same PID, how can they be identified? [ ] Makefile launches docker container [ ] nsenter 这个和docker exec 有关系,和我之前提到的理解错误的docker login 问题有联系. 可以看一下readme. 认真研究一下这个工具.

Network

本质上是一种进程间通信方式,特别是跨系统的进程间通信,必须要通过网络才能进行。随着高并发、分布式、云计算、微服务等技术的普及,网络的性能也变得越来越重要。

网络接口配置的最大传输单元(MTU),就规定了最大的 IP 包大小。在我们最常用的以太网中,MTU 默认值是 1500(这也是 Linux 的默认值).

1
2
# check virtual NI
ls -l /sys/class/net

实际上,我们通常用带宽、吞吐量、延时、PPS(Packet Per Second)等指标衡量网络的性能。除了这些指标,网络的可用性(网络能否正常通信)、并发连接数(TCP 连接数量)、丢包率(丢包百分比)、重传率(重新传输的网络包比例)等也是常用的性能指标。

而对 TCP 或者 Web 服务来说,更多会用并发连接数和每秒请求数(QPS,Query per Second)等指标,

查看网络配置:

1
2
3
4
# -s: statistic, show RX and TX
ip -s a s
# multiple -s, show more info
ip -s -s a s

查看套接字socket:

1
2
3
4
5
6
7
8
9
10
# -l 表示只显示监听套接字
# -t 表示只显示 TCP 套接字
# -n 表示显示数字地址和端口(而不是名字)
# -p 表示显示进程信息
netstat -nlp | head
ss -ltnp | head

# 协议栈protocol stack statistic
netstat -s
ss -s

查看throughput and PPS

1
2
3
4
5
# check bandwidth
ethtool eth0 | grep -i speed
# -n: network
# DEV EDEV TCP UDP ICMP, etc
sar -n DEV 1

查看delay:

1
ping

Linux 内核自带的高性能网络测试工具 pktgen, 需要加载内核模块. 用来测试PPS. TCP/UDP 性能测试: iperf3. HTTP 性能: ab, webbench, 如果需要mock 负载,使用wrk.

DNS 不仅方便了人们访问不同的互联网服务,更为很多应用提供了,动态服务发现和全局负载均衡(Global Server Load Balance,GSLB)的机制。这样,DNS 就可以选择离用户最近的 IP 来提供服务。

注意查询IP指定的name server不同,得到的IP也可能不同。

dig can show the recurring query steps:

1
2
3
4
5
# +trace: enable trace
# +nodnssec: 表示禁止DNS安全扩展
dig +trace +nodnssec time.geekbang.org
# check time use Queryrime field
dig time.geekbang.org

nslookup debug mode, used when lookup failed:

1
2
nslookup -debug <hostname>
time nslookup xxx

在应用程序的开发过程中,我们必须考虑到 DNS 解析可能带来的性能问题,掌握常见的优化方法. 碰到dns问题最多的就是劫持,现在公网都是强制https,内部用powerdns(open source).

在实际分析网络性能时,先用 tcpdump 抓包,后用 Wireshark 分析,也是一种常用的方法。

1
2
3
4
5
6
7
8
9
10
11
# 禁止接收从DNS服务器发送过来并包含googleusercontent的包
iptables -I INPUT -p udp --sport 53 -m string --string googleusercontent --algo bm -j DROP

# packet catch
tcpdump -nn udp port 53 or host 35.190.27.188

# -n: forbid ptr
ping -n -c2 google.com

# ptr 查询35.190.27.188的域名
nslookup -type=PTR 35.190.27.188 8.8.8.8

实际上,根据 IP 地址反查域名、根据端口号反查协议名称,是很多网络工具默认的行为,而这往往会导致性能工具的工作缓慢.

[ ] Linux ring buffer with dmesg [ ] DMA ring [ ] conntrack module [ ] Diagnose NAT issue [ ] 酷壳 [ ] tcpdump man [ ] wireshark doc [ ] level-triggered vs edge-triggered [ ] thundering herd 惊群 [ ] 不懂的很多。。可以从go network programming开始学习? 网络学习吃力的同学,先去把林沛满老师两本Wireshark分析网络看完 [ ] hping3 a network penetration test tool, security auditiing, fw test

[ ] blog 动态追踪技术漫谈

From LinkedIn Learning

time in shell is a keyword, it is also a command, they have different output.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
yum install -y time

# use time yum installed
# also show major and minor page faults
$(which time) sleep 1
0.00user 0.00system 0:01.00elapsed 0%CPU (0avgtext+0avgdata 644maxresident)k
0inputs+0outputs (0major+205minor)pagefaults 0swaps

# use shell keyword time
time sleep 1
# user and sys is cpu time on user and system space
# user + sys could be larger than real time if your program
# uses multi-cores
real 0m1.002s
user 0m0.000s
sys 0m0.001s

If you want to test performance of 2 similar commands, put each in a loop and time it as a whole to see the differences, or using strace -c -o /tmp/result.out ./script and head the first several lines of the result to see what system calls cost much.

/proc is mounted at boot time, see mount detail by mount | grep proc command. proc files provide kernel info, printing the contents of the proc file causes the corresponding function in the kernel to be called to produce fresh value. we can also write to proc file to change kernal parameters. /proc is a pseudo filesystem, so ls -l the length may be 0.

/proc files under /proc/sys represent kernel variables.

CPU

Packages for performance tools, yum install them and rpm -ql to see utilities contained:

  • sysstat: iostat, mpstat, nfsiostat-sysstat, pidstat, sar
  • procps-ng: free, pmap, ps, since, tload, top, uptime, vmstat, watch
  • perf: sudo perf record find / -xdev -name core >/dev/null 2>&1; sudo perf report

主要说了sar, top, cpuinfo 以及scheduling priority and nice value. (具体记录在其他blog中,搜一下关键字)

Throughput, important for server, fewer context switch and longer time silces. throughput 和 responsiveness 需要权衡,好的throughput 意味着尽可能run 单一的任务,这样context switch就少. Responsiveness, important for interactive or control system, requires quick context switch and shorter time slices and less page faults.

Can configure the kernel to preemptible or not, preemption means context switch.

Linux kernel has 3 choices for the kind of preemption it employs.

  • None: no preemption
  • Voluntary: kernel checks frequently for placement (the most commonly choice)
  • Preempt: schedule preempts unless kernel is in a critical section.

Memory

/proc/meminfo file, the MemAvailable value is space for program without swapping (includes reclaimable cache and buffer), important.

1
2
3
4
5
6
7
8
MemTotal:       32779460 kB
MemFree: 239788 kB
MemAvailable: 3996032 kB
Buffers: 20 kB
Cached: 5797312 kB
SwapCached: 0 kB
Active: 4435144 kB
...

htop command is similar to top but with colorful display. From the options you will sort by different categories, e.g. %CPU, $MEM, etc.

Translation lookaside buffer(TLB), use for virtual address mapped to physical addresses. Linux supportss having huge pages, can use sysctl config at runtime or during bootstrap (/etc/sysctl.d).

Page faults, a process uses an address that is not mapped or even not RAM resident. minor or major page faults, minor is not a big deal, major has disk I/O involved, much slower. Linux is on-demand page system. (how linux works 这本书也提到了)

Disk

atop command is also helpful.

后面谈到了如何测试不同filesystem 的performance, 可以用dd 生成大文件用作loop device, 格式化为不用的filesystem,然后mount,随后进行大量的文件或文件夹创建,记录时间,在对文件和文件夹进行操作,记录时间.

Kind helps you bring up local k8s cluster for testing and POC. Seamlessly working with kubectl and others: such as prometheus, operator, helmfile, etc.

Install Kind

The install is easy with go(1.17+), see this instruction:

1
2
3
# At the time of writing the kind stable version is 0.18.0, it will place the
# kind binary to $GOBIN(if exist) or $GOPATH
go install sigs.k8s.io/kind@v0.18.0

Basic Workflow

Cluster Creation

To spin up local k8s cluster:

1
2
3
4
5
6
7
8
9
10
11
12
13
# see options and flags
kind create cluster --help

# You can have multiple types of cluster such as dev, stg, prod, etc
# create one node cluster with dev as context name
kind create cluster --name dev

# Different cluster with specific configuration
kind create cluster --name stg --config stg_config.yaml

# Check k8s context
# Note the cluster name is not the same as context name
kubectl config get-contexts

Cluster Configuration

The advanced configuration please see this section. A simple multi-node cluster, can be used to test for example rolling upgrade:

1
2
3
4
5
6
7
8
9
10
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
# 1 control plane node and 3 workers
nodes:
# the control plane node config
- role: control-plane
# the three workers
- role: worker
- role: worker
- role: worker

Load Image

The kind k8s cluster uses containerd runtime, you can docker exec into node container and check with crictl command:

1
2
3
4
5
6
# list images
crictl images
# list containers
crictl ps
# list pods
crictl pods

To load image into the kind node, for example:

1
kind load docker-image busybox [--name dev] [--nodes x,y,z]

Then in the node container you will see busybox by crictl images.

Context Switch

To manage clusters:

1
2
3
4
5
6
7
8
# View kind clusters
kind get clusters

# Get cluster kubeconfig detail
kind get kubeconfig -n dev

# Switch context from dev to stg
kubectl config use-contexts kind-stg

Cluster Deletion

1
kind delete cluser --name dev

Kind Logging

To check kind logs:

1
kind export logs --name dev [./some_folder]

Ingress

For how to set up ingress for kind K8s cluster, please check: https://kind.sigs.k8s.io/docs/user/ingress/

Load Balancer

For how to set up LB service type in kind K8s cluster, please check: https://kind.sigs.k8s.io/docs/user/loadbalancer/

//TODO This is the follow up of <<Terraform Quick Start>>. github repo

Working with Existing Resources

1
2
3
4
5
6
7
8
9
10
11
12
13
# Configure an AWS profile with proper credentials
# for terraform use
aws configure --profile deep-dive
# Linux or MacOS
export AWS_PROFILE=deep-dive

# After terraform files are in place
# download modules and provider plugin
terraform init
terraform validate
# in collective env better to have plan file
terraform plan -out m3.tfplan
terraform apply "m3.tfplan"

you can find useful modules out-of-box in Terrafrom public provider registery. For example, gcp kubernetes module:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
module "gke" {
# specify source
source = "terraform-google-modules/kubernetes-engine/google"
project_id = "<PROJECT ID>"
name = "gke-test-1"
region = "us-central1"
zones = ["us-central1-a", "us-central1-b", "us-central1-f"]
network = "vpc-01"
subnetwork = "us-central1-01"
ip_range_pods = "us-central1-01-gke-01-pods"
ip_range_services = "us-central1-01-gke-01-services"
http_load_balancing = false
horizontal_pod_autoscaling = true
network_policy = true

node_pools = [
{
name = "default-node-pool"
machine_type = "e2-medium"
node_locations = "us-central1-b,us-central1-c"
min_count = 1
max_count = 100
local_ssd_count = 0
disk_size_gb = 100
disk_type = "pd-standard"
image_type = "COS"
auto_repair = true
auto_upgrade = true
service_account = "project-service-account@<PROJECT ID>.iam.gserviceaccount.com"
preemptible = false
initial_node_count = 80
},
]

node_pools_oauth_scopes = {
all = []

default-node-pool = [
"https://www.googleapis.com/auth/cloud-platform",
]
}

node_pools_labels = {
all = {}

default-node-pool = {
default-node-pool = true
}
}

node_pools_metadata = {
all = {}

default-node-pool = {
node-pool-metadata-custom-value = "my-node-pool"
}
}

node_pools_taints = {
all = []

default-node-pool = [
{
key = "default-node-pool"
value = true
effect = "PREFER_NO_SCHEDULE"
},
]
}

node_pools_tags = {
all = []

default-node-pool = [
"default-node-pool",
]
}
}

Then someone in the team provisioning some resources on AWS without using Terraform, we want to include them under the Terraform control:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# update terraform config file to include new added resources
edit terraform.tfvars

# identifiers from provider and configuration
#Use the values output by junior_admin.sh script
# xxx: the resource ID 应该在AWS dashboard中可以找到
terraform import --var-file="terraform.tfvars" "module.vpc.aws_route_table.private[2]" "xxx"
terraform import --var-file="terraform.tfvars" "module.vpc.aws_route_table_association.private[2]" "xxx"
terraform import --var-file="terraform.tfvars" "module.vpc.aws_subnet.private[2]" "xxx"
terraform import --var-file="terraform.tfvars" "module.vpc.aws_route_table_association.public[2]" "xxx"
terraform import --var-file="terraform.tfvars" "module.vpc.aws_subnet.public[2]" "xxx"

# adds new resources to the state
# you will see some change items if not consistent before
terraform plan -out m3.tfplan

Managing State

刚看完第一版,第二版就出来了, hmmm… -_-|||

This book doesn’t offer instructions in using specific scripting languages or tools. There are code examples from specific tools, but these are intended to illustrate concepts and approaches, rather than to provide instruction.

最开始介绍了作者之前的一些经历 starts from team Vmware virtual server farm,从这些经历中,慢慢领悟和学习到了IaC的必要性。puppet and chef 看来是很久之前的config automation tool了. 后来过渡到cloud,从其他IT Ops team 中学到很多new ideas, eye-opener: “The key idea of our new approach was that every server could be automatically rebuilt from scratch, and our configuration tooling would run continuously, not ad hoc. Every server added into our new infrastructure would fall under this approach. If automation broke on some edge case, we would either change the automation to handle it, or else fix the design of the service so it was no longer an edge case.”

虚拟机和容器相辅相成。 Virtualization was one step, allowing you to add and remove VMs to scale your capacity to your load on a timescale of minutes. Containers take this to the next level, allowing you to scale your capacity up and down on a timescale of seconds.

后来我想到了一个问题,如果把容器运行在虚拟机之上,不又多了一层overhead吗?有没有优化。比如GKE runs on GCE VM, any optimization on VM image for k8s? 是的,用的是container-optimized OS.

这里还有一些文章,讲了k8s运行环境的比较: Where to Install Kubernetes? Bare-Metal vs. VMs. vs. Cloud. Running Containers on Bare Metal vs. VMs: Performance and Benefits

Part I Fundations

Chapter 1

Infrastructure as code is an approach to infrastructure automation based on practices from software development. It emphasizes consistent, repeatable routines for provisioning and changing systems and their configuration. Changes are made to definitions and then rolled out to systems through unattended processes(指不需要人参与) that include thorough validation.

The phrase dynamic infrastructure to refer to the ability to create and destroy servers programmatically.

Challenges with dynamic infrastructure, the previous one can cause the next:

  • Server Sprawl: servers growing faster then ability can control.
  • Configuration drift: inconsistency across the servers, such as manual ad-hoc fixes, config.
  • Snowflake Server: can’t be replicated.
  • Fragile Infrastructure: snowflake server problem expands.
  • Automation Fear: lack of confidence.
  • Erosion: infrastructure decays over time, such as components upgrade, patches, disk fill up, hardware failure.

An operations team should be able to confidently and quickly rebuild any server in their infrastructure.

Principles of Infrastruction as Code to mitigate above challenges:

  • Systems can be easily reproduced.
  • Systems are disposable.
  • Systems are consistent.
  • Processes are repeatable.
  • Design is always changing.

Effective infrastructure teams have a strong scripting culture. If a task can be scripted, script it. If a task is hard to script, drill down and see if there’s a technique or tool that can help, or whether the problem the task is addressing can be handled in a different way.

General practices of infrastructure as Code:

  • Use definition files: to specify infra elements and config.
  • Self-documented systems and processes: doc may leave gaps over time.
  • Version all things.
  • Continuously test systems and processes, how? see Chapter 11.
  • Small changes rather than batches.
  • Keep services available continuously, see chapter 14.
  • Antifragility, beyond robust: When something goes wrong, the priority is not simply to fix it, but to improve the ability of the system to cope with similar incidents in the future.

Chapter 2

Dynamic Infrastructure Platform: is a system that provides computing resources, particularly servers, storage, and networking, in a way that they can be programmatically allocated and managed.

主要讲了构造dynamic infra platform 的要求,比如programmable, on-demand, self-service。需要提供用户什么功能,比如compute, storage(block, storage storage, networked filesystem), network, anth, etc。

这里有个概念要澄清一下: private cloud vs bare-metal cloud. 之前认为意义相同,但并不是,bare-metal cloud is running an OS directly on server hardware rather than in a VM. There are many reasons why running directly on hardware may be the best choice for a given application or service. Virtualization adds performance overhead, because it inserts extra software layers between the application and the hardware resources it uses. Processes on one VM can impact the performance of other VMs running on the same host. 常用的tool for managing bare-metal: Cobbler, Foreman, etc.

An IT professional, the deeper and stronger your understanding of how the system works down the stack and into the hardware, the more proficient you’ll be at getting the most from it.

并不是说new instance就一定是well performance的,虚拟化也有很多不确定因素: For example, the Netflix team knew that a percentage of AWS instances, when provisioned, will perform much worse than the average instance, whether because of hardware issues or simply because they happen to be sharing hardware with someone else’s poorly behaving systems. So they wrote their provisioning scripts to immediately test the performance of each new instance. If it doesn’t meet their standards, the script destroys the instance and tries again with a new instance.

Software and infrastructure should be architected, designed, and implemented with an understanding of the true architecture of the hardware, networking, storage, and the dynamic infrastructure platform.

Chapter 3

Infrasturcture Definition Tools: This chapter has discussed the types of tools to manage high-level infrastructure according to the principles and practices of infrastructure as code.

Chapter 4

Server Configuration Tools 主要讲了provisioning tools, such as chef, puppet and ansible, salt. Tools for packing server templates, such as Packer. Tools for running commands on server.

Many server configuration tool vendors provide their own configuration registry to manage configuration definitions, for example, Chef Server, PuppetDB, and Ansible Tower.

In many cases, new servers can be built using off-the-shelf server template images. Packaging common elements onto a template makes it faster to provision new servers. Some teams take this further by creating server templates for particular roles such as web servers and application servers. Chapter 7 discusses trade-offs and patterns around baking server elements into templates versus adding them when creating servers (这个是当时正在做的新项目)。

Unikernel Server Templates. an OS image that is custom-compiled with the application it will run. The image only includes the parts of the OS kernel needed for the application, so is small and fast. This image is run directly as a VM or container (see later in this chapter) but has a single address space.

It’s important for an infrastructure team to build up and continuously improve their skills with scripting. Learn new languages, learn better techniques, learn new libraries and frameworks

Server change management models:

  • Ad hoc change, lead to config drift, snowflake server and other evils.
  • Configuration synchronization, may cause config drift on left parts
  • Immutable infra, completely replacing, requires good templates management.

Containerized services follows something similar to immutable infra, replace old container completely when apply changes. A container uses operating system features to isolate the processes, networking, and filesystem of the container, so it appears to be its own, self-contained server environment.

There is actually some dependency between the host and container. In particular, container instances use the Linux kernel of the host system, so a given image could potentially behave differently, or even fail, when run on different versions of the kernel.

A host server runs virtual machines using a hypervisor, Container instances share the operating system kernel of their host system, so they can’t run a different OS. Container has less overhead than a hardware virtual machine. Container image can be much smaller than a VM image, because it doesn’t need to include the entire OS. It can start up in seconds, as it doesn’t need to boot a kernel from scratch. And it consumes fewer system resources, because it doesn’t need to run its own kernel. So a given host can run more container processes than full VMs.

Container security, While containers isolate processes running on a host from one another, this isolation is not impossible to break. Different container implementations have different strengths and weaknesses. When using containers, a team should be sure to fully understand how the technology works, and where its vulnerabilities may lie.

Teams should ensure the provenance of each image used within the infrastructure is well known, trusted, and can be verified and traced. (当时ICP4D image RedHat 也专门去扫描检测了)。

Chapter 5

General Indrastructure Services. The purpose of this chapter isn’t to list or explain these services and tools. Instead, it is intended to explain how they should work in the context of a dynamic infrastructure managed as code.

The services and tools addressed are monitoring, service discovery, distributed process management, and software deployment. (这是几个主要的在infra 完成构建后的其他主要服务)

Monitor: alerting, metrics and logging. Monitoring information comes in two types: state and events. State is concerned with the current situation, whereas an event records actions or changes.

  • Alerting: Tell Me When Something Is Wrong
  • Metrics: Collect and Analyze Data
  • Log Aggregation and Analysis

Service Discovery: Applications and services running in an infrastructure often need to know how to find other applications and services.

Distributed Process Management: VMs or containers. K8s, Nomad, Openshift.

Software Deployment: Many have a series of environments for testing stages, including things like operational acceptance testing (OAT), QA (for humans to carry out exploratory testing), system integration testing (SIT), user acceptance testing (UAT), staging, preproduction, and performance.

Part II Patterns

Chapter 6

Patterns for Provisioning Servers.

Provisioning is not only done for a new server. Sometimes an existing server is re- provisioned, changing its role from one to another.

Server’s lifecycle:

  1. package a server template.
  2. create a new server
  3. update a server
  4. replace a server
  5. delete a server

Zero-downtime replacement ensures that a new server is completely built and tested while the existing server is still running so it can be hot-swapped into service once ready.

Advocates of immutable servers view making a change to the configuration of a production server as bad practice, no better than modifying the source code of software directly on a production server.

  1. recover from failure, outage, maintenance
  2. resize server pool, add/remove instances
  3. reconfig hardware resources, for example, add CPU, RAM, mount new disks, etc.

Server roles: Another pattern is to have a role-inheritance hierarchy(我们确实也是这么做的). The base role would have the software and configuration common to all servers, such as a monitoring agent, common user accounts, and common configuration like DNS and NTP server settings. Other roles would add more things on top of this, possibly at several levels.

It can still be useful to have servers with multiple roles even with the role inheritance pattern. For example, although production deployments may have separate web, app, and db servers, for development and some test cases, it can be pragmatic to combine these onto a single server.

Cloned server (similar to save container to image) suffers, because they have runtime data from the original server, which is not reproducible and it accumulate changes or data.

Bootstrapping new servers:

  • push bootstrapping: Ansible, Chef, Puppet
  • pull bootstrapping: cloud-init

Smoke test every new server instance:

  • Is the server running and accessible?
  • Is the monitoring agent running?
  • Has the server appeared in DNS, monitoring, and other network services?
  • Are all of the necessary services (web, app, database, etc.) running?
  • Are required user accounts in place?
  • Are there any ports open that shouldn’t be?
  • Are any user accounts enabled that shouldn’t be?

Smoke tests could be integrated with monitoring systems. Most of the checks that would go into a smoke test would work great as routine monitoring checks, so the smoke test could just verify that the new server appears in the monitoring system, and that all of its checks are green.

Chapter 7

Patterns for Managing Server Templates 需要重点关注.

这也是我们采取的前后2种方法,new generation采用第二种。也可以两者结合,把经常变化的部分放在creation time provisioning. One end of the spectrum is minimizing what’s on the template and doing most of the provisioning work when a new server is created.

Keeping templates minimal makes sense when there is a lot of variation in what may be installed on a server. For example, if people create servers by self-service, choosing from a large menu of configuration options, it makes sense to provision dynamically when the server is created. Otherwise, the library of prebuilt templates would need to be huge to include all of the variations that a user might select.

At the other end of the provisioning spectrum is putting nearly everything into the server template.

Doing all of the significant provisioning in the template, and disal‐ lowing changes to anything other than runtime data after a server is created, is the key idea of immutable servers.

Process to build template An alternative to booting the origin image is to mount the origin disk image in another server and apply changes to its filesystem. This tends to be much faster, but the customization process may be more complicated.

Netflix’s Aminator tool builds AWS AMIs by mounting the origin image as a disk volume. The company’s blog post on Aminator describes the process quite well. Packer offers the amazon-chroot builder to support this approach.

It could make sense to have server templates tuned for different pur‐ poses. Database server nodes could be built from one template that has been tuned for high-performance file access, while web servers may be tuned for network I/O throughput.(我们并没有考虑这么多)

Chapter 8

Patterns for Updating and Changing Servers 需要重点关注。 An effective change management process ensures that any new change is rolled out to all relevant existing servers and applied to newly created servers.

Continuous Configuration Synchronization, for example, google gcloud resource configuration has a central of truth repo(主要是针对API, role权限), the configuration process sync every one hour or so to elminiate config drift.

Any areas not explicitly managed by configuration definitions may be changed outside the tooling, which leaves them vulnerable to configuration drift.

Immutable Servers, the practice is normally combined with keeping the lifespan of servers short, as with the Phoenix. So servers are rebuilt as frequently as every day, leaving little opportunity for unmanaged changes. Another approach to this issue is to set those parts of a server’s filesystems that should not change at runtime as read-only.

Using the term “immutable” to describe this pattern can be misleading. “Immutable” means that a thing can’t be changed, so a truly immutable server would be useless. As soon as a server boots, its runtime state changes—processes run, entries are written to logfiles, and application data is added, updated, and removed.It’s more useful to think of the term “immutable” as applying to the server’s configu‐ ration, rather than to the server as a whole.

Depending on the design of the configuration tool, a pull-based system may be more scalable than a push-based system. A push system needs the master to open connections to the systems it manages, which can become a bottleneck with infrastructures that scale to thousands of servers. Setting up clusters or pools of agents can help a push model scale. But a pull model can be designed to scale with fewer resources, and with less complexity.

Chapter 9

Patterns for Defining Infrastructure This chapter will look at how to provision and configure larger groups of infrastructure elements.

Stack: A stack is a collection of infrastructure elements that are defined as a unit.

Use parameterized environment definitions, for example terraform brings up a stack with a single definition file for different environments.

提到了Consul configuration registry, 里面存储了不同stack的资源,比如run time IP address,可以供stack之间相互引用,这样decouple了stack,于是可以各自为政.

1
2
3
4
5
6
7
# AWS, get vip_ip from consul
resource "consul_keys" "app_server" {
key {
name = "vip_ip"
path = "myapp/${var.environment}/appserver/vip_ip"
}
}

It’s better to ensure that infrastructure is provisioned and updated by running tools from centrally managed systems, such as an orchestration agent. An orchestration agent is a server that is used to execute tools for provisioning and updating infrastructure. These are often controlled by a CI or CD server, as part of a change management pipeline. 处于安全,一致性,依赖的原因,确实应该如此.

Part III Practice

Chapter 10

Software Engineering Practices for Infrastructure Assume everything you deliver will need to change as the system evolves.

The true measure of the quality of a system, and its code, is how quickly and safely changes are made to it.

gitlab-ci上组里就是这么做的: Although a CI tool can be used to run tests automatically on commits made to each separate branch, the integrated changes are only tested together when the branches are merged. Some teams find that this works well for them, generally by keeping branches very short-lived. 这里总结得很好,commit changes to short-lived branch and then merge to trunk, do both CI on before and after the merge for branch and trunk.

这个CI/CD也解释得很好: CI Continuous integration addresses work done on a single codebase. CD Continuous delivery expands the scope of this continuous integration to the entire system, with all of its components.

The idea behinds CD is to ensure that all of the deployable components, systems, and infrastructure are continuously validated to ensure that they are production ready. It is used to address the problems of the “integration phase.”

One misconception about CD is that it means every change committed is applied to production immediately after passing automated tests. The point of CD is not to apply every change to production immediately, but to ensure that every change is ready to go to production.

Code Quality The key to a well-engineered system is simplicity. Build only what you need, then it becomes easier to make sure what you have built is correct. Reorganize code when doing so clearly adds value.

Technical debt is a metaphor for problems in a system that have been left unfixed. 最好不要积累technical debts,发现的时候就去修复.

An optional feature that is no longer used, or whose development has been stopped, is technical debt. It should be pruned ruthlessly. Even if you decide later on that you need that code, it should be in the history of the VCS. If, in the future, you want to go back and dust it off, you’ve got it in the history in version control.

Chapter 11

Testing Infrastructure Changes,需要关注. The pyramid puts tests with a broader scope toward the top, and those with a narrow scope at the bottom. The lower tiers validate smaller, individual things such as defini‐ tion files and scripts. The middle tiers test some of the lower-level elements together —for example, by creating a running server. The highest tiers test working systems together—for example, a service with multiple servers and their surrounding infrastructure.

There are more tests at the lower levels of the pyramid and fewer at the top. Because the lower-level tests are smaller and more focused, they run very quickly. The higher- level tests tend to be more involved, taking longer to set up and then run, so they run slower.

In order for CI and CD to be practical, the full test suite should run every time someone commits a change. The committer should be able to see the results of the test for their individual change in a matter of minutes. Slow test suites make this difficult to do, which often leads teams to decide to run the test suite periodically—every few hours, or even nightly.

If running tests on every commit is too slow to be practical, the sol‐ ution is not to run the tests less often, but instead to fix the situa‐ tion so the test suite runs more quickly. This usually involves re- balancing the test suite, reducing the number of long-running tests and increasing the coverage of tests at the lower levels.

This in turn may require rearchitecting the system being tested to be more modular and loosely coupled, so that individual compo‐ nents can be tested more quickly.

其实test cases也不太好决定, 还是要根据实际需求去选择测试什么部分,经常变动或容易broken的组件,及时的更新。 Practice:

  • Test at the Lowest Level Possible
  • Only Implement the Layers You Need
  • Prune the Test Suite Often
  • Continuously Review Testing Effectiveness

Whenever there is a major issue in production or even in testing, consider running a blameless post-mortem. 谷歌内部也提倡这个习惯。

Low-level testing 对于ansible playbook, packer json file 之类的文件检查,有几个步骤:

  • syntax check, ansible and others 自带有parser
  • static code analysis: linting, Static analysis can be used to check for common errors and bad habits which, while syntactically correct, can lead to bugs, security holes, performance issues, or just code that is difficult to understand.
  • unit testing, ansible has dedicate module for this, also puppet and chef.

Mid-level testing For example, starts building template via Packer and Ansible, the validation process would be to create a server instance using the new template, and then run some tests against it.

Tools to test server configuration: Serverspec, 目前对于packer instance 都是自己去检查的, for example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
describe service('login_service') do
it { should be_running }
end

describe host('dbserver') do
it { should be_reachable.with( :port => 5432 ) }
end

//
describe 'install and configure web server' do
let(:chef_run) { ChefSpec::SoloRunner.converge(nginx_configuration_recipe) }

it 'installs nginx' do
expect(chef_run).to install_package('nginx')
end
end

describe 'home page is working' do
let(:chef_run) {
ChefSpec::SoloRunner.converge(nginx_configuration_recipe,
home_page_deployment_recipe)
}

it 'loads correctly' do
response = Net::HTTP.new('localhost',80).get('/')
expect(response.body).to include('Welcome to the home page')
end
end

Automatically tests that remotely logging into a server can be challenging to implement securely. These tests either need a hardcoded password, or else an SSH key or similar mechanism that authorizes unattended logins.

One approach to mitigate this is to have tests execute on the test server and push their results to a central server. This could be combined with monitoring, so that servers can self-test and trigger an alert if they fail.

Another approach is to generate one-off authentication credentials when launching a server to test.

High-level testing The higher levels of the test suite involve testing that multiple elements of the infra‐ structure work correctly when integrated together.

Testing Operational Quality 这部分也很重要,但是应该在QA的范围. People managing projects to develop and deploy software have a bucket of requirements they call non-functional requirements, or NFRs; these are also sometimes referred to as cross-functional requirements (CFRs). Performance, availability, and security tend to be swept into this bucket.

Operational testing can take place at multiple tiers of the testing pyramid, although the results at the top tiers are the most important.

关于testing and monitoring 的关系: Testing is aimed at detecting problems when making changes, before they are applied to production systems. Monitoring is aimed at detecting problems in running systems.

In order to effectively test a component, it must be isolated from any dependencies during the test. A solution to this is to use a stub server instead of the application server. It’s important for the stub server to be simple to maintain and use. It only needs to return responses specific to the tests you write.

Mocks, fakes, and stubs are all types of test doubles. A test double replaces a dependency needed by a component or service being tested, to simplify testing.

QA tester means: quality analyst/assurance.

story: a small piece of work (Jira 中的分类), 可能是这个意思.

Chapter 12

Change Management Pipelines for Infrastructure This chapter explains how to implement continuous delivery for infrastructure by building a change management pipeline. 讲了如何设计,集成,测试CD pipeline.

A change management pipeline could be described as the automated manifestation of your infrastructure change management process. 就理解成CD pipeline.

Guidelines for Designing Pipelines:

  • Ensure Consistency Across Stages, e.g: server operating system versions and configuration should be the same across environments. Make sure that the essential characteristics are the same.
  • Get Immediate Feedback for Every Change
  • Run Automated Stages Before Manual Stages
  • Get Production-Like Sooner Rather Than Later

My colleague Chris Bird described this as DevOops; the ability to automatically configure many machines at once gives us the ability to automatically break many machines at once. 也就是说利害是hand by hand的。

这里有recap了一下一个CI/CD的流程:

  1. local development stage, make code and test on local virtualization, then commit to VSC.
  2. build stage, syntax checking, unit tests, test doubles, publish reports, packaging and upload code/template image, etc.

如果不是用的immutable server的模式,则你会需要一个configuration master (chef server, puppet master or ansibel tower)去配置环境,所以在CI pipeline的最后,会打包上传一个configuration artifact 供这些config master 使用去配置running server,或者是masterless configuration, running server 会自动从一个file server 下载.

如果使用的是immutable server模式,则内容都在image template中配置好了,比如使用packer,不在需要configuration master or masterless.

  1. automated test stage, refer to test pyramid
  2. manual validation stage
  3. apply to live, any significant risk or uncertainty at this stage should be modeled and addressed in upstream stages.

还要注意的是,并不是每个commit 都会走所有的流程,可能commit 1/2/3 走到一个stage,然后合起来进入下一个stage, the earlier stages of the pipeline will run more often than the later stages. not every change, even ones that pass testing and demo, are necessarily deployed immediately to production.

Pipeline for complex system: fan-in pattern: The fan-in pattern is a common one, useful for building a system that is composed of multiple components. Each component starts out with its own pipeline to build and test it in isolation. Then the component pipelines are joined so that the components are tested together. A system with multiple layers of components may have multiple joins. 这个流程图就如同fan-in的扇形.

Contract tests are automated tests that check whether a provider interface behaves as consumers expect. This is a much smaller set of tests than full functional tests, purely focused on the API that the service has committed to provide to its consumers.

Chapter 13

Workflow for the Infrastructure Team 这章描述用语很好. An infrastructure engineer can no longer just log onto a server to make a change. Instead, they make changes to the tools and definitions, and then allow the change management pipeline to roll the changes out to the server.

A sandbox is an environment where a team member can try out changes before com‐ mitting them into the pipeline. It may be run on a local workstation, using virtualiza‐ tion, or could be run on the virtualization platform.

Autonomic Automation Workflow Using local sandbox for testing: A sandbox is an environment where a team member can try out changes before com‐ mitting them into the pipeline. It may be run on a local workstation, using virtualiza‐ tion, or could be run on the virtualization platform.

Keeping the whole change/commit cycle short needs some habits around how to structure the changes so they don’t break production even when the whole task isn’t finished. Feature toggles and similar techniques mentioned in Chapter 12 can help.

Chapter 14

Continuity with Dynamic Infrastructure This chapter is concerned with the operational quality of production infrastructure. Many IT service providers use availability as a key performance metric or SLA(service level agreement). This is a percentage, often expressed as a number of nines: “five nines availability” means that the system is available 99.999% of the time.

Service continuity Keeping services available to end users in the face of problems and changes

A pitfall of using dynamic pools to automatically replace failed servers is that it can mask a problem. If an application has a bug that causes it to crash frequently, it may take a while for people to notice. So it is important to implement metrics and alerting on the pool’s activity. The team should be sent critical alerts when the frequency of server failures exceeds a threshold.

Software that has been designed and implemented with the assumption that servers and other infrastructure elements are routinely added and removed is sometimes referred to as cloud native. Cloud-native software handles constantly changing and shifting infrastructure seamlessly.

The team at Heroku published a list of guidelines for applications to work well in the context of a dynamic infrastructure, called the 12-factor application.

Some characteristics of non-cloud-native software that require lift and shift migrations:

  • Stateful sessions
  • Storing data on the local filesystem
  • Slow-running startup routines
  • Static configuration of infrastructure parameters

Zero-Downtime Changes Many changes require taking elements of the infrastructure offline, or completely replacing them. Examples include upgrading an OS kernel, reconfiguring a network, or deploying a new version of application software. However, it’s often possible to carry out these changes without interrupting service.

  • Blue-Green Replacement
  • Phoenix Replacement
  • Canary Replacement
  • dark launching

Routing Traffic for Zero-Downtime Replacements. Zero-downtime change patterns involve fine-grained control to switch usage between system components.

Zero-Downtime Changes with Data. The problem comes when the new version of the component involves a change to data formats so that it’s not possible to have both versions share the same data stor‐ age without issues. An effective way to approach data for zero-downtime deployments is to decouple data format changes from software releases.

Data continuity Keeping data available and consistent on infrastructure that isn’t. There are many techniques that can be applied to this problem. A few include:

  • Replicating data redundantly
  • Regenerating data
  • Delegating data persistence
  • Backing up to persistent storage

Disaster recovery Coping well when the worst happens

Iron-age IT organizations usually optimize for mean time between failures (MTBF), whereas cloud-age organizations optimize for mean time to recover (MTTR).

Security Keeping bad actors at bay

  • Reliable Updates as a Defense
  • Provenance of Packages
  • Automated Hardening

Common vulnerabilities list from CVE.

Hardening refers to configuring a system to make it more secure than it would be out of the box. Typical activities include:

  • Configuring security policies (e.g., firewall rules, SSH key use, password policies, sudoers files, etc.).
  • Removing all but the most essential user accounts, services, software packages, and so on.
  • Auditing user accounts, system settings, and checking installed software against known vulnerabilities.

Frameworks and scripts for hardening system, see here. It is essential that the members of the team review and understand the changes made by externally created hardening scripts before applying them to their own infrastructure.

Chapter 15

Organizing for Infrastructure as Code This final chapter takes a look at implementing it from an organizational point of view.

The organizaitional principles that enable this include:

  • A continuous approach to the design, implementation, and improvement of services
  • Empowering teams to continuously deliver and improve their services
  • Ensuring high levels of quality and compliance while delivering rapidly and continuously

A kanban board is a powerful tool to make the value stream visible. This is a variation of an agile story wall, set up to mirror the value stream map for work.

A retrospective is a session that can be held regularly, or after major events like the completion of a project. Everyone involved in the process gathers together to discuss what is working well, and what is not working well, and then decide on changes that could be made to processes and systems in order to get better outcomes.

Post-mortems are typically conducted after an incident or some sort of major prob‐ lem. The goal is to understand the root causes of the issue, and decide on actions to reduce the change of similar issues happening.

0%