Occasionally VM instance reboot in an unplanned way and triggers alert, usually this is due to hardware or software issue on physical machine hosting the VM that causes the VM to crash.
From last reboot will see reboot records, or who -b displaying last time reboot info.
For CentOS/RHEL systems, you’ll find the logs at /var/log/messages while for Ubuntu/Debian systems, its logged at /var/log/syslog.
1 2 3
# exclude irrelevant info # then looking around the possible key word log sudo grep -iv ': starting\|kernel: .*: Power Button\|watching system buttons\|Stopped Cleaning Up\|Started Crash recovery kernel' /var/log/messages | grep -iw 'recover[a-z]*\|power[a-z]*\|shut[a-z ]*down\|rsyslogd\|ups'
From the command output above, skim /var/log/messages file around the timestamp, output example:
1 2 3 4
May 3 23:31:52 xxxxx systemd: Started Update UTMP about System Boot/Shutdown. May 3 23:31:55 xxxxx rsyslogd: [origin software="rsyslogd" swVersion="8.24.0-52.el7" x-pid="949" x-info="http://www.rsyslog.com"] start May 3 23:31:55 xxxxx systemd: Started Google Compute Engine Shutdown Scripts. May 3 23:37:43 xxxxx audispd: node=xxxxx type=EXECVE msg=audit(1620085063.649:1984): argc=3 a0="last" a1="-x" a2="shutdown"
I see the reboot was caused by Shutdown Scripts, further check the VM instance log or StackDriver log (if on public cloud platform, check platform log system is more convenient than ssh to checking VM log), get error:
yum versionlock, to restrict a package to a fixed version against yum update/upgrade.
The plugin stores a package list in /etc/yum/pluginconf.d/versionlock.list, which you can edit directly. Yum will normally attempt to update all packages, but the plugin will exclude the packages listed in the versionlock.list file.
1 2 3 4 5 6 7 8 9 10 11 12 13
# install version lock yum install yum-plugin-versionlock
# list lock yum versionlock list [package name]
# delete lock yum versionlock delete 0:elasticsearch* # clear all version lock yum versionlock clear
When I was working at IBM, I applied a dedicated cluster for Ansible learning.
After I left, I decide to use Vagrant to create local cluster for the same
purpose.
NOTE: I have also created a docker sponsored Ansible testing environment,
please see here
Please check
Vagrant Ansible testing cluster repo. Follow the README to set up and play with ansible. The problems I had at the
time of creating the repo:
how to establish the SSH connection to Vagrant VM.
Control node requirements:
Starting with ansible-core 2.11, the project will only be packaged for Python
3.8 and newer.
If you are using Ansible to manage machines in a cloud, consider using a machine
inside that cloud as your control node. In most cases Ansible will perform
better from a machine on the cloud than from a machine on the open Internet.
Managed node requirements:
Although you do not need a daemon on your managed nodes, you do need a way for
Ansible to communicate with them. For most managed nodes, Ansible makes a
connection over SSH and transfers modules using SFTP. For any machine or device
that can run Python, you also need Python 2 (version 2.6 or later) or Python 3
(version 3.5 or later).
If install on Linux using yum (I use pip install in virtualenv in the demo, see repo README):
Yaml syntax,
Especially the difference between > and | for multi-line coding:
Spanning multiple lines using a | will include the newlines and any trailing
spaces. Using a > will fold newlines to spaces; In either case the indentation
will be ignored.
1 2 3 4 5 6 7 8 9
include_newlines:| exactly as you see will appear these three lines of poetry fold_newlines:> this is really a single line of text despite appearances
Background
Prod env upgrade failed, rolled back leads to historical monitoring data lost. The root cause is old PVC was removed accidently thus the corresponding PV got released, thanksfully the PV reclaim policy is retain so the data on disk was still preserved.
Question
How to access the historical data on released PV?
显而易见需要重新bound.
Solution
The PV is actually provisioned dynamically by custom storage class gce-regional-ssd, in its definition, it is preserved for specific PVC by specifying the claimRef field:
Since the PVC monitoring-alertmanager is alreay used another PV, to make this one available, kubectl edit to remove the uid and resourceVersion, modify the name, save and quit:
最近一周在on-call,这里整理一下Linkedin Learn: Linux performance的内容. 这里主要关注4个方面: CPU, Memory, Disk and FileSystem IO, Network. 其实很多时候,alerts or incidents are derived from performance issues, of course we have insight checks for services themselves. 建议阅读 <<How linux works>>这本书,里面都讲到了。
For more comprehensive cheat sheet, please switch to <<On-Call System Performance>>.
CPU
Understand the meaning of uptime command(check man). Load average is not CPU usage. 平均负载是指单位时间内,系统处于可运行状态®和不可中断状态(D)的平均进程数,也就是平均活跃进程数(可以这么理解,但源码它实际上是活跃进程数的指数衰减平均值),它和 CPU 使用率并没有直接关系. (有时需要对比系统的运行时间,有的故障可能是系统重启导致的)
1 2 3 4 5 6 7 8
# check CPU number lscpu # check load average uptime w # or using top with 1 to show each cpu # and load average top -b -n 1 | head
综合top, ps 中查看进程 status code, such as SLsl,很有意义,可以了解进程的组成等,比如多线程.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
man ps # PROCESS STATE CODES D uninterruptible sleep (usually IO) R running or runnable (on run queue) S interruptible sleep (waiting for an event to complete) T stopped by job control signal t stopped by debugger during the tracing W paging (not valid since the 2.6.xx kernel) X dead (should never be seen) Z defunct ("zombie") process, terminated but not reaped by its parent I idle # For BSD formats and when the stat keyword is used, additional characters may be displayed: < high-priority (not nice to other users) N low-priority (nice to other users) L has pages locked into memory (for real-time and custom IO) s is a session leader (会话是指共享同一个控制终端的一个或多个进程组) l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do) + is in the foreground process group (进程组表示一组相互关联的进程,比如每个子进程都是父进程所在组的成员)
比如当平均负载为2时, 意味着什么呢? When number of CPU is larger, equal or less then 2. Check lscpu or grep 'model name' /proc/cpuinfo | wc -l to see the logical CPU size.
三个不同时间间隔的平均值,其实给我们提供了,分析系统负载趋势的数据来源,让我们能更全面、更立体地理解目前的负载状况. 当平均负载高于 CPU 数量 70% 的时候,你就应该分析排查负载高的问题了。一旦负载过高,就可能导致进程响应变慢,进而影响服务的正常功能. 最推荐的方法,还是把系统的平均负载监控起来(prometheus + grafana),然后根据更多的历史数据,判断负载的变化趋势。当发现负载有明显升高趋势时,比如说负载翻倍了,你再去做分析和调查。
# stress cpu with 1 process stress --cpu 1 --timeout 600 # stress io # stress -i 1 --timeout 600 does not work well # because VM sync buffer is small stress-ng --io 1 --hdd 1 --timeout 600 # stress cpu with 8 processes stress -c 8 --timeout 600
# benchmark tool sysbench --threads=10 --max-time=300 threads run
# 系统整体的性能 # focus on in, cs, r, b, check man for description # 注意r 的个数是否远超CPU 个数 # in 太多也是个问题 # us sy 看cpu 主要是被用户 还是 内核 占据 # -w: wide display # -S: unit m(mb) # 2: profile interval vmstat -w -S m 2
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 230548 24 4068628 0 0 0 33 16436 29165 25 2 73 0 0 1 0 0 226888 24 4069028 0 0 0 43 15443 27179 39 2 59 0 0 3 0 0 225588 24 4070544 0 0 0 523 20873 38865 36 2 61 0 0
# 查看process 的性能 # only show at least one column non-zero item # -w: show context switch # -u: show cpu stat pidstat -w -u 2 # -t: thread level, helpful! # -p: pid # check both process and its thread context switch status pidstat -wt -p 14233 2
# analyze interrupt, good point to check proc file # focus on the most frequently changing part, 然后查看指代什么行为 watch -d 'cat /proc/interrupts' # RES 重调度中断
strace hanging at futex, FUTEX_WAIT: The process is multi-threaded, you are tracing the original parent thread, and it’s doing nothing but waiting for some other threads to finish.
If top, pidstat 这类工具已经不能给出更多的信息了。这时,我们就应该求助那些基于事件记录的动态追踪工具了。
1 2 3 4 5
# wait for 15 seconds # crtl+c quit perf record -g # check pert report
# check cpu %si and %hi # load average may be low top
# check softirq frequencies and category # usually NET_RX ratio change a lot watch -d cat /proc/softirqs
# check network rx/tx packet vs KBS per second # -n DEV 表示显示网络收发的报告,间隔1秒输出一组数据 # can calculate bytes/packet sar -n DEV 1
# -i eth0 只抓取eth0网卡 # -n不解析协议名和主机名 # tcp port 80表示只抓取tcp协议并且端口号为80的网络帧 tcpdump -i eth0 -n tcp port 80
在tcpdump的输出中,观察Flag[x] 类型, 比如Flag[S] 表示SYN 包。可以找到source IP 并且通过防火墙隔离。
[ ] cassandra perf analysis find cause?
[x] why cassandra Sleep with high CPU usage? -> multi-threads are running
[x] why process is S but some threads are R, parent is doing nothing but waiting children threads.
tmpfs 详解
It is intended to appear as a mounted file system, but data is stored in volatile memory instead of a persistent storage device. A similar construction is a RAM disk, which appears as a virtual disk drive and hosts a disk file system. tmpfs is meant only for ephemeral files
1 2 3 4 5 6 7 8 9 10 11 12 13
# resize mounted tmpfs size # only temporary mount -o remount,size=300M tmpfs /dev/shm # -T: see file system type df -hT
# create new tmpfs mounted mkdir /data # -t: type # -o: options # second tmpfs: device name mount -t tmpfs -o size=100M tmpfs /data umount /data
yum install -y bcc-tools # need to manually export export PATH=$PATH:/usr/share/bcc/tools # check overall cache hit rate cachestat 1 3 # check process cache hit rate # similar to top command mechanism # 3: 3 seconds update cachetop 3
还要注意的是,就算是第一次读,也会有缓存命中,因为系统会预读一部分到内存中,直接IO虽然跳过了buffer, 但cache hit rate is 100%,根据缓存命中次数(每次命中就是one page 4K size),计算出在当时时间间隔中的命中大小: HITS * 1024 * 4K, 在除以时间间隔,就可知速率K/s.
dd command bs(block size) option value, 4K is fine, for large storage hard drive, 1M is good to go, bs vaule deponds on your RAM size, it is the size that processed of each operation.
Memory leak, 理解内存的分配与回收,哪些虚拟内存段容易发生泄漏问题: heap and 内存映射段(包括动态链接库和共享内存,其中共享内存由程序动态分配和管理)。所以,如果程序在分配后忘了回收,就会导致跟堆内存类似的泄漏问题。
1 2 3 4 5 6
# -r: mem statistic # -S: swap statistic # 3: refresh rate second sar -r -S 3 # in bcc-tools with cachestat and cachetop memleak -a -p $(pidof app_name)
很有意思的思路:
一般来说,生产系统的应用程序,应该有动态调整日志级别的功能。继续查看源码,你会发现,这个程序也可以调整日志级别。如果你给它发送 SIGUSR1 信号,就可以把日志调整为 INFO 级;发送 SIGUSR2 信号,则会调整为 WARNING 级
在排查应用程序问题时,我们可能需要,在线上环境临时开启应用程序的调试日志。有时候,事后一不小心就忘了调回去。没把线上的日志调高到警告级别,可能会导致 CPU 使用率、磁盘 I/O 等一系列的性能问题,严重时,甚至会影响到同一台服务器上运行的其他应用程序。
1 2 3 4 5
# File reads and writes by filename and process. Top for files filetop
# Trace open() syscalls. Uses Linux eBPF/bcc opensnoop
MySQL 的 MyISAM 引擎,主要依赖系统缓存加速磁盘 I/O 的访问。可如果系统中还有其他应用同时运行, MyISAM 引擎很难充分利用系统缓存。缓存可能会被其他应用程序占用,甚至被清理掉。所以,不建议把应用程序的性能优化完全建立在系统缓存上。最好能在应用程序的内部分配内存,构建完全自主控制的缓存;或者使用第三方的缓存应用,比如 Memcached、Redis 等。
学习了redis的一些特性和配置。
为了更客观合理地评估优化效果,我们首先应该对磁盘和文件系统进行基准测试,得到文件系统或者磁盘 I/O 的极限性能。fio 文件系统和磁盘 I/O 性能基准测试工具, fio is a tool that will spawn a number of threads or processes doing a particular type of I/O action as specified by the user. The typical use of fio is to write a job file matching the I/O load one wants to simulate:
Note that 用磁盘路径测试写,会破坏这个磁盘中的文件系统,所以在使用前,你一定要事先做好数据备份。
time in shell is a keyword, it is also a command, they have different output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
yum install -y time
# use time yum installed # also show major and minor page faults $(whichtime) sleep 1 0.00user 0.00system 0:01.00elapsed 0%CPU (0avgtext+0avgdata 644maxresident)k 0inputs+0outputs (0major+205minor)pagefaults 0swaps
# use shell keyword time timesleep 1 # user and sys is cpu time on user and system space # user + sys could be larger than real time if your program # uses multi-cores real 0m1.002s user 0m0.000s sys 0m0.001s
If you want to test performance of 2 similar commands, put each in a loop and time it as a whole to see the differences, or using strace -c -o /tmp/result.out ./script and head the first several lines of the result to see what system calls cost much.
/proc is mounted at boot time, see mount detail by mount | grep proc command. proc files provide kernel info, printing the contents of the proc file causes the corresponding function in the kernel to be called to produce fresh value. we can also write to proc file to change kernal parameters. /proc is a pseudo filesystem, so ls -l the length may be 0.
/proc files under /proc/sys represent kernel variables.
CPU
Packages for performance tools, yum install them and rpm -ql to see utilities contained:
sysstat: iostat, mpstat, nfsiostat-sysstat, pidstat, sar
主要说了sar, top, cpuinfo 以及scheduling priority and nice value. (具体记录在其他blog中,搜一下关键字)
Throughput, important for server, fewer context switch and longer time silces. throughput 和 responsiveness 需要权衡,好的throughput 意味着尽可能run 单一的任务,这样context switch就少. Responsiveness, important for interactive or control system, requires quick context switch and shorter time slices and less page faults.
Can configure the kernel to preemptible or not, preemption means context switch.
Linux kernel has 3 choices for the kind of preemption it employs.
None: no preemption
Voluntary: kernel checks frequently for placement (the most commonly choice)
Preempt: schedule preempts unless kernel is in a critical section.
Memory
/proc/meminfo file, the MemAvailable value is space for program without swapping (includes reclaimable cache and buffer), important.
htop command is similar to top but with colorful display. From the options you will sort by different categories, e.g. %CPU, $MEM, etc.
Translation lookaside buffer(TLB), use for virtual address mapped to physical addresses. Linux supportss having huge pages, can use sysctl config at runtime or during bootstrap (/etc/sysctl.d).
Page faults, a process uses an address that is not mapped or even not RAM resident. minor or major page faults, minor is not a big deal, major has disk I/O involved, much slower. Linux is on-demand page system. (how linux works 这本书也提到了)
Kind helps you bring up local k8s cluster for testing and POC. Seamlessly
working with kubectl and others: such as prometheus, operator, helmfile, etc.
Install Kind
The install is easy with go(1.17+), see this
instruction:
1 2 3
# At the time of writing the kind stable version is 0.18.0, it will place the # kind binary to $GOBIN(if exist) or $GOPATH go install sigs.k8s.io/kind@v0.18.0
Basic Workflow
Cluster Creation
To spin up local k8s cluster:
1 2 3 4 5 6 7 8 9 10 11 12 13
# see options and flags kind create cluster --help
# You can have multiple types of cluster such as dev, stg, prod, etc # create one node cluster with dev as context name kind create cluster --name dev
# Different cluster with specific configuration kind create cluster --name stg --config stg_config.yaml
# Check k8s context # Note the cluster name is not the same as context name kubectl config get-contexts
Cluster Configuration
The advanced configuration please see this
section.
A simple multi-node cluster, can be used to test for example rolling upgrade:
1 2 3 4 5 6 7 8 9 10
kind:Cluster apiVersion:kind.x-k8s.io/v1alpha4 # 1 control plane node and 3 workers nodes: # the control plane node config -role:control-plane # the three workers -role:worker -role:worker -role:worker
Load Image
The kind k8s cluster uses containerd runtime, you can docker exec into node
container and check with crictlcommand:
1 2 3 4 5 6
# list images crictl images # list containers crictl ps # list pods crictl pods
//TODO
This is the follow up of <<Terraform Quick Start>>.
github repo
Working with Existing Resources
1 2 3 4 5 6 7 8 9 10 11 12 13
# Configure an AWS profile with proper credentials # for terraform use aws configure --profile deep-dive # Linux or MacOS export AWS_PROFILE=deep-dive
# After terraform files are in place # download modules and provider plugin terraform init terraform validate # in collective env better to have plan file terraform plan -out m3.tfplan terraform apply "m3.tfplan"
This book doesn’t offer instructions in using specific scripting languages or tools. There are code examples from specific tools, but these are intended to illustrate concepts and approaches, rather than to provide instruction.
最开始介绍了作者之前的一些经历 starts from team Vmware virtual server farm,从这些经历中,慢慢领悟和学习到了IaC的必要性。puppet and chef 看来是很久之前的config automation tool了. 后来过渡到cloud,从其他IT Ops team 中学到很多new ideas, eye-opener: “The key idea of our new approach was that every server could be automatically rebuilt from scratch, and our configuration tooling would run continuously, not ad hoc. Every server added into our new infrastructure would fall under this approach. If automation broke on some edge case, we would either change the automation to handle it, or else fix the design of the service so it was no longer an edge case.”
虚拟机和容器相辅相成。
Virtualization was one step, allowing you to add and remove VMs to scale your capacity to your load on a timescale of minutes. Containers take this to the next level, allowing you to scale your capacity up and down on a timescale of seconds.
后来我想到了一个问题,如果把容器运行在虚拟机之上,不又多了一层overhead吗?有没有优化。比如GKE runs on GCE VM, any optimization on VM image for k8s? 是的,用的是container-optimized OS.
Infrastructure as code is an approach to infrastructure automation based on practices from software development. It emphasizes consistent, repeatable routines for provisioning and changing systems and their configuration. Changes are made to definitions and then rolled out to systems through unattended processes(指不需要人参与) that include thorough validation.
The phrase dynamic infrastructure to refer to the ability to create and destroy servers programmatically.
Challenges with dynamic infrastructure, the previous one can cause the next:
Server Sprawl: servers growing faster then ability can control.
Configuration drift: inconsistency across the servers, such as manual ad-hoc fixes, config.
Snowflake Server: can’t be replicated.
Fragile Infrastructure: snowflake server problem expands.
Automation Fear: lack of confidence.
Erosion: infrastructure decays over time, such as components upgrade, patches, disk fill up, hardware failure.
An operations team should be able to confidently and quickly rebuild any server in their infrastructure.
Principles of Infrastruction as Code to mitigate above challenges:
Systems can be easily reproduced.
Systems are disposable.
Systems are consistent.
Processes are repeatable.
Design is always changing.
Effective infrastructure teams have a strong scripting culture. If a task can be scripted, script it. If a task is hard to script, drill down and see if there’s a technique or tool that can help, or whether the problem the task is addressing can be handled in a different way.
General practices of infrastructure as Code:
Use definition files: to specify infra elements and config.
Self-documented systems and processes: doc may leave gaps over time.
Version all things.
Continuously test systems and processes, how? see Chapter 11.
Small changes rather than batches.
Keep services available continuously, see chapter 14.
Antifragility, beyond robust: When something goes wrong, the priority is not simply to fix it, but to improve the ability of the system to cope with similar incidents in the future.
Chapter 2
Dynamic Infrastructure Platform: is a system that provides computing resources, particularly servers, storage, and networking, in a way that they can be programmatically allocated and managed.
这里有个概念要澄清一下: private cloud vs bare-metal cloud. 之前认为意义相同,但并不是,bare-metal cloud is running an OS directly on server hardware rather than in a VM. There are many reasons why running directly on hardware may be the best choice for a given application or service. Virtualization adds performance overhead, because it inserts extra software layers between the application and the hardware resources it uses. Processes on one VM can impact the performance of other VMs running on the same host. 常用的tool for managing bare-metal: Cobbler, Foreman, etc.
An IT professional, the deeper and stronger your understanding of how the system works down the stack and into the hardware, the more proficient you’ll be at getting the most from it.
并不是说new instance就一定是well performance的,虚拟化也有很多不确定因素:
For example, the Netflix team knew that a percentage of AWS instances, when provisioned, will perform much worse than the average instance, whether because of hardware issues or simply because they happen to be sharing hardware with someone else’s poorly behaving systems. So they wrote their provisioning scripts to immediately test the performance of each new instance. If it doesn’t meet their standards, the script destroys the instance and tries again with a new instance.
Software and infrastructure should be architected, designed, and implemented with an understanding of the true architecture of the hardware, networking, storage, and the dynamic infrastructure platform.
Chapter 3
Infrasturcture Definition Tools: This chapter has discussed the types of tools to manage high-level infrastructure according to the principles and practices of infrastructure as code.
Chapter 4
Server Configuration Tools
主要讲了provisioning tools, such as chef, puppet and ansible, salt. Tools for packing server templates, such as Packer. Tools for running commands on server.
Many server configuration tool vendors provide their own configuration registry to manage configuration definitions, for example, Chef Server, PuppetDB, and Ansible Tower.
In many cases, new servers can be built using off-the-shelf server template images. Packaging common elements onto a template makes it faster to provision new servers. Some teams take this further by creating server templates for particular roles such as web servers and application servers. Chapter 7 discusses trade-offs and patterns around baking server elements into templates versus adding them when creating servers (这个是当时正在做的新项目)。
Unikernel Server Templates. an OS image that is custom-compiled with the application it will run. The image only includes the parts of the OS kernel needed for the application, so is small and fast. This image is run directly as a VM or container (see later in this chapter) but has a single address space.
It’s important for an infrastructure team to build up and continuously improve their skills with scripting. Learn new languages, learn better techniques, learn new libraries and frameworks
Server change management models:
Ad hoc change, lead to config drift, snowflake server and other evils.
Configuration synchronization, may cause config drift on left parts
Immutable infra, completely replacing, requires good templates management.
Containerized services follows something similar to immutable infra, replace old container completely when apply changes. A container uses operating system features to isolate the processes, networking, and filesystem of the container, so it appears to be its own, self-contained server environment.
There is actually some dependency between the host and container. In particular, container instances use the Linux kernel of the host system, so a given image could potentially behave differently, or even fail, when run on different versions of the kernel.
A host server runs virtual machines using a hypervisor, Container instances share the operating system kernel of their host system, so they can’t run a different OS. Container has less overhead than a hardware virtual machine. Container image can be much smaller than a VM image, because it doesn’t need to include the entire OS. It can start up in seconds, as it doesn’t need to boot a kernel from scratch. And it consumes fewer system resources, because it doesn’t need to run its own kernel. So a given host can run more container processes than full VMs.
Container security, While containers isolate processes running on a host from one another, this isolation is not impossible to break. Different container implementations have different strengths and weaknesses. When using containers, a team should be sure to fully understand how the technology works, and where its vulnerabilities may lie.
Teams should ensure the provenance of each image used within the infrastructure is well known, trusted, and can be verified and traced. (当时ICP4D image RedHat 也专门去扫描检测了)。
Chapter 5
General Indrastructure Services.
The purpose of this chapter isn’t to list or explain these services and tools. Instead, it is intended to explain how they should work in the context of a dynamic infrastructure managed as code.
The services and tools addressed are monitoring, service discovery, distributed process management, and software deployment. (这是几个主要的在infra 完成构建后的其他主要服务)
Monitor: alerting, metrics and logging.
Monitoring information comes in two types: state and events. State is concerned with the current situation, whereas an event records actions or changes.
Alerting: Tell Me When Something Is Wrong
Metrics: Collect and Analyze Data
Log Aggregation and Analysis
Service Discovery: Applications and services running in an infrastructure often need to know how to find other applications and services.
Distributed Process Management: VMs or containers. K8s, Nomad, Openshift.
Software Deployment: Many have a series of environments for testing stages, including things like operational acceptance testing (OAT), QA (for humans to carry out exploratory testing), system integration testing (SIT), user acceptance testing (UAT), staging, preproduction, and performance.
Part II Patterns
Chapter 6
Patterns for Provisioning Servers.
Provisioning is not only done for a new server. Sometimes an existing server is re- provisioned, changing its role from one to another.
Server’s lifecycle:
package a server template.
create a new server
update a server
replace a server
delete a server
Zero-downtime replacement ensures that a new server is completely built and tested while the existing server is still running so it can be hot-swapped into service once ready.
Advocates of immutable servers view making a change to the configuration of a production server as bad practice, no better than modifying the source code of software directly on a production server.
recover from failure, outage, maintenance
resize server pool, add/remove instances
reconfig hardware resources, for example, add CPU, RAM, mount new disks, etc.
Server roles:
Another pattern is to have a role-inheritance hierarchy(我们确实也是这么做的). The base role would have the software and configuration common to all servers, such as a monitoring agent, common user accounts, and common configuration like DNS and NTP server settings. Other roles would add more things on top of this, possibly at several levels.
It can still be useful to have servers with multiple roles even with the role inheritance pattern. For example, although production deployments may have separate web, app, and db servers, for development and some test cases, it can be pragmatic to combine these onto a single server.
Cloned server (similar to save container to image) suffers, because they have runtime data from the original server, which is not reproducible and it accumulate changes or data.
Bootstrapping new servers:
push bootstrapping: Ansible, Chef, Puppet
pull bootstrapping: cloud-init
Smoke test every new server instance:
Is the server running and accessible?
Is the monitoring agent running?
Has the server appeared in DNS, monitoring, and other network services?
Are all of the necessary services (web, app, database, etc.) running?
Are required user accounts in place?
Are there any ports open that shouldn’t be?
Are any user accounts enabled that shouldn’t be?
Smoke tests could be integrated with monitoring systems. Most of the checks that would go into a smoke test would work great as routine monitoring checks, so the smoke test could just verify that the new server appears in the monitoring system, and that all of its checks are green.
Chapter 7
Patterns for Managing Server Templates 需要重点关注.
这也是我们采取的前后2种方法,new generation采用第二种。也可以两者结合,把经常变化的部分放在creation time provisioning.
One end of the spectrum is minimizing what’s on the template and doing most of the provisioning work when a new server is created.
Keeping templates minimal makes sense when there is a lot of variation in what may be installed on a server. For example, if people create servers by self-service, choosing from a large menu of configuration options, it makes sense to provision dynamically when the server is created. Otherwise, the library of prebuilt templates would need to be huge to include all of the variations that a user might select.
At the other end of the provisioning spectrum is putting nearly everything into the server template.
Doing all of the significant provisioning in the template, and disal‐ lowing changes to anything other than runtime data after a server is created, is the key idea of immutable servers.
Process to build template
An alternative to booting the origin image is to mount the origin disk image in another server and apply changes to its filesystem. This tends to be much faster, but the customization process may be more complicated.
Netflix’s Aminator tool builds AWS AMIs by mounting the origin image as a disk volume. The company’s blog post on Aminator describes the process quite well. Packer offers the amazon-chroot builder to support this approach.
It could make sense to have server templates tuned for different pur‐ poses. Database server nodes could be built from one template that has been tuned for high-performance file access, while web servers may be tuned for network I/O throughput.(我们并没有考虑这么多)
Chapter 8
Patterns for Updating and Changing Servers 需要重点关注。
An effective change management process ensures that any new change is rolled out to all relevant existing servers and applied to newly created servers.
Continuous Configuration Synchronization, for example, google gcloud resource configuration has a central of truth repo(主要是针对API, role权限), the configuration process sync every one hour or so to elminiate config drift.
Any areas not explicitly managed by configuration definitions may be changed outside the tooling, which leaves them vulnerable to configuration drift.
Immutable Servers, the practice is normally combined with keeping the lifespan of servers short, as with the Phoenix. So servers are rebuilt as frequently as every day, leaving little opportunity for unmanaged changes. Another approach to this issue is to set those parts of a server’s filesystems that should not change at runtime as read-only.
Using the term “immutable” to describe this pattern can be misleading. “Immutable” means that a thing can’t be changed, so a truly immutable server would be useless. As soon as a server boots, its runtime state changes—processes run, entries are written to logfiles, and application data is added, updated, and removed.It’s more useful to think of the term “immutable” as applying to the server’s configu‐ ration, rather than to the server as a whole.
Depending on the design of the configuration tool, a pull-based system may be more scalable than a push-based system. A push system needs the master to open connections to the systems it manages, which can become a bottleneck with infrastructures that scale to thousands of servers. Setting up clusters or pools of agents can help a push model scale. But a pull model can be designed to scale with fewer resources, and with less complexity.
Chapter 9
Patterns for Defining Infrastructure
This chapter will look at how to provision and configure larger groups of infrastructure elements.
Stack: A stack is a collection of infrastructure elements that are defined as a unit.
Use parameterized environment definitions, for example terraform brings up a stack with a single definition file for different environments.
提到了Consul configuration registry, 里面存储了不同stack的资源,比如run time IP address,可以供stack之间相互引用,这样decouple了stack,于是可以各自为政.
1 2 3 4 5 6 7
# AWS, get vip_ip from consul resource "consul_keys" "app_server" { key { name = "vip_ip" path = "myapp/${var.environment}/appserver/vip_ip" } }
It’s better to ensure that infrastructure is provisioned and updated by running tools from centrally managed systems, such as an orchestration agent. An orchestration agent is a server that is used to execute tools for provisioning and updating infrastructure. These are often controlled by a CI or CD server, as part of a change management pipeline. 处于安全,一致性,依赖的原因,确实应该如此.
Part III Practice
Chapter 10
Software Engineering Practices for Infrastructure
Assume everything you deliver will need to change as the system evolves.
The true measure of the quality of a system, and its code, is how quickly and safely changes are made to it.
gitlab-ci上组里就是这么做的:
Although a CI tool can be used to run tests automatically on commits made to each separate branch, the integrated changes are only tested together when the branches are merged. Some teams find that this works well for them, generally by keeping branches very short-lived.
这里总结得很好,commit changes to short-lived branch and then merge to trunk, do both CI on before and after the merge for branch and trunk.
这个CI/CD也解释得很好:
CI Continuous integration addresses work done on a single codebase. CD Continuous delivery expands the scope of this continuous integration to the entire system, with all of its components.
The idea behinds CD is to ensure that all of the deployable components, systems, and infrastructure are continuously validated to ensure that they are production ready. It is used to address the problems of the “integration phase.”
One misconception about CD is that it means every change committed is applied to production immediately after passing automated tests. The point of CD is not to apply every change to production immediately, but to ensure that every change is ready to go to production.
Code Quality
The key to a well-engineered system is simplicity. Build only what you need, then it becomes easier to make sure what you have built is correct. Reorganize code when doing so clearly adds value.
Technical debt is a metaphor for problems in a system that have been left unfixed. 最好不要积累technical debts,发现的时候就去修复.
An optional feature that is no longer used, or whose development has been stopped, is technical debt. It should be pruned ruthlessly. Even if you decide later on that you need that code, it should be in the history of the VCS. If, in the future, you want to go back and dust it off, you’ve got it in the history in version control.
Chapter 11
Testing Infrastructure Changes,需要关注.
The pyramid puts tests with a broader scope toward the top, and those with a narrow scope at the bottom. The lower tiers validate smaller, individual things such as defini‐ tion files and scripts. The middle tiers test some of the lower-level elements together —for example, by creating a running server. The highest tiers test working systems together—for example, a service with multiple servers and their surrounding infrastructure.
There are more tests at the lower levels of the pyramid and fewer at the top. Because the lower-level tests are smaller and more focused, they run very quickly. The higher- level tests tend to be more involved, taking longer to set up and then run, so they run slower.
In order for CI and CD to be practical, the full test suite should run every time someone commits a change. The committer should be able to see the results of the test for their individual change in a matter of minutes. Slow test suites make this difficult to do, which often leads teams to decide to run the test suite periodically—every few hours, or even nightly.
If running tests on every commit is too slow to be practical, the sol‐ ution is not to run the tests less often, but instead to fix the situa‐ tion so the test suite runs more quickly. This usually involves re- balancing the test suite, reducing the number of long-running tests and increasing the coverage of tests at the lower levels.
This in turn may require rearchitecting the system being tested to be more modular and loosely coupled, so that individual compo‐ nents can be tested more quickly.
static code analysis: linting, Static analysis can be used to check for common errors and bad habits which, while syntactically correct, can lead to bugs, security holes, performance issues, or just code that is difficult to understand.
unit testing, ansible has dedicate module for this, also puppet and chef.
Mid-level testing
For example, starts building template via Packer and Ansible, the validation process would be to create a server instance using the new template, and then run some tests against it.
Tools to test server configuration: Serverspec, 目前对于packer instance 都是自己去检查的, for example:
describe service('login_service') do it { should be_running } end
describe host('dbserver') do it { should be_reachable.with( :port =>5432 ) } end
// describe 'install and configure web server'do let(:chef_run) { ChefSpec::SoloRunner.converge(nginx_configuration_recipe) }
it 'installs nginx'do expect(chef_run).toinstall_package('nginx') end end
describe 'home page is working'do let(:chef_run) { ChefSpec::SoloRunner.converge(nginx_configuration_recipe, home_page_deployment_recipe) }
it 'loads correctly'do response = Net::HTTP.new('localhost',80).get('/') expect(response.body).toinclude('Welcome to the home page') end end
Automatically tests that remotely logging into a server can be challenging to implement securely. These tests either need a hardcoded password, or else an SSH key or similar mechanism that authorizes unattended logins.
One approach to mitigate this is to have tests execute on the test server and push their results to a central server. This could be combined with monitoring, so that servers can self-test and trigger an alert if they fail.
Another approach is to generate one-off authentication credentials when launching a server to test.
High-level testing
The higher levels of the test suite involve testing that multiple elements of the infra‐ structure work correctly when integrated together.
Testing Operational Quality 这部分也很重要,但是应该在QA的范围.
People managing projects to develop and deploy software have a bucket of requirements they call non-functional requirements, or NFRs; these are also sometimes referred to as cross-functional requirements (CFRs). Performance, availability, and security tend to be swept into this bucket.
Operational testing can take place at multiple tiers of the testing pyramid, although the results at the top tiers are the most important.
关于testing and monitoring 的关系:
Testing is aimed at detecting problems when making changes, before they are applied to production systems. Monitoring is aimed at detecting problems in running systems.
In order to effectively test a component, it must be isolated from any dependencies during the test. A solution to this is to use a stub server instead of the application server. It’s important for the stub server to be simple to maintain and use. It only needs to return responses specific to the tests you write.
Mocks, fakes, and stubs are all types of test doubles. A test double replaces a dependency needed by a component or service being tested, to simplify testing.
QA tester means: quality analyst/assurance.
story: a small piece of work (Jira 中的分类), 可能是这个意思.
Chapter 12
Change Management Pipelines for Infrastructure
This chapter explains how to implement continuous delivery for infrastructure by building a change management pipeline. 讲了如何设计,集成,测试CD pipeline.
A change management pipeline could be described as the automated manifestation of your infrastructure change management process. 就理解成CD pipeline.
Guidelines for Designing Pipelines:
Ensure Consistency Across Stages, e.g: server operating system versions and configuration should be the same across environments. Make sure that the essential characteristics are the same.
Get Immediate Feedback for Every Change
Run Automated Stages Before Manual Stages
Get Production-Like Sooner Rather Than Later
My colleague Chris Bird described this as DevOops; the ability to automatically configure many machines at once gives us the ability to automatically break many machines at once. 也就是说利害是hand by hand的。
这里有recap了一下一个CI/CD的流程:
local development stage, make code and test on local virtualization, then commit to VSC.
build stage, syntax checking, unit tests, test doubles, publish reports, packaging and upload code/template image, etc.
如果不是用的immutable server的模式,则你会需要一个configuration master (chef server, puppet master or ansibel tower)去配置环境,所以在CI pipeline的最后,会打包上传一个configuration artifact 供这些config master 使用去配置running server,或者是masterless configuration, running server 会自动从一个file server 下载.
如果使用的是immutable server模式,则内容都在image template中配置好了,比如使用packer,不在需要configuration master or masterless.
automated test stage, refer to test pyramid
manual validation stage
apply to live, any significant risk or uncertainty at this stage should be modeled and addressed in upstream stages.
还要注意的是,并不是每个commit 都会走所有的流程,可能commit 1/2/3 走到一个stage,然后合起来进入下一个stage, the earlier stages of the pipeline will run more often than the later stages. not every change, even ones that pass testing and demo, are necessarily deployed immediately to production.
Pipeline for complex system:
fan-in pattern: The fan-in pattern is a common one, useful for building a system that is composed of multiple components. Each component starts out with its own pipeline to build and test it in isolation. Then the component pipelines are joined so that the components are tested together. A system with multiple layers of components may have multiple joins. 这个流程图就如同fan-in的扇形.
Contract tests are automated tests that check whether a provider interface behaves as consumers expect. This is a much smaller set of tests than full functional tests, purely focused on the API that the service has committed to provide to its consumers.
Chapter 13
Workflow for the Infrastructure Team 这章描述用语很好.
An infrastructure engineer can no longer just log onto a server to make a change. Instead, they make changes to the tools and definitions, and then allow the change management pipeline to roll the changes out to the server.
A sandbox is an environment where a team member can try out changes before com‐ mitting them into the pipeline. It may be run on a local workstation, using virtualiza‐ tion, or could be run on the virtualization platform.
Autonomic Automation Workflow
Using local sandbox for testing:
A sandbox is an environment where a team member can try out changes before com‐ mitting them into the pipeline. It may be run on a local workstation, using virtualiza‐ tion, or could be run on the virtualization platform.
Keeping the whole change/commit cycle short needs some habits around how to structure the changes so they don’t break production even when the whole task isn’t finished. Feature toggles and similar techniques mentioned in Chapter 12 can help.
Chapter 14
Continuity with Dynamic Infrastructure
This chapter is concerned with the operational quality of production infrastructure.
Many IT service providers use availability as a key performance metric or SLA(service level agreement). This is a percentage, often expressed as a number of nines: “five nines availability” means that the system is available 99.999% of the time.
Service continuity
Keeping services available to end users in the face of problems and changes
A pitfall of using dynamic pools to automatically replace failed servers is that it can mask a problem. If an application has a bug that causes it to crash frequently, it may take a while for people to notice. So it is important to implement metrics and alerting on the pool’s activity. The team should be sent critical alerts when the frequency of server failures exceeds a threshold.
Software that has been designed and implemented with the assumption that servers and other infrastructure elements are routinely added and removed is sometimes referred to as cloud native. Cloud-native software handles constantly changing and shifting infrastructure seamlessly.
The team at Heroku published a list of guidelines for applications to work well in the context of a dynamic infrastructure, called the 12-factor application.
Some characteristics of non-cloud-native software that require lift and shift migrations:
Stateful sessions
Storing data on the local filesystem
Slow-running startup routines
Static configuration of infrastructure parameters
Zero-Downtime Changes
Many changes require taking elements of the infrastructure offline, or completely replacing them. Examples include upgrading an OS kernel, reconfiguring a network, or deploying a new version of application software. However, it’s often possible to carry out these changes without interrupting service.
Routing Traffic for Zero-Downtime Replacements. Zero-downtime change patterns involve fine-grained control to switch usage between system components.
Zero-Downtime Changes with Data. The problem comes when the new version of the component involves a change to data formats so that it’s not possible to have both versions share the same data stor‐ age without issues. An effective way to approach data for zero-downtime deployments is to decouple data format changes from software releases.
Data continuity
Keeping data available and consistent on infrastructure that isn’t.
There are many techniques that can be applied to this problem. A few include:
Replicating data redundantly
Regenerating data
Delegating data persistence
Backing up to persistent storage
Disaster recovery
Coping well when the worst happens
Iron-age IT organizations usually optimize for mean time between failures (MTBF), whereas cloud-age organizations optimize for mean time to recover (MTTR).
Removing all but the most essential user accounts, services, software packages, and so on.
Auditing user accounts, system settings, and checking installed software against known vulnerabilities.
Frameworks and scripts for hardening system, see here. It is essential that the members of the team review and understand the changes made by externally created hardening scripts before applying them to their own infrastructure.
Chapter 15
Organizing for Infrastructure as Code
This final chapter takes a look at implementing it from an organizational point of view.
The organizaitional principles that enable this include:
A continuous approach to the design, implementation, and improvement of services
Empowering teams to continuously deliver and improve their services
Ensuring high levels of quality and compliance while delivering rapidly and continuously
A kanban board is a powerful tool to make the value stream visible. This is a variation of an agile story wall, set up to mirror the value stream map for work.
A retrospective is a session that can be held regularly, or after major events like the completion of a project. Everyone involved in the process gathers together to discuss what is working well, and what is not working well, and then decide on changes that could be made to processes and systems in order to get better outcomes.
Post-mortems are typically conducted after an incident or some sort of major prob‐ lem. The goal is to understand the root causes of the issue, and decide on actions to reduce the change of similar issues happening.