Infrastructure as Code

刚看完第一版,第二版就出来了, hmmm… -_-|||

This book doesn’t offer instructions in using specific scripting languages or tools. There are code examples from specific tools, but these are intended to illustrate concepts and approaches, rather than to provide instruction.

最开始介绍了作者之前的一些经历 starts from team Vmware virtual server farm,从这些经历中,慢慢领悟和学习到了IaC的必要性。puppet and chef 看来是很久之前的config automation tool了. 后来过渡到cloud,从其他IT Ops team 中学到很多new ideas, eye-opener: “The key idea of our new approach was that every server could be automatically rebuilt from scratch, and our configuration tooling would run continuously, not ad hoc. Every server added into our new infrastructure would fall under this approach. If automation broke on some edge case, we would either change the automation to handle it, or else fix the design of the service so it was no longer an edge case.”

虚拟机和容器相辅相成。 Virtualization was one step, allowing you to add and remove VMs to scale your capacity to your load on a timescale of minutes. Containers take this to the next level, allowing you to scale your capacity up and down on a timescale of seconds.

后来我想到了一个问题,如果把容器运行在虚拟机之上,不又多了一层overhead吗?有没有优化。比如GKE runs on GCE VM, any optimization on VM image for k8s? 是的,用的是container-optimized OS.

这里还有一些文章,讲了k8s运行环境的比较: Where to Install Kubernetes? Bare-Metal vs. VMs. vs. Cloud. Running Containers on Bare Metal vs. VMs: Performance and Benefits

Part I Fundations

Chapter 1

Infrastructure as code is an approach to infrastructure automation based on practices from software development. It emphasizes consistent, repeatable routines for provisioning and changing systems and their configuration. Changes are made to definitions and then rolled out to systems through unattended processes(指不需要人参与) that include thorough validation.

The phrase dynamic infrastructure to refer to the ability to create and destroy servers programmatically.

Challenges with dynamic infrastructure, the previous one can cause the next:

  • Server Sprawl: servers growing faster then ability can control.
  • Configuration drift: inconsistency across the servers, such as manual ad-hoc fixes, config.
  • Snowflake Server: can’t be replicated.
  • Fragile Infrastructure: snowflake server problem expands.
  • Automation Fear: lack of confidence.
  • Erosion: infrastructure decays over time, such as components upgrade, patches, disk fill up, hardware failure.

An operations team should be able to confidently and quickly rebuild any server in their infrastructure.

Principles of Infrastruction as Code to mitigate above challenges:

  • Systems can be easily reproduced.
  • Systems are disposable.
  • Systems are consistent.
  • Processes are repeatable.
  • Design is always changing.

Effective infrastructure teams have a strong scripting culture. If a task can be scripted, script it. If a task is hard to script, drill down and see if there’s a technique or tool that can help, or whether the problem the task is addressing can be handled in a different way.

General practices of infrastructure as Code:

  • Use definition files: to specify infra elements and config.
  • Self-documented systems and processes: doc may leave gaps over time.
  • Version all things.
  • Continuously test systems and processes, how? see Chapter 11.
  • Small changes rather than batches.
  • Keep services available continuously, see chapter 14.
  • Antifragility, beyond robust: When something goes wrong, the priority is not simply to fix it, but to improve the ability of the system to cope with similar incidents in the future.

Chapter 2

Dynamic Infrastructure Platform: is a system that provides computing resources, particularly servers, storage, and networking, in a way that they can be programmatically allocated and managed.

主要讲了构造dynamic infra platform 的要求,比如programmable, on-demand, self-service。需要提供用户什么功能,比如compute, storage(block, storage storage, networked filesystem), network, anth, etc。

这里有个概念要澄清一下: private cloud vs bare-metal cloud. 之前认为意义相同,但并不是,bare-metal cloud is running an OS directly on server hardware rather than in a VM. There are many reasons why running directly on hardware may be the best choice for a given application or service. Virtualization adds performance overhead, because it inserts extra software layers between the application and the hardware resources it uses. Processes on one VM can impact the performance of other VMs running on the same host. 常用的tool for managing bare-metal: Cobbler, Foreman, etc.

An IT professional, the deeper and stronger your understanding of how the system works down the stack and into the hardware, the more proficient you’ll be at getting the most from it.

并不是说new instance就一定是well performance的,虚拟化也有很多不确定因素: For example, the Netflix team knew that a percentage of AWS instances, when provisioned, will perform much worse than the average instance, whether because of hardware issues or simply because they happen to be sharing hardware with someone else’s poorly behaving systems. So they wrote their provisioning scripts to immediately test the performance of each new instance. If it doesn’t meet their standards, the script destroys the instance and tries again with a new instance.

Software and infrastructure should be architected, designed, and implemented with an understanding of the true architecture of the hardware, networking, storage, and the dynamic infrastructure platform.

Chapter 3

Infrasturcture Definition Tools: This chapter has discussed the types of tools to manage high-level infrastructure according to the principles and practices of infrastructure as code.

Chapter 4

Server Configuration Tools 主要讲了provisioning tools, such as chef, puppet and ansible, salt. Tools for packing server templates, such as Packer. Tools for running commands on server.

Many server configuration tool vendors provide their own configuration registry to manage configuration definitions, for example, Chef Server, PuppetDB, and Ansible Tower.

In many cases, new servers can be built using off-the-shelf server template images. Packaging common elements onto a template makes it faster to provision new servers. Some teams take this further by creating server templates for particular roles such as web servers and application servers. Chapter 7 discusses trade-offs and patterns around baking server elements into templates versus adding them when creating servers (这个是当时正在做的新项目)。

Unikernel Server Templates. an OS image that is custom-compiled with the application it will run. The image only includes the parts of the OS kernel needed for the application, so is small and fast. This image is run directly as a VM or container (see later in this chapter) but has a single address space.

It’s important for an infrastructure team to build up and continuously improve their skills with scripting. Learn new languages, learn better techniques, learn new libraries and frameworks

Server change management models:

  • Ad hoc change, lead to config drift, snowflake server and other evils.
  • Configuration synchronization, may cause config drift on left parts
  • Immutable infra, completely replacing, requires good templates management.

Containerized services follows something similar to immutable infra, replace old container completely when apply changes. A container uses operating system features to isolate the processes, networking, and filesystem of the container, so it appears to be its own, self-contained server environment.

There is actually some dependency between the host and container. In particular, container instances use the Linux kernel of the host system, so a given image could potentially behave differently, or even fail, when run on different versions of the kernel.

A host server runs virtual machines using a hypervisor, Container instances share the operating system kernel of their host system, so they can’t run a different OS. Container has less overhead than a hardware virtual machine. Container image can be much smaller than a VM image, because it doesn’t need to include the entire OS. It can start up in seconds, as it doesn’t need to boot a kernel from scratch. And it consumes fewer system resources, because it doesn’t need to run its own kernel. So a given host can run more container processes than full VMs.

Container security, While containers isolate processes running on a host from one another, this isolation is not impossible to break. Different container implementations have different strengths and weaknesses. When using containers, a team should be sure to fully understand how the technology works, and where its vulnerabilities may lie.

Teams should ensure the provenance of each image used within the infrastructure is well known, trusted, and can be verified and traced. (当时ICP4D image RedHat 也专门去扫描检测了)。

Chapter 5

General Indrastructure Services. The purpose of this chapter isn’t to list or explain these services and tools. Instead, it is intended to explain how they should work in the context of a dynamic infrastructure managed as code.

The services and tools addressed are monitoring, service discovery, distributed process management, and software deployment. (这是几个主要的在infra 完成构建后的其他主要服务)

Monitor: alerting, metrics and logging. Monitoring information comes in two types: state and events. State is concerned with the current situation, whereas an event records actions or changes.

  • Alerting: Tell Me When Something Is Wrong
  • Metrics: Collect and Analyze Data
  • Log Aggregation and Analysis

Service Discovery: Applications and services running in an infrastructure often need to know how to find other applications and services.

Distributed Process Management: VMs or containers. K8s, Nomad, Openshift.

Software Deployment: Many have a series of environments for testing stages, including things like operational acceptance testing (OAT), QA (for humans to carry out exploratory testing), system integration testing (SIT), user acceptance testing (UAT), staging, preproduction, and performance.

Part II Patterns

Chapter 6

Patterns for Provisioning Servers.

Provisioning is not only done for a new server. Sometimes an existing server is re- provisioned, changing its role from one to another.

Server’s lifecycle:

  1. package a server template.
  2. create a new server
  3. update a server
  4. replace a server
  5. delete a server

Zero-downtime replacement ensures that a new server is completely built and tested while the existing server is still running so it can be hot-swapped into service once ready.

Advocates of immutable servers view making a change to the configuration of a production server as bad practice, no better than modifying the source code of software directly on a production server.

  1. recover from failure, outage, maintenance
  2. resize server pool, add/remove instances
  3. reconfig hardware resources, for example, add CPU, RAM, mount new disks, etc.

Server roles: Another pattern is to have a role-inheritance hierarchy(我们确实也是这么做的). The base role would have the software and configuration common to all servers, such as a monitoring agent, common user accounts, and common configuration like DNS and NTP server settings. Other roles would add more things on top of this, possibly at several levels.

It can still be useful to have servers with multiple roles even with the role inheritance pattern. For example, although production deployments may have separate web, app, and db servers, for development and some test cases, it can be pragmatic to combine these onto a single server.

Cloned server (similar to save container to image) suffers, because they have runtime data from the original server, which is not reproducible and it accumulate changes or data.

Bootstrapping new servers:

  • push bootstrapping: Ansible, Chef, Puppet
  • pull bootstrapping: cloud-init

Smoke test every new server instance:

  • Is the server running and accessible?
  • Is the monitoring agent running?
  • Has the server appeared in DNS, monitoring, and other network services?
  • Are all of the necessary services (web, app, database, etc.) running?
  • Are required user accounts in place?
  • Are there any ports open that shouldn’t be?
  • Are any user accounts enabled that shouldn’t be?

Smoke tests could be integrated with monitoring systems. Most of the checks that would go into a smoke test would work great as routine monitoring checks, so the smoke test could just verify that the new server appears in the monitoring system, and that all of its checks are green.

Chapter 7

Patterns for Managing Server Templates 需要重点关注.

这也是我们采取的前后2种方法,new generation采用第二种。也可以两者结合,把经常变化的部分放在creation time provisioning. One end of the spectrum is minimizing what’s on the template and doing most of the provisioning work when a new server is created.

Keeping templates minimal makes sense when there is a lot of variation in what may be installed on a server. For example, if people create servers by self-service, choosing from a large menu of configuration options, it makes sense to provision dynamically when the server is created. Otherwise, the library of prebuilt templates would need to be huge to include all of the variations that a user might select.

At the other end of the provisioning spectrum is putting nearly everything into the server template.

Doing all of the significant provisioning in the template, and disal‐ lowing changes to anything other than runtime data after a server is created, is the key idea of immutable servers.

Process to build template An alternative to booting the origin image is to mount the origin disk image in another server and apply changes to its filesystem. This tends to be much faster, but the customization process may be more complicated.

Netflix’s Aminator tool builds AWS AMIs by mounting the origin image as a disk volume. The company’s blog post on Aminator describes the process quite well. Packer offers the amazon-chroot builder to support this approach.

It could make sense to have server templates tuned for different pur‐ poses. Database server nodes could be built from one template that has been tuned for high-performance file access, while web servers may be tuned for network I/O throughput.(我们并没有考虑这么多)

Chapter 8

Patterns for Updating and Changing Servers 需要重点关注。 An effective change management process ensures that any new change is rolled out to all relevant existing servers and applied to newly created servers.

Continuous Configuration Synchronization, for example, google gcloud resource configuration has a central of truth repo(主要是针对API, role权限), the configuration process sync every one hour or so to elminiate config drift.

Any areas not explicitly managed by configuration definitions may be changed outside the tooling, which leaves them vulnerable to configuration drift.

Immutable Servers, the practice is normally combined with keeping the lifespan of servers short, as with the Phoenix. So servers are rebuilt as frequently as every day, leaving little opportunity for unmanaged changes. Another approach to this issue is to set those parts of a server’s filesystems that should not change at runtime as read-only.

Using the term “immutable” to describe this pattern can be misleading. “Immutable” means that a thing can’t be changed, so a truly immutable server would be useless. As soon as a server boots, its runtime state changes—processes run, entries are written to logfiles, and application data is added, updated, and removed.It’s more useful to think of the term “immutable” as applying to the server’s configu‐ ration, rather than to the server as a whole.

Depending on the design of the configuration tool, a pull-based system may be more scalable than a push-based system. A push system needs the master to open connections to the systems it manages, which can become a bottleneck with infrastructures that scale to thousands of servers. Setting up clusters or pools of agents can help a push model scale. But a pull model can be designed to scale with fewer resources, and with less complexity.

Chapter 9

Patterns for Defining Infrastructure This chapter will look at how to provision and configure larger groups of infrastructure elements.

Stack: A stack is a collection of infrastructure elements that are defined as a unit.

Use parameterized environment definitions, for example terraform brings up a stack with a single definition file for different environments.

提到了Consul configuration registry, 里面存储了不同stack的资源,比如run time IP address,可以供stack之间相互引用,这样decouple了stack,于是可以各自为政.

1
2
3
4
5
6
7
# AWS, get vip_ip from consul
resource "consul_keys" "app_server" {
key {
name = "vip_ip"
path = "myapp/${var.environment}/appserver/vip_ip"
}
}

It’s better to ensure that infrastructure is provisioned and updated by running tools from centrally managed systems, such as an orchestration agent. An orchestration agent is a server that is used to execute tools for provisioning and updating infrastructure. These are often controlled by a CI or CD server, as part of a change management pipeline. 处于安全,一致性,依赖的原因,确实应该如此.

Part III Practice

Chapter 10

Software Engineering Practices for Infrastructure Assume everything you deliver will need to change as the system evolves.

The true measure of the quality of a system, and its code, is how quickly and safely changes are made to it.

gitlab-ci上组里就是这么做的: Although a CI tool can be used to run tests automatically on commits made to each separate branch, the integrated changes are only tested together when the branches are merged. Some teams find that this works well for them, generally by keeping branches very short-lived. 这里总结得很好,commit changes to short-lived branch and then merge to trunk, do both CI on before and after the merge for branch and trunk.

这个CI/CD也解释得很好: CI Continuous integration addresses work done on a single codebase. CD Continuous delivery expands the scope of this continuous integration to the entire system, with all of its components.

The idea behinds CD is to ensure that all of the deployable components, systems, and infrastructure are continuously validated to ensure that they are production ready. It is used to address the problems of the “integration phase.”

One misconception about CD is that it means every change committed is applied to production immediately after passing automated tests. The point of CD is not to apply every change to production immediately, but to ensure that every change is ready to go to production.

Code Quality The key to a well-engineered system is simplicity. Build only what you need, then it becomes easier to make sure what you have built is correct. Reorganize code when doing so clearly adds value.

Technical debt is a metaphor for problems in a system that have been left unfixed. 最好不要积累technical debts,发现的时候就去修复.

An optional feature that is no longer used, or whose development has been stopped, is technical debt. It should be pruned ruthlessly. Even if you decide later on that you need that code, it should be in the history of the VCS. If, in the future, you want to go back and dust it off, you’ve got it in the history in version control.

Chapter 11

Testing Infrastructure Changes,需要关注. The pyramid puts tests with a broader scope toward the top, and those with a narrow scope at the bottom. The lower tiers validate smaller, individual things such as defini‐ tion files and scripts. The middle tiers test some of the lower-level elements together —for example, by creating a running server. The highest tiers test working systems together—for example, a service with multiple servers and their surrounding infrastructure.

There are more tests at the lower levels of the pyramid and fewer at the top. Because the lower-level tests are smaller and more focused, they run very quickly. The higher- level tests tend to be more involved, taking longer to set up and then run, so they run slower.

In order for CI and CD to be practical, the full test suite should run every time someone commits a change. The committer should be able to see the results of the test for their individual change in a matter of minutes. Slow test suites make this difficult to do, which often leads teams to decide to run the test suite periodically—every few hours, or even nightly.

If running tests on every commit is too slow to be practical, the sol‐ ution is not to run the tests less often, but instead to fix the situa‐ tion so the test suite runs more quickly. This usually involves re- balancing the test suite, reducing the number of long-running tests and increasing the coverage of tests at the lower levels.

This in turn may require rearchitecting the system being tested to be more modular and loosely coupled, so that individual compo‐ nents can be tested more quickly.

其实test cases也不太好决定, 还是要根据实际需求去选择测试什么部分,经常变动或容易broken的组件,及时的更新。 Practice:

  • Test at the Lowest Level Possible
  • Only Implement the Layers You Need
  • Prune the Test Suite Often
  • Continuously Review Testing Effectiveness

Whenever there is a major issue in production or even in testing, consider running a blameless post-mortem. 谷歌内部也提倡这个习惯。

Low-level testing 对于ansible playbook, packer json file 之类的文件检查,有几个步骤:

  • syntax check, ansible and others 自带有parser
  • static code analysis: linting, Static analysis can be used to check for common errors and bad habits which, while syntactically correct, can lead to bugs, security holes, performance issues, or just code that is difficult to understand.
  • unit testing, ansible has dedicate module for this, also puppet and chef.

Mid-level testing For example, starts building template via Packer and Ansible, the validation process would be to create a server instance using the new template, and then run some tests against it.

Tools to test server configuration: Serverspec, 目前对于packer instance 都是自己去检查的, for example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
describe service('login_service') do
it { should be_running }
end

describe host('dbserver') do
it { should be_reachable.with( :port => 5432 ) }
end

//
describe 'install and configure web server' do
let(:chef_run) { ChefSpec::SoloRunner.converge(nginx_configuration_recipe) }

it 'installs nginx' do
expect(chef_run).to install_package('nginx')
end
end

describe 'home page is working' do
let(:chef_run) {
ChefSpec::SoloRunner.converge(nginx_configuration_recipe,
home_page_deployment_recipe)
}

it 'loads correctly' do
response = Net::HTTP.new('localhost',80).get('/')
expect(response.body).to include('Welcome to the home page')
end
end

Automatically tests that remotely logging into a server can be challenging to implement securely. These tests either need a hardcoded password, or else an SSH key or similar mechanism that authorizes unattended logins.

One approach to mitigate this is to have tests execute on the test server and push their results to a central server. This could be combined with monitoring, so that servers can self-test and trigger an alert if they fail.

Another approach is to generate one-off authentication credentials when launching a server to test.

High-level testing The higher levels of the test suite involve testing that multiple elements of the infra‐ structure work correctly when integrated together.

Testing Operational Quality 这部分也很重要,但是应该在QA的范围. People managing projects to develop and deploy software have a bucket of requirements they call non-functional requirements, or NFRs; these are also sometimes referred to as cross-functional requirements (CFRs). Performance, availability, and security tend to be swept into this bucket.

Operational testing can take place at multiple tiers of the testing pyramid, although the results at the top tiers are the most important.

关于testing and monitoring 的关系: Testing is aimed at detecting problems when making changes, before they are applied to production systems. Monitoring is aimed at detecting problems in running systems.

In order to effectively test a component, it must be isolated from any dependencies during the test. A solution to this is to use a stub server instead of the application server. It’s important for the stub server to be simple to maintain and use. It only needs to return responses specific to the tests you write.

Mocks, fakes, and stubs are all types of test doubles. A test double replaces a dependency needed by a component or service being tested, to simplify testing.

QA tester means: quality analyst/assurance.

story: a small piece of work (Jira 中的分类), 可能是这个意思.

Chapter 12

Change Management Pipelines for Infrastructure This chapter explains how to implement continuous delivery for infrastructure by building a change management pipeline. 讲了如何设计,集成,测试CD pipeline.

A change management pipeline could be described as the automated manifestation of your infrastructure change management process. 就理解成CD pipeline.

Guidelines for Designing Pipelines:

  • Ensure Consistency Across Stages, e.g: server operating system versions and configuration should be the same across environments. Make sure that the essential characteristics are the same.
  • Get Immediate Feedback for Every Change
  • Run Automated Stages Before Manual Stages
  • Get Production-Like Sooner Rather Than Later

My colleague Chris Bird described this as DevOops; the ability to automatically configure many machines at once gives us the ability to automatically break many machines at once. 也就是说利害是hand by hand的。

这里有recap了一下一个CI/CD的流程:

  1. local development stage, make code and test on local virtualization, then commit to VSC.
  2. build stage, syntax checking, unit tests, test doubles, publish reports, packaging and upload code/template image, etc.

如果不是用的immutable server的模式,则你会需要一个configuration master (chef server, puppet master or ansibel tower)去配置环境,所以在CI pipeline的最后,会打包上传一个configuration artifact 供这些config master 使用去配置running server,或者是masterless configuration, running server 会自动从一个file server 下载.

如果使用的是immutable server模式,则内容都在image template中配置好了,比如使用packer,不在需要configuration master or masterless.

  1. automated test stage, refer to test pyramid
  2. manual validation stage
  3. apply to live, any significant risk or uncertainty at this stage should be modeled and addressed in upstream stages.

还要注意的是,并不是每个commit 都会走所有的流程,可能commit 1/2/3 走到一个stage,然后合起来进入下一个stage, the earlier stages of the pipeline will run more often than the later stages. not every change, even ones that pass testing and demo, are necessarily deployed immediately to production.

Pipeline for complex system: fan-in pattern: The fan-in pattern is a common one, useful for building a system that is composed of multiple components. Each component starts out with its own pipeline to build and test it in isolation. Then the component pipelines are joined so that the components are tested together. A system with multiple layers of components may have multiple joins. 这个流程图就如同fan-in的扇形.

Contract tests are automated tests that check whether a provider interface behaves as consumers expect. This is a much smaller set of tests than full functional tests, purely focused on the API that the service has committed to provide to its consumers.

Chapter 13

Workflow for the Infrastructure Team 这章描述用语很好. An infrastructure engineer can no longer just log onto a server to make a change. Instead, they make changes to the tools and definitions, and then allow the change management pipeline to roll the changes out to the server.

A sandbox is an environment where a team member can try out changes before com‐ mitting them into the pipeline. It may be run on a local workstation, using virtualiza‐ tion, or could be run on the virtualization platform.

Autonomic Automation Workflow Using local sandbox for testing: A sandbox is an environment where a team member can try out changes before com‐ mitting them into the pipeline. It may be run on a local workstation, using virtualiza‐ tion, or could be run on the virtualization platform.

Keeping the whole change/commit cycle short needs some habits around how to structure the changes so they don’t break production even when the whole task isn’t finished. Feature toggles and similar techniques mentioned in Chapter 12 can help.

Chapter 14

Continuity with Dynamic Infrastructure This chapter is concerned with the operational quality of production infrastructure. Many IT service providers use availability as a key performance metric or SLA(service level agreement). This is a percentage, often expressed as a number of nines: “five nines availability” means that the system is available 99.999% of the time.

Service continuity Keeping services available to end users in the face of problems and changes

A pitfall of using dynamic pools to automatically replace failed servers is that it can mask a problem. If an application has a bug that causes it to crash frequently, it may take a while for people to notice. So it is important to implement metrics and alerting on the pool’s activity. The team should be sent critical alerts when the frequency of server failures exceeds a threshold.

Software that has been designed and implemented with the assumption that servers and other infrastructure elements are routinely added and removed is sometimes referred to as cloud native. Cloud-native software handles constantly changing and shifting infrastructure seamlessly.

The team at Heroku published a list of guidelines for applications to work well in the context of a dynamic infrastructure, called the 12-factor application.

Some characteristics of non-cloud-native software that require lift and shift migrations:

  • Stateful sessions
  • Storing data on the local filesystem
  • Slow-running startup routines
  • Static configuration of infrastructure parameters

Zero-Downtime Changes Many changes require taking elements of the infrastructure offline, or completely replacing them. Examples include upgrading an OS kernel, reconfiguring a network, or deploying a new version of application software. However, it’s often possible to carry out these changes without interrupting service.

  • Blue-Green Replacement
  • Phoenix Replacement
  • Canary Replacement
  • dark launching

Routing Traffic for Zero-Downtime Replacements. Zero-downtime change patterns involve fine-grained control to switch usage between system components.

Zero-Downtime Changes with Data. The problem comes when the new version of the component involves a change to data formats so that it’s not possible to have both versions share the same data stor‐ age without issues. An effective way to approach data for zero-downtime deployments is to decouple data format changes from software releases.

Data continuity Keeping data available and consistent on infrastructure that isn’t. There are many techniques that can be applied to this problem. A few include:

  • Replicating data redundantly
  • Regenerating data
  • Delegating data persistence
  • Backing up to persistent storage

Disaster recovery Coping well when the worst happens

Iron-age IT organizations usually optimize for mean time between failures (MTBF), whereas cloud-age organizations optimize for mean time to recover (MTTR).

Security Keeping bad actors at bay

  • Reliable Updates as a Defense
  • Provenance of Packages
  • Automated Hardening

Common vulnerabilities list from CVE.

Hardening refers to configuring a system to make it more secure than it would be out of the box. Typical activities include:

  • Configuring security policies (e.g., firewall rules, SSH key use, password policies, sudoers files, etc.).
  • Removing all but the most essential user accounts, services, software packages, and so on.
  • Auditing user accounts, system settings, and checking installed software against known vulnerabilities.

Frameworks and scripts for hardening system, see here. It is essential that the members of the team review and understand the changes made by externally created hardening scripts before applying them to their own infrastructure.

Chapter 15

Organizing for Infrastructure as Code This final chapter takes a look at implementing it from an organizational point of view.

The organizaitional principles that enable this include:

  • A continuous approach to the design, implementation, and improvement of services
  • Empowering teams to continuously deliver and improve their services
  • Ensuring high levels of quality and compliance while delivering rapidly and continuously

A kanban board is a powerful tool to make the value stream visible. This is a variation of an agile story wall, set up to mirror the value stream map for work.

A retrospective is a session that can be held regularly, or after major events like the completion of a project. Everyone involved in the process gathers together to discuss what is working well, and what is not working well, and then decide on changes that could be made to processes and systems in order to get better outcomes.

Post-mortems are typically conducted after an incident or some sort of major prob‐ lem. The goal is to understand the root causes of the issue, and decide on actions to reduce the change of similar issues happening.

0%