Lab Environment Setup
Consul is easy to install, just a executable binary, put it in /usr/local/bin
:
https://www.consul.io/downloads
我修改了一下课程的demo,做了一个consul lab cluster via Vagrant: https://github.com/chengdol/InfraTree/tree/master/vagrant-consul
Glossary: https://github.com/chengdol/InfraTree/blob/master/vagrant-consul/glossary.md
Introduction
Challenges in managing services:
- Service discovery
- Failure Detection
- Mutli-Data center
- Service configuration
一个应用服务架构中,一般有API tier增加灵活性,同时提供额外的服务,比如以下应用就可以直接拿来API使用:
Consul is distributed
.
These services need to be discovered by each other. 对于越来越复杂的内部组织结构,比如很多internal load balancer, Consul can come and play, 比如提供内部的DNS服务, Service discovery.
Failure Dectection, Consul running lightweight Consul agent (server or client mode) on each of node in your environment. The agent will diagnose all services running locally.
Reacting configuration
via key/value store, reflecting changes quickly in near real time.
Multi-Data center aware.
Consul vs Other softwares, see here. Especailly Consul vs Istio, see here.
Consul UI online demo: https://demo.consul.io
Monitor Nodes
在这一章的例子中提供了一个很好的建模思路!在vagrant virutal machine中安装docker,然后用container的方式运行一些服务(比如这里的Nginx web and HAProxy LB),再expose(localhost)这些端口(对machine iptables做了更改),这样就避免了很多的virtual machine上的安装配置工作。
Start consul server agent:
1 | # -dev: development agent, server mode will be turned on this agent, for quick start |
[ ] 我改动了一下Vagrantfile, 我估计是routing table出了问题,在MacOS host上无法访问private network中的virtual machine via private IP: https://stackoverflow.com/questions/23497855/unable-to-connect-to-vagrant-private-network-from-host
于是我增加了一个VM ui
去显示consul 的UI with port forwarding, but still does not work, from the log the port 8500 is bound with 127.0.0.1:
1 | Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600) |
首先,我想到了更改Client Addr 为 172.20.20.41,因为这是我在Vagrantfile中设置的private IP:
1 | consul agent -config-file /vagrant/ui.consul.json -advertise 172.20.20.41 -client 172.20.20.41 |
但还是不行, 主机上localhost:8500
无法连接,当然为了确认-client
flag的使用的正确性,用netstat查看一下是否端口在改interface上。后来我就想到应该是iptables的问题了,没有这个interface上的流量forward出去,那就改成0.0.0.0
好了(specify “any IPv4 address at all”):
1 | # /vagrant/ui.consul.json set ui is true |
或者可以在config json中定义client_addr
:
1 | { |
虽然通过的ui virtual machine暴露的web,但是所有信息都来自consul server! 和k8s nodeport的模式类似。 Can access via HTTP API: https://www.consul.io/api-docs
1 | http://localhost:8500/v1/catalog/nodes |
DNS query, go to ui
node, when we run consul agent, the DNS port is 8600:
1 | # query node |
The RPC Protocol is deprecated and support was removed in Consul 0.8
. Please use the HTTP API
, which has support for all features of the RPC Protocol.
Consul Commands
这里提到了2个有用的commands, 本来是用RPC实现的,但现在改了:
1 | # can specify target point |
Here 172.20.20.31
is consul server, you must start it by -client 0.0.0.0
, otherwise the port is bound with loopback interface and cannot access.
Other commands:
1 | # maintain node |
Note that consul exec
is by default disabled:
https://www.consul.io/docs/agent/options.html#disable_remote_exec
这个命令挺危险,就相当于ssh到node执行command line. 比如在node上用的docker container提供服务,则可以exec到node docker stop xxx
.
BTY, gracefully exit the consul process will not cause warning or error in UI display. If you force kill it, the node will be marked as critical.
Service Discovery
One way to register service to consul is use Service definition: https://www.consul.io/docs/agent/services 比如register LB service to consul,这样的好处就是前面提到了,consul会根据其他agent反馈的web nginx的情况及时修改HAProxy的config信息,更新配置, 接下来会看到:
Regsiter service does not mean the service is healthy, also need to do healthy check: For example:
1 | { |
Then launch consul client agent, add one more service config file web.service.json
for registration, for example, in web1
node:
1 | consul agent -config-file /vagrant/common.json \ |
Then check the consul UI, you will see the node is good but service is unhealthy because now there is no nginx running, so create nginx in web1
node:
1 | /vagrant/setup.web.sh |
Then refresh the web page, everything is good.
You can dig
the web service from ui
node, this is so called internal service discovery
, not facing public. 这些数据对于LB来说可以用来direct traffic, 这就是Consul自带DNS的好处,没有什么额外的设置了,并且还提供了health check,就非常方便了. 并且public facing LB也在Consul中注册了,这样一旦LB goes down,就能被马上监测到。
1 | dig @localhost -p 8600 web.service.consul SRV |
Except query DNS from dig
, consul HTTP API also can do it:
1 | # services list |
前面用到了service definition去register service,这只是一种方法,还可以用HTTP API 注册. 这里还有一些自动注册的工具: https://www.consul.io/downloads_tools
- docker container registrator
- consul aware app: using HTTP API
LB Dynamic Config
HAProxy: The Reliable, High Performance TCP/HTTP Load Balancer.
HAProxy config file haproxy.cfg
example:
1 | global |
8080
port is where nginx web service from, bind *:80
is meant to expose port for health check, 意思是外界通过LB上的80
端口访问后台web servers, 这也就是为啥consul中LB的health check输出 居然是welcome to Nginx!
,因为那是后台返回的页面.
In the demo, we run HAProxy container in lb
machine. How to verify it is up and running?
In any machine:
1 | dig @localhost -p 8600 lb.service.consul SRV |
Now let’s verify LB is actually working:
1 | # try several times, LB will cycling through backend servers |
如果这时关掉一个web server,在HAProxy没有enable health check功能的情况下,仍然会把请求发往已经挂掉的server,则用户得到503 error. 这也是很多LB的问题,需要设置自身的health check。但如果用consul的DNS,由于各个server的health check已经集成进去了,consul会返回健康的server进行服务. So we can feed information to LB from consul dynamically.
Consul Template
Consul template is go template format: https://github.com/hashicorp/consul-template 这个不仅仅用于config LB, any application with config file can utilize this tool!
Workflow:
consul template will listen changes from consul, as changes occur it will be pushed to the consul template daemon (run in lb
machine). consul template daemon will generate HAProxy new config file from a template for HAProxy, then we tell docker to restart HAProxy (or HAProxy reload config).
This is the haproxy.ctmpl
file
1 | global |
This part means in the web display HAProxy statistic report! 这个统计图挺直观的,但我这里由于route原因看不到, access from http://<Load balancer IP>/haproxy
:
1 | stats enable |
Next, install consul-template in lb
machine, run some tests with template file:
1 | # dry run |
At meanwhile, go to web1
machine, run docker stop/start web
, you will see the real time updates in output from consul-template command above.
Then, create consul-template template file lb.consul-template.hcl
, used to tell consul-template how to do its job.
1 | consul-template -config /vagrant/provision/lb.consul-template.hcl |
Then we can provision the daemon run in background in lb
machine:
1 | (consul-template -config /vagrant/provision/lb.consul-template.hcl >/dev/null 2>&1)& |
Open the consul UI, in terminal go to web1
or web2
machine, stop/start the docker, see the updates. Also in lb
machine, run below command to see the LB still works good, it will not return the unhealthy server to you:
1 | curl http://localhost/ip.html |
Other tools
-
Envconsul Envconsul provides a convenient way to launch a subprocess with environment variables populated from HashiCorp Consul and Vault. 前面提到了config file for process, here Envconsul set env variables for process and kick off for us.
-
confd confd is a lightweight configuration management tool
-
fabio fabio is a fast, modern, zero-conf load balancing HTTP(S) and TCP router for deploying applications managed by consul
Reactive Configuration
One of primary use case is to update app configuration. for example, when services changes inject the changes to consul key/value pairs and have it pushed into our application.
注意key/value不要用来当Database, it’s not intended for! 但是运作的方式几乎和etcd
一样!
https://etcd.io/
Go to Consul UI to add key/value pairs, create a folder path /prod/portal/haproxy
, then create key/value pair in it:
1 | maxconn 2048 |
SSH to ui
node, let’s read the key/value stored:
1 | # list all pairs |
The API will return JSON data, you can use jq
to parse it.
Update the LB config template haproxy.ctmpl
as:
1 | global |
Then make consul-template process reload without killing it:
1 | # HUP signal will make consul-tempalte reload |
Then you will see the haproxy.cfg
file is regenerated!
来谈谈为什么这个key/value setting如此重要:
有时候实现并不知道具体设置参数,在production环境,你可能想real time更新参数,比如这里LB中maxconn
,实际使用中可能由于machine CPU, memory等因素,不得不调小,你可以用consul maint或其他方式去调节, but that would be a pain and the change will take time to converge across the infrastructure.
Use Key/Value store is really a reactive confiuration!
Blocking query
https://www.consul.io/api-docs/features/blocking
A blocking query is used to wait for a potential change using long polling
. Not all endpoints support blocking, but each endpoint uniquely documents its support for blocking queries in the documentation.
Endpoints that support blocking queries return an HTTP header named X-Consul-Index
. This is a unique identifier representing the current state of the requested resource.
Use curl -v
to check HEADER info to see if it has X-Consul-Index
.
这个功能可以用在比如自己的app long polling consul API, 去等待changes happen, reactive listen to changes of consul. 这比周期性的探测节省很多资源。for example:
1 | curl -v http://localhost:8500/v1/kv/prod/portal/haproxy/stats?index=<X-Consul-Index value in header>'&'wait=40s |
如果有change发生,每次X-Consul-Index
value 都会变化.
Health Check
Gossip pool via Serf
and Edge triggered updates, peer to peer.
Serf: https://www.serfdom.io/ (在UI中每个node都有Serf health status)
If you kill and start the consul agent in one node, you will see the log something like:
1 | serf: EventMemberFailed ... |
There are LAN gossip and WAN gossip.
Information disseminated:
- Membership (discovery, joining) - joining the cluster entails only knowing the address of one other node (not required to be a server)
- Failure detection - affords distributed health checks, no need for centralized health checking
- Event broadcast - i.e. leader elected, custom events
System-Level Check
非常类似于K8s的 liveness probe.
https://www.consul.io/docs/agent/checks.html
One of the primary roles of the agent is management of system-level
and application-level
health checks. A health check is considered to be application-level if it is associated with a service. If not associated with a service, the check monitors the health of the entire node.
前面都是用到了service check, 这里增加check node status. For example, disk usage, memory usage, etc.
Update common.json
config file, this config file will take effect on lb
and web
machines, 这部分配置在最近的新版本已经变化了:
1 | { |
Let’s see the mem_utilization.sh
file:
1 | AVAILABLE_RAM=`grep MemAvailable /proc/meminfo | awk '{print $2}'` |
The system-level health check sections will be displayed in consul UI.
For stress test, install stress
software in web1
machine (in the demo code it is added):
1 | # install |
CPU stress test, then you will see in the consul UI the node is unhealthy and is cycled out from LB:
1 | stress -c 1 |
Watching the consul UI for web1
, you will see CPU check failed:
1 | CPU: 100% |
Once it recover, node will itself back into the pool. 这个功能非常有用,可以提前预警可能会发生问题的node. 比如某个web server overloaded,检测出unhealthy,则会被LB 移出,待恢复后又会自动加进去!