Watched Pluralsight <<Managing Ansible with Red Hat Ansible Tower>>

This is a brief introduction for Tower, to see details please check official documents.

Need to know:

  1. Create project to set the runtime environment(Python virtual env), playbook directory.
  2. Add template, associated with project, set verbosity, concurrent job, prompt, etc.
  3. Launch job from template, may be provide extra variables in prompt.
  4. Check job status and log from job dashboard.

Step 1,2,3 could be done by running playbook on Tower.

Introduction

Tower is a kind of control node that also provides a central web UI, authentation and API for Ansible. The new version of Tower is called as Ansible Automation Platform.

Tower installation needs license.

Red Hat Ansible Tower official web site: https://access.redhat.com/products/ansible-tower-red-hat

I use Tower version 3.7.4: https://docs.ansible.com/ansible-tower/3.7.4/html/quickinstall/index.html

Need to apply subscription in order to login the Tower web UI, get trial free license from there: https://docs.ansible.com/ansible-tower/3.7.4/html/installandreference/updates_support.html#trial-evaluation

Tower install package download: https://releases.ansible.com/ansible-tower/setup/ For example, I am using bundled(self-contained) installer ansible-tower-setup-bundle-3.7.4-1.tar.gz, can be used without netwrork connection.

The installation may fail due to lack of necessary packages, just install it, for example:

1
sudo yum install -y rsync

For Tower single node installation, extract the tar.gz and edit the inventory file(Tower is installed through Ansible as well) to fill passwords:

1
2
3
admin_password='admin'
pg_password='admin'
rabbitmq_password='admin'

Then install by running:

1
sudo ./setup.sh

The playbook location: /var/lib/awx/projects, you can put playbooks and ansible.cfg and others info in a tar.gz package and place it under this path (should not need to manually manage these directories).

Tower REST API:

1
2
# check api version
curl -XGET -k https://localhost/api/

There are 4 main components for Tower:

  • Nginx: provide web server for UI and API
  • PostgreSQL: internal relational database server
  • supervisord: process control system that manages the application: running jobs, etc
  • rabbitmq-server: AMQP message broker supporting signalling by application components
  • memcached: local caching service

These services communicate with each other using normal network protocols:

  • Nginx: 80/tcp,443/tcp
  • PostgreSQL: 5432/tcp
  • Rabbitmq-server: beam listens on 5672/tcp, 15672/tcp, 25672/tcp

In the single machine installation, only need to expose 80/tcp and 443/tcp.

There are some wrapper systemctl commands for Tower:

1
2
3
4
ansible-tower-service status
ansible-tower-service start
ansible-tower-service stop
ansible-tower-service restart

Dashboard

To have a overview of Tower dashboard and setup: https://www.youtube.com/watch?v=ToXoDdUOzj8

  1. create a project, SCM TYPE set to Manual which means you will put your playbook folder in the /var/lib/awx/<any folder>/my-playbook directory. Set ANSIBLE ENVIRONMENT to a virtual python env folder.
  2. create inventory.
  3. create templates, set the PROJECT, PLAYBOOK path, JOB TYPE, INVENTORY, ENABLE CURRENT JOBS, etc
  4. launch the template job w/o extra vars from console or from Tower API.

Manual Quick Debug

Sometimes I would like to run playbook in CLI, that’s easy to do:

  1. upload playbook in one of the Tower VM path /var/lib/awx/projects/my-playbook.
  2. source the python venv, for example the venv is put in /var/lib/awx/venv.
  3. run playbook from inside the my-playbook directory, otherwise you may encounter strange issue(if you check the process launched by Tower, it runs this way), for example
1
2
3
4
5
6
7
8
source /var/lib/awx/venv/my-venv/bin/activate
cd /var/lib/awx/projects/my-playbook

# no inventory means run on localhost
ansible-playbook playbook_v1.yml \
-e @var.json \
-e "endpoint=http://example.com/xs73s93jsdfsf" \
-vvv

Search Job Log

It is useful to accurately locate the job specific task logs, in the job log search bar, it can do target and fuzzy search:

1
task:"<task name>"

Other search bars have similar syntax.

01/25/2022

git branch command behaves like less pager, but I want to dump the ouput:

1
git config --global pager.branch false

02/04/2022

Makefile tutorial

03/02/2022

you have new mail in /var/spool/mail/root, what is it, see this post, to read it from end:

1
less +G $MAIL

03/17/2022

The selection of VM disk type should be aware of write/read performance as well, is the usage write or read heavy? Not only disk size, for example, this is gcloud disk properties form.

03/18/2022

Multiple successive forward slashes in Linux path have no effect, they are treated the same as single slash, see this ticket. But successive forward slashes in GCS bucket path are not merged!

04/26/2022

Backups vs Snapshots, what is the difference. A snapshot only needs to save enough information in local to undo a change and that makes it take less space than a backup. Backups is stored in different places.

Today is a frustrating day and I am scheduled for oncall. There is one related to network that the requests from one side intermittently timeout, for example:

1
failed to create dial connection with read/write 10s timeout: dial tcp: i/o timeout

It turns out the root cause is network packet loss. GCP has network performance dashboard to help you monitor and spot packet loss and latency statistics, very helpful.

I am also educated by this blog, it is worth to read through.

Different OS the namespace/cgroup path is slightly different.

Ubuntu

As for Ubuntu, CPU cgroup for example is from:

1
2
# if you run docker container
cd /sys/fs/cgroup/cpu/docker

then you will see folder named as docker container ID, for example:

1
2
3
4
5
6
7
cd 9a89252ea39e15c5f90cc7b1a606bc64d4acb3a50c112ab53f3e751d06ba85db

# limit: cfs_quota_us/cfs_period_us
cpu.cfs_period_us
cpu.cfs_quota_us
# request: proportional to other container CPU shares
cpu.shares

For other cgroups, following the similiar path pattern, for example, the memory path is /sys/fs/cgroup/memory/docker.

GKE

In Google GKE node, the CPU cgroup path is like:

1
2
3
# burstable is a type of QoS (quality of service)
# you can see QoS in pod description status section
cd /sys/fs/cgroup/cpu/kubepods/burstable/podf13115bf-9e5b-4871-b855-d24430b31aa6/84b07d37a9f07c57fe9e642f2cd951f7821b5d2333c92eae945b99f5dd996491

The 84b07d37a9f07c57fe9e642f2cd951f7821b5d2333c92eae945b99f5dd996491 is container ID inside of the pod. Using kubectl describe can see it, then you can see the cpu limit/request source of truth.

//TODO 关于init process signal handling is clear, how about init process forward signal?

[ ] ancestor ns send sigterm to container init, will it receive it or not.

首先需要理解docker container -> process in different pid namespace

需要理解几个点ancestor namespace kill child init process: https://man7.org/linux/man-pages/man7/pid_namespaces.7.html docker kill, docker stop, docker rm difference https://unix.stackexchange.com/questions/509660/do-docker-container-rm-and-docker-container-kill-effectively-achieve-the-sam

from inside of container kill init process, foreground: https://devops.stackexchange.com/questions/5613/how-to-explicitly-kill-the-process-with-pid-1-from-inside-a-container https://docs.docker.com/engine/reference/run/#foreground we need handler for init process https://medium.com/@gchudnov/trapping-signals-in-docker-containers-7a57fdda7d86

how to know living process has signal handler: https://stackoverflow.com/questions/5975315/linux-how-to-see-if-a-living-process-has-signal-handlers-set/8810790

there is a demo to write signal proxy by your own: https://medium.com/hackernoon/my-process-became-pid-1-and-now-signals-behave-strangely-b05c52cc551c

Best practices for propagating signals on Docker https://www.kaggle.com/residentmario/best-practices-for-propagating-signals-on-docker

Kill init Process in Container

Inside container, PID 1 will never be killed by kill -9 1, but if PID 1 has registered other signal handlers then it can respond accordingly, need to check the signal bit:

1
2
3
4
5
# sh as PID 1
docker run -it --entrypoint=/bin/sh busybox

# check signal bitmap PID 1
cat /proc/1/status | grep -i sigcgt

The output is:

1
SigCgt:	0000000000010002

So sh has 2 handlers in bit 2(SIGINT) and bit 17(SIGHLD), so for this container, it will never react to kill 1 or kill -9 1 as no handler registered for them.

If the init PID you use has bitmap set for SIGTERM with exit(0), then you can terminate it by kill 1.

The thing is I ran docker container in the VM instance from gcloud, the docker container has gcloud SDK installed beforehand, without mounting user ~/.config folder, I found that the container SDK has already been set with the service account from host, for example:

1
2
# gcloud_test is image built with gcloud SDK
docker run -it --rm --entrypoint=/bin/bash gcloud_test:latest

Inside container, executing:

1
gcloud auth list

Instead of asking you login, the host associated service account was displayed.

It turns out that it is related to Metadata Server provided by Google Cloud: Your VM automatically has access to the metadata server API without any additional authorization. You can only query the metadata server programmatically from within a VM.

For example, to get the service account of the VM:

1
curl -H "Metadata-Flavor:Google" http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/

More metadata items see here.

The gcloud SDK inside of container will do something like this to automatically fetch host’s service account and use it, if I disable the container networking during creation, this mechanism will not work anymore:

1
2
# disable network
docker run -it --rm --network none --entrypoint=/bin/bash gcloud_test:latest

Also notes that this is a common concept for most of the cloud providers, not something unique to Google.

When performed the ES upgrade from a Linux jumpbox docker container, interestingly on one of the regions’ jumpbox I cannot read the mounted folder and got permission denied error. This is related to Selinux setting on docker daemon.

For example, on that jumpbox:

1
2
3
4
5
6
7
8
9
# test is a folder in host user home I want to mount
sudo docker run \
--rm \
-v ~/test:/test \
busybox sh \
-c "ls /test"

# got access denied
ls: can't open '/test': Permission denied

First, verify the Selinux mode is enforcing, you can check by

1
2
getenforce
# enforcing

Then I see the docker daemon Selinux is enabled, this is why I get permission denied:

1
2
3
4
5
6
7
8
sudo docker info | grep Security -A5

Security Options:
seccomp
Profile: /etc/docker/seccomp.json
# below keyword means Selinux is enabled
selinux
Kernel Version: 3.10.0-1062.12.1.el7.x86_64

On other regions’ jumpbox, although the Selinux is enforcing mode but the docker daemon does not enable it specifically, so I can still read/write mounted foler.

Solutions:

  1. set Selinux to permissive mode and mount as usual
1
sudo setenforce 0
  1. mount with label Z, see this question

From docker official, Configure the selinux label

1
2
3
4
5
6
7
8
# test is a folder in host user home I want to mount
# append Z and ro(read-only) labels
# Z: the mount is private and unshared
sudo docker run \
--rm \
-v ~/test:/test:Z,ro \
busybox sh \
-c "ls /test"

Reference

Secure your containers with SELinux

We accidently configured all nodes the master role and wrong data path in upgrade and result in all shards unassigned with cluster red status, this led to data loss and corrupted shards.

For example, from the cluster health API the cluster status after upgrade:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
"cluster_name": "xxx",
"status": "red",
"timed_out": false,
"number_of_nodes": 5,
// no data nodes
"number_of_data_nodes": 0,
// no primary shards
"active_primary_shards": 0,
"active_shards": 0,
"relocating_shards": 0,
"initializing_shards": 0,
// all unassigned
"unassigned_shards": 33,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 0
}

In this case you need to take a glance at node status, it turns out that we have wrong configuration, all node are set to master, data path is wrong too:

1
2
3
4
5
6
7
8
9
10
curl "localhost:9200/_cat/nodes"

172.16.0.141 5 86 1 0.00 0.10 0.19 im - 172.16.0.141
172.16.0.140 24 86 2 0.02 0.14 0.16 im - 172.16.0.140
172.16.0.138 4 66 0 0.27 0.18 0.24 im - 172.16.0.138
172.16.0.137 4 73 0 0.00 0.07 0.12 im - 172.16.0.137
172.16.0.152 4 86 1 0.00 0.04 0.09 im * 172.16.0.152

// for data path, master and data nodes may different but the same kind
// should be the same path

The solution is to set the right configuration (node role and data path) and restart the whole cluster, Usually, the node rejoin will transform unassigned to assigned/started, if not, the data may lose and corrupted so still unassigned.

From allocation explain API to get details:

1
curl "http://localhost:9200/_cluster/allocation/explain" | jq
1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
"note": "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
"index": "elastalert-status",
"shard": 0,
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "CLUSTER_RECOVERED",
"at": "2021-12-20T23:54:16.720Z",
"last_allocation_status": "no_valid_shard_copy"
},
"can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster"
}

How to proceed, in any node:

  1. retry reroute
1
curl -XPOST "localhost:9200/_cluster/reroute?retry_failed=true"
  1. force reroute and accept data loss See explanation for these 2:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// allocate a primary shard to a node that holds a stale copy
curl -XPOST "localhost:9200/_cluster/reroute" \
-H "Content-Type: application/json" \
-d \
'{
"commands":
[
{
"allocate_stale_primary":
{
"index" : "elastalert-status",
"shard" : 0,
"node" : "172.16.0.138",
"accept_data_loss" : true
}
}
]
}'

// this actually deletes target index
curl -XPOST "localhost:9200/_cluster/reroute?pretty" \
-H "Content-Type: application/json" \
-d \
'{
"commands":
[
{
"allocate_empty_primary" :
{
"index" : "elastalert-status",
"shard" : 0,
"node" : "172.16.0.138",
"accept_data_loss" : true
}
}
]
}'
  1. reindex from a backup if you have

At that time the #2 solved the issue but we lost the all influenced indicies.

Reference

How to resolve unassigned shards in Elasticsearch, this series is good.

Cert-manager git repo, it is recommended to read through the document.

Install

Using helm chart to deploy cert-manager in K8s cluster in cert-manager namespace(customizing chart if needed), verifying deployment is good, see here

Highlight Concepts

Issuers and ClusterIssuers: https://cert-manager.io/docs/concepts/issuer/

Issuers configuration, especially ACME protocol as we use tarsier CA: https://cert-manager.io/docs/configuration/

Certficiate resources: https://cert-manager.io/docs/usage/certificate/

Usage Case

We use Google tarsier CA and cert-manager to manage, renew certificate for ingresses, see secure ingress resources. Although it cannot directly work on Anthos MCI(multi-cluster ingress) but the workaround is simple by manually creating certificate to associate with target tls secret.

After deployment of cert-manager is done and run correctly, add cert-manager supported annotations to target ingress, for example:

1
2
3
4
5
6
7
kind: Ingress
metadata:
annotations:
# cluster issuer is cluster scope and deployed by yourself
cert-manager.io/cluster-issuer: example-cert
cert-manager.io/duration: 2160h
cert-manager.io/renew-before: 72h

Then cert-manager will automatically create certificate resource and start to issue certificate. Note that you can manually create this resource for some scenarios: tls secret is used by multiple ingresses(no need to add annotations to each ingress) or Anthos MCI(does not support cert-manager):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
apiVersion: cert-manager.io/v1
kind: Certificate
name: example
namespace: default
spec:
dnsNames:
# this is from ingress host name
- '*.service.example.google'
duration: 2160h0m0s
issuerRef:
group: cert-manager.io
kind: ClusterIssuer
name: example-cert
renewBefore: 72h0m0s
# this is from ingress tls name
secretName: example-tls

The spec.dnsNames is Subject Alternative Name(SAN) that can have multiples, Common Name(CN) is derived from the first item. Note CN is discouraged from being used and deprecated, see here

So, this actually is a SAN certificate, not CN(common name) certificate(as mentioned here CN field is deprecated).

Describing it you can examine the Dns names, condition and validation of the new certificate:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Spec:
Dns Names:
*.service.example.google
Duration: 2160h0m0s
Issuer Ref:
Group: cert-manager.io
Kind: ClusterIssuer
Name: example-cert
Renew Before: 72h0m0s
Secret Name: example-tls
Status:
Conditions:
Last Transition Time: 2021-10-08T19:18:51Z
Message: Certificate is up to date and has not expired
Observed Generation: 1
Reason: Ready
Status: True
Type: Ready
Not After: 2022-01-06T18:18:43Z
Not Before: 2021-10-08T18:18:44Z
Renewal Time: 2022-01-03T18:18:43Z
Revision: 1

The certificate and key are stored in K8s secret resource, once certificate issuing is done, cert-manager will manage this secret by adding specific annotations, for example:

1
2
3
4
5
6
7
8
9
10
kind: Secret
metadata:
annotations:
cert-manager.io/alt-names: '*.service.example.google'
cert-manager.io/certificate-name: example
cert-manager.io/common-name: '*.service.example.google'
cert-manager.io/ip-sans: ""
cert-manager.io/issuer-group: cert-manager.io
cert-manager.io/issuer-kind: ClusterIssuer
cert-manager.io/issuer-name: example-cert

If no secret exists then the cert-manager will create the secret for you. Note that the old secret will be overridden every time new certificate is issued.

Also note that deletion of certificate will not delete associated secret.

To decode the certificate content, using base64 and openssl, usually there are multiple-level certificates placed, from bottom to top(root CA, intermediate CA to certificate), select one to decode:

1
2
3
echo <secret encode block> | base64 -d
# select one CERTIFICATE block to decode
openssl x509 -in certificate.crt -text -noout

Or decoding online: https://www.sslshopper.com/certificate-decoder.html

In reverse, to encode certificate and used in secret:

1
2
# -w 0: git rid of new line, there may be a % chart at end, drop it.
cat certificate.crt | base64 -w 0

//TODO: [ ] warnings.warn, for example, from elasticsearch package https://docs.python.org/3/howto/logging.html#when-to-use-logging Default handler for child logger: When no handler set to child logger and you call logger.info() and so on, there is a last resort: logging.lastResort:

1
2
3
4
5
6
7
8
9
10
11
12
13
import logging

# effective level inherits from root logger: warning
logger = logging.getLogger(__name__)
# where do they go? see below
logger.warning("hello")
logger.warning("world")

# <_StderrHandler <stderr> (WARNING)>
print(logging.lastResort)
# set it to Nono will get error when you call
# above logger.warning()
logging.lastResort = None

From experience, for complex application logging:

  • inherit logging.Logger class to creat customized logger class
  • create log output folder if does not exist
  • update default logging dict config and applied
  • set custom logger class
  • get logger and return to module

My logging framework for complex application, for simple application, just use code configuration to set up module’s logger layout.

Introduction

Official document, read through carefully.

Default output: by default, no destination is set for any logging messages. They will check to see if no destination is set; and if one is not set, they will set a destination of the console (sys.stderr).

First, know when to use logging.

Logging flow

Multiple calls to getLogger() with the same name will return a reference to the same logger object (singleton).

Loggers have a concept of effective level. If a level is not explicitly set on a logger, the level of its parent is used instead as its effective level. If the parent has no explicit level set, its parent is examined, and so on. When deciding whether to process an event, the effective level of the logger is used to determine whether the event is passed to the logger’s handlers.

Child loggers propagate messages up to the handlers associated with their ancestor loggers. Because of this, it is unnecessary to define and configure handlers for all the loggers an application uses. It is sufficient to configure handlers for a top-level logger and create child loggers as needed. (You can, however, turn off propagation by setting the propagate attribute of a logger to False.) 注意,propagate 不会受到 parent logger level 的影响, 都会收到下级的信息.

Thread safety: It achieves this though using threading locks; there is one lock to serialize access to the module’s shared data, and each handler also creates a lock to serialize access to its underlying I/O.

Cookbook highlights:

  1. logger name can be chained, logging.getLogger: apple, apple.pear, apple.pear.peach, child log will pass to parent.
  2. support threads.
  3. one logger can have multiple handler and formatter: Sometimes it will be beneficial for an application to log all messages of all severities to a text file while simultaneously logging errors or above to the console. The root logger level should <= handler level, otherwise handler will not receive messages.
  4. console message may no need to contain timestamp, but log file needs.
  5. To dynamic change log configuration, you can use signal handler or using log config server, it is listening on port to receive new configuration.
  6. pass log level to cli application

Logging level: DEBUG, INFO, WARNING, ERROR, CRITIAL, in increasing order of severity. 设置这个的好处是可以根据level输出对应的logging信息,在设定level之下的 logging不会被记录, 比如当前level 设定为INFO, 则logging.debug()不会被输出。

The default level is WARNING, which means that only events of this level and above will be tracked, unless the logging package is configured to do otherwise.

Basic Usage

See Basic Logging Tutorial. This is not good for practical use as set basicConfig will impact other imported modules, but for simple usage it is fine.

logging.basicConfig only initializes root logger.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import logging
import sys
import pprint

# see logging record attributes
# https://docs.python.org/3/library/logging.html#logrecord-attributes
FORMAT = "[%(threadName)s, %(asctime)s, %(levelname)s] %(message)s"

# filename: write log to a file
logging.basicConfig(filename='/opt/logfile.log',
level=logging.DEBUG,
format=FORMAT)
# default the log will append to file
# if you want to overwrite, set filemode='w'
logging.basicConfig(filename='/opt/logfile.log',
filemode='w',
level=logging.DEBUG,
format=FORMAT,
# custom timestamp
datefmt='%m/%d/%Y %H:%M:%S')

# stream: write to system stdout via stream
logging.basicConfig(stream=sys.stdout,
level=logging.DEBUG,
format=FORMAT)

# use lazy %s formatting in logging functions
logging.debug("%s", "something")
logging.info("%s", "something")
logging.warning("%s", "something")
logging.error("%s", "something")
logging.critical("%s", "something")

# print pretty
logging.info(pprint.pformat([json or yaml]))

For json format log, there is a open source module: https://github.com/madzak/python-json-logger

NOTE: The desired logging.basicConfig() should come at very first and it only take effect once.

Advanced Usage

More advanced usage please see: https://docs.python.org/3.8/howto/logging.html#advanced-logging-tutorial

Module-Level Functions

Configuring separated logger for different modules.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import logging
import elasticsearch
import traceback

# for current module
logger = logging.getLogger(__name__)
# have to set handler for logger otherwise only logging.WARNING level will work
logger.setLevel(logging.DEBUG)

# set a stream_handler, you can choose other handlers
stream_handler = logging.StreamHandler(sys.stdout)
formatter = logging.Formatter(
'[%(threadName)s, %(asctime)s, %(levelname)s] %(message)s')
stream_handler.setFormatter(formatter)
# if handler logging level > logger logging level, the level in gap will not show
# if handler logging level <= logger logging level, show everything logger allowed
stream_handler.setLevel(logging.DEBUG)
logger.addHandler(stream_handler)

# use logger instead of logging
# use lazy % formatting in logger function
logger.debug("%s", "something")

# reset different level for imported elasticsearch module
# elasticsearch.logger is found by dir(elasticsearch)
es_logger = elasticsearch.logger
# you can enrich or suppress log from elasticsearch
es_logger.setLevel(elasticsearch.logging.INFO)
es_logger.addHandler(stream_handler)

# logging exception detail
try:
a = [1,2]
b = a[10]
except IndexError as e:
logger.error(e, exc_info=True)
# if you don't know the exception type
try:
pass
except:
logger.error("Exception is %s", traceback.format_exc())

Log Rotating

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# size based
from logging.handlers import RotatingFileHandler
# time based
from logging.handlers import TimedRotatingFileHandler

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
# rotating log file when size > 1000 bytes
# keep 2 backup logs: app.log.1, app.log.2
file_r = RotatingFileHandler('file_app.log', maxBytes=1000, backupCount=2)
# rotating every 2 days with 5 backups
time_r = TimedRotatingFileHandler('time_app.log', when='d', interval=2,
backupCount=5)

logger.addHandler(file_r)
logger.addHandler(time_r)

for _ in range(1000):
logger.info("this is %d", _)

Log Config File

For complex logging configuration, we can write a config file, the format can be ini-style or dict-style. This way we don’t need to hard code or change the configuration in the code.

1
2
3
4
5
6
7
8
9
10
11
import logging
import logging.config

logging.config.fileConfig('logging.ini')
logging.config.dictConfig('logging.dict')
# or input a dict object with content
LOGGING_CONFIG = {}
logging.config.dictConfig(LOGGING_CONFIG)

# 'simplelogger' is defined in config file
logger = logging.getLogger('simplelogger')

Signal Trap

In production we can use signals to change logging level dynamically, for example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import signal

# Note to include signal_num and frame marked them as unused and del if no use.
def switch_logging_level(unused_signal_num, unused_frame):
'''Set logging level to INFO when receives SIGUSR1

For example: kill -10 <PID>
'''
del unused_signal_num
del unused_frame

if logger.isEnabledFor(logging.DEBUG):
logger.setLevel(logging.INFO)
logger.info('Disable logging.DEBUG level')
else:
logger.setLevel(logging.DEBUG)
logger.info('Enable logging.DEBUG level')

signal.signal(signal.SIGUSR1, switch_logging_level)

It depends on how you initializes the logger, you can also update the handler’s logging level:

1
2
3
4
5
6
7
import logging

logger = logging.getLogger()
logger.setLevel(level)
...
for handler in logger.handlers:
handler.setLevel(logging.DEBUG)

For docker container please run signal script as init process or use tini to forward signal.

0%