//TODO: [ ] see reading section [ ] series https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-b190cc97f0f6 [ ] prom config with service discovry, for example consul

Docker Compose Demo

Github repo: https://github.com/chengdol/InfraTree/tree/master/docker-monitoring

The steps are in README.

Prometheus

Open-source monitoring and alerting system: https://prometheus.io/ Prometheus collects and stores its metrics as time-series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.

Architecture

Learning targets:

Know how to set up prometheus cluster for testing purpose
Know how to configure prometheus/alertmanager/grafana
Know how to export different kind of metrics
Know how to PromQL
Know how to integrate with Grafana

Metric type

Explanation (counter, gauge, histogram, summary): https://www.youtube.com/watch?v=nJMRmhbY5hY

Conuter: request count, task completed, error count, etc. Query how fast the value is increasing, rate() only applies for counter as it is monotonic increasing.

Guage: memory usage, queue size, kafka lag, etc. For example, avg_over_time() on gauge type.

Histogram: duration of http request, response size, etc. To late calculate average and percentile, happy with approximation. You can use default bucket or customizing your own. The vaule in bucket is accumulated, add to all buckets that greater than current value.

Summary: duration of http request, response size, etc. complex than Histogram, no idea the value range so cannot histogram.

PromQL

First, understand metric type in prometheus https://www.youtube.com/watch?v=nJMRmhbY5hY

Helpful promQL visualizing tool, cheat sheet: https://promlabs.com/promql-cheat-sheet

How to know labels of a specific metric? Using prometheus query browser run metric name and see the console output, it will contains all labels of that metric.

Alert Expr

Excluding Time Slot from Alert Expr

This is helpful as we know it is no-ops. Now Prometheus supports time-based muting:

1 2	# excluding specific time slot some_metrics_vector and ON() absent(day_of_week() == 0 AND hour() >=3 < 4 AND minute() >= 10 < 50)

The explanation please see here, so for logical operators they are case-insensitive, and or AND, either is fine.

You can verify in prometheus expression browser first then writing to alert expr.

Tips:: In alert debug, to see label instance or job value in description, for example:

    annotations:
      description: '{{ $labels.instance }} is not responding for 10 minutes.'
``` 
Just run that alert expression manually in expression browser, modify the alert
expression to see the output labels.

### [Capture the Counter First Event](https://stackoverflow.com/questions/66532785/prometheus-alerting-rule-not-detecting-first-time-metric-increase)
There is case the we want to capture the first 0(non-existence) -> 1 counter
event and fire alert, this can be captured by `unless` + `offset`, and after 1
we can use increase to catch:

```bash
# ((0 -> 1 case capture) or (1 -> 1+ case capture))
((_metric_counter_ unless _metric_counter_ offset 15m) or (increase(_metric_counter_[15m]))) > 0

Query Example

Here I list some examples to explain and practice common PromQL. Part of them are from Grafana dashboard as they have embedded variables, but the syntax and usage is the same in prometheus expression browser and Grafana.

Understand instant and range vector and how rate and irate works: https://www.metricfire.com/blog/understanding-the-prometheus-rate-function/

rate(average rate!) or irate(instant rate, last 2 data points only) calculates the per-second average rate of how fast a value is increasing over a period of time, they automatically adjusts for counter resets. If you want to use any other aggregation(such as sum) together with rate then you must apply rate first, otherwise the counter resets will not be caught and you will get weird results.

irate(spike) should only be used when graphing volatile, fast-moving counters.

Use rate(trend) for alerts and slow-moving counters, as brief changes in the rate can reset the FOR clause and graphs consisting entirely of rare spikes are hard to read.

Also remember rate first then aggregation rather than inversely https://www.robustperception.io/rate-then-sum-never-sum-then-rate

For group_left(many to one!) and group_right(one to many!), here is the example.

One query example for system load average dashboard:

# ${interval}, ${load}, ${service}, $env: 
# these variables are defined from by dashboard config variables

# explain on label_replace
# 在avg_over_time()得到的向量中，对于instance这个label，看是否match $env-(.+) 这个正则表达式
# 如果有match，则$1 就是对应正则中的第一个(.+)的真实值，然后在label_replace返回的新向量中，增加一个
# label name=$1，如果没有match，则返回原来的向量
avg(
label_replace(avg_over_time(node_load${load}{instance=~"^.+-${service:regex}-[0-9]+$"}[${interval}]),
"instance_group",
"$1",
"instance",
"$env-(.+)")
) by (instance_group) > 6
# then average the new vector, group it by instance_group label and check if the average group level LA > 6

Above varaibles are Query type: https://grafana.com/docs/grafana/latest/variables/variable-types/add-query-variable/

Grafana

As the key visual component of monitoring system, please see the separate post <<Grafana Quick Start>>

AlertManager

How to connect alertmaneger to prometheus: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config

How to config alertmanager itself: https://prometheus.io/docs/alerting/latest/configuration/

Config example to start: https://github.com/prometheus/alertmanager/blob/main/doc/examples/simple.yml Tool to generate routing tree: https://prometheus.io/docs/alerting/latest/configuration/

Alertmanager repo and docker: https://github.com/prometheus/alertmanager

The example alertmanager start command:

/bin/alertmanager --config.file=/etc/config/alertmanager.yml --storage.path=/data --web.route-prefix=/ --web.external-url=https://xxx.xxx/alertmanager

Run identical Prometheus servers on two or more separate machines. Identical alerts will be deduplicated by the Alertmanager.

For high availability of the Alertmanager, you can run multiple instances in a Mesh cluster and configure the Prometheus servers to send notifications to each of them.

To silence one alert, using New Silence and in matcher use alertname as key and alertname vaule as value(can add more key-value to filter more). If silence multiple alerts, using regex. Preview silence can show you how many current active alerts are affected, or you can just silence it so no new alert will come.

Integrated with K8s

https://www.youtube.com/watch?v=bErGEHf6GCc&list=PLpbcUe4chE7-HuslXKj1MB10ncorfzEGa https://www.youtube.com/watch?v=CmPdyvgmw-A https://www.youtube.com/watch?v=h4Sl21AKiDg

https://www.youtube.com/watch?v=5o37CGlNLr8 https://www.youtube.com/watch?v=LQpmeb7idt8

Prometheus Quick Start