//TODO: [ ] see reading section [ ] series https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-b190cc97f0f6 [ ] prom config with service discovry, for example consul
Docker Compose Demo
Github repo: https://github.com/chengdol/InfraTree/tree/master/docker-monitoring
The steps are in README.
Prometheus
Open-source monitoring and alerting system:
https://prometheus.io/
Prometheus collects and stores its metrics as time-series
data, i.e. metrics
information is stored with the timestamp at which it was recorded, alongside
optional key-value
pairs called labels.
Architecture
Learning targets:
- Know how to set up prometheus cluster for testing purpose
- Know how to configure prometheus/alertmanager/grafana
- Know how to export different kind of metrics
- Know how to
PromQL
- Know how to integrate with Grafana
Metric type
Explanation (counter, gauge, histogram, summary): https://www.youtube.com/watch?v=nJMRmhbY5hY
Conuter: request count, task completed, error count, etc.
Query how fast the value is increasing, rate()
only applies for counter
as
it is monotonic increasing.
Guage: memory usage, queue size, kafka lag, etc.
For example, avg_over_time()
on gauge type.
Histogram: duration of http request, response size, etc. To late calculate average and percentile, happy with approximation. You can use default bucket or customizing your own. The vaule in bucket is accumulated, add to all buckets that greater than current value.
Summary: duration of http request, response size, etc. complex than Histogram, no idea the value range so cannot histogram.
PromQL
First, understand metric type in prometheus https://www.youtube.com/watch?v=nJMRmhbY5hY
Helpful promQL visualizing tool, cheat sheet: https://promlabs.com/promql-cheat-sheet
How to know labels of a specific metric? Using prometheus query browser run metric name and see the console output, it will contains all labels of that metric.
Alert Expr
Excluding Time Slot from Alert Expr
This is helpful as we know it is no-ops. Now Prometheus supports time-based muting:
1 | # excluding specific time slot |
The explanation please see
here, so for
logical operators they are case-insensitive, and
or AND
, either is fine.
You can verify in prometheus expression browser first then writing to alert expr.
Tips:: In alert debug, to see label instance or job value in description, for example:
1 | annotations: |
Query Example
Here I list some examples to explain and practice common PromQL. Part of them are from Grafana dashboard as they have embedded variables, but the syntax and usage is the same in prometheus expression browser and Grafana.
Understand instant
and range
vector and how rate
and irate
works:
https://www.metricfire.com/blog/understanding-the-prometheus-rate-function/
rate
(average rate!) or irate
(instant rate, last 2 data points only)
calculates the per-second average rate of how fast a value is increasing over a
period of time, they automatically adjusts for counter resets. If you want to
use any other aggregation(such as sum
) together with rate
then you must
apply rate
first, otherwise the counter resets will not be caught and you will
get weird results.
irate
(spike) should only be used when graphing volatile, fast-moving counters.
Use rate
(trend) for alerts and slow-moving counters, as brief changes in the
rate
can reset the FOR clause and graphs consisting entirely of rare spikes
are hard to read.
Also remember rate
first then aggregation rather than inversely
https://www.robustperception.io/rate-then-sum-never-sum-then-rate
For group_left
(many to one!) and group_right
(one to many!), here is the
example.
One query example for system load average dashboard:
1 | # ${interval}, ${load}, ${service}, $env: |
Above varaibles are Query type: https://grafana.com/docs/grafana/latest/variables/variable-types/add-query-variable/
Grafana
As the key visual component of monitoring system, please see the separate post
<<Grafana Quick Start>>
AlertManager
How to connect alertmaneger to prometheus: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config
How to config alertmanager itself: https://prometheus.io/docs/alerting/latest/configuration/
Config example to start: https://github.com/prometheus/alertmanager/blob/main/doc/examples/simple.yml Tool to generate routing tree: https://prometheus.io/docs/alerting/latest/configuration/
Alertmanager repo and docker: https://github.com/prometheus/alertmanager
The example alertmanager start command:
/bin/alertmanager --config.file=/etc/config/alertmanager.yml --storage.path=/data --web.route-prefix=/ --web.external-url=https://xxx.xxx/alertmanager
Run identical Prometheus servers on two or more separate machines. Identical alerts will be deduplicated by the Alertmanager.
For high availability of the Alertmanager, you can run multiple instances in a Mesh cluster and configure the Prometheus servers to send notifications to each of them.
To silence one alert, using New Silence and in matcher use alertname
as
key and alertname vaule as value(can add more key-value to filter more). If
silence multiple alerts, using regex. Preview silence can show you how many
current active alerts are affected, or you can just silence it so no new alert
will come.
Integrated with K8s
https://www.youtube.com/watch?v=bErGEHf6GCc&list=PLpbcUe4chE7-HuslXKj1MB10ncorfzEGa https://www.youtube.com/watch?v=CmPdyvgmw-A https://www.youtube.com/watch?v=h4Sl21AKiDg
https://www.youtube.com/watch?v=5o37CGlNLr8 https://www.youtube.com/watch?v=LQpmeb7idt8