Regarding Logstash introduction recaps, please have a look at Elastic Stack Quick Start
Issue description: Over the past few months, we have been seeing some logs that were indeed generated, but they lost in Elasticsearch database, which result in false PD alert as the alert expression relies on these missing logs.
The data path is common and typical:
1 | Data source |
The lost could happen in transit at any point, let’s check and narrow the scope down.
At very beginning, I just suspected the data lost on Kafka without any proof and adjusted the ack
level on Logstash Kafka output plugin, it turns out not the case.
Does The UDP Packet Reach VM?
As UDP transfer is not reliable, the packets could be lost before they reached Logstash VM, we can have background tcpdump to sniff the UDP packets from target host and port to verify, for example:
1 | # time limit the process run |
It is important to limit the pcap file size and roll over new file if necessary, for example, the pcap file generated from above command will be:
1 | # suffix number starts from 1, auto appended. |
Each will have file size <= 200MB as -C
specified.
Then you can use Wireshark or tcpdump itself to read/filter the pcap file:
1 | # -r: read pcap file |
Then we know the UDP packets actually reached the VM, move to the next question.
Other tcpdump Usages
If you want to have file creation in a strict time interval manner:
1 | # -G: rotated every 1800s duration |
We can have -C
(size) and -G
(time) together, but need the timestamp file name! For example:
1 | tcpdump -i eth0 -w /tmp/trace-%Y-%m-%d-%H:%M:%S.pcap -G 3 -C 2 |
The file creation is either every 3s or 2MB, whichever comes first, look the size rollover is within each time interval:
1 | -rw-r--r--. 1 tcpdump 2.0M May 2 06:30 trace-2022-05-02-06:30:20.pcap |
You can rotate the file by size and limit the number of file created,
1 | # -W: file count limit 3 |
The result could be:
1 | # rotated among these 3 files |
or by time and size both, note that rotation happens only within the timeslice: every 3s when the size exceeds!
1 | # must use timestamp file name |
Does Logstash UDP Input Drop Packet?
This can be verified by adding another output plugin file
to store target info to a local file, for example in logstash config file:
1 | input {} |
By checking the log file, I know the output does not send the message out, plus the filter is good, so it must be input {}
UDP dropped the data.
Why Does UDP Drop Packet?
UDP is not a reliable transport. By design, it will drop messages if it does not have space to buffer them.
There is a blog talks about the UDP packer error and UDP receive buffer error, see from netstat -us
, they are the first-hand indicator for packet drop.
How to Solve It?
Luckly, we can increase the queue size and buffer size, for example:
1 | input { |
The value really depends on your traffc, you can run sar -n DEV 1
to find a reasonable estimation. Moreover, you need to uplift system socket receive buffer in order to set Logstash receive_buffer_bytes
correctly if it is larger than system default buffer, for example:
1 | echo "net.core.rmem_max=16777216" >> /etc/sysctl.conf |
Finally, increase the Logstash JVM heap size accordingly, usually half of the RAM size for both Xms
and Xmx
.
Then restart and check the status of new config:
1 | systemctl restart logstash |
Horizontal Scaling
If the traffic is heavy, vertical scaling on cpu core or RAM is not enough and the packet drop continues. It turns out the Logtash does not scale will the increasing CPU cores, see this comment.
So in this case you have to do horizontal scaling.
Monitoring/Alert
It is helpful to have Grafana dashboard to display drop rate as to inbound traffic, for example:
1 | label_replace(rate(node_netstat_Udp_InErrors{instance=~".*-example-.*"}[5m]) |
There is the list of metrics exposed by node exporter.
Postscript
There are other things I practiced:
- Logstash config and syntax for input, filter, output plugin.
- Using
nc
as UDP client to test Logstash UDP input and filter. - Revisit Kafka config and know possible data loss in Kafka side.
- Revisit PromQL
最后把系统优化中的网络部分又看了一遍加深印象:D