Calico Bandwidth Explore

Kubernetes version 1.13.2

I have talked about setting up pod network in Kubernetes cluster using Calico network add-on in this post <<Set Up K8s Cluster by Kubeadm>>. Recently I was involved in one issue from performance team, they complained that the network has bottleneck in calico, let’s see what happened and learn new things!

Description

The performance team set up a 6 nodes cluster with 1 master and 5 workers. Each machine has 48 cpu cores, 128GB memory, 2T+ disk and 10000Mb/s network speed.

These are the test cases:

10 jobs for each user(8 small jobs, 1 middle job and 1 large job)

8 small jobs(RowGen -> XX -> Peek), 1 job unit
aggregate job                         -- 10 millions records (neally 2.2G)
filter job                            -- 10 millions records (neally 2.2G)
funnel_continuous                     -- 10 millions records (neally 2.2G)
funnel sort                           -- 10 millions records (neally 2.2G)
join job                              -- 10 millions records (neally 2.2G)
lookup                                -- 10 millions records (neally 2.2G)
sort job                              -- 10 millions records (neally 2.2G)
Transformation                        -- 10 millions records (neally 2.2G)

1 middle job, 2 job units
Middle_Job                            -- 50 million records(nearly 11G)

1 Large job, 3 job units
Large job                             -- 10 million records(nearly 2.1GB)

They ran concurrent users with N compute pods and found that the bottleneck is in calico network:

BTY, there are still enough resources(CPU, Memory and Disk I/O) to support DataStage scale up for DS concurrent users to run jobs on nodes and pods. But the network bandwidth between pods is not enough to support it.

iperf3 Command

They use iperf3, a TCP, UDP, and SCTP network throughput measurement tool, measure memory-to-memory performance access a network, server-client mode.

To install, for example in Centos:

1	sudo yum install iperf3 -y

The Usage is simple. More simple demos

Node to Node

For command options and flags see user guide.

This will transfer /large_file in client to /large_file in server, time interval is 40 seconds.

## server
# -s: server
# -F: read data from network and write to the file
iperf3 -s -F /large_file
## client
# -c: client mode connect to server ip
# -i 1: 1 sec report
# -t 40: transfer time 40 sec
# -F: read from file and write to network
iperf3 -c <server ip> -i 1 -t 40 -F /large_file

UDP traffic benchmark:

## server
# -s: server
# -f M: format Mbytes/sec
# -V: verbose
# -p 5999: port 
iperf3 -s -f M -V -p 5999

## client
# -p 5999: port 
# -c: client mode connect to server ip
# -u: udp
# -b 24M: bandwidth 24Mbits/sec -> 3Mbytes/sec
# -t 600: time 600 sec
# -l 300: packet len 300 bytes
# so IP packet rate is around 3Mbytes/300bytes = 10K/sec
iperf3 -p 5999 -c 10.19.1.15 -u -b 24M -t 600 -l 300

Pod to Pod

The same as Node to Node, but wget and build iperf3 inside pod container and use the container’s IP (in container run hostname -I), for example, I flood data from is-en-conductor-0 pod to is-engine-compute-12 pod, they reside in different host machine.

Thinking

After I reproducing the tests, I was thinking Calico is a widely used add-on that shouldn’t have such obvious bottleneck, otherwise many people will complain and improve it.

Is there any improper configuration?

Configuring IP-in-IP By default, the manifests enable IP-in-IP encapsulation across subnets (additional overhead compare to non IP-in-IP), if don’t need it (when? I am not very clear), disable it in calico manifest yaml file:
1
2
3
# Enable IPIP
- name: CALICO_IPV4POOL_IPIP
value: "off"
See this IP-in-IP issue I am using Calico version 3.3, document about IP-in-IP
Which Network Interface is Used Another question I have is which network interface is used for node-to-node and pod-to-pod test?

There are several network interfaces in host machine, one is for public IP with MTU 9000 and in K8s we use private IP interface with MTU 1500. This will have impact on iperf3 testing.

It shows that pod-to-pod test uses MTU 1500 but node-to-node uses MTU 9000.

Need to test after enlarging MTU size to see if that change improves network throughput, also remember to update Calico manifest yaml, refer this document
1
2
# Configure the MTU to use
veth_mtu: "9000"

ethtool Command

The ethtool can display speed property of the network interface, for example:

# ethtool eno1

Settings for eno1:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Half 1000baseT/Full
        Supported pause frame use: No
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Half 1000baseT/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Link partner advertised link modes:  10baseT/Full
                                             100baseT/Full
                                             1000baseT/Full
        Link partner advertised pause frame use: Transmit-only
        Link partner advertised auto-negotiation: Yes
        Link partner advertised FEC modes: Not reported
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: on
        Supports Wake-on: g
        Wake-on: d
        Current message level: 0x000000ff (255)
                               drv probe link timer ifdown ifup rx_err tx_err
        Link detected: yes

The output depends on the what the network driver can provide, you may get nothing if the virtual machine does not have much data available (for example, in IBM softlayer cluster), refer to this question

In a virtual machine the link speed or duplex mode is usually meaningless, as the network interface is most often just a virtual link to the host system, with no actual physical Ethernet layer. The speed is as high as the CPU and memory can handle (or as low, as the connection rate limit is configured), cabling type does no exist, as there is no cable, etc. This virtual interface is bridged or routed to the actual physical network by the host system and only on the host system the physical port parameters can be obtained.

you can check if the network interface is virtual or not:

ls -l /sys/class/net

# output example
# both virtual: virtio1
lrwxrwxrwx. 1 root 0 Apr 27 16:11 eth0 -> ../../devices/pci0000:00/0000:00:04.0/virtio1/net/eth0
lrwxrwxrwx. 1 root 0 Apr 27 16:11 lo -> ../../devices/virtual/net/lo

Summary

Performance test big picture
iperf3, node-to-node, pod-to-pod tests
ethtool
Calico configutation: IP-in-IP, MTU
calicoctl, haven’t got time to learn

Other Blogs

k8s calico flannel cilium 网络性能测试 Benchmark results of Kubernetes network plugins (CNI) over 10Gbit/s network