ArthurChiao's Blog

OVS Unknown Unicast Flooding Under Distributed L2 Gateway

Published at 2019-10-13 | Last Update 2020-05-18

TL; DR

In a distributed L2 gateway environment (e.g. Spine-Leaf), misconfigurations of ARP aging time may cause OVS unicast flooding. And, the behaviors of distributed L2 gateway products vary among different vendors.

1 Problem Description

An internal user reported that they noticed some of their instances (docker containers) periodically get a relatively large sceptical ingress traffic - even if their instance is not serving, shown as below:

Fig. 1.1 Sceptical periodic ingress traffic to an instance

2 Infra & Environment Info

This section provides some basic infrastructure information to help understand the problem. For more detailed information, please refer to my previous post Ctrip Network Architecture Evolution in the Cloud Computing Era.

The data center network utilizes a Spine-Leaf architecture, with Leaf nodes serve as both distributed L2 and L3 gateway.

Fig. 2.1 Datacenter network topology

While inside each compute host, all instances connected to a OVS bridge, and the default route inside container points to its own (distributed) gateway.

Fig. 2.2 Virtual network topology inside a host

Others:

  • OVS version: 2.3.1, 2.5.6
  • Linux Kernel: 4.14

3 Trouble Shooting

3.1 Confirm: Unicast Flooding

Invoke tcpdump and without too much effort, we confirmed that these traffic are not targeted for the container, namely, neither the dst_ip nor the dst_mac of these periodical packets were the intances’s IP/MAC. So we got the first conclusion: OVS was doing unicast flooding [1].

Unicast flooding means, OVS didn’t know where the dst_mac of a packet is, so it duplicated this packet, and sent a copy of this packet to all the interfaces that having the same VLAN tag. E.g., in Fig 2.2, inst1’s egress traffic will be duplicated to inst2, but not inst3 and inst4.

3.2 Confirm: all flooded traffic are destinated for the L2 GW

Then next the question was, why OVS didn’t know the dst_mac.

These flooded packets varied a log, they came from different IP addresses, and went for different IP addresses, either.

But further looking into the captured packets, we found that all these flooded packets sharing the same dst_mac, let’s say 00:11:22:33:44:55. It took we a while to figure out that this was the distributed L2 gateway address in our Spine-Leaf network (this MAC was manually coded, responsible by another team, that’s why we didn’t determined it at first time).

3.3 Verify: OVS fdb entry went stale while container ARP was active

What OVS fdb looks like:

$ ovs-appctl fdb/show br-int
 port  VLAN  MAC                Age
    1     0  c2:dd:d2:40:7c:15    1
    2     0  04:40:a9:db:6f:df    1
    2     4  00:11:22:33:44:55    16
    2     9  00:11:22:33:44:55    6

Next, we’d like to verify our assumption: L2 GW’s entry in OVS fdb would be stale when this problem happened.

Fortunately, the problem happened every 20 minutes (turned out there are some periodic jobs that generated the traffic), so it’s easy for us to capture anything we wanted. We used following command to check the entry’s existence. In our case, the instance has a VLAN tag 4, so we grep pattern " 4 00:11:22:33:44:55".

for i in {1..1800}; do
    echo $(date) " " $(ovs-appctl fdb/show br-int | grep " 4 00:11:22:33:44:55") >> fdb.txt;
    sleep 1;
done

Normally, the print would like this:

2     4  00:11:22:33:44:55    16
2     4  00:11:22:33:44:55    17
2     4  00:11:22:33:44:55    18
2     4  00:11:22:33:44:55    19
2     4  00:11:22:33:44:55    0
...

During the test, we found that the print disappeared for minutes, and this period exactly matched the problematic period. So the second conclusion: the fdb entry indeeded went to stale ahead of the ARP entry inside container (as container was using this ARP entry transmitting those packets).

But there was still one question remaining: container’s traffic was not interrupted when during flooding, which meaned, the packets from gateway to container had succesfully been received by container, so from container’s view, it was not affected by this flooding behavior (but if the flooded traffic was really huge, the container may be affected, as there may have packet drops at this case).

In short: gateway had replied every request it received from container, why OVS hadn’t flushed its fdb entry for gateway? In theory OVS would do so, as the replies were unicast packets originated from gateway. Did we miss something?

3.4 Distributed L2 GW behavior: vendor-dependent

Cloud it be possible that the src_mac of unicast reply from gateway is different from the GW_MAC seen in container?

To verify, I invoked a really simple traffic, ping GW from container, print the src and dst MAC addresses of each packet:

$ tcpdump -n -e -i eth1 host 10.60.6.1 and icmp
fa:16:3e:96:5e:3e > 00:11:22:33:44:55, 10.6.6.9 > 10.60.6.1: ICMP echo request, id 7123, seq 1, length 64
70:ea:1a:aa:bb:cc > fa:16:3e:96:5e:3e, 10.6.6.1 > 10.60.6.9: ICMP echo reply, id 7123, seq 1, length 64
fa:16:3e:96:5e:3e > 00:11:22:33:44:55, 10.6.6.9 > 10.60.6.1: ICMP echo request, id 7123, seq 2, length 64
70:ea:1a:aa:bb:cc > fa:16:3e:96:5e:3e, 10.6.6.1 > 10.60.6.9: ICMP echo reply, id 7123, seq 2, length 64

That’s it! Why the reply packet has a MAC address 70:ea:1a:aa:bb:cc instead of 00:11:22:33:44:55? Who is 70:ea:1a:aa:bb:cc? We are notified that this was one of the real MACs of the distributed L2 GW, while the latter was the virtual MAC. That’s the problem! GW replies with a different MAC than 00:11:22:33:44:55, so this entry would never be flushed by OVS fdb, thus flooding continued.

This was the behavior of Cisco devices, we further checked our H3C devices, surprizingly found that under the same conditions, H3C replies were consistent: it always uses 00:11:22:33:44:55 for both sending and receiving. Till now, I havn’t got a definitive answer about what a distributed L2 (and L3) should behave.

More about physical switch ARP aging

The cisco switch maintains a ARP timer for each IP, which defaults to 1500s (25 minutes) [4].

LEAF # show ip arp vrf test | in 2001
10.6.2.227 00:01:12 fa16.xxxx.97c7 Vlan2001
10.6.2.228 00:01:15 fa16.xxxx.97c8 Vlan2001
10.6.2.229 00:01:33 fa16.xxxx.97c9 Vlan2001

If there are no frames originated from this ARP for 19 minutes, it will send a gratuitous ARP to this IP (host).

Host $ tcpdump -en -i eth0 ether src 00:11:22:33:44:55
15:26:31.650401 00:11:22:33:44:55 > fa:16:xx:67, vlan 2001, ARP, Request who-has 10.6.2.241 (fa:16:xx:67) tell 10.6.2.1
15:27:06.023959 00:11:22:33:44:55 > fa:16:xx:c5, vlan 2001, ARP, Request who-has 10.6.2.73  (fa:16:xx:c5) tell 10.6.2.1
15:27:07.594005 00:11:22:33:44:55 > fa:16:xx:aa, vlan 2001, ARP, Request who-has 10.6.2.7   (fa:16:xx:aa) tell 10.6.2.1

3.5 Fixup

This problem was raised by the distributed L2 GW behavior, but actually it is a configuration error inside host: we should always make sure intermediate forwarding devices (OVS bridges in our case) have a longer aging time than instance itself, and physical switch ARP aging time.

Linux kernel’s ARP aging machanism is really complicated, rather than one or several parameters, it is controlled by a combination of parameters and a state machine, refer to this post [3] if you are interested. Set OVS fdb aging to 1800s is safe enough for us:

$ ovs-vsctl set bridge br-int  other_config:mac-aging-time=1800
$ ovs-vsctl set bridge br-bond other_config:mac-aging-time=1800

(Above configurations survive OVS and system reboot.)

After this configuration, the problem disappeared:

Fig. 3.1 Problem disappeared

4 Summary

The kernel usually has a longer ARP aging time than OVS fdb (default 300s), thus in some cases, when the gateway’s MAC entry is still valid in ARP table, it is already stale in OVS fdb. So the next egress packet (with dst_mac=GW_MAC) to the gateway will trigger OVS unicast flooding.

When received a correctly responded packet from gateway, OVS fdb will flush gateway’s MAC entry, then subsequent unicast flooding will stop (turned to normal L2 forwarding as OVS knowns where GW_MAC is).

The real catastrophe comes when gateway responded incorrectly, to be specific:

  1. the gateway is a distributed L2 gateway, with a virtual MAC and many instance MACs (the same idea of VIP and instance IPs in load balancers)
  2. egress traffic from container to gateway uses gateway’s virtual MAC
  3. responded traffic from gateway to container uses one of its real MAC (instance MAC)

In this case, the OVS fdb will not be flushed, so OVS will unicast-flood every egress packet of the container which are destinated for the gateway, until the gateway proactively advertises its virtual MAC to container, or container initiates a proactive ARP request to gateway - this flooding period may persist for minutes, and all traffic in the same VLAN (or even in the entire OVS if VLAN not used) will be accumulated/copied to every instance that connected to OVS.

This may cause severe problems (e.g. packet drop) if you have QoS settings for the OVS interfaces that containers are using.

References

  1. Cisco Doc: Unicast Flooding
  2. Ctrip Network Architecture Evolution in the Cloud Computing Era
  3. Analysis of ARP aging time principle implemented by Linux
  4. Cisco Doc: ip arp timeout settings