*** Note: This is part of a series of posts outlining networking topics I had to research to more thoroughly understand VXLAN. I discussed IS-IS in the previous post. ***
One of the first things I learned about network traffic was how switches make forwarding decisions. The broadcast address, FFFF.FFFF.FFFF, is flooded through all the interfaces within the same VLAN as the broadcast frame. MAC addresses not recorded in the MAC table are treated similarly and forwarded out all interfaces participating in the original frame’s VLAN. I vaguely remember multicast being discussed but mostly for layer 3 traffic, specifically for IPv6. Taken together, these three types of traffic comprise BUM: broadcast, unknown unicast and multicast. Because BUM traffic naturally uses more bandwidth than unicast traffic, efficiency is a major concern when recreating BUM traffic in a VXLAN environment.
Before I discuss the different designs for replicating BUM traffic in a VXLAN fabric, there’s some vocabulary that I have to mention. In traditional networking, VLANs are layer 2 domains and VRFs are layer 3 domains. In VXLAN, confusingly, both are called virtual network interfaces (VNIs.) L2VNIs are basically the equivalent of VLANs and L3VNIs are basically the equivalent of VRFs. Both are numbered from 0 to 16 million and are configured on individual switches. The vast majority of configuration examples that I’ve encounter assign the L2VNIs with numbers smaller than the L3VNIs but I suspect that’s a best practice more than a hard rule. For the most part, I’ll be discussing L2VNIs.
So, what are the options to replicate BUM traffic? There are two that I’ve studied and I’ll begin with the more straightforward one. It’s called ingress replication or head end replication (IR/HER.) The two terms are interchangeable. The idea is simple: create a packet for all devices that need to receive BUM traffic and forward those packets using unicast. The main benefit of this option is the simple configuration. It requires one line on all the access switches to turn on and that’s it. There are no control plane mechanism, messages or protocols required for IR/HER. The downside, as you can guess, is its inefficiency. If there are N endpoints that need to receive the BUM traffic, the replicating switch needs to make N unicast packets, all with mostly the same information. In fact, only the destination address is changed for each packet. It’s a waste of bandwidth and computation.
The second more interesting, and complicated, option is to use multicast. Here’s where I had to do some studying. I knew that multicast was the class D address range of 188.8.131.52-184.108.40.206, but not more beyond that. As I suspected, there was a lot to unpack.
Multicast is a one to many solution. It sits somewhere between broadcast and unicast in that a source is able to send traffic to a set of interested receivers within a domain (or within multiple domains i.e. inter-AS multicast.) There are several protocols that allow this type of traffic to occur: IGMP and PIM.
IGMP or internet group management protocol is used between a host device and a router. PIM or protocol independent multicast is a control protocol used between multicast routers to communicate with each other. There is a special type of router in a multicast environment called the rendezvous point (RP.) As its name suggests, it acts as the site where multicast traffic between source and receiver meet. When a source sends multicast traffic, that traffic is sent to the RP and when a destination host signals its desire to join a particular multicast group, that message is sent towards the RP also. Let’s go through a high level example of how multicast routing works.
In the figure, host C wants to receive traffic from multicast group 220.127.116.11 and sends an IGMP message towards router C. Router C is statically configured with the address of the RP, router B, and sends out a PIM join message. Specifically this is a (*,G) (star comma gee) message meaning router C has no idea what the source of the multicast group is. In this example the multicast group is 18.104.22.168 so the (*,G) is (*, 22.214.171.124.) Each unique multicast address will have a separate (*,G) entry in the multicast routing table just like each prefix has a separate entry in the unicast routing table. The RP receives the PIM join request and registers the interface through which it received the message to the 126.96.36.199 group. This is called the outgoing interface or OIF. Whenever the RP receives traffic for the 188.8.131.52 group in the future, it will forward that traffic through all the OIFs for that group.
Some time later, server A generates traffic destined for the 184.108.40.206 group. If this is the first time that the traffic is being generated, router A needs to send a PIM register message to the RP. The multicast payload is encapsulated in the PIM register message. Router A will continue sending these PIM register messages until it receives a PIM stop-register message from the RP indicating that router A has successfully registered as the source for that multicast group. Router B receives the PIM register, decapsulates it and registers that interface through which it received the register message. This interface is now the incoming interface or IIF for that multicast group. The RP sends a PIM stop-register message towards router A and also forwards the multicast packet out its OIF towards router C.
When router C receives the multicast packet it checks whether it has any receivers that want traffic from that group and forwards the packet(s) out the appropriate interface(s.) If this was the first multicast packet that it has received from this multicast group, it will do a lookup in its routing table for the source IP address of the multicast packet. If there is a better route to reach the source IP than the interface through which it received the multicast packet, it sends a PIM join message out that interface instead and sends a PIM prune message to the RP. The RP removes the interface towards router C from its OIF list for multicast group 220.127.116.11 and stops sending multicast traffic for group 18.104.22.168 to router C. Router A receives router C’s join message and will forward all multicast traffic on the link it shares with router C. The result is that multicast traffic will follow a more optimum route from server A to host C. This shorter path is called the shortest path tree (SPT) and the original route going through the RP is called the rendezvous point tree (RPT.) The multicast route installed in the multicast routing table for router C is now an (S,G) (es comma gee) route instead of an (*,G) route because the source, server A’s IP address, is known.
What I just described is a multicast scheme called PIM sparse mode (PIM-SM). There is always a possibly sub-optimal route that multicast traffic takes through the RP followed by a more optimal route later. Any router having interested multicast receivers will automatically signal this shortest path tree as soon as it receives the first multicast packet. This allows for optimal routing between each receiver and the source but in networks where all receivers can also possibly become multicast sources, there can be an explosion of (S,G) multicast routes.
This is exactly what happens in an VXLAN fabric. Recall that this multicast discussion was originally to describe how VXLAN handles BUM traffic. The answer is multicast. If each L2VNI is assigned a multicast group, then BUM traffic can be encapsulated into a multicast packet and through multicast reach only the other hosts within the same L2VNI. That’s great but any host within the L2VNI can originate BUM traffic and thus become the source of the multicast group. With thousands of hosts across multiple L2VNIs, the access switches within the fabric will all potentially have to have a large amount of (S,G) multicast routes.
To reduce the amount of state information that the VXLAN fabric needs to support, a special instance of PIM, bidirectional PIM or PIM-BiDir, is used instead of PIM-SM. Essentially, PIM-BiDir prohibits SPTs from being signaled and therefore the only multicast routes installed are (*,G) routes. The great thing about configuring PIM is the similarity between the configuration of the different modes. To change from sparse mode PIM to bidirectional PIM, just add bidir at the end of the configuration.
In summary there are two ways that VXLAN transports BUM traffic: IR/HER and multicast routing. The latter requires more configuration in the underlay but is superior for efficiency and scaling; instead of sending individual unicast packets to all interested receivers, a single replicated multicast packet is forwarded out the appropriate interfaces.
The topology I used was purposefully chosen to illustrate the RPT and SPT process. I also only included one receiver and one multicast group. In reality, there can be multiple RPs that each are responsible for a range of multicast addresses. Or there can be redundant RPs servicing multiple domains and have what’s called an multicast source discovery protocol (MSDP) peering between each other. Multicast, like any networking protocol, contains many details. VXLAN only required me to research up to PIM BiDir and I only dived more deeply when I started studying MPLS networks. But that’s an entire other series of posts.