ACI Multi-Pod

Mukesh Chanderia
Mar 31, 2024
24 min read

Updated: Apr 27

Features of ACI Multi-Pod

Disaster Recovery: Helps create a system that ensures business operations continue even during disasters.
Multiple Locations: Supports data centers spread across different server rooms.
Virtualization: Enables moving virtual machines (VMs) easily between different networks.
Single Management: Allows management of the entire system from one place.
Single Cluster: Cisco ACI Multi-Pod connects different parts of the network, called pods, into one system managed by a single Cisco APIC.
Shared Resources: Key components like tenants, VRFs, and bridge domains can be used across all pods.
Pod Structure: Each pod has its own two-layer network design (leaf-and-spine) and separate control systems.
Pod Connectivity: Connections between pods are made through spine switches and the IPN (Inter-Pod Network).

IPN: Connecting Cisco ACI Pods

Connectivity Hub: The IPN serves as a link connecting various Cisco ACI pods.
Multicast Support: It supports multicast through PIM Bidir with a minimum /15 subnet mask, essential for managing Layer 2 broadcast, unknown unicast, and multicast (BUM) traffic.
DHCP Relay Functionality: Enables seamless communication for Dynamic Host Configuration Protocol (DHCP) requests across the network.
OSPF Integration: Facilitates Open Shortest Path First (OSPF) communication between spine switches and routers.
Enhanced MTU Support: Accommodates larger maximum transmission units (MTUs) to effectively manage Virtual Extensible LAN (VXLAN) encapsulated traffic.
Quality of Service (QoS): Ensures consistent QoS policy implementation across all pods for optimal performance.
Routed Subinterface Support: Employs routed subinterfaces on IPN devices, which is essential for connecting to spine switches. Traffic originating from the spine switches is tagged with an 802.1Q VLAN 4 value.
Zero-Touch Provisioning: VLAN 4 is hardcoded for zero-touch provisioning, allowing new spine switches in a pod to send DHCP discovery messages in a fixed VLAN before the Cisco APIC applies any policies. This VLAN is confined to the link between spine switches and IPN devices, without the need for extension across the IPN.
LLDP Requirement: Link Layer Discovery Protocol (LLDP) must be enabled on the IPN devices for efficient network discovery.
Extension of Cisco ACI: The IPN functions as an extension of the Cisco ACI network, configured and managed separately to enhance network capabilities.

Multicast Support

Seamless Extension: Layer 2 multi-destination flows, known as BUM traffic, can be seamlessly extended across different pods belonging to bridge domains.
Encapsulation within Pods: Within each pod, BUM traffic is encapsulated into a VXLAN multicast frame, enabling transmission to all local leaf nodes.
Distinct Multicast Groups: Each defined bridge domain is associated with a unique multicast group (BD GIPo), ensuring organized traffic management.
IPN Network Integration: The multicast group used within the pod is also extended through the IPN network, where these groups operate in PIM Bidir mode for optimal performance.
Layer 3 Multicast Support: Layer 3 multicast (VRF GIPo) is efficiently forwarded over the IPN, enhancing data flow across the network.
PIM Bidir Benefits: The IPN leverages PIM Bidir due to its:
- Scalability: Easily accommodates growing network demands.
- Simplicity: Eliminates the need for data-driven multicast state creation.
- Proven Design: Represents a well-tested and recommended architectural choice.

DHCP Relay Support

Essential Relay Functionality: The IPN is required to support DHCP relay, facilitating automatic provisioning for all Cisco ACI devices deployed in remote pods.
Zero-Touch Configuration: Devices can seamlessly join the Cisco ACI Multi-Pod fabric with zero-touch configuration, streamlining the setup process.
DHCP Request Relay: IPN devices connected to the spine switches of remote pods must relay DHCP requests generated by a new starting Cisco ACI spine toward the Cisco APIC nodes.
Unified Addressing: Cisco APICs utilize the first pod TEP address range, regardless of which pod they are physically connected to, ensuring consistent network addressing.

OSPF Support

Routing Protocol: OSPFv2 is the main routing protocol used, along with static routing, for connecting the Integrated Pod Network (IPN) to the spine switches.
TEP Address Range: OSPFv2 is responsible for sharing the Tunnel Endpoint (TEP) address range with other pods.
Protocol Flexibility: As long as the IPN devices and spine switches are OSPF neighbours, any routing protocol can be used to pass TEP information between IPN devices.

Increased MTU Support

Purpose: The IPN needs to support a larger Maximum Transmission Unit (MTU) on its physical connections to allow VXLAN data traffic to flow between pods without splitting or reassembling packets.
Previous Limitation: Before Cisco ACI Release 2.2, spine nodes were set to send out 9150-byte full-size frames for MP-BGP control plane traffic with spine switches in remote pods.
New Configuration: With ACI Release 2.2, a new option has been added to the Application Policy Infrastructure Controller (APIC), allowing users to adjust the MTU size for control plane packets from ACI nodes (both leaf and spine switches).
MTU Requirement: The MTU in the IPN must be at least 50 bytes larger than the biggest frame size that endpoints connected to ACI leaf nodes can generate.

QoS Considerations

IPN Management: The Integrated Pod Network (IPN) is not managed by Cisco APIC and can change the 802.1p (class of service or CoS) priority settings. This requires extra steps to ensure Quality of Service (QoS) in a Cisco ACI Multi-Pod setup.
Packet CoS Value: When a packet leaves a spine switch in one pod heading to the IPN, the outer header contains the CoS value to maintain the prioritization of different types of traffic.
CoS Information Loss: IPN devices may not keep the CoS setting in 802.1p frames during transport. As a result, when the packet arrives at the other pod, it will no longer have the CoS information assigned by the first pod.
Configuring DSCP Policy: To keep the 802.1p frame information in the Cisco ACI Multi-Pod setup, a Differentiated Services Code Point (DSCP) policy must be set up on APIC. This policy helps maintain the QoS priority settings by mapping CoS to DSCP levels for various traffic types.
Important Configuration Notes:
- The CoS-to-DSCP mapping must be configured on APIC before traffic is sent to the IPN or received on remote spine switches.
- Care should be taken not to overwrite these DSCP values, especially for ACI Infra traffic classes like Control Plane and Policy Plane Traffic.
- Ensure that IPN devices do not change the DSCP markings to maintain the configured levels.
DSCP Conversion: With the DSCP policy in place, Cisco APIC converts the CoS level in the outer 802.1p header to a DSCP level in the outer IP header as packets leave the pod.
Preserving QoS Settings: When packets reach the second pod, the DSCP level is converted back to the original CoS level, ensuring that QoS priority settings are preserved.

CoS-to-DSCP Mapping Examples

Mapping Configuration: The examples illustrate how to set up a CoS-to-DSCP mapping on the APIC through the path: Tenant > infra > Policies > Protocol > DSCP class-COS translation policy, specifically for Layer 3 traffic. This changes the default settings.

Traffic Reassignment: When traffic arrives at the spine nodes in a remote pod, it is reassigned to the correct class of service based on the DSCP value in the outer IP header of inter-pod VXLAN traffic before entering the pod.
Traffic Marking: In this setup:
- Policy Plane Traffic (communication between APIC nodes in different pods) is marked as Expedited Forwarding (EF).
- Control Plane Traffic (like OSPF and MP-BGP packets) is marked as CS4.
Prioritizing Traffic: It’s important to configure IPN devices to prioritize Policy Plane and Control Plane traffic. This ensures that these critical services remain stable, even when there is a lot of east-west user traffic between pods.

IPN Control Plane

BUM Traffic Between Pods

Bridge Domain and Multicast Group: Each bridge domain in the Cisco ACI fabric is linked to a separate multicast group (GIPo). This ensures that multi-destination frames are delivered only to the endpoints within that specific bridge domain.
Default Multicast Range: By default, each bridge domain uses a GIPo from the 225.0.0.0/15 multicast range, which is set up during the initial configuration of the APIC.
Multi-Pod Deployment: In a Cisco ACI Multi-Pod setup, bridge domains can extend to access pods, requiring similar multicast behavior.
Support for PIM Bidir: The Integrated Pod Network (IPN) must support PIM Bidirectional (Bidir) to generate BUM (Broadcast, Unknown unicast, and Multicast) frames from an endpoint in a bridge domain.
Frame Encapsulation: When a BUM frame is created, it is encapsulated by the leaf node to which the endpoint is connected and can travel across the IPN to reach remote endpoints in the same bridge domain.
Spine Switch Functions: For this process to work, spine switches must perform two key functions:
1. Forward multicast frames to the IPN devices so they can reach remote pods.
2. Send IGMP join messages to the IPN network each time a new bridge domain is activated in the local pod. This allows the spine to receive BUM traffic for that bridge domain from endpoints in remote pods.
Authoritative Spine Node: For each bridge domain, one spine node is chosen as the main device to handle both functions, using the IS-IS control plane for communication between spine switches.

Sequence of BUM Traffic Delivery Between Pods

Originating Frame: An endpoint (EP1) in bridge domain 1 (BD1) creates a BUM frame.
Frame Encapsulation: The local leaf node wraps the frame and sends it to the multicast group GIPo1 (225.1.1.128) associated with BD1. The frame travels through multi-destination trees assigned to BD1, reaching all local spine and leaf nodes where BD1 exists.
Spine Node Forwarding: Spine 1 forwards the BUM traffic for BD1 to the IPN devices through the specific link connected to IPN1.
Multicast Replication: The IPN device receives the traffic and replicates it for all pods that have sent an IGMP Join request for GIPo1. This ensures BUM traffic is only sent to pods where BD1 is active.
Local Pod Distribution: The spine that sent the IGMP Join receives the multicast traffic and forwards it within the local pod using the multi-destination trees for BD1. All leaf switches with BD1 receive the frame.
Final Delivery: The leaf switch connected to endpoint 2 (EP2) receives the stream, unwraps the packet, and forwards it to EP2.
Design Consideration: It’s important to consider where to place the rendezvous point (RP) in the IPN during deployment.

Role of the Rendezvous Point (RP) in PIM Bidir Deployment

Importance of RP: The RP plays a crucial role in a PIM Bidir deployment as it directs all multicast traffic for Bidir groups.
Traffic Flow: Multicast traffic flows toward the Bidir RPs, branching off as needed both upstream and downstream.
BUM Traffic Implication: All BUM traffic exchanged between pods is routed through the same IPN device serving as the RP for the default multicast range of 225.0.0.0/15.
Bridge Domain Assignment: This range is used to assign multicast groups to each defined bridge domain.

Spine Switches and IPN Connectivity

Spine Switch Connection: You don’t need to connect every spine switch in a pod to the IPN devices. However, at least two spine switches should be connected for redundancy.

Direct Connections Impact: If spine switches from different pods are directly connected, it can affect BUM traffic sent to remote pods.
Election Limitation: In this case, the directly connected spine switches cannot be designated for a specific bridge domain, making it impossible to forward BUM traffic between pods.
Layer 3 IPN Devices Requirement: Always deploy at least one Layer 3 IPN device (or two for redundancy) between pods.
Physical Path Assurance: Ensure there is a physical connection between all IPN devices to prevent issues with BUM traffic across pods.

The following example shows an IPN configuration for a Cisco Nexus 9000 Series Switch operating in NX-OS mode, which interconnects two pods (its interface Ethernet2/7 is connected to Pod 1 – Spine 1 and interface Ethernet2/9 to Pod 2 – Spine 1).

feature dhcp
feature pim
 
# Enable Jumbo frames
policy-map type network-qos jumbo
  class type network-qos class-default
    mtu 9216
 
system qos
  service-policy type network-qos jumbo
 
service dhcp
ip dhcp relay
ip pim ssm range 232.0.0.0/8
 
# Create a new VRF for Multipod.
vrf context fabric-mpod
  ip pim rp-address 12.1.1.1 group-list 225.0.0.0/8 bidir
  ip pim rp-address 12.1.1.1 group-list 239.255.255.240/28 bidir
  ip pim ssm range 232.0.0.0/8
 
interface Ethernet2/7
  no switchport
  mtu 9150
  no shutdown
 
interface Ethernet2/7.4
  description pod1-spine1
  mtu 9150
  encapsulation dot1q 4
  vrf member fabric-mpod
  ip address 201.1.2.2/30
  ip router ospf a1 area 0.0.0.0
  ip pim sparse-mode
  ip dhcp relay address 10.0.0.1
  ip dhcp relay address 10.0.0.2
  ip dhcp relay address 10.0.0.3
  no shutdown
 
interface Ethernet2/9
  no switchport
  mtu 9150
  no shutdown
 
interface Ethernet2/9.4
  description to pod2-spine1
  mtu 9150
  encapsulation dot1q 4
  vrf member fabric-mpod
  ip address 203.1.2.2/30
  ip router ospf a1 area 0.0.0.0
  ip pim sparse-mode
  ip dhcp relay address 10.0.0.1
  ip dhcp relay address 10.0.0.2
  ip dhcp relay address 10.0.0.3
  no shutdown
 
interface loopback29
  vrf member fabric-mpod
  ip address 12.1.1.1/32
 
router ospf a1
  vrf fabric-mpod
    router-id 29.29.29.29

As a best practice, isolating Multi-Pod traffic within a dedicated VRF instance enhances network efficiency across the Integrated Pod Network (IPN).

This configuration ensures that traffic remains secure and manageable. Additionally, the spine interfaces are connected to the IPN devices using point-to-point routed sub-interfaces with VLAN 4, which facilitates clear and organized communication.

It's important to note that the use of sub-interfaces on the IPN devices is required solely for the connections to the Cisco ACI spine switches, further streamlining the network design.

Multi-Pod Provisioning and Packet Flow Between Pods

Provisioning Requirement: In a Cisco ACI Multi-Pod deployment, the fabric must be set up before it can forward traffic from endpoints.
Control and Data Plane Functions: The Cisco ACI Multi-Pod fabric uses different control and data plane functions to connect endpoints across various pods.
Information Exchange: After successful provisioning, information about all endpoints is shared through the COOP database in each spine via BGP EVPN across the IPN.
External Route Exchange: External routing information from L3Outs is also exchanged using BGP VPNv4/v6, serving as the control plane between pods.
Data Plane Traffic Forwarding: Once the forwarding information (like endpoint and L3Out routes) is shared, data traffic can flow across pods through the IPN using TEP and VXLAN encapsulation, just like it does within a single pod.
Traffic Types: Both unicast and flooding traffic can be forwarded smoothly between pods.
Adding a Pod: In the Cisco APIC user interface, you can use a wizard to add a pod to the Multi-Pod setup.
This wizard helps configure necessary settings like L3Outs on spine switches connected to the IPN, MTU on spine-to-IPN interfaces, OSPF settings towards the IPN, Anycast TEP IP address, and more.
To access this wizard, go to Fabric > Inventory > Quick Start > Add Pod and select "Add Pod" from the options.

Setting Up Cisco ACI Multi-Pod Provisioning

Initial Connection: The first pod (seed pod) and the second pod must be physically connected to the IPN devices before starting the provisioning process.
Steps for Setup:
1. Configure Access Policies: Set up access policies for all interfaces on the spine switches connecting to the IPN.
2. Define Spine Access Policies: Specify these policies as spine access policies.
3. Associate AEP: Use these policies to link an Attached Entity Profile (AEP) for a Layer 3 domain that uses VLAN 4 for encapsulation of the sub-interface.
4. Define Sub-Interfaces: Set up the sub-interfaces like regular leaf access ports. These will be used by the infrastructure L3Out interface you create.
5. Define Multi-Pod Environment: Specify the TEP address for spine switches facing each other across the IPN. This address will be a shared anycast address for all spine switches in a pod. Also, define Layer 3 interfaces between the spine interfaces and the IPN.
6. Configure the IPN: Set up the IPN devices.
7. IP Address and Protocol Setup: Assign IP addresses to the interfaces facing the spine switches, and enable OSPF routing protocol, MTU support, DHCP relay, and PIM Bidir.
8. OSPF Adjacencies: The IPN devices will establish OSPF adjacencies with the spine switches and share routes for the IS-IS network in VRF overlay-1.
9. DHCP Relay Configuration: Configure the DHCP relay in the IPN to learn adjacencies. DHCP frames sent across the IPN will reach the primary APIC in the first pod for DHCP address assignment from the TEP pool. Without DHCP relay, zero-touch provisioning won't work for nodes in the second pod.
10. Interface Access Policies for Second Pod: Set up the access policies for the second pod.
11. Importance of Access Policies: Without these access policies, the second pod cannot complete the process to join the fabric. You can add the device to the fabric, but discovery won't finish.
12. OSPF Connection Issues: The spine switch in the second pod won’t communicate with the original pod since OSPF adjacency can't be established due to VLAN 4 requirements and missing OSPF interface profiles and external Layer 3 definitions.
13. Reusing Access Policies: You can reuse the access policies from the first pod if the spine interfaces in both pods are the same. If the spine interfaces and ports on all switches match, you only need to add the spine switches to the defined switch profile.

Cisco ACI Multi-Pod Setup Process

Policy Push from APIC: Cisco APIC node 1 sends the infrastructure L3Out policies to the spine switches in Pod 1. This policy sets up OSPF on the interfaces connected to the IPN.
TEP Prefix Learning: At this point, the IPN learns the TEP prefixes from Pod 1 via spine OSPF. Meanwhile, the spine switches in Pod 1 learn the IP prefixes from the IPN facing the new spines in Pod 2.
Spine Boot-Up: The first spine switch in Pod 2 starts up and sends a DHCPDISCOVER message to all connected interfaces, including those towards the IPN devices.
DHCP Relay Configuration: The IPN device that receives the DHCPDISCOVER message is set up to relay it to the Cisco APIC nodes in Pod 1. This is possible because the IPN devices learned the Pod 1 TEP prefixes through OSPF with the spines.
DHCPOFFER Response: The Cisco APIC responds with a DHCPOFFER that includes important initial settings:
- The new spine downloads a bootstrap for the infra L3Out configuration from the APIC and sets up OSPF and BGP towards the IPN.
- The new spine also configures itself as a DHCP relay for the new switches in Pod 2, allowing their DHCP discovery messages to be sent to APIC nodes in Pod 1.
Sub-Interface IP: The new spine sets its sub-interface IP that faces the IPN.
Static Route Setup: A static route is created to the Cisco APIC that sent the DHCPOFFER, pointing to the IPN IP address that relayed the DHCP messages.
Bootstrap Location: The bootstrap location for the infra L3Out configuration of the new spine is established.
Additional Nodes in Pod 2: All other nodes in Pod 2 come online in the same way as a single pod, with the only difference being that DHCP discovery is relayed through the IPN.
APIC Discovery: The Cisco APIC controller in Pod 2 is discovered as usual.
Joining the Cluster: The Cisco APIC controller in Pod 2 joins the APIC cluster.

Inter-Pods MP-BGP Control Plane

Endpoint Information Storage: In a single ACI fabric, information about all endpoints connected to leaf nodes is stored in the COOP database, which is accessible from spine nodes.
Learning Local Endpoints: When a leaf node learns about a new local endpoint, it sends a COOP control plane message to a spine node. This message includes the endpoint's information (IPv4/IPv6 addresses and MAC addresses).
Information Synchronization: Spine switches use the COOP protocol to synchronize endpoint information among themselves.
Sharing Endpoint Information: The COOP database information in each pod is shared through MP-BGP EVPN across the IPN, allowing each pod to recognize which endpoints are in which pods. MP-BGP EVPN operates directly between spine switches in each pod.
IPN Device Role: IPN devices do not participate in the BGP sessions; they provide TEP reachability to help establish BGP sessions between spine switches.
BGP Configuration: BGP in each pod operates within the same Autonomous System (AS). This AS number is set through the BGP route reflector policy, independent of the Cisco ACI Multi-Pod setup.

Types of TEPs Used in Inter-Pods MP-BGP

BGP Router ID: This is the IP address of each spine used to establish MP-BGP peer sessions. It was previously known as CP-TEP (Control Plane TEP) and is called EVPN-RID in multi-site deployments.
DP-TEP (Data Plane TEP): An IP address used as the next-hop to represent one pod in the BGP table.
Anycast Proxy TEP: An IP address included in the outer VXLAN header for packets traveling across the IPN for spine-proxy.

Inter-Pods VXLAN Data Plane

Inter-Pods MP-BGP EVPN Control Plane Sequence

Learning an Endpoint: When an endpoint (EP1) is recognized on Leaf 1 in Pod 1, Leaf 1 sends a COOP message to one of the spine switches.
Updating the COOP Database: The spine switch that receives the message adds the endpoint information to the COOP database and shares it with all other spine switches in Pod 1. EP1 is linked to the TEP address of Leaf 1.
Sharing with Other Pods: The endpoint information in Pod 1’s COOP database is shared with other pods through MP-BGP EVPN.
Learning in Pod 2: When a spine in Pod 2 receives endpoint information via MP-BGP EVPN, it adds this information to its own COOP database and synchronizes it with other local spine nodes.
Next-Hop Address: When the MP-BGP entries from Pod 1 are converted into COOP entries in Pod 2, the next-hop for these entries is the DP-TEP of Pod 1.
Anycast TEP Association: EP1 is now associated with an Anycast TEP address (Proxy A) that represents Pod 1 instead of the TEP address of Leaf 1.
Control Plane Isolation: This setup allows for robust control plane isolation across pods. There’s no need to send new control plane updates to Pod 2, even if EP1 moves between leaf nodes in Pod 1, as the entry will always point to the Proxy A next-hop address.

BGP Peering Methods Between Spine Nodes

Since spine nodes in different pods are part of the same BGP Autonomous System, they can connect in two ways:

Full-Mesh: This method establishes a full mesh of MP-BGP (IBGP) sessions between spine switches in each pod. This is the default setup.
Route Reflector: This approach defines route reflector nodes in each pod, which is recommended for better resiliency. Spine nodes only peer with remote route reflector nodes, and a full mesh of MP-BGP sessions is established only between the route reflectors.
- External Route Reflectors: These are called external route reflectors, in contrast to internal route reflectors that exist between spine and leaf switches. Tenants > infra> Policies > Protocol > Fabric Ext Connection Policies > Fabric Ext Connection Policy default

Inter-Pods VXLAN Data Plane

Establishing IP Connectivity: To connect endpoints in different pods, the first step is completing an Address Resolution Protocol (ARP) exchange.

ARP Request Sequence

Generating ARP Request: EP1 sends out an ARP request to find the MAC address of EP2 (assuming both are in the same IP subnet).
Local Leaf Reception: Leaf 1 in Pod 1 receives the ARP request. It checks the packet, learns information about EP1, and identifies that the request is for EP2’s IP address. Since EP2 isn’t known yet, it doesn't have information about EP2 in its tables.
ARP Flooding: Because ARP flooding is enabled, the leaf selects the FTAG linked to one of the multi-destination trees for BUM traffic. It encapsulates the ARP request into a multicast packet using the GIPo associated with the specific bridge domain. The VXLAN header is updated with the pcTag information for the Endpoint Group (EPG) that EP1 belongs to.
Sending Across IPN: The designated spine switch sends the encapsulated ARP request across the IPN, using the same GIPo multicast address. The IPN must be properly set up for traffic replication to all remote pods where this bridge domain exists. This replication uses multicast routing with PIM Bidir.
Flooding in Pod 2: A spine node in Pod 2 (which previously sent an IGMP Join for the multicast group) receives the packet and floods it along a local multi-destination tree. The spine has learned about EP1 from an MP-BGP update from Pod 1.
Leaf Reception in Pod 2: The leaf where EP2 is connected (Leaf 4 in Pod 2) receives the flooded ARP request. It learns about EP1 and forwards the packet to all local interfaces in the bridge domain.
EP2 Response: EP2 receives the ARP request and responds, allowing the fabric to recognize it as a "live" endpoint (no longer a "silent host").

Unicast ARP Reply: EP2 sends a unicast ARP reply to EP1's MAC address.
VXLAN Encapsulation: Leaf 4 encapsulates the reply in a VXLAN frame destined for Leaf 1 in Pod 1. It also updates local spine nodes about EP2 being connected.
Receiving the Reply: Leaf 1 in Pod 1 receives the packet, de-encapsulates it, learns about EP2, and forwards the packet to EP1. EP1 receives the ARP reply.
Handling Silent Hosts: Cisco ACI can manage silent hosts without flooding ARP requests in the bridge domain.
ARP Gleaning Process:
- If ARP flooding is disabled, and the spine lacks destination information, the fabric generates an ARP request using the pervasive gateway IP of the bridge domain.
- This request is sent to all leaf nodes in the bridge domain.
- In Multi-Pod deployments, the ARP glean request is also sent to remote pods across the IPN.
Multicast Frame for ARP Gleaning: The ARP glean message is encapsulated in a multicast frame using the multicast group 239.255.255.240, instead of the usual GIPo for BUM traffic.

Dataplane TEP/External Proxy TEP (ETEP) is the Address owned by all Spines in a POD and acts a next hop for BGP EVPN paths

apic1# moquery -c ipv4If -f 'ipv4.If.mode*"etep"' -x 'rsp-subtree=children'

# ipv4.If

id : lo14

adminSt : enabled

dn : topology/pod-1/node-1001/sys/ipv4/inst/dom-overlay-1/if-[lo14]

donorIf : unspecified

lcOwn : local

modTs : 2019-02-20T16:58:34.113-04:00

mode : etep

rn : if-[lo14]

# ipv4.Addr

addr : 192.168.1.254/32

TSHOOT

To find out from which leaf the mac address was learnt

spine1# show coop internal info repo ep | grep -B 8 -A 35 00:50:56:A8:B0:03

------------------------------------------

**ommitted

EP bd vnid : 15761417

EP mac : 00:50:56:A8:B0:03

**ommitted

Tunnel nh : 10.0.72.67

**omitted

a-apic1# moquery -c ipv4Addr -f ‘ipv4.Addr.addr==“10.0.72.67”’

Total Objects shown: 1

# ipv4.Addr

addr : 10.0.72.67/32

dn : topology/pod-1/node-101/sys/ipv4/inst/dom-overlay-1/if-[lo0]/addr-[10.0.72.67/32]

**ommitted

If there is Layer 2 Unicast Traffic then how remote POD learns about EP.

Step 1 : Local Pod Spine installs COOP record

show coop internal info repo ep | grep -B 8 -A 35 <mac address>

a-spine1# show coop internal info repo ep | grep -B 8 -A 35 00:50:56:A8:B0:03

------------------------------------------

**ommitted

EP bd vnid : 15761417

EP mac : 00:50:56:A8:B0:03

**ommitted

Tunnel nh : 10.0.72.67

**omitted

Step 2 : Local Pod Spine Exports into BGP EVPN

show bgp l2vpn evpn <mac address> vrf overlay-1

a-spine1# show bgp l2vpn evpn 00:50:56:A8:B0:03 vrf overlay-1

Route Distinguisher: 1:16777199 (L2VNI 1)

BGP routing table entry for

[2]:[0]:[15761417]:[48]:[0050.56a8.b003]:[0]:[0.0.0.0]/216, **ommitted

Paths: (1 available, best #1)

Flags: (0x00010a 00000000) on xmit-list, is not in rib/evpn

Multipath: eBGP iBGP

Advertised path-id 1

Path type: local 0x4000008c 0x0 ref 0, path is valid, is best path

AS-Path: NONE, path locally originated

0.0.0.0 (metric 0) from 0.0.0.0 (192.168.1.101)

Origin IGP, MED not set, localpref 100, weight 32768

Received label 15761417

Extcommunity:

RT:5:16

Path-id 1 advertised to peers:

192.168.2.101 192.168.2.102

Step 3 : Remote Pod Spine Receives through EVPN

show bgp l2vpn evpn <mac address> vrf overlay-1

spine3# show bgp l2vpn evpn 00:50:56:A8:B0:03 vrf overlay-1

Route Distinguisher: 1:16777199

BGP routing table entry for [2]:[0]:[15335345]:[48]:[0050.56a8.b003]:[0]:[0.0.0.0]/216, *ommitted

Paths: (2 available, best #1)

Flags: (0x000202 00000000) on xmit-list, is not in rib/evpn, is locked

Multipath: eBGP iBGP

Advertised path-id 1

Path type: internal 0x40000018 0x2040 ref 1, path is valid, is best path

AS-Path: NONE, path sourced internal to AS

192.168.1.254 (metric 3) from 192.168.1.101 (192.168.1.101) <- BGP Address of spine 1

192.168.1.254 is TEP/ETEP of POD1

Origin IGP, MED not set, localpref 100, weight 0

Received label 15335345

Received path-id 1

Extcommunity:

RT:5:16

ENCAP:8

Step 4 : Remote Pod Spine Imports into COOP

show coop internal info repo ep | grep -B 8 -A 35 <mac address>

Scenario : EP’s cannot communicate in L2 BD

Step 1 : Does local leaf knows about remote EP ?

leaf101# show endpoint mac 8c60.4f02.88fc <- No output

Step 2 : Does BD flood or proxy unknown unicast?

apic1# moquery -c fvBD -f 'fv.BD.name=="bd-L2-2“’

name : bd-L2-2

dn : uni/tn-CiscoLive2020/BD-bd-L2-2

unkMacUcastAct : proxy

Step 3 : Does Local Pod Spine have the EP?

spine1# moquery -c coopEpRec -f 'coop.EpRec.mac=="8c60.4f02.88fc "'

spine1# show bgp l2vpn evpn 8c60.4f02.88fc vrf overlay-1

Step 4 : Does Remote Pod Spine have the EP? --> Yes

spine3# moquery -c coopEpRec -f 'coop.EpRec.mac=="8c60.4f02.88fc "'

# coop.EpRec

vnid : 15761417

mac : 8C:60:4F:02:88:FC

spine3# show bgp l2vpn evpn 8c60.4f02.88fc vrf overlay-1 --> Remote Spine Exports to EVPN

AS-Path: NONE, path locally originated

0.0.0.0 (metric 0) from 0.0.0.0 (192.168.2.101)

Origin IGP, MED not set, localpref 100, weight 32768

Received label 15761417

Extcommunity:

RT:5:16

Step 5 : Is EVPN up between Pods?

spine1# show bgp l2vpn evpn summ vrf overlay-1 ---> BGP is down

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd

192.168.2.101 4 65000 57380 66362 0 0 0 00:00:21 Active

192.168.2.102 4 65000 57568 66357 0 0 0 00:00:22 Active

MULTICAST

What is Multicast?

Multicast: Delivery of a single data transmission to a group of recipients.
Comparison:
- Unicast: One-to-one
- Broadcast: One-to-all
- Multicast: One-to-many
- Anycast: One-to-any (used for redundancy)

Why Use Multicast?

Server sends one stream, network replicates as needed
Host joins a multicast group only if interested
Reduces server load and network congestion

Disadvantages of Multicast

Uses UDP (best effort)
No retransmission, sequencing, or congestion control
Not suitable for load-balanced paths
Quality of Service (QoS) is more challenging

How Multicast Works

Senders don't know receivers
Receivers signal routers using IGMP
Routers build a loop-free tree from sender to receivers.

Multicast Addressing

IPv4 Class D: 224.0.0.0 – 239.255.255.255
Reserved Ranges:
- 224.0.0.x → Link-local
- 224.0.0.13 → PIM
- 232.0.0.0/8 → Source Specific Multicast (SSM)
- 239.0.0.0/8 → Private use (admin scoped)

Layer 2 Multicast (Ethernet)

Multicast IPs don’t map to host MACs (no ARP)
Uses special MAC: 0100.5Exx.xxxx (first 25 bits)
Hosts must configure NIC to accept frames for multicast MAC

Multicast Host Signaling: IGMP

IGMPv2:

Hosts use Membership Report to join groups
Leave-Group Message: Notifies no longer interested
Querier: Lowest IP router on LAN sends periodic queries

IGMPv3:

Supports Source-Specific Multicast (SSM)
Sends membership reports to 224.0.0.22

Multicast Router Signaling

RPF – Reverse Path Forwarding:

Loop prevention mechanism
Checks if incoming interface is the correct RPF interface
If not → RPF failure → drop
Use show ip rpf <source-IP> for troubleshooting

Multicast Distribution Trees

Source Tree (SPT): From sender directly to receiver
Shared Tree (RPT): From sender to Rendezvous Point (RP) to receiver
- Only in PIM Sparse Mode

Multicast Routing Tables

mRIB: Multicast Routing Information Base (control plane)
mFIB: Multicast Forwarding Information Base (data plane)

Show Commands:

show ip mrib
show ip mfib
show ip mroute

show ip mroute Output:

(*,G) → Shared tree (RP-centric)
(S,G) → Source tree
Lists Incoming Interface (RPF), OIL (outgoing interface list)

Below is an example of how you might see these commands used in a multi-pod Cisco ACI environment.

1. show ip mrib

Example Output:

APIC1# show ip mrib
Multicast Routing Information Base
----------------------------------
S,G: (10.1.1.1,239.1.1.1)
   Incoming Interface : Vlan10
   Outgoing Interface(s): Vlan20, Vlan30
   RP                : 10.1.1.254

S,G: (10.1.1.2,239.1.1.2)
   Incoming Interface : Vlan10
   Outgoing Interface(s): Vlan40
   RP                : 10.1.1.254

Explanation:

S,G Entries: Each entry represents a multicast route defined by a Source (S) and a Group (G).
Incoming Interface: This is the interface (for example, Vlan10) on which multicast traffic for the specific source-group pair is received.
Outgoing Interfaces: These interfaces (e.g., Vlan20, Vlan30) are where the multicast traffic is forwarded to downstream receivers.
RP (Rendezvous Point): In a Protocol Independent Multicast (PIM) environment, this shows the designated router (here, 10.1.1.254) that acts as the rendezvous point for group membership.

This command provides you with the routing decisions made by the control plane—the “multicast RIB”—that tells the system how to distribute multicast streams among different pods or segments.

2. show ip mfib

Example Output:

APIC1# show ip mfib
Multicast Forwarding Information Base
---------------------------------------
MRoute: (10.1.1.1,239.1.1.1)
   Incoming Interface : Vlan10
   Outgoing Interfaces: Vlan20 (IF index 2), Vlan30 (IF index 3)

MRoute: (10.1.1.2,239.1.1.2)
   Incoming Interface : Vlan10
   Outgoing Interface : Vlan40 (IF index 4)

Explanation:

MFIB vs. MRIB: While the MRIB shows the multicast routing table learned via routing protocols (the control plane), the MFIB represents the data-plane state that is actually programmed into the hardware for fast packet forwarding.
IF index: The interface index numbers (e.g., 2, 3, 4) help identify the hardware-specific port identifiers.
Purpose: This command lets you verify that the multicast routing decisions have been correctly translated into forwarding entries used by the hardware. In a multi-pod setup, ensuring that these entries are correct is critical for efficient multicast traffic distribution across pods.

3. show ip mroute

Example Output:

APIC1# show ip mroute
Multicast Routing Table
-----------------------
(*,239.1.1.1), 00:05:12/00:00:01, RP 10.1.1.254, Flags: S
  Incoming interface : Vlan10, RPF nbr: 10.1.1.1
  Outgoing interface list: Vlan20, Vlan30

(10.1.1.2,239.1.1.2), 00:04:33/00:00:03, Flags: S
  Incoming interface : Vlan10, RPF nbr: 10.1.1.2
  Outgoing interface : Vlan40

Explanation:

(*,239.1.1.1): A wildcard source entry where “*” indicates that the multicast group is being shared among multiple sources.
Timers: The numbers (e.g., 00:05:12/00:00:01) usually indicate the duration the route has been active and the remaining time before the route expires or is refreshed.
RP & Flags: The RP is shown along with flags (here, S might denote “static” or “source-specific”), which provide additional context on how the entry was created.
RPF nbr (Reverse Path Forwarding Neighbor): This is the neighbor used to validate the incoming multicast stream to ensure that packets arrive on the correct interface.
Purpose: The show ip mroute command provides a more comprehensive view of the multicast routing table including both the control plane and operational timers. This is useful for troubleshooting, as you can see the aging information and flag status along with the interface details.

Putting It in Context – Multi-Pod ACI

In a multi-pod Cisco ACI fabric, multicast traffic must be efficiently distributed across different pods. The above commands are used to:

Validate Routing (MRIB): Check that multicast routes are correctly learned and are present in the routing table.
Verify Forwarding (MFIB): Ensure that the hardware is set up to forward multicast packets according to the control plane’s decisions.
Troubleshoot & Monitor (Mroute): Look at the complete multicast routing table with timers and flags, making it easier to identify issues like stale routes or incorrect interface assignments.

Using these commands together gives network administrators full visibility into both the control and data plane multicast configurations and operations, which is critical in a complex multi-pod environment.

Protocol Independent Multicast (PIM)

Leverages unicast routing for RPF checks
Does NOT build its own topology
Global Enable: ip multicast-routing
Interface Enable: ip pim {dense-mode | sparse-mode}

PIM Modes

Mode	Behavior	Description
Dense	Push	Floods traffic to all
Sparse	Pull	Sends only when requested
Bidirectional	Pull (both directions)	Supports multiple senders
SSM	Pull (source specific)	IGMPv3-only

PIM Join/Prune Messages

Joins sent upstream on RPF interface when:
- IGMP join received
- Router receives multicast with no known group
Prunes sent when receiver leaves the group

Rendezvous Point (RP)

Shared root for RPT (Shared Tree)
Learned through:
- PIM Register from FHR
- PIM Join from LHR
Can be manually configured or discovered (Auto-RP/BSR)

Tree Merging & SPT Switchover

RP links sender & receiver trees initially
SPT Switchover: After traffic starts, LHR switches to shortest path with (S,G) Join
Prunes old shared tree with (*,G) Prune

Example Flow Summary

Host sends IGMP join
LHR sends PIM Join to RP
FHR registers with RP via PIM Register
RP joins sender’s tree
Traffic begins flowing via RP
SPT Switchover happens (if configured)

System Gipo Usage

What does Multipod use BUM for?

• Unknown Unicast Flooding

• Multidestination Traffic (ARP, Multicast, BPDU’s)

• Inter-pod Glean Messages

• EP Announce Messages

Spines act has multicast hosts (IGMP only) & join fabric multicast groups (Gipo’s). IPN’s receive Joins. IPN’s send PIM joins to RP

Only one spine in each pod joins each group

spine1# show ip igmp gipo joins

GIPo list as read from IGMP-IF group-linked list

------------------------------------------------

225.0.80.64 0.0.0.0 Join Eth1/25.25 95 Enabled

RPF for all IPN’s must point to same RP

Phantom RP

• Bidir PIM doesn’t support multiple RP’s

• Phantom RP is only means of RP redundancy

• Works by advertising varied Prefix Lengths for RP subnet

• Failover handled via IGP

• Loopback must be OSPF P2P network type

• Exact RP Address must not exist anywhere

Common Multicast Problems

Issue # 1: RP Address Exists on Multiple Routers

Issue # 2: RP Loopback not OSPF P2P Network

In OSPF Loopbacks advertise /32 by default.

Common Multipod L3out Problems

Issue 1: Asymmetric Routing with Active/Active Pods.

Issue 2: Stretched L3out VIP Failover

Please check the following

• Ensure an SVI is used for the l3out (no flooding for routed interfaces)

• Ensure the same vlan encap is used in each pod

• Ensure the IPN agrees on the tree for the GIPO

• Ensure a GARP is sent by external router

• Check if the GARP is sent with COS 6 (more on this later)

QoS

DHCP

POD2 - Spine

POD2-spine# show ip interface ethernet 1/11.39 vrf overlay-1

PODE-SPINE

show dhcp internal event-history traces | egrep " 12:25:31"

APIC1 of POD1 which is acting as DHCP server

/var/log/dme/log

Location for bootstrap XML file for spine.

• Challenge :

APIC in Pod 1 use 10.0.0.1 and 10.0.0.2
APIC in Pod 2 use 10.0.0.3 but in IPN and Pod 2, the routes for TEP pool 10.0.0.0/16 is sending it toward Pod 1 🡨 no reachability to 10.0.0.3 • Solution :
Leaf where APIC is connected in pod 2 sees it via LLDP
That leaf in Pod 2 insert a static route 10.0.0.3 (connected in infra vlan)
Leaf in Pod2 redistribute it to ISIS
Spine in pod 2 redistribute 10.0.0.3 from ISIS to OSPF
Spine in Pod 1 redistribute 10.0.0.3 from OSPF to ISIS
APIC 1 and 2 can get IP reachability using this route
Cluster gets fully fit.

bdsol-aci32-leaf5# show ip route 10.0.0.3 vrf overlay-1 IP Route Table for VRF "overlay-1"'*' denotes best ucast next-hop'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]'%<string>' in via output denotes VRF <string>

10.0.0.3/32, ubest/mbest: 1/0, attached *via 10.0.0.3, vlan2, [0/0], 3d23h, am

via 10.0.0.3, vlan2, [1/0], 3d23h, static

bdsol-aci32-leaf5# show ipmgr internal trace | egrep 10.0.0.3/321083) 2016 Aug 8 12:36:47.516192 ipmgr_static_rt.c:ipmgr_process_static_rt_message:3199: Add Non-pervasive static route in vrf overlay-1 nhvrf overlay-1 10.0.0.3/32 10.0.0.3 Vlan2 0 1 0 BFD disabled1202) 2016 Aug 8 12:36:47.404624 ipmgr_static_rt.c:ipmgr_process_objstore_hndl_ipv4_rt_nh_message:3603: Item 0: dn: sys/ipv4/inst/dom-overlay-1/rt-[10.0.0.3/32]/nh-[vlan2]-addr-[10.0.0.3/32]-vrf-[overlay-1], vrf overlay-1, prefix 0xa0000031207) 2016 Aug 8 12:36:47.404514 ipmgr_static_rt.c:ipmgr_process_objstore_hndl_ipv4_rt_nh_message:3544: Object num 0 => DN: sys/ipv4/inst/dom-overlay-1/rt-[10.0.0.3/32]/nh-[vlan2]-addr-[10.0.0.3/32]-vrf-[overlay-1] (prop_chg_bmp = 0)

Features of ACI Multi-Pod

IPN: Connecting Cisco ACI Pods

Multicast Support

DHCP Relay Support

OSPF Support

Increased MTU Support

QoS Considerations

CoS-to-DSCP Mapping Examples

BUM Traffic Between Pods

Sequence of BUM Traffic Delivery Between Pods

Role of the Rendezvous Point (RP) in PIM Bidir Deployment

Spine Switches and IPN Connectivity

Multi-Pod Provisioning and Packet Flow Between Pods

Setting Up Cisco ACI Multi-Pod Provisioning

Cisco ACI Multi-Pod Setup Process

Inter-Pods MP-BGP Control Plane

Types of TEPs Used in Inter-Pods MP-BGP

Inter-Pods MP-BGP EVPN Control Plane Sequence

BGP Peering Methods Between Spine Nodes

Inter-Pods VXLAN Data Plane

TSHOOT

If there is Layer 2 Unicast Traffic then how remote POD learns about EP.

Scenario : EP’s cannot communicate in L2 BD

MULTICAST

What is Multicast?

Why Use Multicast?

Disadvantages of Multicast

How Multicast Works

Multicast Addressing

Layer 2 Multicast (Ethernet)

Multicast Host Signaling: IGMP

IGMPv2:

IGMPv3:

Multicast Router Signaling

RPF – Reverse Path Forwarding:

Multicast Distribution Trees

Multicast Routing Tables

show ip mroute Output:

1. show ip mrib

2. show ip mfib

3. show ip mroute

Putting It in Context – Multi-Pod ACI

Protocol Independent Multicast (PIM)

PIM Modes

PIM Join/Prune Messages

Rendezvous Point (RP)

Tree Merging & SPT Switchover

Example Flow Summary

Comments