top of page
  • Writer's pictureMukesh Chanderia

ACI Multi-Pod

Updated: Jul 20

Need for Cisco ACI Multi-Pod?

  • Deployment of active-active disaster recovery solution for business continuity.

  • Data center deployed in multiple server rooms.

  • Infrastructure for a virtualization solution that supports live VM mobility across Layer 3 domains, etc.

  • Use a single administration domain.

  • Cisco ACI Multi-Pod is a single Cisco APIC cluster/single domain that interconnects portions of the fabrics , referred as pods.

  • Cisco ACI building blocks (tenants , VRFs, bridge domains, EGPs and so on) are deployed and useable on all pods.

  • Each pod has it's own leaf-and-spine two-tier architecture and isolated control plane protocols (IS-IS, COOP, MP-BGP).

  • Connectivity between pods is established through the spine switches in different pods and the IPN.








The IPN connects different Cisco ACI pods.

IPN must support several specific functionalities to perform those connectivity functions, such as:

  • Multicast support (PIM Bidir with at least /15 subnet mask), which is needed to handle Layer 2 broadcast, unknown unicast, and multicast (BUM) traffic.

  • Dynamic Host Configuration Protocol (DHCP) relay support.

  • Open Shortest Path First (OSPF) support between spine and routers.

  • Increased maximum transmission unit (MTU) support to handle the Virtual Extensible LAN (VXLAN) encapsulated traffic.

  • Quality of service (QoS) considerations for consistent QoS policy deployment across pods.

  • Routed subinterface support, since the use of subinterfaces on the IPN devices is mandatory for the connections toward the spine switches (the traffic originated from the spine switches interface is always tagged with an 802.1Q VLAN 4 value).

  • The VLAN 4 subinterface is hardcoded for zero touch provisioning so that new spine switches in the new pod can send DHCP discovery in the fixed VLAN even before Cisco APIC can push any policies to them. This VLAN is local only to the link between the spine switches and IPN devices. It does not have to be extended across IPN.

  • LLDP must be enabled on the IPN device.


IPN represents an extension of the Cisco ACI network & is separately configured and managed from the Cisco ACI.






OSPF Support


OSPFv2 is the only routing protocol (in addition to static routing) supported for connectivity between the IPN and the spine switches.


It is used to advertise the TEP address range to other pods.


As long as IPN devices and spines are OSPF neighbors, any protocol can be used to deliver the TEP information from one IPN device to another.




QoS Considerations



Since the IPN is not under Cisco APIC management and may modify the 802.1p (class of service [CoS]) priority settings, additional steps is required to guarantee the QoS priority in a Cisco ACI Multi-Pod topology.


When the packet leaves the spine switch in one pod towards the IPN, the outer header of the packet has CoS value to keep the prioritization of variety of traffic across pods.


However, IPN devices may not preserve the CoS setting in 802.1p frames during transport. Therefore, when the frame reaches the other pod, it will lack the CoS information that is assigned at the source in the first pod.


To preserve the 802.1p frame information in the Cisco ACI Multi-Pod topology, we need to configure on APIC a differentiated services code point (DSCP) policy to preserve the QoS priority settings while mapping CoS to DSCP levels for different traffic types.



Also, required to ensure that IPN devices will not overwrite the DSCP markings, so the IPN will not change the configured levels.


With a DSCP policy enabled, Cisco APIC converts the CoS level in the outer 802.1p header to a DSCP level in the outer IP header and frames leave the pod according to the configured mappings.


When they reach the second pod, the mapped DSCP level is mapped back to the original CoS level, so the QoS priority settings are preserved.


The following examples show a CoS-to-DSCP mapping that is configured on the APIC using the Tenant > infra > Policies > Protocol > DSCP class-cos translation policy for Layer 3 traffic, which modifies the default behavior.



Hence, when traffic is received on the spine nodes of a remote pod, it is reassigned to its proper class of service before being injected into the pod based on the DSCP value in the outer IP header of inter-pod VXLAN traffic.


The DSCP class-cos translation policy in the example above marks the Policy Plane Traffic (that is, communication between APIC nodes that are deployed in separate pods) as Expedited Forwarding (EF), whereas Control Plane Traffic (that is, OSPF and MP-BGP packets) is marked as CS4.


Hence, need to configure the IPN devices to prioritize those two types of traffic to ensure that the policy and control plane remains stable also in scenarios where a large amount of east-west user traffic is required across pods.



IPN Control Plane





BUM Traffic Between Pods


Each bridge domain in Cisco ACI fabric associates a separate multicast group (GIPo) to ensure granular delivery of multi-destination frames only to the endpoints that are part of a given bridge domain.


Hence, each bridge domain by default makes use of a GIPo part of the 225.0.0.0/15 multicast range, which was configured during the first APIC setup.


In a Cisco ACI Multi-Pod deployment, the bridge domains can be extended access pods, so similar behavior must be achieved. As indicated before, the IPN must support PIM Bidir, so the BUM frames are generated by an endpoint part of a bridge domain.


These frames are encapsulated by the leaf node where the endpoint is connected, and can transit across the IPN to reach remote endpoints part of the same bridge domain.


For this functionality, the spine switches must perform two basic functions:

  • Forward received multicast frames toward the IPN devices to ensure they can be delivered to the remote pods.

  • Send IGMP joins toward the IPN network every time a new bridge domain is activated in the local pod. This function is to be able to receive BUM traffic for that bridge domain originated by an endpoint that is connected to a remote pod.

For each bridge domain, one spine node is elected as the authoritative device to perform both functions, using the IS-IS control plane between the spine switches.




The delivery of BUM traffic between pods follows the following sequence:

  1. EP1 belonging to BD1 originates from a BUM frame.

  2. The frame is encapsulated by the local leaf node and destined to the multicast group GIPo1 (225.1.1.128) associated to BD1. As a consequence, it is sent along one of the multi-destination trees that are assigned to BD1 and reaches all the local spine and leaf nodes where BD1 has been instantiated.

  3. Spine 1 is responsible for forwarding BUM traffic for BD1 toward the IPN devices, using the specific link connected to IPN1.

  4. The IPN device receives the traffic and performs multicast replication toward all the pods from which it received an IGMP Join for GIPo1. This process ensures that BUM traffic is sent only to pods where BD1 is active.

  5. The spine that sent the IGMP Join toward the IPN devices receives the multicast traffic and forwards it inside the local pod along one of the multi-destination trees that are associated to BD1. All leaf switches where BD1 has been instantiated receive the frame.

  6. The leaf where EP2 is connected also receives the stream, de-encapsulates the packet, and forwards it to EP2.


An important design consideration should be made for the deployment of the rendezvous point (RP) in the IPN.



The role of the RP is important in a PIM Bidir deployment, as all multicast traffic in Bidir groups vectors toward the Bidir RPs, branching off as necessary as it flows upstream and/or downstream.


The implication is that all the BUM traffic exchanged across pods would be sent through the same IPN device acting as RP for the 225.0.0.0/15 default range used to assign multicast groups to each defined bridge domain.


Spine Switches and IPN Connectivity


If you have multiple spine switches in each pod, it is not mandatory to connect every spine that is deployed in a pod to the IPN devices, even though two should be a minimum for redundancy.



If spine switches belonging to separate pods are directly connected, there will be implications for the BUM traffic that needs to be sent to remote pods.


In this situation, the directly connected spine switches in separate pods cannot be elected as designated for a given bridge domain, which will lead to the impossibility of forwarding BUM traffic across pods.


Therefore, you should always deploy at least one Layer 3 IPN device (or a pair for redundancy) between pods.


In addition, you have to always ensure that there is a physical path interconnecting all IPN devices to avoid issues with the BUM traffic across pods.





The following example shows an IPN configuration for a Cisco Nexus 9000 Series Switch operating in NX-OS mode, which interconnects two pods (its interface Ethernet2/7 is connected to Pod 1 – Spine 1 and interface Ethernet2/9 to Pod 2 – Spine 1).






feature dhcp
feature pim
 
# Enable Jumbo frames
policy-map type network-qos jumbo
  class type network-qos class-default
    mtu 9216
 
system qos
  service-policy type network-qos jumbo
 
service dhcp
ip dhcp relay
ip pim ssm range 232.0.0.0/8
 
# Create a new VRF for Multipod.
vrf context fabric-mpod
  ip pim rp-address 12.1.1.1 group-list 225.0.0.0/8 bidir
  ip pim rp-address 12.1.1.1 group-list 239.255.255.240/28 bidir
  ip pim ssm range 232.0.0.0/8
 
interface Ethernet2/7
  no switchport
  mtu 9150
  no shutdown
 
interface Ethernet2/7.4
  description pod1-spine1
  mtu 9150
  encapsulation dot1q 4
  vrf member fabric-mpod
  ip address 201.1.2.2/30
  ip router ospf a1 area 0.0.0.0
  ip pim sparse-mode
  ip dhcp relay address 10.0.0.1
  ip dhcp relay address 10.0.0.2
  ip dhcp relay address 10.0.0.3
  no shutdown
 
interface Ethernet2/9
  no switchport
  mtu 9150
  no shutdown
 
interface Ethernet2/9.4
  description to pod2-spine1
  mtu 9150
  encapsulation dot1q 4
  vrf member fabric-mpod
  ip address 203.1.2.2/30
  ip router ospf a1 area 0.0.0.0
  ip pim sparse-mode
  ip dhcp relay address 10.0.0.1
  ip dhcp relay address 10.0.0.2
  ip dhcp relay address 10.0.0.3
  no shutdown
 
interface loopback29
  vrf member fabric-mpod
  ip address 12.1.1.1/32
 
router ospf a1
  vrf fabric-mpod
    router-id 29.29.29.29

As a best practice, the Multi-Pod traffic across the IPN is isolated in a VRF instance. In addition, the spine interfaces are connected to the IPN devices through point-to-point routed sub interfaces with VLAN 4


The use of sub interfaces on the IPN devices is only mandatory for the connections toward the Cisco ACI spine switches.


Multi-Pod Provisioning and Packet Flow Between Pods


In a Cisco ACI Multi-Pod deployment, the fabric must be provisioned before it can forward endpoint traffic.


The Cisco ACI Multi-Pod fabric applies different control and data plane functionalities for connecting endpoints deployed across different pods.


Once, the Cisco ACI Multi-Pod is successfully provisioned, the information about all the endpoints stored in COOP database in each spine will be exchanged via BGP EVPN through IPN.


External route information from L3Outs is exchanged via BGP VPNv4/v6. These are the control plane between pods.


Once the forwarding information such as endpoint and L3Out routes are exchanged via control plane, data plane traffic will be forwarded across pods through IPN with TEP and VXLAN encapsulation like within a single pod.


Not only unicast traffic, but also flooding traffic can be forwarded seamlessly.


In the Cisco APIC user interface, you can use a wizard to add a pod to the Multi-Pod deployments, which helps you provision the necessary L3Outs on the spine switches connected to the IPN, MTU on all spine-to-IPN interfaces, OSPF configuration towards the IPN, Anycast TEP IP address, and so on. You can invoke this wizard using Fabric > Inventory > Quick Start > Add Pod and choose Add Pod from the work plane.




Initially, the first pod (also known as ‘seed’ pod) and the second pod should be physically connected to the IPN devices. Before Cisco ACI Multi-Pod provisioning process can start, you should set up the Cisco APIC and the IPN using the following these steps:


  1. Configure access policies: Configure access policies for all the interfaces on the spine switches used to connect to the IPN.

  2. Define these policies as spine access policies.

  3. Use these policies to associate an Attached Entity Profile (AEP) for a Layer 3 domain that uses VLAN 4 (as a requirement) for the encapsulation for the sub interface.

  4. Define these sub interfaces in the same way as normal leaf access ports. The sub interfaces are used by the infrastructure L3Out interface that you define.

  5. Define the Multi-Pod environment: For the Cisco ACI Multi-Pod setup, you should define the TEP address for the spine switches facing each other across the IPN. This IP address is used as anycast shared address across all spine switches in a pod. You should also define the Layer 3 interfaces between the spine interfaces and the IPN.

  6. Configure the IPN.

  7. Configure the IPN devices with IP addresses on the interfaces facing the spine switches, and enable the OSPF routing protocol, MTU support, DHCP-relay, and PIM Bidir.

  8. The IPN devices create OSPF adjacencies with the spine switches and exchange the routes of the underlying IS-IS network part of VRF overlay-1.

  9. The configuration of the IPN defines the DHCP relay, which is critical for learning adjacencies because the DHCP frames forwarded across the IPN will reach the primary APIC in the first pod to get a DHCP address assignment from the TEP pool. Without DHCP relay in the IPN, zero-touch provisioning will not occur for Cisco ACI nodes deployed in the second pod.

  10. Establish the interface access policies for the second pod:

  11. If you do not establish the access policies for the second pod, then the second pod cannot complete the process of joining the fabric. You can add the device to the fabric, but it does not complete the discovery process.

  12. Thus, the spine switch in the second pod has no way to talk to the original pod, since the OSPF adjacency cannot be established due to VLAN 4 requirement, and the OSPF interface profile and the external Layer 3 definition do not exist.

  13. You can reuse the access policies of the first pod as long as the spine interfaces you are using on both pods are the same. Hence, if the spine interfaces in both pods are the same and the ports in all the switches also are the same, then the only action you need to take is to add the spine switches to the switch profile that you define.



  1. Cisco APIC node 1 pushes the infra L3Out policies to the spine switches in Pod 1. The spine L3Out policy provisions the IPN-connected interfaces on spine switches with OSPF.

  2. At this point, IPN has learned Pod 1 TEP prefixes via spine OSPF. And Pod 1 spine switches has learned IP prefixes on IPN facing to new spines in Pod 2.

  3. The first spine in Pod 2 boots up and sends DHCPDISCOVER to every connected interface, including the ones toward the IPN devices.

  4. The IPN device receiving the DHCPDISCOVER has been configured to relay that message to the Cisco APIC nodes in Pod 1. It can be accomplished because the IPN devices learned the Pod 1 TEP prefixes via OSPF with spines.

  5. The Cisco APIC sends DHCPOFFER, which includes the following initial parameters: The new spine downloads a bootstrap for the infra L3Out configuration from APIC and configures OSPF and BGP towards IPN. It also sets itself as DHCP relay device for the new switch nodes in the new pod so that DHCP discovery from them can be relayed to APIC nodes in Pod 1 as well.

  6. Sub interface IP of the new spine facing towards IPN.

  7. A static route to the Cisco APIC that sent DHCPOFFER, which points to the IPN IP address that relayed the DHCP messages.

  8. Bootstrap location for the infra L3Out configuration of the new spine.

  9. All other nodes in Pod 2 come up in the same way as a single pod. The only difference is the DHCP discovery is relayed through IPN.

  10. The Cisco APIC controller in Pod 2 is discovered as usual.

  11. The Cisco APIC controller in Pod 2 joins the APIC cluster.



Inter-Pods MP-BGP Control Plane


In a single ACI fabric, information about all the endpoints connected to the leaf nodes is stored in the COOP database, which is available in the spine nodes.


Every time an endpoint is learned as a local endpoint on a leaf node, the leaf originates a COOP control plane message to communicate the endpoint information (IPv4/IPv6 and MAC addresses) to a spine node.


The COOP protocol is also applied by the spine switches to synchronize this information between them.


The COOP database information in each pod is shared via MP-BGP EVPN through IPN so that each pod knows which endpoint is learned in which pod. MP-BGP EVPN runs directly between spine switches in each pod.


IPN devices will not participate in this BGP session, but it just provides TEP reachability to establish BGP sessions between each spine switch.


BGP in each pod runs in the same BGP AS. This AS number is configured via BGP route reflector policy regardless of Cisco ACI Multi-Pod.



Inter-Pods VXLAN Data Plane




The inter-pods MP-BGP EVPN control plane functionality follows the following sequence:

  1. When an endpoint EP1 is learned on Leaf 1 in Pod 1, Leaf 1 sends a COOP message to one of the spine switches.

  2. The receiving spine adds the endpoint information to the COOP database and synchronizes the information to all the other local spine switches. EP1 is associated to the TEP address of Leaf 1.

  3. The endpoint information in the COOP database in Pod 1 is shared with other pods via MP-BGP EVPN

  4. Once the spine in Pod 2 learns endpoint information via MP-BGP EVPN, it adds the information to the COOP database and synchronizes it to all the other local spine nodes.

  5. When MP-BGP entries from pod 1 are translated into the COOP entries on Pod 2 spines, the MP-BGP entries have Pod 1 DP-TEP as the next-hop.

  6. EP1 is now associated to an Anycast TEP address (Proxy A) that represents Pod 1 instead of Leaf 1 TEP.

  7. This behavior provides a robust control plane isolation across pods, as there is no requirement to send new control plane updates toward Pod 2 even if EP1 moves many times across leaf nodes part of Pod 1, since the entry would continue to point to the Proxy A next-hop address.


Since the spine nodes in different pods are part of the same BGP Autonomous System, the peering between the spine nodes connected through the IPN can be performed in two ways:

  • Full-Mesh: Establishing a full mesh of MP-BGP (IBGP) sessions between spine switches in each pod, which is the default behavior.

  • Route Reflector: Defining route reflector nodes in each pod (recommended for resiliency) so that the spine nodes only peer with the remote route reflector nodes and a full mesh of MP-BGP session is then established only between the route reflectors.

  • Those route reflectors are called external route reflectors compared to internal router reflectors between spine and leaf switches. Tenants > infra> Policies > Protocol > Fabric Ext Connection Policies > Fabric Ext Connection Policy default




Inter-Pods VXLAN Data Plane


To establish IP connectivity between endpoints that are connected to separate pods, the first requirement is to be able to complete an Address Resolution Protocol (ARP) exchange.




The ARP request with ARP flooding that is enabled in the bridge domain follows the following sequence:

  1. EP1 generates an ARP request to determine EP2’s MAC address (assuming EP1 and EP2 are part of the same IP subnet).

  2. The local leaf (Leaf 1 in Pod 1) receives the packet, inspects the payload of the ARP packet and learns EP1 information (as a locally connected endpoint) and knows that the ARP request is for EP2’s IP address. Since EP2 has not been learned yet, the leaf does not find any information about EP2 in its local forwarding tables, such as the endpoint table. As a consequence, since ARP flooding is enabled, the leaf picks the FTAG associated to one of the multi-destination trees used for BUM traffic and encapsulates the packet into a multicast packet (the external destination address is the GIPo associated to the specific BD). While performing the encapsulation, the leaf also adds to the VXLAN header the pcTag information relative to the EPG that EP1 belongs to.

  3. The designated spine sends the encapsulated ARP request across the IPN, still applying the same GIPo multicast address as the destination of the VXLAN encapsulated packet. The IPN network must have built a proper state to allow for the replication of the traffic toward all the remote Pods where this specific bridge domain has been deployed. This replication is performed through multicast routing with PIM Bidir in IPN.

  4. One of the spine nodes in Pod 2 receives the packet (the specific spine that previously sent toward the IPN an IGMP Join for the multicast group associated to the bridge domain) and floods it along a local multi-destination tree. Notice also that the spine has learned EP1 information from an MP-BGP update received from the spine in Pod1.

  5. The leaf where EP2 is connected (Leaf 4 in Pod 2) receives the flooded ARP request, learns EP1 information (location and Class ID/pcTag) and forwards the packet to all the local interfaces part of the bridge domain.

  6. EP2 receives the ARP request and triggers its reply allowing then the fabric to discover it (EP2 is not a "silent host" anymore).



  1. EP2 generates a unicast ARP reply destined to EP1 MAC address.

  2. The local leaf (Leaf 4 in Pod 2) has now EP1 location information so the frame is VXLAN encapsulated and destined to Leaf 1 in Pod 1. At the same time, the local leaf also discovers that EP2 is locally connected and informs the local spine nodes through COOP.

  3. The remote leaf node (Leaf 1 in Pod 1) receives the packet, de-encapsulates it, learns and programs in the local endpoint table EP2 location and Class ID information and forwards the packet to the interface where EP1 is connected. EP1 is hence able to receive the ARP reply.


Cisco ACI fabric is designed to handle the presence of silent hosts even without requiring the flooding of ARP requests inside the bridge domain.


Without ARP flooding allowed in the bridge domain, the leaf nodes are not allowed to flood the ARP Request frame along the local multi-destination tree. To ensure that the ARP request can be delivered to a remote endpoint for allowing it's learning, a process named “ARP gleaning” has been implemented.


With ARP gleaning, if the spine does not have information on where the destination of the ARP request is connected, the fabric generates an ARP request that is originated from the pervasive gateway IP address of the bridge domain.


This ARP request is sent out all the leaf nodes edge interfaces part of the bridge domain.

In the Cisco ACI Multi-Pod deployment, the ARP glean request is also sent to the remote pods across the IPN.


The ARP glean message is encapsulated into a multicast frame before being sent out toward the IPN. The specific multicast group 239.255.255.240 is used for sourcing ARP glean messages for all the bridge domains (instead of the specific GIPo normally used for BUM traffic in a bridge domain).


What is the Dataplane TEP/External Proxy TEP (ETEP)?


Address owned by all Spines in a POD and acts a next hop for BGP EVPN paths


apic1# moquery -c ipv4If -f 'ipv4.If.mode*"etep"' -x 'rsp-subtree=children'

# ipv4.If

id : lo14

adminSt : enabled

dn : topology/pod-1/node-1001/sys/ipv4/inst/dom-overlay-1/if-[lo14]

donorIf : unspecified

lcOwn : local

modTs : 2019-02-20T16:58:34.113-04:00

mode : etep

rn : if-[lo14]

# ipv4.Addr

addr : 192.168.1.254/32


TSHOOT


To find out from which leaf the mac address was learnt


spine1# show coop internal info repo ep | grep -B 8 -A 35 00:50:56:A8:B0:03

------------------------------------------

**ommitted

EP bd vnid : 15761417

EP mac : 00:50:56:A8:B0:03

**ommitted

Tunnel nh : 10.0.72.67

**omitted


a-apic1# moquery -c ipv4Addr -f ‘ipv4.Addr.addr==“10.0.72.67”’

Total Objects shown: 1

# ipv4.Addr

addr : 10.0.72.67/32

dn : topology/pod-1/node-101/sys/ipv4/inst/dom-overlay-1/if-[lo0]/addr-[10.0.72.67/32]

**ommitted


If there is Layer 2 Unicast Traffic then how remote POD learns about EP.


Step 1 : Local Pod Spine installs COOP record


show coop internal info repo ep | grep -B 8 -A 35 <mac address>


a-spine1# show coop internal info repo ep | grep -B 8 -A 35 00:50:56:A8:B0:03

------------------------------------------

**ommitted

EP bd vnid : 15761417

EP mac : 00:50:56:A8:B0:03

**ommitted

Tunnel nh : 10.0.72.67

**omitted


Step 2 : Local Pod Spine Exports into BGP EVPN


show bgp l2vpn evpn <mac address> vrf overlay-1


a-spine1# show bgp l2vpn evpn 00:50:56:A8:B0:03 vrf overlay-1

Route Distinguisher: 1:16777199 (L2VNI 1)

BGP routing table entry for

[2]:[0]:[15761417]:[48]:[0050.56a8.b003]:[0]:[0.0.0.0]/216, **ommitted

Paths: (1 available, best #1)

Flags: (0x00010a 00000000) on xmit-list, is not in rib/evpn

Multipath: eBGP iBGP

Advertised path-id 1

Path type: local 0x4000008c 0x0 ref 0, path is valid, is best path

AS-Path: NONE, path locally originated

0.0.0.0 (metric 0) from 0.0.0.0 (192.168.1.101)

Origin IGP, MED not set, localpref 100, weight 32768

Received label 15761417

Extcommunity:

RT:5:16

Path-id 1 advertised to peers:

192.168.2.101 192.168.2.102


Step 3 : Remote Pod Spine Receives through EVPN


show bgp l2vpn evpn <mac address> vrf overlay-1


spine3# show bgp l2vpn evpn 00:50:56:A8:B0:03 vrf overlay-1

Route Distinguisher: 1:16777199

BGP routing table entry for [2]:[0]:[15335345]:[48]:[0050.56a8.b003]:[0]:[0.0.0.0]/216, *ommitted

Paths: (2 available, best #1)

Flags: (0x000202 00000000) on xmit-list, is not in rib/evpn, is locked

Multipath: eBGP iBGP

Advertised path-id 1

Path type: internal 0x40000018 0x2040 ref 1, path is valid, is best path

AS-Path: NONE, path sourced internal to AS

192.168.1.254 (metric 3) from 192.168.1.101 (192.168.1.101) <- BGP Address of spine 1

192.168.1.254 is TEP/ETEP of POD1

Origin IGP, MED not set, localpref 100, weight 0

Received label 15335345

Received path-id 1

Extcommunity:

RT:5:16

ENCAP:8


Step 4 : Remote Pod Spine Imports into COOP


show coop internal info repo ep | grep -B 8 -A 35 <mac address>


Scenario : EP’s cannot communicate in L2 BD


Step 1 : Does local leaf knows about remote EP ?

leaf101# show endpoint mac 8c60.4f02.88fc <- No output


Step 2 : Does BD flood or proxy unknown unicast?


apic1# moquery -c fvBD -f 'fv.BD.name=="bd-L2-2“’

name : bd-L2-2

dn : uni/tn-CiscoLive2020/BD-bd-L2-2

unkMacUcastAct : proxy


Step 3 : Does Local Pod Spine have the EP?


spine1# moquery -c coopEpRec -f 'coop.EpRec.mac=="8c60.4f02.88fc "'


spine1# show bgp l2vpn evpn 8c60.4f02.88fc vrf overlay-1


Step 4 : Does Remote Pod Spine have the EP? --> Yes

spine3# moquery -c coopEpRec -f 'coop.EpRec.mac=="8c60.4f02.88fc "'

# coop.EpRec

vnid : 15761417

mac : 8C:60:4F:02:88:FC


spine3# show bgp l2vpn evpn 8c60.4f02.88fc vrf overlay-1 --> Remote Spine Exports to EVPN


AS-Path: NONE, path locally originated

0.0.0.0 (metric 0) from 0.0.0.0 (192.168.2.101)

Origin IGP, MED not set, localpref 100, weight 32768

Received label 15761417

Extcommunity:

RT:5:16


Step 5 : Is EVPN up between Pods?


spine1# show bgp l2vpn evpn summ vrf overlay-1 ---> BGP is down


Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd

192.168.2.101 4 65000 57380 66362 0 0 0 00:00:21 Active

192.168.2.102 4 65000 57568 66357 0 0 0 00:00:22 Active



MULTICAST


System Gipo Usage



What does Multipod use BUM for?


• Unknown Unicast Flooding

• Multidestination Traffic (ARP, Multicast, BPDU’s)

• Inter-pod Glean Messages

• EP Announce Messages


Spines act has multicast hosts (IGMP only) & join fabric multicast groups (Gipo’s). IPN’s receive Joins. IPN’s send PIM joins to RP




Only one spine in each pod joins each group


spine1# show ip igmp gipo joins

GIPo list as read from IGMP-IF group-linked list

------------------------------------------------

225.0.80.64 0.0.0.0 Join Eth1/25.25 95 Enabled


RPF for all IPN’s must point to same RP


Phantom RP


• Bidir PIM doesn’t support multiple RP’s

• Phantom RP is only means of RP redundancy

• Works by advertising varied Prefix Lengths for RP subnet

• Failover handled via IGP

• Loopback must be OSPF P2P network type

• Exact RP Address must not exist anywhere




Common Multicast Problems


Issue # 1: RP Address Exists on Multiple Routers



Issue # 2: RP Loopback not OSPF P2P Network


In OSPF  Loopbacks advertise /32 by default.




Common Multipod L3out Problems


Issue 1: Asymmetric Routing with Active/Active Pods.




Issue 2: Stretched L3out VIP Failover




Please check the following


• Ensure an SVI is used for the l3out (no flooding for routed interfaces)

• Ensure the same vlan encap is used in each pod

• Ensure the IPN agrees on the tree for the GIPO

• Ensure a GARP is sent by external router

• Check if the GARP is sent with COS 6 (more on this later)



QoS







DHCP


POD2 - Spine


POD2-spine# show ip interface ethernet 1/11.39  vrf overlay-1



PODE-SPINE


show dhcp internal event-history traces | egrep " 12:25:31"



APIC1 of POD1 which is acting as DHCP server


/var/log/dme/log




Location for bootstrap XML file for spine.



• Challenge :

  • APIC in Pod 1 use 10.0.0.1 and 10.0.0.2

  • APIC in Pod 2 use 10.0.0.3 but in IPN and Pod 2, the routes for TEP pool 10.0.0.0/16 is sending it toward Pod 1 🡨 no reachability to 10.0.0.3 Solution :

  • Leaf where APIC is connected in pod 2 sees it via LLDP

  • That leaf in Pod 2 insert a static route 10.0.0.3 (connected in infra vlan)

  • Leaf in Pod2 redistribute it to ISIS

  • Spine in pod 2 redistribute 10.0.0.3 from ISIS to OSPF

  • Spine in Pod 1 redistribute 10.0.0.3 from OSPF to ISIS

  • APIC 1 and 2 can get IP reachability using this route

  • Cluster gets fully fit.


bdsol-aci32-leaf5# show ip route 10.0.0.3 vrf overlay-1 IP Route Table for VRF "overlay-1"'*' denotes best ucast next-hop'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]'%<string>' in via output denotes VRF <string>

10.0.0.3/32, ubest/mbest: 1/0, attached *via 10.0.0.3, vlan2, [0/0], 3d23h, am

via 10.0.0.3, vlan2, [1/0], 3d23h, static

bdsol-aci32-leaf5# show ipmgr internal trace | egrep 10.0.0.3/321083) 2016 Aug 8 12:36:47.516192 ipmgr_static_rt.c:ipmgr_process_static_rt_message:3199: Add Non-pervasive static route in vrf overlay-1 nhvrf overlay-1 10.0.0.3/32 10.0.0.3 Vlan2 0 1 0 BFD disabled1202) 2016 Aug 8 12:36:47.404624 ipmgr_static_rt.c:ipmgr_process_objstore_hndl_ipv4_rt_nh_message:3603: Item 0: dn: sys/ipv4/inst/dom-overlay-1/rt-[10.0.0.3/32]/nh-[vlan2]-addr-[10.0.0.3/32]-vrf-[overlay-1], vrf overlay-1, prefix 0xa0000031207) 2016 Aug 8 12:36:47.404514 ipmgr_static_rt.c:ipmgr_process_objstore_hndl_ipv4_rt_nh_message:3544: Object num 0 => DN: sys/ipv4/inst/dom-overlay-1/rt-[10.0.0.3/32]/nh-[vlan2]-addr-[10.0.0.3/32]-vrf-[overlay-1] (prop_chg_bmp = 0)



57 views0 comments

Recent Posts

See All

Comments


bottom of page