top of page

Common Issues in Cisco ACI

  • Writer: Mukesh Chanderia
    Mukesh Chanderia
  • Dec 9, 2025
  • 9 min read

Updated: Jan 4

LACP --> port-channel or vPC member flips


When a port-channel or vPC member flips to Down or Suspended, traffic can black-hole or pin to fewer links. In Cisco ACI, these symptoms almost always trace back to LACP negotiation or a physical/link-layer issue.


The quick read


  • If a member shows s (suspended) in show port-channel summary: ACI is transmitting LACPDUs but not receiving them, or partner parameters don’t match.

  • If a member shows D (down) in a port-channel that’s SD: The bundle exists, but the member is down or mis-programmed.

  • Counters tell the truth: In show lacp counters, look for Sent >> Recv or Recv = 0 on the affected member.

  • FSM states confirm it: States like "SUSPENDED_FOR_MISCONFIGURATION" or repeated "RECEIVE_PARTNER_PDU_TIMED_OUT" mean we’re not hearing back from the peer or the peer is not compatible.

  • Fixes: Align LACP mode (Active/Passive), verify the peer is running LACP on the same channel, check cabling/optic/VLAN encaps, and consider fast timers for quicker failover.


Typical symptoms & how to read them

1) Port-channel summary


Leaf# "show port-channel summary"


  • PoX(SU) – Bundle is Up, Switched; members should show (P) (participating).

  • PoX(SD) – Bundle defined but Down; members may show (s) (suspended) or (D) (down).

  • Member flags:

    • (P) = up in port-channel (healthy)

    • (s) = suspended (LACP mismatch or no partner LACPDUs)

    • (D) = down (physical/operationally down)


2) LACP counters

Leaf# show lacp counters


  • Healthy member: Sent ~ Recv and both increment steadily.

  • Problem member: Sent increases, Recv stays low or zero → not hearing the peer (bad cable/optic, peer not LACP, wrong port-channel on peer, blocked by policy).


3) LACP per-interface detail (FSM)


Leaf# show lacp interface ethernet 1/x detail


Watch for:

  • LACP_ST_SUSPENDED_FOR_MISCONFIGURATION – actor/partner parameters do not agree (system ID, key, or operational settings).

  • RECEIVE_PARTNER_PDU_TIMED_OUT – leaf is not receiving LACPDUs (peer isn’t sending, link path issue, or control plane blocked).


Why it happens (common root causes)


  1. Peer side isn’t running LACP

    • Peer has static channel, wrong port-channel ID, LACP disabled, or interfaces not in the same bundle.

  2. Parameter mismatch

    • System ID/key mismatch (e.g., vPC vs single-chassis aggregation on peer), hashing/group membership mismatches, min-links constraints unmet.

  3. Physical layer problems

    • Bad optic/cable, mismatched transceivers, speed/duplex/autoneg anomalies, or LOS/LOF errors (check show interface transceiver details and error counters).

  4. Wrong policy attachment in ACI

    • The LACP policy isn’t actually attached to the interface policy group the Access Port Selector is using; or the wrong interface profile/selector targets the port.



Fix it fast: a clean remediation checklist

  1. Verify the peer’s bundle

    • On the peer, confirm the same set of member ports are in a single LACP port-channel and are up and active.

    • Ensure LACP mode = active (recommended on both ends).

  2. Confirm ACI policy wiring

    • APIC: Access Policies → Interface Policy Groups → (vPC/Port-Channel)

    • Ensure the LACP Policy you expect is selected and that this policy group is tied to the correct Access Port Selector and Leaf Interface Profile targeting the right leaf/port.

  3. Physical health

    • Swap known-good cables/optics, check interface error counters, verify speed/MTU/autoneg options are consistent with the peer.

  4. Min-links logic

    • If you use min-links, confirm enough members are healthy to bring the bundle up; otherwise the port-channel will stay down by design.

  5. Enable LACP fast timers  to shorten detection/failover when a member silently dies.



Enabling LACP fast timers in Cisco ACI

Goal: Use 1-second LACP “short” timeout for faster failure detection.

Where: APIC Access Policies → Policies → Interface → LACP.

Steps (GUI):

  1. Create or edit an LACP Policy:

    • Mode: active

    • Rate/Fast Select: fast

    • (Optional, recommended) Graceful Convergence = Enabled, Suspend Individual = Enabled, set Min-Links as appropriate.

  2. Apply this LACP policy to the Interface Policy Group (vPC or Port-Channel) used by your uplink bundle.

  3. Ensure that policy group is actually referenced by the Access Port Selector for the correct leaf/ports.


Interpreting logs & states


  • CLI State Evidence

    • Observation: show port-channel summary shows bundle SU but member flagged (s) suspended.

    • Meaning: LACP negotiation failed on that member (no compatible partner or no LACPDUs received).

  • Counters Evidence

    • Observation: show lacp counters on a bad member shows high Sent, near-zero Recv.

    • Meaning: We are transmitting LACPDUs but not receiving any—peer isn’t sending (misconfig or down) or path is broken.

  • FSM/Timer Evidence

    • Observation: Interface detail logs repeated RECEIVE_PARTNER_PDU_TIMED_OUT and state SUSPENDED_FOR_MISCONFIGURATION.

    • Meaning: The leaf’s LACP state machine timed out waiting for partner PDUs or detected incompatible partner parameters, so it suspended the member to protect the bundle.

  • Resolution Evidence

    • Action: Correct peer LACP mode/bundle membership (ensure active on both ends), fix cabling/optic, and apply LACP fast timers.

    • Outcome: Member transitions to PORT_MEMBER_COLLECTING_AND_DISTRIBUTING_ENABLED, counters Sent/Recv increment steadily and the member shows (P) in the bundle.


FAQ

Q: Is enabling fast timers disruptive?

Generally no; it’s a control-plane timer change. Apply during a maintenance window for critical paths, and validate afterward.

Q: Will fast timers help if the peer isn’t running LACP?

They help detect failure faster, but the real fix is enabling/configuring LACP correctly on the peer.

Q: I see (D) (down) instead of (s) (suspended). What’s the difference?

(D) is typically physical/operational down

(s) means ACI suppressed the member due to LACP mismatch or timeouts.


----------------------------------------------------------------------------------------------------------------------


Understanding Cisco ACI Endpoint Dampening (F4311)


Cisco ACI includes several mechanisms to maintain stability and scalability in large fabrics. One of the most important—and often misunderstood—features is COOP Endpoint Dampening, which prevents excessive endpoint advertisements from overwhelming the control plane.


What Is COOP Endpoint Dampening?


The Council of Oracles Protocol (COOP) is responsible for distributing endpoint location information learned at leaf switches to all spines in the fabric.

In cases where an endpoint moves too frequently, the COOP control plane may become overloaded with constant advertisements. To protect itself, ACI uses a dampening mechanism that:

  1. Assigns a penalty for every move (MAC or IP changing location).

  2. Suppresses the endpoint if the penalty exceeds the threshold.

  3. Freezes the endpoint temporarily (fault F4311).

  4. Allows it to reappear once the penalty decays.


COOP Dampening Is a Protection Mechanism

It prevents:

  • Loops causing rapid MAC moves

  • Misconfigured active-active uplinks

  • Flapping virtual IPs

  • Unstable L3/L2 adjacencies

This feature ensures fabric stability—but in clustered designs, it can be unintentionally triggered.


Why Clusters (Like Oracle RAC / Exadata) Trigger COOP Dampening

Clustered workloads often use:

  • Multiple IPs bound to a single NIC/MAC

  • Virtual IPs (VIPs) that float between nodes

  • Rapid, repeated ARP/GARP announcements during failover

From ACI’s perspective, this appears as constant endpoint movement.


Example of a Common Pattern


Node 1 MAC: aa:bb:cc:11:22:33


IP List:

10.10.10.4

10.10.10.5

10.10.10.8

10.10.10.100 ← Cluster VIP


Node 2 MAC: aa:bb:cc:44:55:66


IP List:

10.10.20.20

10.10.10.100 ← Same VIP during failover


When the VIP (10.10.10.100) rapidly shifts between nodes, ACI logs resemble:


Moved IP 10.10.10.100 from MAC 11:22:33 → 44:55:66

Moved IP 10.10.10.100 from MAC 44:55:66 → 11:22:33

[DAMP] penalty increased: 9500 → 15000 → 20000 → Freeze


Once the threshold is exceeded, COOP freezes the endpoint and raises fault F4311.


Leaf # show system internal epm endpoint mac <mac-address>


This reveals:

  • Number of IPs bound to a MAC

  • Last interface location

  • VLAN/BD/VNID context


FIX


For clustered environments, Cisco best practice recommends:


Disable IP Data-Plane Learning

(across BD subnets used by clustered workloads)

This ensures ACI learns endpoint movement only from ARP/GARP updates, not every routing or IP announcement.


Keep ARP and GARP learning enabled

Failovers still work, but without overwhelming the control plane.


Clear frozen endpoints after changes


Leaf # clear system internal epm endpoint mac <mac>


Conclusion


ACI endpoint dampening (F4311) is a protection mechanism, not a fabric failure. In clustered database or application environments with floating IPs, this feature may be unintentionally triggered during VIP failovers.

By:

  • Disabling IP Data-Plane Learning,

  • Allowing ARP/GARP-based updates, and

  • Clearing stale endpoints,

ACI fabrics can safely support RAC/Exadata-style clustering without control-plane churn or endpoint freezes.



-----------------------------------------------------------------------------------------------------------------------------


When EVPN BGP Flaps Are Not an ACI Problem


Problem Statement

A subset of EVPN BGP neighbors on ACI spines experienced brief BGP session resets, while all other neighbors remained stable for months or years.

Key observations:

  • Sessions went Down → Up within seconds

  • Reset reason was consistently Hold Timer Expired

  • No configuration changes were made at the time

  • Only remote peers were affected

  • Similar flaps were observed historically


Understanding the Architecture Context


Local vs Remote ASNs


In Cisco ACI Multi-Site designs:

  • Local ACI spines typically share a single local AS

  • Remote sites / IPN peers often use different ASNs

  • EVPN control traffic (TCP/179) traverses external transport paths

Example

  • Local ACI AS: AS 15802

  • Remote ASNs: AS 652xx / 654xx / 655xx

This distinction is critical:If only remote AS neighbors flap, the fabric itself is often not the root cause.


What “Hold Timer Expired” Really Means ?


A BGP session resets due to Hold Timer expiry when:


Keepalive messages are not received within the negotiated hold time.

Important clarifications:

  • This is not a protocol mismatch

  • This is not a route-policy error

  • This is not an EVPN bug by default

It simply means: Packets were delayed or dropped somewhere along the path


Why ACI Faults Can Be Misleading ?


  • F0299 – BGP peer not established (Idle)

However:

  • Faults are stateful and sticky

  • They may reflect past incidents

  • They do not always clear immediately

  • They may not match the actual outage timestamp


Correct approach:

  • Use APIC Events and node-level logs

  • Correlate exact timestamps

  • Match peer IPs precisely

Events = TimelineFaults = Health state


Evidence from Node Logs

Across multiple spines, logs showed a consistent pattern:

Neighbor X Down - holdtimer expired
Neighbor X Up

Key characteristics:

  • Down/Up within ~5–15 seconds

  • Same peers affected repeatedly

  • Similar events months apart

  • No local BGP process restarts


This indicates Transient transport instability, not configuration error


Historical Recurrence Matters


One of the most important findings was that:

  • The same peers

  • With the same reset reason

  • Had flapped in previous months and years


Recurring behavior across long timelines rules out:

  • Software defects

  • Misconfiguration

  • One-time operational mistakes


Instead, it points to:


External transport characteristics (WAN / IPN / ISP)


How to Prove Whether ACI Dropped the Keepalives


1. Was the BGP session reset locally?

show bgp l2vpn evpn neighbors <peer> vrf overlay-1

Check:

  • Last reset reason

  • Notification sent/received

  • Keepalive counters

If reset = Hold Timer Expired, continue.


2. Did the ACI interface drop packets?

Identify the egress interface:

show ip route vrf overlay-1 <peer-ip>
show ip adjacency vrf overlay-1 <peer-ip>

Then check:

show interface <intf> counters errors
show queuing interface <intf>

If counters are clean → not an ACI data-plane issue.


3. Did control-plane policing (CoPP) drop traffic?

show copp statistics
show system internal bgp event-history errors

If no BGP-class drops are seen → control plane is healthy.


This confirms:

  • TCP/179 packets

  • Direction (ingress/egress)

  • Presence or absence of keepalives


Final Technical Conclusion

Based on:

  • Hold timer expiry

  • Clean ACI interfaces

  • No CoPP drops

  • Stable local peers

  • Repeated historical occurrences

  • Remote AS involvement


The root cause is:


Transient packet loss or latency in the external transport path (IPN / WAN / ISP)

This is a design and transport consideration, not a fabric defect.


-----------------------------------------------------------------------------------------------------------------------------


When can a Firewall Reboot which are in HA pair results in ACI Endpoint Dampening



Problem Scenario

  • An active firewall node in an HA pair unexpectedly rebooted.

  • The standby firewall became active within seconds.

  • Firewall uplinks were dual-homed to two ACI leaf switches using vPC.

  • The firewall cluster continued to use the same MAC address across all routed interfaces.

  • Physical links and port-channels on the ACI side remained UP.

  • Applications experienced a ~5-minute traffic disruption, followed by recovery without manual intervention.


What ACI Observed:


1. Endpoint Dampening (F4311)

coop-ep-freeze
Endpoint is in dampened freeze state

2. Bridge Domain Learning Disabled (F1197)

bd-limits-exceeded
Learning is disabled on BD because of BD move frequency

These faults occurred within minutes of the firewall failover and aligned precisely with the application outage window.

This combination is significant. Together, they indicate excessive endpoint mobility rather than physical link instability.


Why No Interfaces Flapped (and Why That Matters)

In vPC-based firewall designs:

  • Both firewalls remain physically connected to both leaf switches.

  • A firewall failover does not necessarily cause a link-down event.

  • From ACI’s perspective, nothing “obviously” failed at Layer 1 or Layer 2.

This is expected behavior and not a misconfiguration.


However, it has an important implication:


ACI cannot rely on link state to determine which firewall node is active.

Instead, ACI must infer endpoint location from MAC learning behavior.


Issue: MAC Mobility Across Leaf Switches

The firewall HA design used:

  • A shared MAC address across multiple routed interfaces

  • Multiple Bridge Domains mapped to the same firewall

  • Dual-homing to two leaf switches via vPC


After failover:

  1. Traffic from the new active firewall began arriving on Leaf A

  2. Shortly after, return traffic appeared on Leaf B

  3. The same MAC address was learned alternately on both leaves

  4. This happened across multiple Bridge Domains simultaneously


From ACI’s perspective, this looked like:


A single MAC address rapidly moved between leaf switches.


Why ACI Reacted the Way It Did ?

Cisco ACI includes endpoint move dampening to protect the control plane from:

  • Layer 2 loops

  • Miswired devices

  • Flapping endpoints

  • Broadcast storms

This is not primarily a security feature—it is a control-plane protection mechanism.


How it works ?

  • ACI tracks how frequently an endpoint moves

  • If move frequency exceeds a threshold:

    • Endpoint learning is temporarily frozen (F4311)

    • BD-level learning may be disabled (F1197)

  • This containment is per Bridge Domain and per leaf, limiting blast radius.


In this scenario, the firewall MAC exceeded the move-frequency threshold very quickly, triggering these protections.


Why GARP (Gratuitous ARP) Matters Here


In HA firewall designs, GARP is the preferred signal to notify the network that:


“This MAC/IP is now reachable via a different interface or node.”

In this case:

  • No explicit evidence of GARP processing was found in ACI logs

  • Instead, ACI relied on data-plane traffic to relearn the endpoint

This fallback mechanism is known as ARP Glean.


Glean vs GARP: A Subtle but Critical Difference


GARP (Proactive)

  • Sent immediately during failover

  • Updates MAC location deterministically

  • Minimal disruption


Glean (Reactive)

  • Triggered only when data traffic arrives

  • Happens independently on each leaf

  • Can result in alternating MAC learning

  • Much slower and less stable during failover


Because ACI had to rely on Glean instead of GARP, endpoint learning became reactive and inconsistent across leaf switches—fuelling the mobility storm.



Recent Posts

See All
In-Band Management Configuration in ACI

High-Level Objective The goal is to enable  APICs, leaf switches, and spine switches  to: Use  in-band management IP addresses Carry management traffic  over the ACI fabric data plane Reach  external

 
 
 
Debounce Timer in Cisco ACI

Understanding Interface Flapping and the Debounce Timer in Cisco ACI Interface flapping on Cisco ACI leaf switches is one of the most commonly misunderstood issues in environments connected to WAN, DW

 
 
 

Comments


Follow me

© 2021 by Mukesh Chanderia
 

Call

T: 8505812333  

  • Twitter
  • LinkedIn
  • Facebook Clean
©Mukesh Chanderia
bottom of page