top of page

Common Issues in Cisco ACI

  • Writer: Mukesh Chanderia
    Mukesh Chanderia
  • 1 day ago
  • 5 min read

LACP --> port-channel or vPC member flips


When a port-channel or vPC member flips to Down or Suspended, traffic can black-hole or pin to fewer links. In Cisco ACI, these symptoms almost always trace back to LACP negotiation or a physical/link-layer issue.


The quick read


  • If a member shows s (suspended) in show port-channel summary: ACI is transmitting LACPDUs but not receiving them, or partner parameters don’t match.

  • If a member shows D (down) in a port-channel that’s SD: The bundle exists, but the member is down or mis-programmed.

  • Counters tell the truth: In show lacp counters, look for Sent >> Recv or Recv = 0 on the affected member.

  • FSM states confirm it: States like "SUSPENDED_FOR_MISCONFIGURATION" or repeated "RECEIVE_PARTNER_PDU_TIMED_OUT" mean we’re not hearing back from the peer or the peer is not compatible.

  • Fixes: Align LACP mode (Active/Passive), verify the peer is running LACP on the same channel, check cabling/optic/VLAN encaps, and consider fast timers for quicker failover.


Typical symptoms & how to read them

1) Port-channel summary


Leaf# "show port-channel summary"


  • PoX(SU) – Bundle is Up, Switched; members should show (P) (participating).

  • PoX(SD) – Bundle defined but Down; members may show (s) (suspended) or (D) (down).

  • Member flags:

    • (P) = up in port-channel (healthy)

    • (s) = suspended (LACP mismatch or no partner LACPDUs)

    • (D) = down (physical/operationally down)


2) LACP counters

Leaf# show lacp counters


  • Healthy member: Sent ~ Recv and both increment steadily.

  • Problem member: Sent increases, Recv stays low or zero → not hearing the peer (bad cable/optic, peer not LACP, wrong port-channel on peer, blocked by policy).


3) LACP per-interface detail (FSM)


Leaf# show lacp interface ethernet 1/x detail


Watch for:

  • LACP_ST_SUSPENDED_FOR_MISCONFIGURATION – actor/partner parameters do not agree (system ID, key, or operational settings).

  • RECEIVE_PARTNER_PDU_TIMED_OUT – leaf is not receiving LACPDUs (peer isn’t sending, link path issue, or control plane blocked).


Why it happens (common root causes)


  1. Peer side isn’t running LACP

    • Peer has static channel, wrong port-channel ID, LACP disabled, or interfaces not in the same bundle.

  2. Parameter mismatch

    • System ID/key mismatch (e.g., vPC vs single-chassis aggregation on peer), hashing/group membership mismatches, min-links constraints unmet.

  3. Physical layer problems

    • Bad optic/cable, mismatched transceivers, speed/duplex/autoneg anomalies, or LOS/LOF errors (check show interface transceiver details and error counters).

  4. Wrong policy attachment in ACI

    • The LACP policy isn’t actually attached to the interface policy group the Access Port Selector is using; or the wrong interface profile/selector targets the port.



Fix it fast: a clean remediation checklist

  1. Verify the peer’s bundle

    • On the peer, confirm the same set of member ports are in a single LACP port-channel and are up and active.

    • Ensure LACP mode = active (recommended on both ends).

  2. Confirm ACI policy wiring

    • APIC: Access Policies → Interface Policy Groups → (vPC/Port-Channel)

    • Ensure the LACP Policy you expect is selected and that this policy group is tied to the correct Access Port Selector and Leaf Interface Profile targeting the right leaf/port.

  3. Physical health

    • Swap known-good cables/optics, check interface error counters, verify speed/MTU/autoneg options are consistent with the peer.

  4. Min-links logic

    • If you use min-links, confirm enough members are healthy to bring the bundle up; otherwise the port-channel will stay down by design.

  5. Enable LACP fast timers  to shorten detection/failover when a member silently dies.



Enabling LACP fast timers in Cisco ACI

Goal: Use 1-second LACP “short” timeout for faster failure detection.

Where: APIC Access Policies → Policies → Interface → LACP.

Steps (GUI):

  1. Create or edit an LACP Policy:

    • Mode: active

    • Rate/Fast Select: fast

    • (Optional, recommended) Graceful Convergence = Enabled, Suspend Individual = Enabled, set Min-Links as appropriate.

  2. Apply this LACP policy to the Interface Policy Group (vPC or Port-Channel) used by your uplink bundle.

  3. Ensure that policy group is actually referenced by the Access Port Selector for the correct leaf/ports.


Interpreting logs & states


  • CLI State Evidence

    • Observation: show port-channel summary shows bundle SU but member flagged (s) suspended.

    • Meaning: LACP negotiation failed on that member (no compatible partner or no LACPDUs received).

  • Counters Evidence

    • Observation: show lacp counters on a bad member shows high Sent, near-zero Recv.

    • Meaning: We are transmitting LACPDUs but not receiving any—peer isn’t sending (misconfig or down) or path is broken.

  • FSM/Timer Evidence

    • Observation: Interface detail logs repeated RECEIVE_PARTNER_PDU_TIMED_OUT and state SUSPENDED_FOR_MISCONFIGURATION.

    • Meaning: The leaf’s LACP state machine timed out waiting for partner PDUs or detected incompatible partner parameters, so it suspended the member to protect the bundle.

  • Resolution Evidence

    • Action: Correct peer LACP mode/bundle membership (ensure active on both ends), fix cabling/optic, and apply LACP fast timers.

    • Outcome: Member transitions to PORT_MEMBER_COLLECTING_AND_DISTRIBUTING_ENABLED, counters Sent/Recv increment steadily and the member shows (P) in the bundle.


FAQ (quick hits)

Q: Is enabling fast timers disruptive?

Generally no; it’s a control-plane timer change. Apply during a maintenance window for critical paths, and validate afterward.

Q: Will fast timers help if the peer isn’t running LACP?

They help detect failure faster, but the real fix is enabling/configuring LACP correctly on the peer.

Q: I see (D) (down) instead of (s) (suspended). What’s the difference?

(D) is typically physical/operational down

(s) means ACI suppressed the member due to LACP mismatch or timeouts.



Understanding Cisco ACI Endpoint Dampening (F4311)


Cisco ACI includes several mechanisms to maintain stability and scalability in large fabrics. One of the most important—and often misunderstood—features is COOP Endpoint Dampening, which prevents excessive endpoint advertisements from overwhelming the control plane.


What Is COOP Endpoint Dampening?


The Council of Oracles Protocol (COOP) is responsible for distributing endpoint location information learned at leaf switches to all spines in the fabric.

In cases where an endpoint moves too frequently, the COOP control plane may become overloaded with constant advertisements. To protect itself, ACI uses a dampening mechanism that:

  1. Assigns a penalty for every move (MAC or IP changing location).

  2. Suppresses the endpoint if the penalty exceeds the threshold.

  3. Freezes the endpoint temporarily (fault F4311).

  4. Allows it to reappear once the penalty decays.


COOP Dampening Is a Protection Mechanism

It prevents:

  • Loops causing rapid MAC moves

  • Misconfigured active-active uplinks

  • Flapping virtual IPs

  • Unstable L3/L2 adjacencies

This feature ensures fabric stability—but in clustered designs, it can be unintentionally triggered.


Why Clusters (Like Oracle RAC / Exadata) Trigger COOP Dampening

Clustered workloads often use:

  • Multiple IPs bound to a single NIC/MAC

  • Virtual IPs (VIPs) that float between nodes

  • Rapid, repeated ARP/GARP announcements during failover

From ACI’s perspective, this appears as constant endpoint movement.


Example of a Common Pattern


Node 1 MAC: aa:bb:cc:11:22:33


IP List:

10.10.10.4

10.10.10.5

10.10.10.8

10.10.10.100 ← Cluster VIP


Node 2 MAC: aa:bb:cc:44:55:66


IP List:

10.10.20.20

10.10.10.100 ← Same VIP during failover


When the VIP (10.10.10.100) rapidly shifts between nodes, ACI logs resemble:


Moved IP 10.10.10.100 from MAC 11:22:33 → 44:55:66

Moved IP 10.10.10.100 from MAC 44:55:66 → 11:22:33

[DAMP] penalty increased: 9500 → 15000 → 20000 → Freeze


Once the threshold is exceeded, COOP freezes the endpoint and raises fault F4311.


Leaf # show system internal epm endpoint mac <mac-address>


This reveals:

  • Number of IPs bound to a MAC

  • Last interface location

  • VLAN/BD/VNID context


FIX


For clustered environments, Cisco best practice recommends:


Disable IP Data-Plane Learning

(across BD subnets used by clustered workloads)

This ensures ACI learns endpoint movement only from ARP/GARP updates, not every routing or IP announcement.


Keep ARP and GARP learning enabled

Failovers still work, but without overwhelming the control plane.


Clear frozen endpoints after changes


Leaf # clear system internal epm endpoint mac <mac>


Conclusion


ACI endpoint dampening (F4311) is a protection mechanism, not a fabric failure. In clustered database or application environments with floating IPs, this feature may be unintentionally triggered during VIP failovers.

By:

  • Disabling IP Data-Plane Learning,

  • Allowing ARP/GARP-based updates, and

  • Clearing stale endpoints,

ACI fabrics can safely support RAC/Exadata-style clustering without control-plane churn or endpoint freezes.


 
 
 

Recent Posts

See All
Wireshark

1. What is Wireshark? Wireshark is a network packet capture tool . It shows every packet of data moving through your network. Think of it...

 
 
 
MultiCast In ACI

Understanding Multicast in Cisco ACI 1. Multicast Traffic Flow in ACI In ACI, multicast traffic is primarily managed within Bridge...

 
 
 
Quality of Service (QoS) in Cisco ACI

Configuring Quality of Service (QoS)  in Cisco ACI (Application Centric Infrastructure)  involves creating and applying QoS policies that...

 
 
 

Follow me

© 2021 by Mukesh Chanderia
 

Call

T: 8505812333  

  • Twitter
  • LinkedIn
  • Facebook Clean
©Mukesh Chanderia
bottom of page