Common Issues in Cisco ACI
- Mukesh Chanderia
- 1 day ago
- 5 min read
LACP --> port-channel or vPC member flips
When a port-channel or vPC member flips to Down or Suspended, traffic can black-hole or pin to fewer links. In Cisco ACI, these symptoms almost always trace back to LACP negotiation or a physical/link-layer issue.
The quick read
If a member shows s (suspended) in show port-channel summary: ACI is transmitting LACPDUs but not receiving them, or partner parameters don’t match.
If a member shows D (down) in a port-channel that’s SD: The bundle exists, but the member is down or mis-programmed.
Counters tell the truth: In show lacp counters, look for Sent >> Recv or Recv = 0 on the affected member.
FSM states confirm it: States like "SUSPENDED_FOR_MISCONFIGURATION" or repeated "RECEIVE_PARTNER_PDU_TIMED_OUT" mean we’re not hearing back from the peer or the peer is not compatible.
Fixes: Align LACP mode (Active/Passive), verify the peer is running LACP on the same channel, check cabling/optic/VLAN encaps, and consider fast timers for quicker failover.
Typical symptoms & how to read them
1) Port-channel summary
Leaf# "show port-channel summary"
PoX(SU) – Bundle is Up, Switched; members should show (P) (participating).
PoX(SD) – Bundle defined but Down; members may show (s) (suspended) or (D) (down).
Member flags:
(P) = up in port-channel (healthy)
(s) = suspended (LACP mismatch or no partner LACPDUs)
(D) = down (physical/operationally down)
2) LACP counters
Leaf# show lacp counters
Healthy member: Sent ~ Recv and both increment steadily.
Problem member: Sent increases, Recv stays low or zero → not hearing the peer (bad cable/optic, peer not LACP, wrong port-channel on peer, blocked by policy).
3) LACP per-interface detail (FSM)
Leaf# show lacp interface ethernet 1/x detail
Watch for:
LACP_ST_SUSPENDED_FOR_MISCONFIGURATION – actor/partner parameters do not agree (system ID, key, or operational settings).
RECEIVE_PARTNER_PDU_TIMED_OUT – leaf is not receiving LACPDUs (peer isn’t sending, link path issue, or control plane blocked).
Why it happens (common root causes)
Peer side isn’t running LACP
Peer has static channel, wrong port-channel ID, LACP disabled, or interfaces not in the same bundle.
Parameter mismatch
System ID/key mismatch (e.g., vPC vs single-chassis aggregation on peer), hashing/group membership mismatches, min-links constraints unmet.
Physical layer problems
Bad optic/cable, mismatched transceivers, speed/duplex/autoneg anomalies, or LOS/LOF errors (check show interface transceiver details and error counters).
Wrong policy attachment in ACI
The LACP policy isn’t actually attached to the interface policy group the Access Port Selector is using; or the wrong interface profile/selector targets the port.
Fix it fast: a clean remediation checklist
Verify the peer’s bundle
On the peer, confirm the same set of member ports are in a single LACP port-channel and are up and active.
Ensure LACP mode = active (recommended on both ends).
Confirm ACI policy wiring
APIC: Access Policies → Interface Policy Groups → (vPC/Port-Channel)
Ensure the LACP Policy you expect is selected and that this policy group is tied to the correct Access Port Selector and Leaf Interface Profile targeting the right leaf/port.
Physical health
Swap known-good cables/optics, check interface error counters, verify speed/MTU/autoneg options are consistent with the peer.
Min-links logic
If you use min-links, confirm enough members are healthy to bring the bundle up; otherwise the port-channel will stay down by design.
Enable LACP fast timers to shorten detection/failover when a member silently dies.
Enabling LACP fast timers in Cisco ACI
Goal: Use 1-second LACP “short” timeout for faster failure detection.
Where: APIC Access Policies → Policies → Interface → LACP.
Steps (GUI):
Create or edit an LACP Policy:
Mode: active
Rate/Fast Select: fast
(Optional, recommended) Graceful Convergence = Enabled, Suspend Individual = Enabled, set Min-Links as appropriate.
Apply this LACP policy to the Interface Policy Group (vPC or Port-Channel) used by your uplink bundle.
Ensure that policy group is actually referenced by the Access Port Selector for the correct leaf/ports.
Interpreting logs & states
CLI State Evidence
Observation: show port-channel summary shows bundle SU but member flagged (s) suspended.
Meaning: LACP negotiation failed on that member (no compatible partner or no LACPDUs received).
Counters Evidence
Observation: show lacp counters on a bad member shows high Sent, near-zero Recv.
Meaning: We are transmitting LACPDUs but not receiving any—peer isn’t sending (misconfig or down) or path is broken.
FSM/Timer Evidence
Observation: Interface detail logs repeated RECEIVE_PARTNER_PDU_TIMED_OUT and state SUSPENDED_FOR_MISCONFIGURATION.
Meaning: The leaf’s LACP state machine timed out waiting for partner PDUs or detected incompatible partner parameters, so it suspended the member to protect the bundle.
Resolution Evidence
Action: Correct peer LACP mode/bundle membership (ensure active on both ends), fix cabling/optic, and apply LACP fast timers.
Outcome: Member transitions to PORT_MEMBER_COLLECTING_AND_DISTRIBUTING_ENABLED, counters Sent/Recv increment steadily and the member shows (P) in the bundle.
FAQ (quick hits)
Q: Is enabling fast timers disruptive?
Generally no; it’s a control-plane timer change. Apply during a maintenance window for critical paths, and validate afterward.
Q: Will fast timers help if the peer isn’t running LACP?
They help detect failure faster, but the real fix is enabling/configuring LACP correctly on the peer.
Q: I see (D) (down) instead of (s) (suspended). What’s the difference?
(D) is typically physical/operational down
(s) means ACI suppressed the member due to LACP mismatch or timeouts.
Understanding Cisco ACI Endpoint Dampening (F4311)
Cisco ACI includes several mechanisms to maintain stability and scalability in large fabrics. One of the most important—and often misunderstood—features is COOP Endpoint Dampening, which prevents excessive endpoint advertisements from overwhelming the control plane.
What Is COOP Endpoint Dampening?
The Council of Oracles Protocol (COOP) is responsible for distributing endpoint location information learned at leaf switches to all spines in the fabric.
In cases where an endpoint moves too frequently, the COOP control plane may become overloaded with constant advertisements. To protect itself, ACI uses a dampening mechanism that:
Assigns a penalty for every move (MAC or IP changing location).
Suppresses the endpoint if the penalty exceeds the threshold.
Freezes the endpoint temporarily (fault F4311).
Allows it to reappear once the penalty decays.
COOP Dampening Is a Protection Mechanism
It prevents:
Loops causing rapid MAC moves
Misconfigured active-active uplinks
Flapping virtual IPs
Unstable L3/L2 adjacencies
This feature ensures fabric stability—but in clustered designs, it can be unintentionally triggered.
Why Clusters (Like Oracle RAC / Exadata) Trigger COOP Dampening
Clustered workloads often use:
Multiple IPs bound to a single NIC/MAC
Virtual IPs (VIPs) that float between nodes
Rapid, repeated ARP/GARP announcements during failover
From ACI’s perspective, this appears as constant endpoint movement.
Example of a Common Pattern
Node 1 MAC: aa:bb:cc:11:22:33
IP List:
10.10.10.4
10.10.10.5
10.10.10.8
10.10.10.100 ← Cluster VIP
Node 2 MAC: aa:bb:cc:44:55:66
IP List:
10.10.20.20
10.10.10.100 ← Same VIP during failover
When the VIP (10.10.10.100) rapidly shifts between nodes, ACI logs resemble:
Moved IP 10.10.10.100 from MAC 11:22:33 → 44:55:66
Moved IP 10.10.10.100 from MAC 44:55:66 → 11:22:33
[DAMP] penalty increased: 9500 → 15000 → 20000 → Freeze
Once the threshold is exceeded, COOP freezes the endpoint and raises fault F4311.
Leaf # show system internal epm endpoint mac <mac-address>
This reveals:
Number of IPs bound to a MAC
Last interface location
VLAN/BD/VNID context
FIX
For clustered environments, Cisco best practice recommends:
Disable IP Data-Plane Learning
(across BD subnets used by clustered workloads)
This ensures ACI learns endpoint movement only from ARP/GARP updates, not every routing or IP announcement.
Keep ARP and GARP learning enabled
Failovers still work, but without overwhelming the control plane.
Clear frozen endpoints after changes
Leaf # clear system internal epm endpoint mac <mac>
Conclusion
ACI endpoint dampening (F4311) is a protection mechanism, not a fabric failure. In clustered database or application environments with floating IPs, this feature may be unintentionally triggered during VIP failovers.
By:
Disabling IP Data-Plane Learning,
Allowing ARP/GARP-based updates, and
Clearing stale endpoints,
ACI fabrics can safely support RAC/Exadata-style clustering without control-plane churn or endpoint freezes.
