Common Issues in Cisco ACI

Mukesh Chanderia
Dec 9, 2025
9 min read

Updated: Jan 4

LACP --> port-channel or vPC member flips

When a port-channel or vPC member flips to Down or Suspended, traffic can black-hole or pin to fewer links. In Cisco ACI, these symptoms almost always trace back to LACP negotiation or a physical/link-layer issue.

The quick read

If a member shows s (suspended) in show port-channel summary: ACI is transmitting LACPDUs but not receiving them, or partner parameters don’t match.
If a member shows D (down) in a port-channel that’s SD: The bundle exists, but the member is down or mis-programmed.
Counters tell the truth: In show lacp counters, look for Sent >> Recv or Recv = 0 on the affected member.
FSM states confirm it: States like "SUSPENDED_FOR_MISCONFIGURATION" or repeated "RECEIVE_PARTNER_PDU_TIMED_OUT" mean we’re not hearing back from the peer or the peer is not compatible.
Fixes: Align LACP mode (Active/Passive), verify the peer is running LACP on the same channel, check cabling/optic/VLAN encaps, and consider fast timers for quicker failover.

Typical symptoms & how to read them

1) Port-channel summary

Leaf# "show port-channel summary"

PoX(SU) – Bundle is Up, Switched; members should show (P) (participating).
PoX(SD) – Bundle defined but Down; members may show (s) (suspended) or (D) (down).
Member flags:
- (P) = up in port-channel (healthy)
- (s) = suspended (LACP mismatch or no partner LACPDUs)
- (D) = down (physical/operationally down)

2) LACP counters

Leaf# show lacp counters

Healthy member: Sent ~ Recv and both increment steadily.
Problem member: Sent increases, Recv stays low or zero → not hearing the peer (bad cable/optic, peer not LACP, wrong port-channel on peer, blocked by policy).

3) LACP per-interface detail (FSM)

Leaf# show lacp interface ethernet 1/x detail

Watch for:

LACP_ST_SUSPENDED_FOR_MISCONFIGURATION – actor/partner parameters do not agree (system ID, key, or operational settings).
RECEIVE_PARTNER_PDU_TIMED_OUT – leaf is not receiving LACPDUs (peer isn’t sending, link path issue, or control plane blocked).

Why it happens (common root causes)

Peer side isn’t running LACP
- Peer has static channel, wrong port-channel ID, LACP disabled, or interfaces not in the same bundle.
Parameter mismatch
- System ID/key mismatch (e.g., vPC vs single-chassis aggregation on peer), hashing/group membership mismatches, min-links constraints unmet.
Physical layer problems
- Bad optic/cable, mismatched transceivers, speed/duplex/autoneg anomalies, or LOS/LOF errors (check show interface transceiver details and error counters).
Wrong policy attachment in ACI
- The LACP policy isn’t actually attached to the interface policy group the Access Port Selector is using; or the wrong interface profile/selector targets the port.

Fix it fast: a clean remediation checklist

Verify the peer’s bundle
- On the peer, confirm the same set of member ports are in a single LACP port-channel and are up and active.
- Ensure LACP mode = active (recommended on both ends).
Confirm ACI policy wiring
- APIC: Access Policies → Interface Policy Groups → (vPC/Port-Channel)
- Ensure the LACP Policy you expect is selected and that this policy group is tied to the correct Access Port Selector and Leaf Interface Profile targeting the right leaf/port.
Physical health
- Swap known-good cables/optics, check interface error counters, verify speed/MTU/autoneg options are consistent with the peer.
Min-links logic
- If you use min-links, confirm enough members are healthy to bring the bundle up; otherwise the port-channel will stay down by design.
Enable LACP fast timers to shorten detection/failover when a member silently dies.

Enabling LACP fast timers in Cisco ACI

Goal: Use 1-second LACP “short” timeout for faster failure detection.

Where: APIC Access Policies → Policies → Interface → LACP.

Steps (GUI):

Create or edit an LACP Policy:
- Mode: active
- Rate/Fast Select: fast
- (Optional, recommended) Graceful Convergence = Enabled, Suspend Individual = Enabled, set Min-Links as appropriate.
Apply this LACP policy to the Interface Policy Group (vPC or Port-Channel) used by your uplink bundle.
Ensure that policy group is actually referenced by the Access Port Selector for the correct leaf/ports.

Interpreting logs & states

CLI State Evidence
- Observation: show port-channel summary shows bundle SU but member flagged (s) suspended.
- Meaning: LACP negotiation failed on that member (no compatible partner or no LACPDUs received).
Counters Evidence
- Observation: show lacp counters on a bad member shows high Sent, near-zero Recv.
- Meaning: We are transmitting LACPDUs but not receiving any—peer isn’t sending (misconfig or down) or path is broken.
FSM/Timer Evidence
- Observation: Interface detail logs repeated RECEIVE_PARTNER_PDU_TIMED_OUT and state SUSPENDED_FOR_MISCONFIGURATION.
- Meaning: The leaf’s LACP state machine timed out waiting for partner PDUs or detected incompatible partner parameters, so it suspended the member to protect the bundle.
Resolution Evidence
- Action: Correct peer LACP mode/bundle membership (ensure active on both ends), fix cabling/optic, and apply LACP fast timers.
- Outcome: Member transitions to PORT_MEMBER_COLLECTING_AND_DISTRIBUTING_ENABLED, counters Sent/Recv increment steadily and the member shows (P) in the bundle.

FAQ

Q: Is enabling fast timers disruptive?

Generally no; it’s a control-plane timer change. Apply during a maintenance window for critical paths, and validate afterward.

Q: Will fast timers help if the peer isn’t running LACP?

They help detect failure faster, but the real fix is enabling/configuring LACP correctly on the peer.

Q: I see (D) (down) instead of (s) (suspended). What’s the difference?

(D) is typically physical/operational down

(s) means ACI suppressed the member due to LACP mismatch or timeouts.

----------------------------------------------------------------------------------------------------------------------

Understanding Cisco ACI Endpoint Dampening (F4311)

Cisco ACI includes several mechanisms to maintain stability and scalability in large fabrics. One of the most important—and often misunderstood—features is COOP Endpoint Dampening, which prevents excessive endpoint advertisements from overwhelming the control plane.

What Is COOP Endpoint Dampening?

The Council of Oracles Protocol (COOP) is responsible for distributing endpoint location information learned at leaf switches to all spines in the fabric.

In cases where an endpoint moves too frequently, the COOP control plane may become overloaded with constant advertisements. To protect itself, ACI uses a dampening mechanism that:

Assigns a penalty for every move (MAC or IP changing location).
Suppresses the endpoint if the penalty exceeds the threshold.
Freezes the endpoint temporarily (fault F4311).
Allows it to reappear once the penalty decays.

COOP Dampening Is a Protection Mechanism

It prevents:

Loops causing rapid MAC moves
Misconfigured active-active uplinks
Flapping virtual IPs
Unstable L3/L2 adjacencies

This feature ensures fabric stability—but in clustered designs, it can be unintentionally triggered.

Why Clusters (Like Oracle RAC / Exadata) Trigger COOP Dampening

Clustered workloads often use:

Multiple IPs bound to a single NIC/MAC
Virtual IPs (VIPs) that float between nodes
Rapid, repeated ARP/GARP announcements during failover

From ACI’s perspective, this appears as constant endpoint movement.

Example of a Common Pattern

Node 1 MAC: aa:bb:cc:11:22:33

IP List:

10.10.10.4

10.10.10.5

10.10.10.8

10.10.10.100 ← Cluster VIP

Node 2 MAC: aa:bb:cc:44:55:66

IP List:

10.10.20.20

10.10.10.100 ← Same VIP during failover

When the VIP (10.10.10.100) rapidly shifts between nodes, ACI logs resemble:

Moved IP 10.10.10.100 from MAC 11:22:33 → 44:55:66

Moved IP 10.10.10.100 from MAC 44:55:66 → 11:22:33

[DAMP] penalty increased: 9500 → 15000 → 20000 → Freeze

Once the threshold is exceeded, COOP freezes the endpoint and raises fault F4311.

Leaf # show system internal epm endpoint mac <mac-address>

This reveals:

Number of IPs bound to a MAC
Last interface location
VLAN/BD/VNID context

FIX

For clustered environments, Cisco best practice recommends:

Disable IP Data-Plane Learning

(across BD subnets used by clustered workloads)

This ensures ACI learns endpoint movement only from ARP/GARP updates, not every routing or IP announcement.

Keep ARP and GARP learning enabled

Failovers still work, but without overwhelming the control plane.

Clear frozen endpoints after changes

Leaf # clear system internal epm endpoint mac <mac>

Conclusion

ACI endpoint dampening (F4311) is a protection mechanism, not a fabric failure. In clustered database or application environments with floating IPs, this feature may be unintentionally triggered during VIP failovers.

By:

Disabling IP Data-Plane Learning,
Allowing ARP/GARP-based updates, and
Clearing stale endpoints,

ACI fabrics can safely support RAC/Exadata-style clustering without control-plane churn or endpoint freezes.

-----------------------------------------------------------------------------------------------------------------------------

When EVPN BGP Flaps Are Not an ACI Problem

Problem Statement

A subset of EVPN BGP neighbors on ACI spines experienced brief BGP session resets, while all other neighbors remained stable for months or years.

Key observations:

Sessions went Down → Up within seconds
Reset reason was consistently Hold Timer Expired
No configuration changes were made at the time
Only remote peers were affected
Similar flaps were observed historically

Understanding the Architecture Context

Local vs Remote ASNs

In Cisco ACI Multi-Site designs:

Local ACI spines typically share a single local AS
Remote sites / IPN peers often use different ASNs
EVPN control traffic (TCP/179) traverses external transport paths

Example

Local ACI AS: AS 15802
Remote ASNs: AS 652xx / 654xx / 655xx

This distinction is critical:If only remote AS neighbors flap, the fabric itself is often not the root cause.

What “Hold Timer Expired” Really Means ?

A BGP session resets due to Hold Timer expiry when:

Keepalive messages are not received within the negotiated hold time.

Important clarifications:

This is not a protocol mismatch
This is not a route-policy error
This is not an EVPN bug by default

It simply means: Packets were delayed or dropped somewhere along the path

Why ACI Faults Can Be Misleading ?

F0299 – BGP peer not established (Idle)

However:

Faults are stateful and sticky
They may reflect past incidents
They do not always clear immediately
They may not match the actual outage timestamp

Correct approach:

Use APIC Events and node-level logs
Correlate exact timestamps
Match peer IPs precisely

Events = TimelineFaults = Health state

Evidence from Node Logs

Across multiple spines, logs showed a consistent pattern:

Neighbor X Down - holdtimer expired
Neighbor X Up

Key characteristics:

Down/Up within ~5–15 seconds
Same peers affected repeatedly
Similar events months apart
No local BGP process restarts

This indicates Transient transport instability, not configuration error

Historical Recurrence Matters

One of the most important findings was that:

The same peers
With the same reset reason
Had flapped in previous months and years

Recurring behavior across long timelines rules out:

Software defects
Misconfiguration
One-time operational mistakes

Instead, it points to:

External transport characteristics (WAN / IPN / ISP)

How to Prove Whether ACI Dropped the Keepalives

1. Was the BGP session reset locally?

show bgp l2vpn evpn neighbors <peer> vrf overlay-1

Check:

Last reset reason
Notification sent/received
Keepalive counters

If reset = Hold Timer Expired, continue.

2. Did the ACI interface drop packets?

Identify the egress interface:

show ip route vrf overlay-1 <peer-ip>
show ip adjacency vrf overlay-1 <peer-ip>

Then check:

show interface <intf> counters errors
show queuing interface <intf>

If counters are clean → not an ACI data-plane issue.

3. Did control-plane policing (CoPP) drop traffic?

show copp statistics
show system internal bgp event-history errors

If no BGP-class drops are seen → control plane is healthy.

This confirms:

TCP/179 packets
Direction (ingress/egress)
Presence or absence of keepalives

Final Technical Conclusion

Based on:

Hold timer expiry
Clean ACI interfaces
No CoPP drops
Stable local peers
Repeated historical occurrences
Remote AS involvement

The root cause is:

Transient packet loss or latency in the external transport path (IPN / WAN / ISP)

This is a design and transport consideration, not a fabric defect.

-----------------------------------------------------------------------------------------------------------------------------

When can a Firewall Reboot which are in HA pair results in ACI Endpoint Dampening

Problem Scenario

An active firewall node in an HA pair unexpectedly rebooted.
The standby firewall became active within seconds.
Firewall uplinks were dual-homed to two ACI leaf switches using vPC.
The firewall cluster continued to use the same MAC address across all routed interfaces.
Physical links and port-channels on the ACI side remained UP.
Applications experienced a ~5-minute traffic disruption, followed by recovery without manual intervention.

What ACI Observed:

1. Endpoint Dampening (F4311)

coop-ep-freeze
Endpoint is in dampened freeze state

2. Bridge Domain Learning Disabled (F1197)

bd-limits-exceeded
Learning is disabled on BD because of BD move frequency

These faults occurred within minutes of the firewall failover and aligned precisely with the application outage window.

This combination is significant. Together, they indicate excessive endpoint mobility rather than physical link instability.

Why No Interfaces Flapped (and Why That Matters)

In vPC-based firewall designs:

Both firewalls remain physically connected to both leaf switches.
A firewall failover does not necessarily cause a link-down event.
From ACI’s perspective, nothing “obviously” failed at Layer 1 or Layer 2.

This is expected behavior and not a misconfiguration.

However, it has an important implication:

ACI cannot rely on link state to determine which firewall node is active.

Instead, ACI must infer endpoint location from MAC learning behavior.

Issue: MAC Mobility Across Leaf Switches

The firewall HA design used:

A shared MAC address across multiple routed interfaces
Multiple Bridge Domains mapped to the same firewall
Dual-homing to two leaf switches via vPC

After failover:

Traffic from the new active firewall began arriving on Leaf A
Shortly after, return traffic appeared on Leaf B
The same MAC address was learned alternately on both leaves
This happened across multiple Bridge Domains simultaneously

From ACI’s perspective, this looked like:

A single MAC address rapidly moved between leaf switches.

Why ACI Reacted the Way It Did ?

Cisco ACI includes endpoint move dampening to protect the control plane from:

Layer 2 loops
Miswired devices
Flapping endpoints
Broadcast storms

This is not primarily a security feature—it is a control-plane protection mechanism.

How it works ?

ACI tracks how frequently an endpoint moves
If move frequency exceeds a threshold:
- Endpoint learning is temporarily frozen (F4311)
- BD-level learning may be disabled (F1197)
This containment is per Bridge Domain and per leaf, limiting blast radius.

In this scenario, the firewall MAC exceeded the move-frequency threshold very quickly, triggering these protections.

Why GARP (Gratuitous ARP) Matters Here

In HA firewall designs, GARP is the preferred signal to notify the network that:

“This MAC/IP is now reachable via a different interface or node.”

In this case:

No explicit evidence of GARP processing was found in ACI logs
Instead, ACI relied on data-plane traffic to relearn the endpoint

This fallback mechanism is known as ARP Glean.

Glean vs GARP: A Subtle but Critical Difference

GARP (Proactive)

Sent immediately during failover
Updates MAC location deterministically
Minimal disruption

Glean (Reactive)

Triggered only when data traffic arrives
Happens independently on each leaf
Can result in alternating MAC learning
Much slower and less stable during failover

Because ACI had to rely on Glean instead of GARP, endpoint learning became reactive and inconsistent across leaf switches—fuelling the mobility storm.