ACI Faults

Mukesh Chanderia
Sep 18, 2024
15 min read

Updated: Feb 9

APIC Maintains System State:

The Application Policy Infrastructure Controller (APIC) maintains a comprehensive, up-to-date run-time representation of the administrative and operational state of the ACI fabric system.
This representation is managed through a collection of managed objects (MOs).

Faults as Managed Objects:

In this model, faults are represented as mutable, stateful, and persistent managed objects (MOs).
When a specific condition occurs (e.g., a component failure or an alarm), the system creates a fault MO.
The fault MO is created as a child object to the MO primarily associated with the fault.

Fault Object Classes and Rules:

Fault conditions are defined by the fault rules of the parent object class.
An MO class can have multiple defined faults, each with a different fault code and fault rule.
- Fault Code: Uniquely identifies a fault definition.
- Fault Rule: Uniquely identifies the fault conditions.
For a given fault code, a parent MO instance can have only one fault MO.

Automatic Fault MO Management:

Fault MOs are automatically created, escalated, de-escalated, and deleted by the system as specific conditions are detected.
If the same condition is detected multiple times while the fault MO is active:
- The properties of the fault MO are updated.
- No additional instances of the fault MO are created.
Fault MOs contain an "occur" property to record how many times a fault condition occurs.
- This property is useful for detecting fault flapping.

Triggers for Fault MO Creation:

The creation of a fault MO can be triggered by:
- Internal processes such as finite state machine (FSM) transitions.
- Detected component failures.
- Conditions specified by various fault policies.
  - Some fault policies are user-configurable.
  - For example, setting fault thresholds on statistical measurements like health scores, data traffic, or temperatures.

Persistence of Fault MOs:
- A fault MO remains in the system after the fault condition is cleared until it is deleted under one of the following circumstances:
  - When the parent MO is deleted.
  - When a cleared fault is acknowledged by the user.
  - When a cleared fault has existed longer than the retention interval.
Fault Objects and Records
Fault Severity
Fault Types
Fault Properties
Fault Life Cycle

Fault Objects and Records:
- In the Cisco APIC Management Information Model Reference, the fault package contains fault-related object classes.
Fault Objects:
- Fault objects are represented by two classes:
  - fault:Inst:
    - When a fault occurs in a Managed Object (MO), a fault instance MO (fault:Inst) is created under the MO that experienced the fault condition.
  - fault:Delegate:
    - Used for internal MOs not prominently displayed in the APIC GUI.
    - To improve visibility, a fault delegate MO (fault:Delegate) is created and attached to a higher-visibility logical MO.
    - It is an identical copy of the original fault instance (fault:Inst).
    - The original MO affected by the fault is identified in the fault:Delegate:affected property.
- Example:
  - If the system encounters an issue deploying an endpoint group to a node:
    - A fault:Inst is raised on the node object affected.
    - A corresponding fault:Delegate is raised on the endpoint group object.
    - This allows users to view all faults related to the endpoint group in one place.
Fault Records:
- Purpose:
  - Record the history of state transitions for fault instance objects.
- Creation and Immutability:
  - A fault record object (fault:Record) is created for every fault state change.
  - Fault records are immutable and cannot be modified by users or the system.
  - Creation is triggered by the creation, deletion, or key property modification (e.g., severity, life cycle, acknowledgment) of a fault instance MO.
- Contents:
  - Contains a complete snapshot of the fault instance object at the time of record creation.
  - Includes properties like severity (original, highest, previous), acknowledgment status, occurrence count, and life cycle state.
  - Organized as a flat list under a single container for easy querying.
- Querying and Analysis:
  - Can be queried using time-based filters or property filters such as severity and affected Distinguished Name (DN).
  - Useful for analyzing how a fault object was created and deleted.
- Exporting Records:
  - The creation of a fault record can trigger the export of its details to an external destination via syslog.
- Retention and Purging:
  - Fault records are purged only when maximum capacity is reached and space is needed for new records.
  - A fault record may be retained long after the fault object itself has been deleted, depending on space availability.
  - Retention and purge behavior are specified in the fault record retention policy (fault:ARetP) object.
Fault Severity Overview:
- A fault raised by the system can move through various severity levels during its life cycle.
- The severities are listed in decreasing order of seriousness.
Severity Levels:
1. Critical:
  - Description: A service-affecting condition that requires immediate corrective action.
  - Example: The managed object is out of service and its capability must be restored promptly.
2. Major:
  - Description: A service-affecting condition that requires urgent corrective action.
  - Example: Severe degradation in the managed object's capability; full functionality needs to be restored.
3. Minor:
  - Description: A non-service-affecting fault that requires action to prevent escalation to a more serious fault.
  - Example: An alarm condition detected that is not currently impacting the managed object's capacity.
4. Warning:
  - Description: A potential or impending service-affecting fault with no significant current impact.
  - Action: Further diagnosis and correction are recommended to prevent escalation.
5. Info:
  - Description: A basic notification or informational message, possibly insignificant on its own.
  - Note: Used only for events.
6. Cleared:
  - Description: Notification that the fault condition has been resolved.
  - Outcome: The fault has been cleared from the system.

Fault Types Overview:
- A fault raised by the system can be categorized into several types based on the nature of the issue detected.
Types of Faults:
1. Generic:
  - The system has detected a general or unspecified issue.
2. Equipment:
  - Indicates that a physical component is inoperable or has functional problems.
3. Configuration:
  - The system is unable to successfully configure a component.
4. Connectivity:
  - A connectivity issue has been detected, such as an unreachable adapter.
5. Environmental:
  - The system has detected issues related to power, thermal conditions, voltage irregularities, or loss of CMOS settings.
6. Management:
  - A serious management issue has been detected, which may include:
    - Critical services that could not be started.
    - Components with incompatible firmware versions within the instance.
7. Network:
  - A network-related issue has been detected, such as a link being down.
8. Operational:
  - The system has detected an operational problem, such as:
    - Log capacity limits being reached.
    - Failure in component discovery.

Here is the table summarizing the fault properties and their descriptions:

Property	Description
code	The fault code (e.g., F1017).
rule id	The identifier of the rule that generated the fault instance.
id	The unique identifier assigned to the fault.
cause	The probable cause category (e.g., equipment-inoperable).
type	The type of fault (e.g., connectivity or environmental).
severity	The current severity level of the fault.
created	The date and time when the fault occurred.
lastTransition	The date and time when the severity or life cycle state of the fault last changed.
descr	The description of the fault.
lc	The life cycle state of the fault (e.g., soaking).
occur	The number of times the event that raised the fault has occurred.
origSeverity	The severity assigned to the fault when it first occurred.
prevSeverity	If the severity has changed, this is the previous severity.
highestSeverity	The highest severity encountered for this issue.

Stateful Fault MOs:
- APIC fault Managed Objects (MOs) are stateful, transitioning through multiple states during their life cycle.
Fault State Transitions:
- Faults transition through different states over time, and their severity may change based on persistence or changes in conditions.
Impact of State Changes:
- Each state change creates a fault record.
- If external reporting is configured, the state change can generate a syslog or external report.
Single Fault MO per Parent MO:
- Only one instance of a given fault MO can exist for each parent MO.
Occurrence Tracking:
- If the same fault reoccurs while the fault MO is still active, the APIC increments the fault occurrence count instead of creating a new fault MO.

Explanation of the fault life cycle states:

Soaking:
- When a fault is first detected, the system creates a fault object and enters the "Soaking" state.
- During this time, the system waits to see if the fault persists or resolves on its own. This waiting period is called the "soaking interval."
Soaking-Clearing:
- If the fault clears during the soaking interval, the system moves to the "Soaking-Clearing" state.
- The system watches to see if the fault comes back. If it does, the system returns to the Soaking state. If not, it moves on to the Retaining state.
Raised:
- If the fault continues beyond the soaking interval, the fault enters the "Raised" state.
- The fault's severity may increase since it's now considered more serious. It stays in this state until the issue is fixed.
Raised-Clearing:
- Once the fault is fixed in the Raised state, it moves to "Raised-Clearing."
- The system checks to ensure the fault doesn’t return during a clearing interval. If it does, the fault returns to the Raised state.
Retaining:
- If the fault doesn't reoccur during the clearing interval, the fault moves to the "Retaining" state.
- The fault stays in the system for a set period (retention interval) so administrators can review it. If the fault doesn’t come back and the retention interval ends, or if the user acknowledges the fault, the system deletes the fault.

These intervals (soaking, clearing, retention) are specified by a fault life cycle profile, which defines how long the system waits in each state.

Note : A fault lifecycle change may not happen on a switch if the system that handles internal messages is too busy. This can happen, for example, if the syslog is set to "debug" mode or if you're trying to apply a very large configuration that goes beyond what the switch can handle.

Configuring Fault Life Cycle Intervals

You can adjust three settings related to the fault lifecycle.

Steps:

Go to Fabric > Fabric Policies > Policies > Monitoring > Common Policy > Fault Lifecycle Policy.
In the settings area, you can change these parameters:
- Clearing Interval: Set this between 0 to 3600 seconds. The default is 120 seconds.
- Retention Interval: Set this between 0 to 31,536,000 seconds. The default is 3600 seconds.
- Soaking Interval: Set this between 0 to 3600 seconds. The default is 120 seconds.
To see which nodes and policies will be affected by your changes, click Show Usage.
Finally, click Submit to save your changes.

Viewing Faults

The APIC GUI shows fault information in different ways to help you check system health and troubleshoot specific issues.

Fault Tables

When a fault happens, a fault instance (either fault:Inst or fault:Delegate) is created under the relevant Managed Object (MO).
Each component in the APIC user interface (like a tenant or a fabric node) has a Faults tab that lists all active faults for that MO and its child MOs.
You can see details of a specific active fault by double-clicking its entry in the Faults table. To view past faults, go to the History > Faults tab under the component.

Fault Group View and List View

For components with many faults, there are two ways to view them:
- Group View: Shows one line for each fault code and how many times that fault has occurred. Double-clicking a fault code will show a detailed list of all instances for that code.
- List View: Displays one line for each individual fault instance.
The default view is Group View, but you can switch between the two by clicking the icons in the Faults tab.

Examples of Group View Locations

System > Faults: Shows all faults for every node in the ACI fabric.
Fabric > Inventory > Pod number > Faults: Shows all faults for MOs in that pod.
Fabric > Inventory > Pod number > node > Faults: Shows all faults for MOs in that specific node.

Fault Counts in Dashboards

The APIC GUI has a Dashboard tab for some components (like tenants or pods) that summarizes health scores and fault counts.
It has panels showing fault counts by domain (like infra or tenant) and by type (like configuration or environmental). You can choose to hide acknowledged or delegated faults.
Each panel shows the total count of faults for different severity levels. Double-clicking a fault in the dashboard takes you to the Faults tab with filtered results.

Examples of Dashboard Locations

System > Dashboard: Shows fault counts for the entire ACI Fabric.
Tenant > name > Dashboard: Shows fault counts for all MOs under that tenant.
Fabric > Inventory > Pod number > Dashboard: Shows fault counts for all MOs in that pod.
Fabric > Inventory > Pod number > node > Dashboard: Shows fault counts for all MOs in that node.

Procedure to View Tenant Faults

Go to Tenant > name > name.
Click the Faults tab to display the faults table.
- If the component supports Group View, you'll see fault codes and their counts. Double-click a fault code to see its instances.
- If it doesn’t support Group View, you'll see all individual fault instances.
To view details of a specific fault, double-click its entry. This opens the Fault Properties window, showing general information, troubleshooting tips, and fault history.
To view fault records:
- Go back to the top-level object (e.g., Tenant > name > name).
- Click the History tab, then the Faults tab.
- Double-click a fault row to view its record.

Viewing Faults Using the NX-OS Style CLI

To see a summary of faults for a specific component, use the show faults command with the right options. Here are some common examples:

show faults – Shows all faults.
show faults controller – Shows faults for the controller.
show faults leaf – Shows faults for leaf nodes.
show faults leaf interface – Shows faults for a leaf node's interface.
show faults spine – Shows faults for spine nodes.
show faults tenant – Shows faults for a tenant.

To see past fault records for a specific component, add the history keyword, like this example:show faults history leaf 101 – Shows the fault history for leaf 101.

Handling Expected Faults

Sometimes, faults occur in the ACI fabric that are harmless at the moment. Here's how to manage these expected faults:

Understanding Expected Faults:
- Example: A fault with code F0532 is raised on a port that is currently down but linked to an endpoint group (EPG).
- Scenario: The port is not in use now but will be used in the future, so this fault can be safely ignored.
Options to Manage Expected Faults:
1. Squelch: Permanently Suppress Specific Faults
  - Purpose: Stop all notifications for a specific fault code permanently.
  - Effect:
    - Faults with the squelched code are removed from dashboards and logs.
    - They do not impact the health score.
  - How to Squelch:
    - From a fault table.
    - Within a monitoring policy.
  - Note: To unsquelch, you must manually remove the suppression. Faults that were squelched during the suppression period will not be visible.
2. Acknowledge: Temporarily Ignore Specific Faults
  - Purpose: Temporarily ignore a fault for a specific object (identified by its distinguished name or DN).
  - Benefits:
    - Marks the fault as known, allowing users to ignore related notifications.
    - Deletes the fault before the retention policy removes it automatically.
  - Behavior:
    - If the fault is in a retain life cycle, it is deleted immediately upon acknowledgment.
    - Otherwise, it is deleted after the retention interval expires.
    - If the fault reoccurs after being acknowledged, you need to acknowledge it again.
  - Health Score Impact:
    - By default, acknowledging a fault does not affect the health score.
    - You can choose to exclude acknowledged faults from the health score evaluation.
  - GUI Visibility:
    - You can opt to hide acknowledged faults from the GUI.
When to Use Each Option:
- Squelch: When you know a fault will occur regularly and want to ignore it permanently.
- Acknowledge: When you want to temporarily ignore a fault, especially if you expect the condition to resolve or reoccur sporadically.

Advantages and Disadvantages of Acknowledge vs. Squelch Methods

Granularity of Control

Acknowledge:
- Advantage (+): Provides full control; any specific fault can be acknowledged individually.
- Disadvantage (-): Each fault must be acknowledged one at a time.
Squelch:
- Advantage (+): Allows you to squelch all faults with the same fault code in one setting.
- Disadvantage (-): Not suitable if you need to monitor some faults with that fault code; squelching affects all faults with that code.

Consistency

Acknowledge:
- Advantage (+): Acknowledgment status resets automatically when a fault clears; no user action is needed after the fault resolves.
- Disadvantage (-): If the fault reappears intermittently, it must be acknowledged every time it occurs.
Squelch:
- Advantage (+): An intermittent fault needs to be squelched only once.
- Disadvantage (-): User must remember to unsquelch the fault later if needed; no automatic reset.

Visibility

Acknowledge:
- Advantage (+): Option to hide acknowledged faults.
- Disadvantage (-): Must add a filter to hide acknowledged faults from monitoring.
Squelch:
- Advantage (+): Squelched faults do not appear in monitoring without needing a filter.
- Disadvantages (-):
  - No method to notify the user about fault conditions for squelched faults on any Managed Object (MO).
  - No indication that a fault has been squelched; user must remember to unsquelch it if necessary.

Health Score Impact

Acknowledge:
- Option Available: You can choose whether an acknowledged fault affects the health score.
Squelch:
- Effect: A squelched fault does not affect the health score.

Acknowledging Faults

What It Does:
- Immediate Deletion: Acknowledging a fault in the 'retaining' state deletes it right away instead of waiting for the default retention period of one hour.
- Marking Faults: You can acknowledge faults in other states to mark them as expected or to ignore them temporarily.
How to Acknowledge a Fault:
1. Navigate to the Affected Area:
  - Go to the relevant section in the GUI (e.g., Tenant, Fabric, or Access) where the fault is present.
2. Access the Faults Tab:
  - Click on the Faults tab in the work pane.
3. Locate the Fault:
  - Find the fault code you want to acknowledge in the Faults table and double-click it to see fault instances.
4. Acknowledge the Fault:
  - Check the Acked box next to the fault instance to acknowledge and delete it.
Quick Tip:
- Acknowledge All: Use the Acknowledge All checkbox in the table toolbar to acknowledge all fault instances at once.
- Unacknowledge All: Similarly, use the Un-Acknowledge All checkbox to remove acknowledgments from all fault instances.

Ignoring Acknowledged Faults

Purpose:
- Prevent acknowledged faults from affecting the overall health score of the ACI fabric.
How to Ignore Acknowledged Faults:
1. Navigate to Health Score Policies:
  - Go to Fabric > Fabric Policies > Policies > Monitoring > Common Policy > Health Score Evaluation Policies > Health Score Evaluation Policy.
2. Enable Ignoring:
  - In the work pane, check the box for Ignore Acknowledged Faults.

Hiding Acknowledged and Delegated Faults

Why Hide Faults:
- To reduce clutter and view only relevant faults by hiding acknowledged or delegated faults.
How to Hide Faults:
1. Go to Dashboard or Fault Table:
  - Navigate to any dashboard or fault table (e.g., System > Dashboard or Tenant > [name] > Dashboard).
2. Access Filter Options:
  - In the Fault Counts By Domain or Fault Counts By Type panel, or in the fault table toolbar, click the tools icon.
3. Apply Filters:
  - Hide Acknowledged Faults: Check the box for Hide Acknowledged Faults.
  - Hide Delegated Faults: Check the box for Hide Delegated Faults.
Quick Tip:
- When both a fault and its delegated version are present, hiding delegated faults gives a more accurate fault count.

Changing the Severity or Squelching a Fault

Purpose:
- Adjust the importance of a fault or stop it from appearing in reports and dashboards.
Where to Change Severity or Squelch:
- From the Faults Tab:
  - Directly within the Faults tab of a component in the APIC GUI.
- From a Monitoring Policy:
  - Through a monitoring policy in the Fabric Policies.
How to Change Severity or Squelch from the Faults Tab:
1. Navigate to Faults Tab:
  - Go to the Faults tab that shows the fault instance.
2. Choose an Action:
  - Change Severity:
    - Right-click the fault code row, select Change Severity, choose the new severity level, and click Change Severity.
  - Squelch Fault:
    - Right-click the fault code row, select Ignore Fault, and click Ignore Fault.
3. Confirm Action:
  - A dialog box will appear showing the affected monitoring policy. Confirm the action.
After Squelching:
- A squelch policy is created automatically. To unsquelch, locate and delete this policy under:
  - Tenants > common > Policies > Monitoring > default
  - Fabric > Access Policies > Policies > Monitoring > default
  - Fabric > Fabric Policies > Policies > Monitoring > default
  - Or any non-default monitoring policies you have created.
How to Change Severity or Squelch from the Monitoring Policy:
1. Determine Affected Object Class:
  - Identify the object class related to the fault (e.g., infra:WiNode for fault code F0321).
2. Navigate to Monitoring Policy:
  - Go to Tenants > common > Policies > Monitoring > default or the relevant monitoring policy location.
3. Modify Fault Severity Assignment:
  - Expand the monitoring policy, select Fault Severity Assignment Policies, click Actions > Modify Fault Severity Assignment Policies.
4. Select Object Class:
  - Choose the appropriate object class from the Monitoring Object drop-down.
  - If not listed, click the Edit (pencil) icon to add it.
5. Create Policy:
  - Click + to add a new fault severity assignment.
  - Select the fault code, set the Initial Severity and Target Severity (select squelched to suppress).
  - Optionally, add a comment and click Update to save.

Monitoring a Specific Object Class or Fault Code

Purpose:
- To monitor specific faults or object classes and send fault logs to external servers like Syslog.
How to Monitor Specific Faults:
1. Create a Monitoring Source Policy:
  - Go to Fabric > Fabric Policies > Policies > Monitoring > default > [Source Type] (e.g., Syslog).
2. Choose Source Type:
  - Select the desired source type (e.g., Syslog).
3. Select Monitoring Object:
  - Choose the object class associated with the fault.
4. Set Scope:
  - All Faults: Select all to monitor every fault in the object class.
  - Specific Fault: Select specific fault and choose the fault code.
5. Configure Source:
  - Click + to add a new monitoring source.
  - Name the source, set Min Severity, check Faults, choose the Dest Group, and click Submit.
6. Repeat as Needed:
  - Create additional monitoring sources for other fault codes or object classes as required.
Important Notes:
- For single fault codes, create a corresponding fault severity assignment policy with inherit severity.
- Ensure no conflicting monitoring sources are set up to avoid unwanted fault messages.

Summary

Acknowledging Faults:
- Deletes faults immediately or marks them to be ignored.
- Can be done individually or all at once.
Ignoring Faults:
- Excludes acknowledged faults from health score calculations.
Hiding Faults:
- Removes acknowledged or delegated faults from view in dashboards and fault tables.
Changing Severity/Squelching:
- Adjusts how faults are reported and their impact on system health.
- Can be done directly from the Faults tab or through monitoring policies.
Monitoring Specific Faults:
- Allows detailed tracking and external reporting of particular faults or object classes.

Why suppressed/ignored faults (F1545, F1547) still show up ?

Fault suppression (ignore/squelch) in Fault Severity Assignment Policies does not stop the system from detecting or logging the fault.
- The fault will still be generated internally and appear in the fault database (System → Faults).
- What suppression does: it prevents the fault from raising alarms (SNMP traps, syslog, external monitoring), and can downgrade severity (so it doesn’t escalate as major/critical).
System → Faults tab always shows raw fault objects.
- APIC is designed to display all faults for visibility, even cosmetic ones.
- The suppression policy changes their severity level or visibility in dashboards/health scores/alerts, but not the fact that they exist.
“Ignore” is not the same as “remove.”
- In ACI there is no true delete of a fault; they remain in the fault DB until cleared by the system (when the condition is gone).
- Cosmetic ones (like F1545, F1547) often persist until the software bug/condition is fixed or the object triggering it is removed.

By following these steps, you can effectively manage and control how faults are handled within your ACI fabric, ensuring that expected or non-critical faults do not overwhelm your monitoring and reporting systems.

How to confirm suppression worked

Go to Fabric → Inventory → Faults Dashboard.
- You’ll notice the suppressed fault won’t contribute to Health Score impact.

CLI check:

moquery -c faultInst -f 'fault.Inst.code=="F1545"'

The fault will still be present, but severity may show as squelched or warning depending on your assignment policy.

Important Distinction

Visible in System → Faults: Yes (always, for transparency).
Contributing to alarms/health/NMS traps: No (if squelched/ignored).

Show only ACTIVE (not cleared) F1820 faults

moquery -c faultInst -f 'fault.Inst.code=="F1820" and fault.Inst.status!="cleared"' | egrep "dn|status|severity"

Show only CLEARED F1820 faults

moquery -c faultInst -f 'fault.Inst.code=="F1820" and fault.Inst.status=="cleared"' | egrep "dn|status|severity"

Reference : https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/all/faults/guide/b_APIC_Faults_Errors/b_IFC_Faults_Errors_chapter_01.html