The logs for the DME processes are saved into /var/log/dme/log
ISO image installation Logs
While the upgrade is running, it is written into /root/insieme_installer.log and once the installation is complete,
It is shifted to /firmware/logs/****/insieme_installer.log.
The data conversion logs are in /firmware/dataconv.log and /firmware/dataconv_detail.log.
Switches installer's log are saved /mnt/pss/installer.log,
DME logs are saved /var/sysmgr/tmp_logs/
To check if any upgrade processes are running
apic1# ps -ef | egrep -i "install|atom|dataconv"
The process called Appliance Element (AE) which runs in the APIC is responsible to trigger the upgrade in the APIC.
Check if AE start installer.py process on APIC
apic1# cd /var/log/dme/log/
apic1# zgrep "installer.py" svc_ifc_ae.bin.log*
Does an insieme_4x_installer.log file exist? If it exists, check if upgrade is still running or if there were any errors ?
apic1# cd /firmware/logs/<timestamp>/
apic1# tail -n 50 insieme_4x_installer.log
Does an atom_installer.log file exist? If it exists, check if upgrade is still running or if there were any errors
apic1# cd /firmware/logs/<timestamp>
apic1# tail -n 50 atom_installer.log
Did APIC start and finish data conversion ?
apic1# cd /firmware/logs/<timestamp>/
apic1# egrep "dataconversion|dataconvert" atom_installer.log
Any recent errors in dmesg on APIC, such as process out of memory?
apic1# dmesg
Abort an APIC Upgrade
DO NOT reload, decommission, or restart processes on any APIC until the upgrade has been stopped on all APICs.
1. Restart AE on local APIC to stop ongoing installer process.
APIC#ps ef --forest | egrep "ifc_ae|installer"
acidiag restart ae
2. Rollback atomix upgrade on local APIC so that is recovered to previous working state
APIC# atomix rewind boot
3. Validate that upgrade is marked as failed completeNok for local APIC
APIC# # moquery -c maintUpgJob
5. Change cluster version back to original firmware version (which APICs are currently running)
apic# configure
apic1(config)# firmware
apic1(config-firmware)# controller-group
apic1(config-firmware-controller)# firmware-version
6. Verify that cluster version
# acidiag avread
Stop AE, kill the installer.py
acidiag stop ae
ps -ef | grep installer.py to get the pid
kill -9 pid (the pid is the one get from step above)
Set the cluster version back to 4.2-4o from APIC GUI/CLI, and start AE on APIC2 and APIC3.
apic3# bash
apic3> cd /mit/uni/controller/ctrlrfwpol
apic3> moset version ' apic-4.2(4o)'
apic3> moconfig commit
Apic2> acidiag start ae
Apic3>acidiag start ae
CSCvs14967 memory leakage
AE process periodically causes a restart
Workaround
1) acidiag restart ae
2) Reloading of the Cisco APIC
If the AE process is locked, the APIC firmware upgrade will not kick in. This process queries the chassis IPMI every 10 seconds.
To check last query by AE process use the following commands.
APIC# date
APIC# grep "ipmi" /var/log/dme/log/svc_ifc_ae.bin.log | tail -5
Check if the last query was within the 10 second window of the system time.
Case 1 : APIC1 upgrade fails or stuck
Never attempt to manually upgrade APIC2 or APIC3.
Case 2 :APIC1 is upgraded successfully, but APIC2 is still stuck at 75%
APIC1 upgrade version information is propagated to APIC2 by “svc_ifc_appliance_director” process.
Go to “/firmware/logs/2022-09-02T02:21:07-80
”
apic1# pwd
/firmware/logs/2022-09-02T02:21:07-80
apic1# ls -l
total 2976
-rwxr-xr-x 1 root root 2985605 Sep 2 02:40 atom_installer.log
-rwxr-xr-x 1 root root 49969 Sep 2 02:23 insieme_4x_installer.log
apic1# tail insieme_4x_installer.log
check if you can see copying of image
2022-09-02 02:23:05,228|INFO|12616|installer:156 /mgmt/support/insieme/atomix/atomix_installer.py -l /firmware/logs/2022-09-02T02:21:07-80/atom_installer.log -r -d atomixonly -f /var/run/mgmt/fwrepos/fwrepo/aci-apic-dk9.4.2.4o.bin
apic1# tail atom_installer.log
root 31422 0.0 0.0 0 0 ? I< 02:28 0:00 [kdmflush]
root 31423 0.0 0.0 0 0 ? I< 02:28 0:00 [bioset]
root 31503 0.0 0.0 0 0 ? I 02:28 0:00 [kworker/6:2]
root 31504 0.0 0.0 0 0 ? I 02:28 0:00 [kworker/2:0]
root 31795 0.0 0.0 0 0 ? I 02:28 0:00 [kworker/7:1]
root 32450 0.0 0.0 0 0 ? I< 02:29 0:00 [kdmflush]
root 32451 0.0 0.0 0 0 ? I< 02:29 0:00 [bioset]
2022-09-02 02:40:24,588|INFO|3645|atomix_installer:48 Installation completed, rebooting
2022-09-02 02:40:24,597|INFO|3645|install_utils:89 reboot -f
The Commands
show controller
avread
acidiag avread
show firmware upgrade status
ps –ef | grep installer
ps –ef | grep atom
find /data/db/ -type f -size +10M -exec ls -lh {} \;
systemctl list-units --type=service | grep failed
Physical CIMC connectivity
VIC 1455 supports 10/25-Gigabit with the following restrictions:
All ports must have the same speed.
Port-1 and port-2 is one pair, corresponding to eth2-1 on APIC.
Port-3 and port-4 is another pair, corresponding to eth2-2 on APIC.
Only one connection is allowed for each pair
CSCvq66442 : APIC M3/L3 does not support connecting port 1 and 2 on the VIC towards the fabric
LLDP :
LLDP is a layer 2 “vendor-neutral” neighbor discovery protocol.
LLDP capture
tcpdump -vvnni kpm_inb "ether proto 0x88cc"
LLDP config
root@apic# cat /data/lldp/lldpad.conf
Leaf Commands
show lldp nei
cat /mit/sys/lldp/inst/if-\[eth1--3\]/summary
show sys int lldp info int e1/1
Show lldp nei interface eth 1/3 detail
LLDP must be disable on CIMC
Cluster Mode
PERMISSIVE : APICs (with any serial number) can be auto discovered and allowed to join the cluster.
strict mode : user approval of serial number is required before an APIC can join cluster.
apic1# acidiag avread
Local appliance ID=1 ADDRESS=10.0.0.1
TEP ADDRESS=10.0.0.0/16
ROUTABLE IP ADDRESS=0.0.0.0
CHASSIS_ID=50ca8c6a-fe3e-11eb-b275-2b4d0343ce76
Cluster of 3 lm(t):1(2022-01-20T20:38:51.658+12:00)
appliances (out of targeted 3 lm(t):3(2022-01-20T21:31:08.921+12:00))
FABRIC_DOMAIN name=POD05
set to version=apic-4.2(6h) lm(t):1(2022-03-05T22:09:37.663+12:00);
discoveryMode=PERMISSIVE lm(t):0(1970-01-01T12:00:00.003+12:00);
drrMode=OFF lm(t):0(1970-01-01T12:00:00.003+12:00);
kafkaMode=OFF lm(t):1(2022-03-05T21:56:15.128+12:00)
Apic1# show controller
Fabric Name : POD05
Operational Size : 3
Cluster Size : 3
Time Difference : -2740812
Fabric Security Mode : PERMISSIVE
Wiring issues
leaf# cat /mit/sys/lldp/inst/if-\[eth1--3\]/summary
# LLDP Interface
id : eth1/3
adminRxSt : enabled
adminSt : enabled
adminTxSt : enabled
childAction :
descr :
dn : sys/lldp/inst/if-[eth1/3]
lcOwn : local
mac : 00:3A:9C:81:5D:43
modTs : 2022-03-17T21:28:52.150+05:30
monPolDn : uni/fabric/monfab-default
name :
operRxSt : up
operTxSt : up
portDesc : topology/pod-1/paths-102/pathep-[eth1/3]
portMode : normal
portVlan : unspecified
rn : if-[eth1/3]
status :
sysDesc : topology/pod-1/node-102
wiringIssues : ctrlr-uuid-mismatch
leaf# moquery -c lldpIf -f 'lldp.If.wiringIssues!="”’
# lldp.If
id : eth1/3
adminRxSt : enabled
adminSt : enabled
adminTxSt : enabled
childAction :
descr :
dn : sys/lldp/inst/if-[eth1/3]
lcOwn : local
mac : 00:3A:9C:81:5D:43
modTs : 2022-03-17T21:28:52.150+05:30
monPolDn : uni/fabric/monfab-default
name :
operRxSt : up
operTxSt : up
portDesc : topology/pod-1/paths-102/pathep-[eth1/3]
portMode : normal
portVlan : unspecified
rn : if-[eth1/3]
status :
sysDesc : topology/pod-1/node-102
wiringIssues : ctrlr-uuid-mismatch
fabric-domain-mismatch – Adjacent node belongs to a different fabric
ctrlr-uuid-mismatch – APIC UUID mismatch (duplicate APIC ID)
wiring-mismatch – Invalid connection (Leaf to Leaf, Spine to non-leaf, Leaf fabric port to non-spine etc.)
adajeceny-not-detected – No LLDP adjacency on fabric port
infra-vlan-mismatch – Infra VLAN mismatch between leaf and APIC.
pod-id-mismatch – Pod ID mismatch between APIC and Leaf
unapproved-ctrlr – The SSL handshake between APIC and connected leaf is not completed.
unapproved-serialnumber – Detected a node that is not present in FNV.
CSCvn97719 & CSCvn97710
UUID
Uuid is an unique identifier generated for itself by each APIC. When cluster forms each APIC ID and its uuid are locked to each other.
Uuid is re-generated every time APIC is cleaned up.
Fabric would detect a uuid mismatch for an APIC after clean reboot and would not allow it join cluster.
Solution : To disassociate an APIC ID with its uuid, it must be decommissioned and commissioned back.
Decommission will remove the Chassis ID across the fabric
leaf1# acidiag avread | grep chassisId | cut -d ' '
Validate the Leaf that it has removed the APIC3 old chassis ID
Commission the APIC3 from APIC1 or APIC2
No need to touch clean the APIC
Note : Another way to recover from the UUID mismatch using the av.bin erase command
Docker IP Conflict
Docker uses IP range 172.17.0.0/16 by default, due to this APIC will fail to reach this subnet as this route uses next hop as docker0 interface.
APIC TEP address should not use this subnet and will cause APIC to go diverged in between.
BUG: CSCve84297 , CSCvq97675
MegaSAS
CSCvn13119
Echo the MegaSAS file to fix the storage issue
SSD Wearout
apic# grep -oE "SSD Wearout Indicator is [0-9]+" /var/log/dme/log/svc_ifc_ae.bin.log | tail -1
SSD Wearout Indicator is 6 (when less than 5 SSD RMA is required to be done)
Switch : will require file 1 of 3 & simply use the tool.
Comments