top of page
  • Writer's pictureMukesh Chanderia

ACI Upgrade

The logs for the DME processes are saved into /var/log/dme/log


ISO image installation Logs


While the upgrade is running, it is written into /root/insieme_installer.log and once the installation is complete,


It is shifted to /firmware/logs/****/insieme_installer.log.


The data conversion logs are in /firmware/dataconv.log and /firmware/dataconv_detail.log.


Switches installer's log are saved /mnt/pss/installer.log, 


DME logs are saved /var/sysmgr/tmp_logs/


To check if any upgrade processes are running


apic1# ps -ef | egrep -i "install|atom|dataconv"


The process called Appliance Element (AE) which runs in the APIC is responsible to trigger the upgrade in the APIC.


Check if AE start installer.py process on APIC


apic1# cd /var/log/dme/log/


apic1# zgrep "installer.py" svc_ifc_ae.bin.log*


Does an insieme_4x_installer.log file exist? If it exists, check if upgrade is still running or if there were any errors ?


apic1# cd /firmware/logs/<timestamp>/


apic1# tail -n 50 insieme_4x_installer.log


Does an atom_installer.log file exist? If it exists, check if upgrade is still running or if there were any errors


apic1# cd /firmware/logs/<timestamp>


apic1# tail -n 50 atom_installer.log


Did APIC start and finish data conversion ?


apic1# cd /firmware/logs/<timestamp>/


apic1# egrep "dataconversion|dataconvert" atom_installer.log


Any recent errors in dmesg on APIC, such as process out of memory?


apic1# dmesg


Abort an APIC Upgrade


DO NOT reload, decommission, or restart processes on any APIC until the upgrade has been stopped on all APICs.


1. Restart AE on local APIC to stop ongoing installer process.


APIC#ps ef --forest | egrep "ifc_ae|installer"


acidiag restart ae


2. Rollback atomix upgrade on local APIC so that is recovered to previous working state


APIC# atomix rewind boot


3. Validate that upgrade is marked as failed completeNok for local APIC


APIC# # moquery -c maintUpgJob


5. Change cluster version back to original firmware version (which APICs are currently running)


apic# configure

apic1(config)# firmware

apic1(config-firmware)# controller-group

apic1(config-firmware-controller)# firmware-version


6. Verify that cluster version


# acidiag avread


Stop AE, kill the installer.py 


acidiag stop ae

ps -ef | grep installer.py to get the pid

kill -9 pid (the pid is the one get from step above)


Set the cluster version back to 4.2-4o from APIC GUI/CLI, and start AE on APIC2 and APIC3.

   apic3# bash

   apic3> cd /mit/uni/controller/ctrlrfwpol

   apic3> moset version ' apic-4.2(4o)'

   apic3> moconfig commit

   Apic2> acidiag start ae

   Apic3>acidiag start ae



CSCvs14967 memory leakage


AE process periodically causes a restart


Workaround


1) acidiag restart ae

2) Reloading of the Cisco APIC


If the AE process is locked, the APIC firmware upgrade will not kick in. This process queries the chassis IPMI every 10 seconds.


To check last query by AE process use the following commands.


APIC# date

APIC# grep "ipmi" /var/log/dme/log/svc_ifc_ae.bin.log | tail -5 


Check if the last query was within the 10 second window of the system time.



Case 1 : APIC1 upgrade fails or stuck


Never attempt to manually upgrade APIC2 or APIC3.


Case 2 :APIC1 is upgraded successfully, but APIC2 is still stuck at 75%


APIC1 upgrade version information is propagated to APIC2 by “svc_ifc_appliance_director” process.


Go to “/firmware/logs/2022-09-02T02:21:07-80

apic1# pwd

/firmware/logs/2022-09-02T02:21:07-80


apic1# ls -l

total 2976

-rwxr-xr-x 1 root root 2985605 Sep 2 02:40 atom_installer.log

-rwxr-xr-x 1 root root 49969 Sep 2 02:23 insieme_4x_installer.log


apic1# tail insieme_4x_installer.log


check if you can see copying of image


2022-09-02 02:23:05,228|INFO|12616|installer:156 /mgmt/support/insieme/atomix/atomix_installer.py -l /firmware/logs/2022-09-02T02:21:07-80/atom_installer.log -r -d atomixonly -f /var/run/mgmt/fwrepos/fwrepo/aci-apic-dk9.4.2.4o.bin


apic1# tail atom_installer.log

root 31422 0.0 0.0 0 0 ? I< 02:28 0:00 [kdmflush]

root 31423 0.0 0.0 0 0 ? I< 02:28 0:00 [bioset]

root 31503 0.0 0.0 0 0 ? I 02:28 0:00 [kworker/6:2]

root 31504 0.0 0.0 0 0 ? I 02:28 0:00 [kworker/2:0]

root 31795 0.0 0.0 0 0 ? I 02:28 0:00 [kworker/7:1]

root 32450 0.0 0.0 0 0 ? I< 02:29 0:00 [kdmflush]

root 32451 0.0 0.0 0 0 ? I< 02:29 0:00 [bioset]


2022-09-02 02:40:24,588|INFO|3645|atomix_installer:48 Installation completed, rebooting

2022-09-02 02:40:24,597|INFO|3645|install_utils:89 reboot -f


The Commands


show controller

avread

acidiag avread

show firmware upgrade status

ps –ef | grep installer

ps –ef | grep atom

find /data/db/ -type f -size +10M -exec ls -lh {} \;

systemctl list-units --type=service | grep failed


Physical CIMC connectivity



VIC 1455 supports 10/25-Gigabit with the following restrictions:

All ports must have the same speed.

Port-1 and port-2 is one pair, corresponding to eth2-1 on APIC.

Port-3 and port-4 is another pair, corresponding to eth2-2 on APIC.

Only one connection is allowed for each pair


CSCvq66442 : APIC M3/L3 does not support connecting port 1 and 2 on the VIC towards the fabric



LLDP :


LLDP is a layer 2 “vendor-neutral” neighbor discovery protocol.


LLDP capture


tcpdump -vvnni kpm_inb "ether proto 0x88cc"



LLDP config


root@apic# cat /data/lldp/lldpad.conf


Leaf Commands

show lldp nei

cat /mit/sys/lldp/inst/if-\[eth1--3\]/summary

show sys int lldp info int e1/1

Show lldp nei interface eth 1/3 detail


LLDP must be disable on CIMC



Cluster Mode


PERMISSIVE : APICs (with any serial number) can be auto discovered and allowed to join the cluster.


strict mode : user approval of serial number is required before an APIC can join cluster.


apic1# acidiag avread

Local appliance ID=1 ADDRESS=10.0.0.1

TEP ADDRESS=10.0.0.0/16

ROUTABLE IP ADDRESS=0.0.0.0

CHASSIS_ID=50ca8c6a-fe3e-11eb-b275-2b4d0343ce76

Cluster of 3 lm(t):1(2022-01-20T20:38:51.658+12:00)

appliances (out of targeted 3 lm(t):3(2022-01-20T21:31:08.921+12:00))

FABRIC_DOMAIN name=POD05

set to version=apic-4.2(6h) lm(t):1(2022-03-05T22:09:37.663+12:00);

discoveryMode=PERMISSIVE lm(t):0(1970-01-01T12:00:00.003+12:00);

drrMode=OFF lm(t):0(1970-01-01T12:00:00.003+12:00);

kafkaMode=OFF lm(t):1(2022-03-05T21:56:15.128+12:00)


Apic1# show controller

Fabric Name : POD05

Operational Size : 3

Cluster Size : 3

Time Difference : -2740812

Fabric Security Mode : PERMISSIVE


Wiring issues


leaf# cat /mit/sys/lldp/inst/if-\[eth1--3\]/summary

# LLDP Interface

id : eth1/3

adminRxSt : enabled

adminSt : enabled

adminTxSt : enabled

childAction :

descr :

dn : sys/lldp/inst/if-[eth1/3]

lcOwn : local

mac : 00:3A:9C:81:5D:43

modTs : 2022-03-17T21:28:52.150+05:30

monPolDn : uni/fabric/monfab-default

name :

operRxSt : up

operTxSt : up

portDesc : topology/pod-1/paths-102/pathep-[eth1/3]

portMode : normal

portVlan : unspecified

rn : if-[eth1/3]

status :

sysDesc : topology/pod-1/node-102

wiringIssues : ctrlr-uuid-mismatch


leaf# moquery -c lldpIf -f 'lldp.If.wiringIssues!="”’

# lldp.If

id : eth1/3

adminRxSt : enabled

adminSt : enabled

adminTxSt : enabled

childAction :

descr :

dn : sys/lldp/inst/if-[eth1/3]

lcOwn : local

mac : 00:3A:9C:81:5D:43

modTs : 2022-03-17T21:28:52.150+05:30

monPolDn : uni/fabric/monfab-default

name :

operRxSt : up

operTxSt : up

portDesc : topology/pod-1/paths-102/pathep-[eth1/3]

portMode : normal

portVlan : unspecified

rn : if-[eth1/3]

status :

sysDesc : topology/pod-1/node-102

wiringIssues : ctrlr-uuid-mismatch




fabric-domain-mismatch – Adjacent node belongs to a different fabric

ctrlr-uuid-mismatch – APIC UUID mismatch (duplicate APIC ID)

wiring-mismatch – Invalid connection (Leaf to Leaf, Spine to non-leaf, Leaf fabric port to non-spine etc.)

adajeceny-not-detected –  No LLDP adjacency on fabric port

infra-vlan-mismatch – Infra VLAN mismatch between leaf and APIC.

pod-id-mismatch – Pod ID mismatch between APIC and Leaf

unapproved-ctrlr – The SSL handshake between APIC and connected leaf is not completed.

unapproved-serialnumber – Detected a node that is not present in FNV.


CSCvn97719 & CSCvn97710



UUID


Uuid is an unique identifier generated for itself by each APIC. When cluster forms each APIC ID and its uuid are locked to each other.

Uuid is re-generated every time APIC is cleaned up.


Fabric would detect a uuid mismatch for an APIC after clean reboot and would not allow it join cluster.


Solution : To disassociate an APIC ID with its uuid, it must be decommissioned and commissioned back.



Decommission will remove the Chassis ID across the fabric


leaf1# acidiag avread | grep chassisId | cut -d ' '



Validate the Leaf that it has removed the APIC3 old chassis ID

Commission the APIC3 from APIC1 or APIC2


No need to touch clean the APIC


Note : Another way to recover from the UUID mismatch using the av.bin erase command


Docker IP Conflict


 Docker uses IP range 172.17.0.0/16 by default, due to this APIC will fail to reach this subnet as this route uses next hop as docker0 interface.

APIC TEP address should not use this subnet and will cause APIC to go diverged in between.

BUG: CSCve84297 , CSCvq97675


MegaSAS


CSCvn13119

Echo the MegaSAS file to fix the storage issue


SSD Wearout


apic# grep -oE "SSD Wearout Indicator is [0-9]+"  /var/log/dme/log/svc_ifc_ae.bin.log | tail -1

SSD Wearout Indicator is 6 (when less than 5 SSD RMA is required to be done)


Switch : will require file 1 of 3 & simply use the tool.














15 views0 comments

Recent Posts

See All

Comments


bottom of page