ACI Multi-Site Orchestrator (MSO) Tshoot - Part 2

Mukesh Chanderia
Oct 10, 2024
8 min read

Increasing CPU Cycle Reservation for Orchestrator VMs

Cisco ACI Multi-Site Orchestrator VMs require a dedicated amount of CPU cycles to function optimally. While new deployments automatically set the necessary CPU reservations, upgrading from a version prior to Release 2.1(1) requires manual adjustments to each Orchestrator VM's settings.

Why Increase CPU Reservation?

Properly configuring CPU cycle reservations can help resolve or prevent various unpredictable issues, such as:

Delayed GUI Loading: Orchestrator GUI elements may require multiple attempts to load.

Node Status Fluctuations: Nodes might intermittently switch to an "Unknown" status before reverting to "Ready" on their own.

# docker node ls

ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION

t8wl1zoke0vpxdl9fysqu9otb node1 Ready Active Reachable 18.03.0-ce

kyriihdfhy1k1tlggan6e1ahs * node2 Unknown Active Reachable 18.03.0-ce

yburwactxd86dorindmx8b4y1 node3 Ready Active Leader 18.03.0-ce

Heartbeat Failures: Transient misses in the Orchestrator logs may indicate communication issues.

node2 dockerd: [...] level=error msg="agent: session failed" backoff=100ms

error="rpc error: code = Canceled desc = context canceled" module=node/agent [...]

node2 dockerd: [...] level=error msg="heartbeat to manager [...] failed"

error="rpc error: code = Canceled desc = context canceled" [...]

Enabling NTP for Orchestrator Nodes

Clock synchronization is crucial for Orchestrator nodes. Without it, you might encounter issues like random GUI session log-offs due to expired authentication tokens.

Procedure to Enable NTP

1) Log in directly to one of the Orchestrator VMs.

Navigate to the Scripts Directory:

cd /opt/cisco/msc/scripts

Configure NTP Settings:

2) Use the svm-msc-tz-ntp script to set the time zone and enable NTP.

Parameters:

-tz <time-zone>: Specify your time zone (e.g., US/Pacific).

-ne: Enable NTP.

-ns <ntp-server>: Specify your NTP server (e.g., ntp.esl.cisco.com).

Example Command:

./svm-msc-tz-ntp -tz US/Pacific -ne -ns ntp.esl.cisco.com

svm-msc-tz-ntp: Start

svm-msc-tz-ntp: Executing timedatectl set-timezone US/Pacific

svm-msc-tz-ntp: Executing sed -i 's|^server|\# server|' /etc/ntp.conf

svm-msc-tz-ntp: Executing timedatectl set-ntp true

svm-msc-tz-ntp: Sleeping 10 seconds

svm-msc-tz-ntp: Checking NTP status

svm-msc-tz-ntp: Executing ntpstat;ntpq -p

unsynchronised

polling server every 64 s

remote refid st t when poll reach delay offset jitter

==============================================================================

mtv5-ai27-dcm10 .GNSS. 1 u - 64 1 1.581 -0.002 0.030

3) Verify NTP Configuration:

Check NTP Status:

ntpstat; ntpq -p

unsynchronised

polling server every 64 s

remote refid st t when poll reach delay offset jitter

==============================================================================

*mtv5-ai27-dcm10 .GNSS. 1 u 14 64 1 3.522 -0.140 0.128

4) Confirm Date and Time:

date

Mon Jul 8 14:19:26 PDT 2019

Repeat for All Orchestrator Nodes:

Ensure that each Orchestrator VM undergoes the same NTP configuration process.

Updating DNS for MSO OVA Deployments in VMware ESX

Note: This procedure is only for MSO OVA deployments in VMware ESX. It does not apply to Application Services Engine or Nexus Dashboard deployments.

Procedure to Update DNS

Access the Cluster Node:
- SSH into one of the cluster nodes using the root user account.
Update DNS Configuration:
- Use the nmcli command to set the DNS server IP address.
- Single DNS Server:
- nmcli connection modify eth0 ipv4.dns "<dns-server-ip>"
- Multiple DNS Servers:
  
  nmcli connection modify eth0 ipv4.dns "<dns-server-ip-1> <dns-server-ip-2>"
Restart the Network Interface:
- Apply the DNS changes by restarting the eth0 interface.
  nmcli connection down eth0 && nmcli connection up eth0
Reboot the Node:
- Restart the node to ensure all changes take effect.
Repeat for Other Nodes:
- Perform the same steps on the remaining two cluster nodes.

Restarting Cluster Nodes

Restarting a Single Node Temporarily Down

Restart the Affected Node:
- Simply restart the node that is down.
- No additional steps are needed; the cluster will automatically recover.

Restarting Two Nodes Temporarily Down

Backup MongoDB:
- Before attempting recovery, back up the MongoDB to prevent data loss.
- Important: Ensure at least two nodes are running to keep the cluster operational.
Restart the Two Affected Nodes:
- Restart both nodes that are down.
- No additional steps are needed; the cluster will automatically recover.

Backing Up MongoDB for Cisco ACI Multi-Site

Recommendation: Always back up MongoDB before performing any upgrades or downgrades of the Cisco ACI Multi-Site Orchestrator.

Procedure to Back Up MongoDB

Log In to the Orchestrator VM:
- Access the Cisco ACI Multi-Site Orchestrator virtual machine.
Run the Backup Script:
- Execute the backup script to create a backup file.
  ~/msc_scripts/msc_db_backup.sh
- A backup file named msc_backup_<date+%Y%m%d%H%M>.archive will be created.
Secure the Backup File:
- Copy the backup file to a safe location for future use.

Restoring MongoDB for Cisco ACI Multi-Site

Procedure to Restore MongoDB

Log In to the Orchestrator VM:
- Access the Cisco ACI Multi-Site Orchestrator virtual machine.
Transfer the Backup File:
- Copy your msc_backup_<date+%Y%m%d%H%M>.archive file to the VM.
Run the Restore Script:
- Execute the restore script to restore the database from the backup file.
  ~/msc_scripts/msc_db_restore.sh
Push the Schemas:
- After restoring, push the schemas to ensure everything is up to date.
  msc_push_schemas.py

Custom Certificates Troubleshooting

How to resolve common issues when using custom SSL certificates with Cisco ACI Multi-Site Orchestrator.

Unable to Load the Orchestrator GUI

If you can't access the Orchestrator GUI after installing a custom certificate, the issue might be due to incorrect certificate placement on the Orchestrator nodes. Follow these steps to recover the default certificates and reinstall the new ones.

Steps to Recover Default Certificates and Reinstall Custom Certificates

Log In to Each Orchestrator Node:
- Access each node directly using SSH or your preferred method.
Navigate to the Certificates Directory:
cd /data/msc/secrets
Restore Default Certificates:
- Replace the existing certificate files with the backup copies.
mv msc.key_backup msc.key mv msc.crt_backup msc.crt
Restart the Orchestrator GUI Service:

docker service update msc_ui --force
Reinstall and Activate the New Certificates:
- Follow the certificate installation procedure as outlined in previous documentation to ensure the new certificates are correctly applied.

Adding a New Orchestrator Node to the Cluster

When you add a new node to your Multi-Site Orchestrator cluster, ensure the key is activated to maintain cluster security and functionality.

Steps to Add a New Orchestrator Node

Log In to the Orchestrator GUI:
- Access the GUI using your web browser.
Re-activate the Key:
- Follow the key activation steps as described in the "Activating Custom Keyring" section to integrate the new node securely into the cluster.

Unable to Install a New Keyring After the Default Keyring Expired

If the default keyring has expired and you can't install a new one, it's likely that the custom keyring wasn't properly installed on the cluster nodes. Follow these steps to delete the old keyring and create a new one.

Steps to Create a New Keyring

Access All Cluster Nodes:
- SSH into each node in the cluster.
Remove Old Keyring Files:

cd /data/msc/secrets rm -rf msc.key msc.crt msc.key_backup msc.crt_backup
Generate a New Keyring:
- Create new key and certificate files using OpenSSL.
openssl req -newkey rsa:2048 -nodes -keyout msc.key -x509 -days 365 -out msc.crt -subj '/CN=MSC'
Backup the New Keyring Files:

cp msc.key msc.key_backup cp msc.crt msc.crt_backup
Set Proper Permissions:

chmod 777 msc.key msc.key_backup msc.crt msc.crt_backup
Force Update the Orchestrator GUI Service:

docker service update msc_ui --force
Re-install and activate the new certificates

Adding a New Orchestrator Node to the Cluster

If you're adding a new node to your Multi-Site Orchestrator cluster, follow these steps:

Log in to the Orchestrator GUI.
Re-activate the Key:

Unable to Install a New Keyring After the Default Keyring Expired

If you're unable to install a new keyring after the default one expired, it might be because the custom keyring isn't installed on the cluster nodes.

Solution: Delete the old default keyring and create a new one using the steps provided in the relevant section.

Replacing a Single Node of the Cluster with a New Node

If one node (e.g., node1) goes down and you need to replace it with a new node, follow these steps:

Step 1: Identify the Down Node

On any existing node, get the ID of the node that is down by running:
# docker node ls

ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS 11624powztg5tl9nlfoubydtp * node2 Ready Active Leader fsrca74nl7byt5jcv93ndebco node3 Ready Active Reachable wnfs9oc687vuusbzd3o7idllw node1 Down Active Unreachable
- Note the ID of the down node (node1).

Step 2: Demote the Down Node

Demote the down node by running:

docker node demote <node ID>
- Replace <node ID> with the ID from Step 1.
- Example:
  bash
  Copy code
  docker node demote wnfs9oc687vuusbzd3o7idllw
- You'll see a message: Manager <node ID> demoted in the swarm.

Step 3: Remove the Down Node

Remove the down node from the swarm:

docker node rm <node ID>
- Example:
  bash
  Copy code
  docker node rm wnfs9oc687vuusbzd3o7idllw

Step 4: Navigate to the 'prodha' Directory

On any existing node, change to the prodha directory:

cd /opt/cisco/msc/builds/<build_number>/prodha
- Replace <build_number> with your actual build number.

Step 5: Obtain the Swarm Join Token

Get the token needed to join the swarm:

docker swarm join-token manager
- This command will display a join command containing the token and IP address.
Example Output:

docker swarm join --token SWMTKN-1-... <IP_address>:2376
- Note: Copy the entire join command or at least the token and IP address for later use.

Step 6: Note the Leader's IP Address

Identify the leader node by running:

docker node ls
On the leader node, get its IP address:

ifconfig
- Look for the inet value under the appropriate network interface (e.g., eth0).
inet 10.23.230.152 netmask 255.255.255.0 broadcast 192.168.99.255
- Note: The IP address is 10.23.230.152.

Step 7: Prepare the New Node

Set the Hostname:
hostnamectl set-hostname <new_node_name>
- Replace <new_node_name> with the desired hostname (e.g., node1).

Step 8: Navigate to the 'prodha' Directory on New Node

On the new node, change to the prodha directory:

cd /opt/cisco/msc/builds/<build_number>/prodha

Step 9: Join the New Node to the Swarm

Run the join command using the token and leader IP address obtained earlier:

./msc_cfg_join.py <token> <leader_IP_address>
- Replace:
  - <token> with the token from Step 5.
  - <leader_IP_address> with the IP address from Step 6.
- Example:
  
  ./msc_cfg_join.py SWMTKN-1-... 10.23.230.152

Step 10: Deploy the Configuration

On any node, navigate to the prodha directory:

cd /opt/cisco/msc/builds/<build_number>/prodha
Run the deployment script:

./msc_deploy.py
Result: All services should be up, and the database replicated.

Replacing Two Existing Nodes of the Cluster with New Nodes

If two nodes are down and you need to replace them, follow these steps:

Before You Begin

Important: Since there's a lack of quorum (only one node is up), the Multi-Site Orchestrator won't be available.
Recommendation: Back up the MongoDB database before proceeding.

Step 1: Prepare New Nodes

Bring Up Two New Nodes:
- Ensure they are properly set up and connected.
Set Unique Hostnames for Each New Node:

hostnamectl set-hostname <new_node_name>
- Repeat for both new nodes, using unique names.

Step 2: Remove Down Nodes from Swarm

SSH into the Only Live Node.
List All Nodes:

docker node ls
Example Output:
mathematica
Copy code
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS g3mebdulaed2n0cyywjrtum31 node2 Down Active Reachable ucgd7mm2e2divnw9kvm4in7r7 node1 Ready Active Leader zjt4dsodu3bff3ipn0dg5h3po * node3 Down Active Reachable
Remove Nodes with 'Down' Status:

docker node rm <node-id>
- Example:
  
  docker node rm g3mebdulaed2n0cyywjrtum31 docker node rm zjt4dsodu3bff3ipn0dg5h3po

Step 3: Re-initialize the Docker Swarm

Leave the Existing Swarm:

docker swarm leave --force
Navigate to the 'prodha' Directory:

cd /opt/cisco/msc/builds/<build_number>/prodha
Initialize a New Swarm:

./msc_cfg_init.py
- This command will provide a new token and IP address.

Step 4: Join New Nodes to the Swarm

On Each New Node:
- SSH into the Node.
- Navigate to the 'prodha' Directory:
  cd /opt/cisco/msc/builds/<build_number>/prodha
- Join the Node to the Swarm:
  ./msc_cfg_join.py <token> <leader_IP_address>
  - Replace:
    - <token> with the token from Step 3.
    - <leader_IP_address> with the IP address of the first node (from Step 3).

Step 5: Deploy the Configuration

On Any Node:
- Navigate to the 'prodha' Directory:
  cd /opt/cisco/msc/builds/<build_number>/prodha
- Run the Deployment Script:
  
  ./msc_deploy.py
Result: The new cluster should be operational with all services up.

Relocating Multi-Site Nodes to a Different Subnet

When you need to move one or more Multi-Site nodes from one subnet to another—such as spreading nodes across different data centers—you can follow this simplified procedure.

It's important to relocate one node at a time to maintain redundancy during the migration.

Scenario: Relocating node3 from Data Center 1 (subnet 10.1.1.1/24) to Data Center 2 (subnet 11.1.1.1/24).

Steps:

Demote node3 on node1:
- On node1, run:
  docker node demote node3
Power Down node3:
- Shut down the virtual machine (VM) for node3.
Remove node3 from the Cluster:
- On node1, execute:
  docker node rm node3
Deploy a New VM for node3 in Data Center 2:
- Install the Multi-Site VM (same version as node1 and node2).
- Configure it with the new IP settings for the 11.1.1.1/24 subnet.
- Set the hostname to node3.
Power Up node3 and Test Connectivity:
- Start the new node3 VM.
- Verify connectivity to node1 and node2:
  ping [node1_IP] ping [node2_IP]
Obtain the Swarm Join Token from node1:
- On node1, get the join token:
  docker swarm join-token manager
- Note the provided command and token.
Join node3 to the Swarm:
- On node3, use the token to join the cluster:
  docker swarm join --token [token] [node1_IP]:2377
Verify Cluster Health:
- On any node, check the status:
  docker node ls
- Ensure each node shows:
  - STATUS: Ready
  - AVAILABILITY: Active
  - MANAGER STATUS: One node as Leader, others as Reachable
Update the Swarm Label for node3:
- On node1, update the label:
  docker node update node3 --label-add msc-node=msc-node3
Check Docker Services Status:
- On any node, list services:
  docker service ls
- Confirm that services are running (e.g., REPLICAS show 1/1 or 3/3).
- Wait up to 15 minutes for synchronization if necessary.
Delete the Original node3 VM:
- After confirming everything is functioning, remove the old node3 VM from Data Center 1.

By following these steps, you can successfully relocate a Multi-Site node to a different subnet while maintaining cluster integrity and minimizing downtime

Reference : https://www.cisco.com/c/en/us/td/docs/dcn/mso/3x/troubleshooting/cisco-aci-multi-site-troubleshooting-guide-311/m_troubleshooting_installation_upgrades_and_reboot.html

Increasing CPU Cycle Reservation for Orchestrator VMs

Enabling NTP for Orchestrator Nodes

Updating DNS for MSO OVA Deployments in VMware ESX

Procedure to Update DNS

Restarting Cluster Nodes

Restarting a Single Node Temporarily Down

Restarting Two Nodes Temporarily Down

Backing Up MongoDB for Cisco ACI Multi-Site

Procedure to Back Up MongoDB

Restoring MongoDB for Cisco ACI Multi-Site

Procedure to Restore MongoDB

Custom Certificates Troubleshooting

Unable to Load the Orchestrator GUI

Steps to Recover Default Certificates and Reinstall Custom Certificates

Adding a New Orchestrator Node to the Cluster

Steps to Add a New Orchestrator Node

Unable to Install a New Keyring After the Default Keyring Expired

Steps to Create a New Keyring

Adding a New Orchestrator Node to the Cluster

Unable to Install a New Keyring After the Default Keyring Expired

Replacing a Single Node of the Cluster with a New Node

Step 1: Identify the Down Node

Step 2: Demote the Down Node

Step 3: Remove the Down Node

Step 4: Navigate to the 'prodha' Directory

Step 5: Obtain the Swarm Join Token

Step 6: Note the Leader's IP Address

Step 7: Prepare the New Node

Step 8: Navigate to the 'prodha' Directory on New Node

Step 9: Join the New Node to the Swarm

Step 10: Deploy the Configuration

Replacing Two Existing Nodes of the Cluster with New Nodes

Before You Begin

Step 1: Prepare New Nodes

Step 2: Remove Down Nodes from Swarm

Step 3: Re-initialize the Docker Swarm

Step 4: Join New Nodes to the Swarm

Step 5: Deploy the Configuration

Relocating Multi-Site Nodes to a Different Subnet

Steps:

Comments