Analyzing BGP State Mismatch Inside a Containerlab Docker Fabric
Introduction to Containerlab and FRR
Containerlab is a network emulation platform that allows users to create complex network topologies using Docker containers. In this scenario, we’re using Containerlab to create a network fabric with multiple FRR (Free Range Routing) instances. Each FRR instance runs inside a Docker container, and they’re interconnected using Containerlab’s built-in networking capabilities.
Understanding BGP State Mismatch
BGP state mismatch occurs when there’s a discrepancy in the BGP session states between two or more FRR instances. This can be caused by various factors, including stale router-ids, duplicate routes, and log messages indicating router-id conflicts.
BGP Session States
When a BGP session is established, it goes through several states, including Idle, Connect, Active, and Established. If the session is not established, it can be stuck in one of the other states, indicating a problem.
Missing or Duplicate Routes
If the BGP session is not established, routes may not be advertised or received correctly, resulting in missing or duplicate routes in the routing tables.
Log Messages Indicating Router-ID Conflicts
FRR logs can indicate router-id conflicts, which can cause BGP sessions to fail. These logs can be checked using journalctl -u frr or by inspecting the container logs.
Packet Captures Showing Unexpected Keepalive/UPDATE Exchanges
Using tcpdump to capture packets on the Docker bridge or inside the containers can help identify unexpected keepalive or UPDATE exchanges, which can indicate a problem with the BGP session.
Common Causes of Stale Router-IDs in Cloned FRR Instances
Inheritance of Router-ID from Startup Config on Container Clone
When a new container is cloned from an existing one, it inherits the startup config, including the router-id. This can cause conflicts if the new container is not properly configured.
Manual or Automated Config Cloning without Router-ID Reset
If the config is cloned manually or automatically without resetting the router-id, it can cause conflicts between the containers.
Use of Static Router-ID Statements vs. Automatic Selection
Using static router-id statements can cause conflicts if the same ID is used in multiple containers. Automatic selection of router-id can help avoid this problem.
Docker Image Layering Preserving /etc/frr/frr.conf
Docker image layering can preserve the /etc/frr/frr.conf file, which contains the router-id. This can cause conflicts if the same image is used to create multiple containers.
Environment Variable Overrides Not Applied After Clone
If environment variable overrides are not applied after cloning, the container may inherit the wrong router-id.
Diagnostic Procedure
Verifying FRR Router-ID in Each Container
To verify the FRR router-id in each container, use the following commands:
vtysh -c "show ip bgp summary"
vtysh -c "show ip bgp neighbors"
Check the /etc/frr/frr.conf file for router-id statements:
cat /etc/frr/frr.conf | grep router-id
Inspecting Containerlab Node Definitions
Review the clab.yml file for kind: frr and config mounts:
nodes:
node1:
kind: frr
config: |
router-id 1.1.1.1
Check for config: or bind: sections that duplicate configs:
nodes:
node1:
kind: frr
config: |
router-id 1.1.1.1
bind:
- /etc/frr/frr.conf:/etc/frr/frr.conf
Comparing Router-ID Values Across Peers
Use a script to collect and diff the router-id outputs:
#!/bin/bash
for node in node1 node2 node3; do
echo "Router-id for $node: $(vtysh -c "show ip bgp summary" | grep Router-ID)"
done
Analyzing BGP Logs for Router-ID Mismatch Warnings
Check the FRR logs for router-id mismatch warnings:
journalctl -u frr | grep "router-id mismatch"
Packet Capture Analysis
Use tcpdump to capture packets on the Docker bridge or inside the containers:
tcpdump -i docker0 -n -vv -s 0 -c 100 -W 1000 port 179
Filter for BGP TCP port 179 and observe the router-id in OPEN messages:
tcpdump -i docker0 -n -vv -s 0 -c 100 -W 1000 port 179 | grep "OPEN"
Remediation Steps
Resetting Router-ID on Affected FRR Instances
Remove the static router-id line and restart FRR:
sed -i '/router-id/d' /etc/frr/frr.conf
systemctl restart frr
Enable automatic router-id selection:
echo "router-id 0.0.0.0" >> /etc/frr/frr.conf
systemctl restart frr
Force a new router-id via bgp router-id <new-id> and clear ip bgp *:
vtysh -c "bgp router-id 2.2.2.2"
vtysh -c "clear ip bgp *"
Updating Containerlab Topology to Avoid Config Cloning
Use template: or config: with per-node variables:
nodes:
node1:
kind: frr
template: |
router-id {{ node_id }}
Leverage env: to pass unique router-id values:
nodes:
node1:
kind: frr
env:
ROUTER_ID: 1.1.1.1
Apply bind: mounts that point to node-specific config files:
nodes:
node1:
kind: frr
bind:
- /etc/frr/node1.conf:/etc/frr/frr.conf
Restarting BGP Sessions Cleanly
Use clear ip bgp * soft or reset ip bgp *:
vtysh -c "clear ip bgp * soft"
vtysh -c "reset ip bgp *"
Verify session reestablishment and route convergence:
vtysh -c "show ip bgp summary"
vtysh -c "show ip route"
Validation and Verification
Confirming Unique Router-IDs Across All FRR Peers
Use the diagnostic procedure to verify unique router-ids:
for node in node1 node2 node3; do
echo "Router-id for $node: $(vtysh -c "show ip bgp summary" | grep Router-ID)"
done
Checking BGP State Transitions to Established
Verify BGP state transitions to Established:
vtysh -c "show ip bgp summary"
Verifying Route Symmetry and Absence of Duplicate Advertisements
Verify route symmetry and absence of duplicate advertisements:
vtysh -c "show ip route"
Monitoring for Recurring Router-ID Conflict Logs Over Time
Monitor FRR logs for recurring router-id conflict logs:
journalctl -u frr | grep "router-id mismatch"
Performing Traffic Flow Tests to Ensure Proper Forwarding
Perform traffic flow tests to ensure proper forwarding:
tcpdump -i docker0 -n -vv -s 0 -c 100 -W 1000 port 179
Preventive Measures and Best Practices
Centralized FRR Config Management with Templating
Use centralized FRR config management with templating to avoid config cloning issues.
Automating Router-ID Generation Based on Container Identifiers
Automate router-id generation based on container identifiers to ensure unique router-ids.
Incorporating Router-ID Checks into CI/CD Pipeline for Lab Builds
Incorporate router-id checks into the CI/CD pipeline to ensure unique router-ids during lab builds.
Documenting Cloning Procedures to Exclude Router-ID Inheritance
Document cloning procedures to exclude router-id inheritance and ensure unique router-ids.
Using Containerlab’s vars: Feature to Assign Unique Identifiers Per Node
Use Containerlab’s vars: feature to assign unique identifiers per node and ensure unique router-ids.
Regular Auditing of FRR Configurations in Running Labs
Regularly audit FRR configurations in running labs to ensure unique router-ids and detect potential issues.
Enabling FRR BGP Debug (debug bgp events) Only During Troubleshooting
Enable FRR BGP debug only during troubleshooting to avoid log noise and ensure efficient debugging.
Troubleshooting Checklist
- Verify each FRR container’s router-id is unique
- Ensure no static router-id remains in cloned configs
- Confirm Containerlab node definitions do not duplicate config mounts
- Check FRR logs for router-id conflict messages
- Validate BGP session states after remediation
- Perform post-change traffic validation
- Update lab templates to prevent future cloning issues