Skip to content
LinkState
Go back

Scaling a Self-Healing Containerlab Pipeline

Overview of Containerlab Pipeline

Architecture

The Containerlab pipeline is designed to simulate and replay routing failures across hundreds of nodes, with the goal of scoring AI-generated fixes. The architecture of the pipeline consists of the following components:

graph LR
A[Containerlab] -->|simulates|> B[Network Topology]
B -->|includes|> C[Nodes]
C -->|runs|> D[Routing Protocols]
D -->|includes|> E[OSPF]
D -->|includes|> F[EIGRP]
D -->|includes|> G[BGP]

Components

The pipeline consists of the following components:

Scaling the Pipeline

Horizontal Scaling

To scale the pipeline horizontally, we can add more worker nodes to distribute the simulated nodes across multiple machines. This can be achieved using the following methods:

docker node update --availability=active <node-name>
docker service update --replicas=<number-of-replicas> <service-name>
graph LR
A[Containerlab] -->|simulates|> B[Network Topology]
B -->|includes|> C[Node 1]
B -->|includes|> D[Node 2]
B -->|includes|> E[Node 3]
C -->|runs on|> F[Worker Node 1]
D -->|runs on|> G[Worker Node 2]
E -->|runs on|> H[Worker Node 3]

Vertical Scaling

To scale the pipeline vertically, we can increase the resources on existing worker nodes or optimize the Containerlab configuration. This can be achieved using the following methods:

docker node update --resources=<resource-configuration> <node-name>
docker service update --resources=<resource-configuration> <service-name>
containerlab config optimize --cpu=<cpu-configuration> --memory=<memory-configuration>

Replaying Routing Failures

Failure Scenarios

The pipeline simulates the following failure scenarios:

graph LR
A[Node 1] -->|link|> B[Node 2]
B -->|link|> C[Node 3]
C -->|link|> D[Node 4]
B -->|failure|> E[Link Failure]
graph LR
A[Node 1] -->|link|> B[Node 2]
B -->|link|> C[Node 3]
C -->|link|> D[Node 4]
B -->|failure|> E[Node Failure]
graph LR
A[Node 1] -->|OSPF|> B[Node 2]
B -->|EIGRP|> C[Node 3]
C -->|BGP|> D[Node 4]
B -->|failure|> E[Routing Protocol Failure]

Replay Tool

The replay tool is a custom-built tool that replays routing failures, allowing for the simulation of real-world failure scenarios. The tool is built using Python or Go and integrates with Containerlab using the following command:

containerlab replay --failure-scenario=<failure-scenario> --topology=<topology>

Scoring AI-Generated Fixes

AI Model

The AI model is a machine learning model that generates fixes for routing failures. The model is trained using historical routing failure data and is built using TensorFlow or PyTorch.

Scoring Tool

The scoring tool is a custom-built tool that scores AI-generated fixes, providing a way to evaluate the effectiveness of the fixes. The tool is built using Python or Go and integrates with Containerlab using the following command:

containerlab score --fix=<fix> --topology=<topology>

The scoring metrics used by the tool include repair time and routing protocol convergence time.

Monitoring and Logging

Monitoring Tools

The pipeline uses the following monitoring tools:

prometheus --config.file=prometheus.yml
graph LR
A[Prometheus] -->|metrics|> B[Grafana]
B -->|dashboard|> C[Monitoring Dashboard]

Logging Tools

The pipeline uses the following logging tools:

elasticsearch --config.file=elasticsearch.yml
logstash --config.file=logstash.conf
kibana --config.file=kibana.yml

Limits, Blind Spots, and Operational Preconditions

The pipeline has the following limits, blind spots, and operational preconditions:

To deploy the pipeline safely, the following minimum safe deployment pattern should be followed:


Share this post on:

Previous Post
Tracing Prompt-to-Command Drift in NetDevOps Loops
Next Post
Designing an Open WebUI Front End for a Network Copilot