Skip to content
LinkState
Go back

Telemetry-First Evidence Chains for Session Reset Storms

Introduction to Agent Workflow Design

Overview of Time-Series Anomalies

Time-series anomalies refer to unexpected patterns or deviations in data collected over time. In network management, time-series anomalies can indicate issues such as network congestion, device failures, or security threats. Detecting these anomalies is crucial for maintaining network reliability and performance.

Importance of Deliberate Probe Selection

Deliberate probe selection is essential in agent workflow design as it enables the workflow to focus on the most relevant and critical issues. By selecting the next CLI or API probes deliberately, the workflow can avoid unnecessary probes, reduce overhead, and improve the overall efficiency of the network management process.

Time-Series Anomaly Detection

Anomaly Detection Algorithms

Several anomaly detection algorithms can be used in agent workflow design, including:

Integration with Monitoring Systems

To integrate anomaly detection with monitoring systems, the agent workflow can use APIs or data streaming protocols (e.g., Prometheus, Grafana, and Kafka) to collect and process time-series data.

CLI and API Probe Selection

Probe Selection Criteria

When selecting CLI or API probes, the agent workflow should consider the following criteria:

CLI Probe Examples

Examples of CLI probes include:

show interface
show ip route
show running-config

API Probe Examples

Examples of API probes include:

GET /api/v1/interfaces
GET /api/v1/routes
POST /api/v1/config

gNMI and gRPC Reset Handling

Understanding gNMI and gRPC Resets

gNMI (gRPC Network Management Interface) and gRPC (gRPC Remote Procedure Call) are protocols used for network management and communication. Resets in these protocols can occur due to various reasons, such as network congestion, device failures, or configuration errors.

Distinguishing between Device Bugs and Control-Plane Events

To distinguish between device bugs and control-plane events, the agent workflow can analyze the reset reason, error messages, and system logs.

Reset Handling Strategies

The agent workflow can use the following reset handling strategies:

Agent Workflow Implementation

Workflow Design Considerations

When designing the agent workflow, consider the following factors:

Example Workflow Implementation

An example workflow implementation can be designed using a finite state machine (FSM) or a decision tree. The workflow can start with an initial state, transition to different states based on the detected anomalies and probe results, and finally reach a terminal state.

Code Examples for Probe Selection and Reset Handling

import grpc
from gnmi import gnmi_pb2
from gnmi import gnmi_pb2_grpc

# Create a gNMI client
channel = grpc.insecure_channel('localhost:50051')
client = gnmi_pb2_grpc.gNMIStub(channel)

# Define a probe selection function
def select_probe(anomaly):
    if anomaly.type == 'interface_down':
        return 'show interface'
    elif anomaly.type == 'route_change':
        return 'show ip route'

# Define a reset handling function
def handle_reset(reason):
    if reason == 'device_bug':
        return 'retry'
    elif reason == 'control_plane_event':
        return 'failover'

# Implement the agent workflow
def agent_workflow(anomalies):
    for anomaly in anomalies:
        probe = select_probe(anomaly)
        result = client.Get(gnmi_pb2.GetRequest(path=probe))
        if result.status == 'ok':
            print(f'Probe {probe} succeeded')
        else:
            handle_reset(result.status)

Troubleshooting Agent Workflow Issues

Common Issues and Solutions

Common issues in agent workflow implementation include:

Debugging Techniques for Probe Selection and Reset Handling

Debugging techniques for probe selection and reset handling include:

Logging and Monitoring for Workflow Troubleshooting

Logging and monitoring can be used to troubleshoot agent workflow issues. For example:

Scaling the Agent Workflow

Horizontal Scaling Considerations

Horizontal scaling involves adding more agents or devices to the workflow. Considerations for horizontal scaling include:

Vertical Scaling Limitations

Vertical scaling involves increasing the resources (e.g., CPU, memory, and storage) of the agents and devices. Limitations of vertical scaling include:

Distributed Workflow Implementation Examples

Distributed workflow implementation examples include:

Example Use Cases and Code Examples

Use Case: Network Device Monitoring

In this use case, the agent workflow monitors network devices for anomalies and performs probes to investigate and resolve issues.

import grpc
from gnmi import gnmi_pb2
from gnmi import gnmi_pb2_grpc

# Create a gNMI client
channel = grpc.insecure_channel('localhost:50051')
client = gnmi_pb2_grpc.gNMIStub(channel)

# Define a probe selection function
def select_probe(anomaly):
    if anomaly.type == 'interface_down':
        return 'show interface'
    elif anomaly.type == 'route_change':
        return 'show ip route'

# Implement the agent workflow
def agent_workflow(anomalies):
    for anomaly in anomalies:
        probe = select_probe(anomaly)
        result = client.Get(gnmi_pb2.GetRequest(path=probe))
        if result.status == 'ok':
            print(f'Probe {probe} succeeded')
        else:
            print(f'Probe {probe} failed')

Use Case: Server Performance Monitoring

In this use case, the agent workflow monitors server performance for anomalies and performs probes to investigate and resolve issues.

import psutil

# Define a probe selection function
def select_probe(anomaly):
    if anomaly.type == 'cpu_usage_high':
        return 'psutil.cpu_percent()'
    elif anomaly.type == 'memory_usage_high':
        return 'psutil.virtual_memory().percent'

# Implement the agent workflow
def agent_workflow(anomalies):
    for anomaly in anomalies:
        probe = select_probe(anomaly)
        result = eval(probe)
        if result < 80:
            print(f'Probe {probe} succeeded')
        else:
            print(f'Probe {probe} failed')

Code Examples for gNMI and gRPC Probe Implementation

import grpc
from gnmi import gnmi_pb2
from gnmi import gnmi_pb2_grpc

# Create a gNMI client
channel = grpc.insecure_channel('localhost:50051')
client = gnmi_pb2_grpc.gNMIStub(channel)

# Define a gNMI probe function
def gnmi_probe(path):
    request = gnmi_pb2.GetRequest(path=path)
    response = client.Get(request)
    return response

# Define a gRPC probe function
def grpc_probe(service, method, request):
    stub = serviceStub(channel)
    response = stub.method(request)
    return response

Best Practices for Agent Workflow Design

Design Principles for Scalability and Reliability

Design principles for scalability and reliability include:

Probe Selection and Reset Handling Best Practices

Best practices for probe selection and reset handling include:

Integration with Existing Monitoring and Management Systems

Best practices for integration with existing monitoring and management systems include:

Future Directions and Emerging Trends

Emerging Technologies for Anomaly Detection and Probe Selection

Emerging technologies for anomaly detection and probe selection include:

Future of gNMI and gRPC in Network Management

The future of gNMI and gRPC in network management includes:

Potential Applications of Agent Workflow Design in Other Domains

Potential applications of agent workflow design in other domains include:


Share this post on:

Previous Post
Layered controls can still create default-allow islands
Next Post
Correlation Gates for High Risk Rollbacks