Skip to content
LinkState
Go back

Rollback windows on non-transactional network devices

Introduction to Staged Rollback Plans

Overview of Non-Atomic Reversion

Many network and platform systems lack a true atomic revert operation. When a configuration change is applied, the underlying state may be spread across multiple components (e.g., line cards, control planes, distributed databases) that cannot be rolled back in a single, indivisible step. Instead, reverting requires a sequence of inverse actions that may succeed partially, leave the system in an intermediate state, or fail altogether due to dependencies, resource contention, or timing windows. Recognizing that rollback is a process rather than an instantaneous switch is the foundation for designing a staged rollback plan.

Importance of Rollback Planning

Rollback planning mitigates the blast radius of a failed change by defining clear boundaries, verification gates, and operator intervention points before the change reaches production. Without a structured approach, operators resort to ad‑hoc manual fixes that increase mean time to recovery (MTTR) and risk cascading failures. A staged rollback plan provides:

Understanding Commit Boundaries

Definition and Purpose

A commit boundary marks the point in the change lifecycle after which the new configuration is considered active for traffic, but before which the system can still be returned to the known‑good state using a pre‑defined rollback procedure. It is not a guarantee of atomicity; rather, it is a contractual line where:

Choosing Optimal Commit Boundaries

Considerations for Boundary Selection

  1. State Granularity – Choose a boundary that aligns with the smallest independently revertible unit (e.g., a single line‑card firmware load, a BGP peer session, a VLAN database shard). Smaller granularity reduces blast radius but may increase the number of boundaries to manage.
  2. Dependency Chain – Identify upstream/downstream dependencies that must be satisfied before the change can be safely committed. If a change requires a preceding firmware upgrade, the commit boundary should sit after that upgrade’s verification gate.
  3. Verification Latency – The boundary should occur after the longest-running verification check (e.g., traffic‑plane latency measurement, control‑plane convergence timer) to avoid committing before the system has proven stability.
  4. Operator Visibility – Ensure that telemetry, logs, and CLI show a clear “commit‑accepted” state at the boundary, enabling operators to confirm the transition without ambiguity.
  5. Rollback Cost – Estimate the effort and time required to execute the inverse actions for the unit bounded by the commit point. If the cost exceeds operational thresholds, consider moving the boundary earlier to limit rollback work.

Example Commit Boundary Scenarios

Expiry Window Sizing Strategies

Introduction to Expiry Windows

An expiry window is the maximum time the system is allowed to remain in the post‑commit, pre‑verification state before an automatic rollback trigger fires or operator intervention becomes mandatory. It bounds the risk exposure window: if verification does not succeed within the window, the system assumes the change is unhealthy and initiates a rollback sequence.

Factors Influencing Expiry Window Size

System Resource Constraints

Operational Requirements

Calculating Optimal Expiry Window Sizes

Formulaic Approach

A baseline calculation can be expressed as:

ExpiryWindow = BaseVerificationTime + (SafetyFactor × Variability) + Buffer

Heuristic-Based Approach

When precise measurements are unavailable, use tiered heuristics:

  1. Short‑Lived Changes (e.g., ACL toggle, interface shutdown) – 30–60 s window.
  2. Medium‑Lived Changes (e.g., routing policy, VLAN rewrite) – 2–5 min window, scaled with the number of affected nodes.
  3. Long‑Lived Changes (e.g., firmware upgrade, OS reload) – 10–30 min window, often tied to a maintenance block and requiring manual approval to extend. Adjust heuristics downward if the system exhibits high verification failure rates in production, upward only after demonstrating consistent success across multiple cycles.

Operator Intervention Points

Mandatory Intervention Triggers

Error Thresholds

Timeout Expirations

Operator Intervention Procedures

Communication Protocols

Decision-Making Frameworks

  1. Rapid Triage Checklist – Verify: (a) Is the failure isolated to a single device or subset? (b) Are there any recent unrelated alarms? (c) Can the verification metric be explained by a known transient (e.g., traffic spike)?
  2. Risk‑Based Go/No‑Go – If the failure is likely to cause service degradation exceeding the SLA, elect to force rollback (manual execution of inverse steps) and schedule a post‑mortem. If the failure appears benign and likely self‑healing, grant a limited extension (e.g., +30 s) with heightened monitoring.
  3. Documentation Gate – Before proceeding, the operator must record the decision, rationale, and any manual commands executed in the change ticket. This ensures auditability and feeds future window‑sizing heuristics.

Troubleshooting Rollback Failures

Common Failure Scenarios

Inconsistent State

Resource Unavailability

Diagnostic Techniques

Log Analysis

System Monitoring

Recovery Procedures

Reverting to Previous State

Manual Intervention

Code Examples for Rollback Implementation

CLI Commands for Rollback

Initiating Rollback

# Example on a Cisco IOS-XR router: rollback BGP policy change
configure
  router bgp 65000
    address-family ipv4 unicast
      no neighbor 10.0.0.5 route-policy NEW_POLICY in
      neighbor 10.0.0.5 route-policy OLD_POLICY in
    exit-address-family
  commit
end

Monitoring Rollback Progress

# Verify BGP policy applied and session state
show bgp neighbors 10.0.0.5 received-routes | include OLD_POLICY
show bgp summary | match Established
# Check TCAM utilization after QoS map revert
show platform hardware qos active statistics | include Utilization

Sample Code Snippets

Python Example (using Netmiko)

from netmiko import ConnectHandler
import time

device = {
    'device_type': 'cisco_xr',
    'host': '10.1.1.10',
    'username': 'admin',
    'password': '*****',
    'secret': '*****',
}

def rollback_qos_map(conn):
    """Revert QoS map to previous version."""
    cmds = [
        'configure',
        'class-map type qos match-any OLD_CLASS',
        ' match precedence 0',
        'exit',
        'policy-map type qos OLD_POLICY',
        ' class OLD_CLASS',
        ' set qos-group 1',
        'exit',
        'exit',
        'interface HundredGigE0/0/0/0',
        ' service-policy output OLD_POLICY',
        'exit',
        'commit',
        'end'
    ]
    for cmd in cmds:
        conn.send_config_set(cmd)
        time.sleep(1)  # allow device to process

    # Verify output
    output = conn.send_command('show policy-map interface HundredGigE0/0/0/0')
    if 'OLD_POLICY' in output:
        print("Rollback verified")
    else:
        raise RuntimeError("Rollback verification failed")

with ConnectHandler(**device) as net_conn:
    net_conn.enable()
    try:
        rollback_qos_map(net_conn)
    except Exception as e:
        print(f"Error: {e}")

Share this post on:

Previous Post
Escalating from Diagnosis to Access Change
Next Post
Collapsing per-neighbor policy into peer groups safely