Introduction toTaxonomy Design
Overview of Taxonomy Requirements
A large‑scale service‑provider network must maintain a single source of truth for three orthogonal concerns that drive operational automation: 1. Regional Preference – how traffic should be steered based on geography, latency, peering economics, and regulatory constraints.
2. Backup Intent – declarative statements about which prefixes, services, or state must be replicated, retained, or archived, together with retention‑policy metadata.
3. Maintenance Drains – scheduled or event‑driven reductions of forwarding capacity (e.g., graceful shutdown, traffic shift) that allow safe hardware/software upgrades without service impact. The taxonomy must be extensible to accommodate:
- New POPs (Points of Presence) added frequently as the footprint expands.
- New transit contracts that introduce fresh AS‑paths, MED values, or local‑pref adjustments.
- Multi‑vendor policy rendering where the same logical intent must be translated into vendor‑specific configuration syntax (Juniper Junos, Cisco IOS‑XR, Nokia SR OS, etc.).
If the taxonomy collapses under any of these dimensions, operators lose the ability to automate safely, leading to manual interventions, configuration drift, and increased MTTR.
Key Considerations for Scalability and Flexibility
| Dimension | Design Pressure | Recommended Approach | Trade‑off |
|---|---|---|---|
| Cardinality | 10⁴–10⁵ POPs, 10³ transit contracts, 10² vendor families | Store taxonomy as hierarchical, version‑controlled data (e.g., Git‑backed YAML/JSON) with schema validation (JSON‑Schema or OpenAPI). | Slight overhead for schema validation; mitigated by CI linting. |
| Change Velocity | POPs added weekly; contracts renegotiated monthly; policy updates daily | Immutable commits + pull‑request (PR) gating with automated test‑rendering pipeline. | Requires disciplined GitOps; rollback is a git revert. |
| Multi‑Vendor Rendering | Same logical intent must produce syntactically correct configs for ≥3 vendors | Policy‑as‑code layer: intent → intermediate representation (IR) → vendor‑specific Jinja2/Terraform templates. | Template maintenance burden; mitigated by shared library of Jinja macros. |
| Observability | Need to detect mismatches between intent and rendered config | Export render‑diff artifacts to a time‑series DB (e.g., Prometheus) and alert on non‑zero diff. | Adds pipeline step; negligible latency (<2 s per POP). |
| Safety | Prevent accidental traffic loss during maintenance drains | Blast‑radius tags attached to each drain intent; enforcement via OPA policies that reject drains exceeding a threshold (e.g., >5 % of total egress capacity). | OPA evaluation adds ~10 ms per intent; acceptable. |
Regional Preference Encoding
Hierarchical Structure for Regional Preferences
Regional preference is expressed as a tree where each node corresponds to a geographic aggregation point (continent → country → metro → POP). Leaf nodes hold the actual preference values (e.g., local‑pref, MED, weight).
Continent └─ Country
├─ Metro
│ ├─ POP-A
│ │ ├─ preference: {local_pref: 150, med: 10}
│ │ └─ tags: ["low_latency", "eu‑gdpr"]
│ └─ POP-B │ ├─ preference: {local_pref: 120, med: 20}
│ └─ tags: ["cost_optimized"]
└─ Rural-Area
└─ POP-C
├─ preference: {local_pref: 80, med: 100}
└─ tags: ["backup_only"]
Each node may inherit preferences from its parent unless overridden. Inheritance is explicit (via inherit: true) to avoid accidental shadowing.
Attribute‑Based Encoding for Regional Variations
Beyond the hierarchy, we attach key‑value attributes that capture non‑hierarchical factors (e.g., regulatory regimes, peering type, SLA class). Attributes are stored as a flat map at each node and are consulted during policy rendering.
| Attribute | Type | Example Values | Usage |
|---|---|---|---|
regime | enum | gdpr, ccpa, none | Influences data‑retention logic. |
peering_type | enum | transit, private, ixp | Affects MED/local‑pref calculation. |
sla_class | string | platinum, gold, silver | Maps to queue‑profile selection. |
capacity_mbps | integer | 10000 | Used for drain‑blast‑radius checks. |
maintenance_window | cron string | "0 2 * * SAT" | Default window for POPs lacking explicit drain schedule. |
Attributes are typed via JSON‑Schema, enabling validation at commit time.
Examples of Regional Preference Encoding in Practice
File layout (Git‑ops repo): ``` /taxonomy/regional/ ├─ continent/ │ ├─ na.yaml │ ├─ eu.yaml │ └─ apac.yaml ├─ country/ │ ├─ us.yaml │ ├─ de.yaml │ └─ jp.yaml ├─ metro/ │ ├─ nyc.yaml │ ├─ frankfurt.yaml │ └─ tokyo.yaml └─ pop/ ├─ nyc01.yaml ├─ nyc02.yaml ├─ de01.yaml └─ jp01.yaml
**Sample `nyc01.yaml`:**
```yaml
# nyc01.yaml
parent: metro/nyc
inherit: true
preference:
local_pref: 180 # higher than default to favor local exit
med: 5attributes:
regime: none
peering_type: transit
sla_class: platinum capacity_mbps: 40000
maintenance_window: "0 3 * * SUN"
tags:
- "high_capacity"
- "primary_exit"
Rendering snippet (Jinja2) for Juniper Junos:
{% set pref = regional_data['preference'] %}
set policy-options policy-statement REGIONAL_PREF term 1 from protocol bgp
set policy-options policy-statement REGIONAL_PREF term 1 then local-pref {{ pref.local_pref }}
set policy-options policy-statement REGIONAL_PREF term 1 then med {{ pref.med }}
If inherit: false is set, the node ignores all parent values and relies solely on its own block—useful for overriding a continent‑wide default for a specific POP.
Backup Intent and Data Management
Backup Intent Encoding and Data Retention
Backup intent is modeled as a declarative policy attached to any taxonomy node (region, POP, or service). It specifies:
- What to backup (prefix list, service identifier, config snapshot).
- Where (object‑store bucket, NFS target, or remote site).
- Retention (duration, version count, legal‑hold flag).
- Consistency (snapshot vs. streaming, checksum verification). Schema (JSON‑Schema excerpt): ```json { “$id”: “https://example.com/schemas/backup-intent.json”, “type”: “object”, “required”: [“target”, “retention”], “properties”: { “target”: { “type”: “string”, “format”: “uri” }, “scope”: { “type”: “object”, “oneOf”: [ { “required”: [“prefix_list”] }, { “required”: [“service_id”] }, { “required”: [“config_snapshot”] } ] }, “retention”: { “type”: “object”, “required”: [“duration_days”], “properties”: { “duration_days”: { “type”: “integer”, “minimum”: 1 }, “max_versions”: { “type”: “integer”, “minimum”: 1 }, “legal_hold”: { “type”: “boolean”, “default”: false } } }, “consistency”: { “type”: “string”, “enum”: [“snapshot”, “stream”], “default”: “snapshot” }, “tags”: { “type”: “array”, “items”: { “type”: “string” } } } }
**Example intent for a POP’s routing table:**
```yaml
# backup/nyc01-routing.yaml
target: s3://backup-nyc/routing/
scope:
prefix_list: ["10.0.0.0/8", "192.168.0.0/16"]
retention:
duration_days: 365
max_versions: 12
legal_hold: false
consistency: snapshot
tags: ["routing", "daily"]
Data Management Strategies for Backup and Archive
- Tiered Storage – Recent snapshots (<30 d) go to high‑performance object store (e.g., S3‑Standard); older data transitions to Glacier‑Deep Archive via lifecycle rules.
- Immutable Writes – Enable S3 Object Lock (Governance mode) to satisfy legal‑hold requirements; OPA validates that any intent with
legal_hold: truetargets a bucket with lock enabled. - Deduplication – Backup agent computes SHA‑256 of each chunk; identical chunks across POPs are stored once (reference‑counted).
- Verification – Nightly job reads a random 1 % sample, recomputes hash, and compares against stored metadata; mismatches trigger a PagerDuty alert.
CLI Examples for Backup Intent Configuration
Assuming a custom CLI taxonomyctl that interacts with the Git‑ops repo via a lightweight API:
# Validate a new backup intent file against schema
taxonomyctl backup validate --file backup/nyc01-routing.yaml
# Intent diff: show what would change if we merge this intent
taxonomyctl backup diff --file backup/nyc01-routing.yaml --branch main
# Apply intent (creates a PR, runs CI, auto‑merges on success)
taxonomyctl backup apply --file backup/nyc01-routing.yaml --msg "Add daily routing backup for NYC01"
# List all intents affecting a given POP
taxonomyctl backup list --pop nyc01 --format json
Under the hood, taxonomyctl apply runs:
git checkout -b backup/nyc01-routing-<timestamp>- Copies file to appropriate directory (
/taxonomy/backup/) - Runs
jsonschema -i backup/nyc01-routing.yaml $SCHEMA - Triggers CI pipeline (GitHub Actions) that renders a dry‑run of the backup agent config and pushes an artifact.
- If all checks pass, opens a PR; upon approval and merge, the intent becomes effective.
Maintenance Drains and Scheduling
Maintenance Window Scheduling and Resource Allocation
Each POP may declare zero or more maintenance drain intents. A drain intent contains:
- Scope – which interfaces, protocols, or services to drain (e.g.,
bgp peer 10.0.0.5,isis level-2). - Schedule – cron expression or absolute timestamp with optional timezone.
- Grace Period – time to wait after issuing drain before considering the resource unavailable (allows in‑flight flows to finish).
- Blast‑Radius Limit – maximum percentage of total egress capacity that may be drained concurrently (enforced by OPA).
- Dependencies – list of other drain intents that must complete first (DAG).
Schema excerpt:
{
"$id": "https://example.com/schemas/drain-intent.json",
"type": "object",
"required": ["scope", "schedule", "grace_period_seconds", "blast_radius_percent"],
"properties": {
"scope": { "type": "string", "pattern": "^(bgp|isis|ospf|l2vpn)\\s+.+$" },
"schedule": { "type": "string", "format": "date-time" },
"grace_period_seconds": { "type": "integer", "minimum": 0 },
"blast_radius_percent": { "type": "number", "minimum": 0, "maximum": 100 },
"dependencies": { "type": "array", "items": { "type": "string" } },
"tags": { "type": "array", "items": { "type": "string" } }
}
}
Automated Maintenance Drains using Scripts and Tools The preferred automation stack:
- Scheduler – Apache Airflow (or Temporal) reads drain intents from the taxonomy repo, creates DAG runs per schedule.
- Executor – Ansible playbook (or Nornir) that pushes vendor‑specific drain commands via NETCONF/RESTCONF/gNMI.
- Safety Gate – Before execution, the executor queries a real‑time telemetry service (Prometheus) for current egress utilization per POP; if the projected drain would exceed the
blast_radius_percent, the run is paused and an alert is raised. - Rollback – Each drain command is paired with an undo command (e.g.,
no shutdownvsshutdown). The executor records the undo in a temporary file; on failure or manual abort, it replays the undo sequence.
Code Examples for Maintenance Drain Implementation
Airflow DAG snippet (Python):
from airflow import DAG
from airflow.providers.http.operators.http import SimpleHttpOperatorfrom airflow.utils.dates import days_agoimport json
def load_drain_intents(**context):
# Pull from Git via taxonomyctl API
import subprocess, os
result = subprocess.run(
["taxonomyctl", "drain", "list", "--format", "json"],
capture_output=True, text=True, check=True
)
return json.loads(result.stdout)
with DAG(
dag_id="maintenance_drain_scheduler",
schedule_interval=None, # triggered externally by taxonomy updates
start_date=days_ago(1),
catchup=False,
) as dag:
fetch_intents = SimpleHttpOperator(
task_id="fetch_intents",
http_conn_id="taxonomy_api",
endpoint="/drain/intents",
method="GET",
response_filter=lambda r: json.loads(r.text),
)
def create_drain_task(intent):
return SimpleHttpOperator(
task_id=f"drain_{intent['id']}",
http_conn_id="vendor_api",
endpoint="/drain/execute",
method="POST",
data=json.dumps(intent),
headers={"Content-Type": "application/json"},
# safety check: call a custom endpoint that validates blast radius
# (omitted for brevity)
)
# Dynamically expand tasks based on fetched intents (Airflow 2.3+ supports dynamic task mapping)
drain_tasks = fetch_intents.output.map(create_drain_task)
Ansible playbook for Juniper Junos drain:
- name: Execute BGP peer drain on Juniper
hosts: "{{ target_pop }}"
vars:
peer: "{{ item.peer }}"
grace: "{{ item.grace_period_seconds }}"
tasks:
- name: Issue graceful shutdown
junipernetworks.junos.junos_config:
lines:
- "set protocols bgp group {{ item.group }} neighbor {{ peer }} shutdown"
comment: "Maintenance drain per taxonomy intent"
register: shutdown_result
- name: Wait for grace period
pause:
seconds: "{{ grace }}"
- name: Verify peer state
junipernetworks.junos.junos_command:
commands:
- "show bgp neighbor {{ peer }}"
register: bgp_state
until: "'State: Idle' in bgp_state.stdout"
retries: 10
delay: 5
- name: Record undo command
# (undo logic would be added here)
debug:
msg: "Record undo command for {{ peer }}"