Rebuilding the Looking Glass: Stateless, High-Performance Network Diagnostics at Scale

Why Are Some Networks Still Running CGI in 2026?

Seriously, why are some networks still running CGI scripts from 1997 on core production edge routers? Every single time I have to troubleshoot a BGP route flap at 3 AM and load up some ancient, timeout-ridden Perl looking glass, a small part of my soul dies.

Most public looking glasses are outdated CGI or raw PHP scripts. They parse user inputs insecurely, lack any form of caching, and put a heavy CPU load on core routers during diagnostics.

When public looking glasses are broken, rate-limited, or completely offline, I often find myself falling back to executing manual command sweeps across multiple NLNOG RING nodes just to verify basic path routing and reachability. NLNOG RING is an absolute lifesaver for the network operator community, but running raw scripts to query individual nodes because a decent, secure lookup tool doesn’t exist is exactly what drove me to build a better solution.

So, I decided to build Looking Glass. It is a stateless, horizontally scalable, and secure diagnostic daemon built with Go, ConnectRPC, and Svelte 5. It manages a fleet of multi-vendor core routers over concurrent, isolated SSH sessions and streams real-time diagnostic outputs through a raw API, a neat CLI (lg-cli), and a snappy web client.

Let’s look at how I built the platform and solved real-world performance and security issues at scale.

A Quick Peek at the Architecture

First off, my looking glass is entirely stateless. I designed it to have no local storage, no databases, and absolutely no painful database migrations. This means you can easily spin up multiple replicas behind Nginx, Traefik, or Cloudflare and they will scale out horizontally.

The replicas share a lightweight Redis instance for performance caching and clustering health checks.

graph TD
    ClientCLI[lg-cli Client] -->|ConnectRPC HTTP/2| Server[Looking Glass Server Daemon]
    SvelteUI[SvelteKit UI] -->|HTTP/2 API| Server
    Server -->|Read/Write Cache| Redis[Redis Memory Cache & Sync]
    Server -->|Concurrent SSH| Routers[Fleet of Core Routers <br>FRRouting, Cisco, Juniper, etc.]

By taking advantage of Go’s compiler tricks, I embed the entire SvelteKit frontend build directly into the Go server binary using go:embed. This yields a single, zero-dependency, statically-linked binary. You can ship this as-is or drop it into a bare minimum scratch Docker container.

But how do I deal with runtime configurations if the frontend is baked in at compile time? I cant just rebuild the binary every time I want to point my web UI to a different API endpoint or Sentry key.

To bridge this gap, the Go server dynamically generates and injects a configuration wrapper script at /_app/env.js on the fly. Simple, elegant, and zero rebuilds required.

YAML Specs: Describing Routers Without Spaghetti Code

Hardcoding command drivers for different router vendors (like Cisco IOS, Juniper Junos, Nokia SR OS, or FRRouting) is a recipe for messy, unmaintainable spaghetti code. The moment a vendor updates their CLI syntax, your codebase breaks and becomes a nightmare to fix.

In Looking Glass, I decoupled the command definitions entirely from the Go codebase. Each router vendor is modeled as a YAML specification containing Go text/template blocks:

name: arista_eos
ping:
    ipv4:
        - ping vrf {{.Cfg.VRF}} ip {{.IP.IP}} source {{.Cfg.Source4.IP}} repeat 5
    ipv6:
        - ping vrf {{.Cfg.VRF}} ipv6 {{.IP.IP}} source {{.Cfg.Source6.IP}} repeat 5
# ...
parsers:
    ping:
        kind: textfsm
        template: arista_eos_ping

When a diagnostic request hits the server, Go creates a temporary _tpl_data context. This includes things like parsed IPNet targets, BGP community strings, and sanitized AS-Path expressions. It then safely renders and executes the template in an isolated block.

If you need to tweak a command or add a custom parser, you dont even have to recompile. You just set a ROUTER_DIR environment variable pointing to a local folder. At startup, the server automatically scans this directory and overlays your custom YAML templates directly onto the built-in vendor profiles.

Safe and Sound: Failure-Isolating Output Parsers

If you’ve ever dealt with network tool parsers, you know they are incredibly brittle. A minor firmware upgrade changing a single space in the CLI output can easily crash a standard parser, leaving you with a blank screen T_T.

To protect myself from this, I built a failure-isolation pipeline to handle the mess:

[Raw CLI Output] ──► [Parser Interface] ────► [Result Envelope]
                                              ├── Payload (Typed Protobuf)
                                              ├── ParserKind (TextFSM / JSON)
                                              └── ParseStatus (OK / Failed)

The server exposes a parse.Parser interface implemented by four concrete engines:

RawParser: A simple pass-through that gives you the raw terminal output.
TextFSMParser: Integrates the gotextfsm engine (a pure Go port of Google’s TextFSM) to match output tables against regex templates.
JSONParser: Decodes structured JSON directly for modern shells that support it (like FRRouting’s JSON output).
BuiltinParser: Customized Go regular expressions built specifically for standard system ping and traceroute streams.

Critically, these parsers never return Go errors. Instead, parsing errors are caught internally and mapped to a pb.ParseStatus enum (like PARSE_STATUS_PARSE_FAILED) inside a unified result envelope.

If a parser chokes, the gRPC handler isolates the failure and falls back to rendering the raw terminal output. This guarantees that your network engineers still get their diagnostic logs, even if the parser template has drifted out of sync.

Zero-Tolerance Input Sanitization

Network tools that execute commands on remote routers are prime targets for shell injection. I took a zero-tolerance approach to user input.

Standard target queries (IPs or Hostnames) are parsed as strict IPNet CIDR structures. If they arent valid, the execution stops immediately. Hostnames are resolved through secure local DNS lookup buffers before any SSH command template is ever rendered.

Additionally, BGP AS-Path regular expressions are limited to 30 characters and strictly matched against a white-list: ^[0-9_]+\$?$. If the expression contains a single unexpected character, the request is instantly thrown out in the Go handler, keeping remote terminal buffers completely safe from escape sequence injections.

Avoiding Router Meltdown (With Redis Lease Locking)

To keep my web dashboard updated with router statuses, the server sweeps its catalog once a minute by spinning up no-op SSH sessions to check reachability.

But if you run multiple instances of the server horizontally for high availability, a naive background loop becomes a major headache. Each replica will run its own health check sweep simultaneously, creating a "probe multiplier storm." This can easily trigger security rate limiters or cause CPU spikes on production edge routers.

To solve this, I designed a clustered health coordination pattern using Redis lease locking:

sequenceDiagram
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant Redis as Redis Cache
    participant Router as Core Router

    Note over R1, R2: Sweep Tick (Every 60s)
    R1->>Redis: SetNX lg:health:lease:router1 (70s TTL)
    Redis-->>R1: Success (Lease Granted)
    R2->>Redis: SetNX lg:health:lease:router1 (70s TTL)
    Redis-->>R2: Fail (Lease Active)

    Note over R1: Active Prober
    R1->>Router: SSH Ping Check
    R1->>Redis: Publish lg:health:state:router1 (State: UP)

    Note over R2: Fallback Reader
    R2->>Redis: Read lg:health:state:router1
    Note over R2: Adopt Peer State (No SSH)

Each router health check is gated by a Redis SetNX lock with a 70-second TTL at lg:health:lease:<router_name>. The server replica that successfully grabs the lock becomes the active prober. It executes the SSH connection, updates its local health bit, and publishes the timestamped result to lg:health:state:<router_name>.

The other replicas fail to acquire the lease, skip the SSH connection entirely, and instantly adopt the state published by the active leader.

If Redis goes down, the replicas gracefully catch the connection error and fall back to local direct probing automatically, ensuring that high availability remains intact.

Build-Stable Namespaces for Zero-Downtime Cache Rolling

Caching is great for protecting core routers from repeated denial-of-service queries. But caching dynamic protobuf states in a shared Redis instance introduces a nasty risk. If you roll out an update that changes internal message definitions, older cache items can corrupt the new replicas when they try to deserialize them.

I solved this by generating a build-stable version namespace using compile-time linker flags:

CGO_ENABLED=0 go build -ldflags="-X github.com/AS203038/looking-glass/pkg/utils.release=$(VERSION)"

At runtime, every single cache key written to Redis is automatically prefixed with this build tag:

lg:rpc:<build_version>:<op_name>:<router_id>:<target_hash>

During a rolling deployment, both old and new replicas run side-by-side. Since the cache key contains the exact version hash, the new containers write to and read from a fresh, separate cache namespace. Stale schema models are never read, and the update acts as an immediate, atomic global cache invalidation. No manual cache flushing required!

The API-First Paradigm: ConnectRPC as an Open Diagnostic Standard

Historically, looking glass software was designed strictly for humans. If you wanted to automate reachability checks, you had to write brittle screen-scrapers to parse raw HTML tables (which would instantly break the moment a style or layout was tweaked).

I completely rejected this clunky approach. From day one, I designed Looking Glass to be an API-first gateway that exposes the ConnectRPC protocol.

Why did I choose ConnectRPC?

Statically-Typed Contracts: The entire API is defined in standard Protobuf (lookingglass.proto). Every single input (IP parameters, target hostnames, BGP communities) and output (Ping stats, Traceroute hops, BGP paths) is strictly typed.
Cross-Language Client Generation: Developers can compile .proto files into typed, native clients for Go, Python, TypeScript, Rust, or Java.
ConnectRPC’s Multi-Protocol Transport: ConnectRPC is incredibly flexible. It lets the server speak standard gRPC, gRPC-Web (meaning standard browsers can run RPC calls directly without a proxy), and standard JSON-over-HTTP/1.1 or HTTP/2 simultaneously under a single port.

This decision goes way beyond my specific Go backend implementation. Because the contract is fully open, any ConnectRPC-speaking endpoint that implements the LookingGlassService contract is fully compatible.

This completely decouples the client from the server:

Universal CLI (lg-cli): The lg-cli can target my Go daemon, but it can also target any public instance listed in the public_index.yaml registry, even if that instance’s backend is implemented in Rust or Python, as long as it adheres to the Protobuf schema.
Federated Diagnostics: CDNs, ISPs, and SRE teams can implement cross-AS automated diagnostics. For example, during a BGP route leak or packet loss event, automated scripts can query peer looking glasses programmatically, triggering instant traceroutes to isolate path withdrawals without human intervention.
Integration Ecosystems: It opens the door for automated monitoring systems (like Prometheus blackbox exporters) or ChatOps bots to execute lightweight, typed diagnostic lookups seamlessly.

60fps Routing Diffs in Svelte 5

The frontend web UI is built with Svelte 5, using Svelte Runes like $state, $derived.by, and $effect to handle reactive UI states across complex diagnostic layouts.

One of the coolest features I built is the side-by-side comparative Diff View. It lets operators compare routing paths, ping latencies, and BGP tables from different routers side-by-side (e.g., comparing route propagation between Stockholm and Frankfurt).

Comparing large BGP tables is a CPU-intensive task. If the string diffing ran directly on Svelte’s main rendering thread, the browser UI would stutter and drop frames, especially on mobile devices or low-powered laptops T_T.

To keep the interface running at a silky-smooth 60fps, I offloaded all diff calculations to a dedicated browser Web Worker (diff.worker.ts) using asynchronous message passing. The Svelte UI streams the raw tables to the background worker, which processes the diff and posts the highlighted differences back. The UI repaints the diffs instantly without blocking the main event loop.

Building From Scratch: Under 20MB of Pure Go

To keep my attack surface as small as possible in production, I configured a multi-stage Dockerfile that compiles everything down to a completely empty environment:

# Stage 1: Protobuf compilation with buf
# Stage 2: SvelteKit assets bundling with node:alpine & pnpm
# Stage 3: Go compilation under golang:alpine (statically linked)
# Stage 4: Execution container
FROM scratch AS final
COPY --from=go-builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
COPY --from=go-builder /opt/looking-glass /looking-glass
USER 65532:65532
ENTRYPOINT ["/looking-glass"]

The final Docker image is built FROM scratch and is incredibly lightweight, weighing in at under 20MB! It contains no shell, no package manager, and runs as a completely unprivileged non-root user (UID 65532). Even if an attacker somehow found a vulnerability, there is literally nothing in the container for them to execute.

Ofc, since I release multi-arch binaries and pre-built multi-arch Docker images, operators can pull and run it instantly on AMD64 or ARM64 hardware without compiling a thing.

What’s Next on the Roadmap?

Looking Glass is proof that network operations software doesnt have to look and feel like a legacy relic from the dot-com bubble. By combining Go’s static compilation and concurrent SSH channels with a modern ConnectRPC API and a snappy Svelte 5 UI, I can build diagnostics tooling that is secure, fast, and highly scalable.

The project is fully open source and actively developed. On my roadmap, I am working on:

Native Prometheus metric endpoints to track SSH latency, connection pools, and cache performance.
Default YAML templates for VyOS and Huawei VRP.

If you are running an autonomous system or a CDN, check out the repository, drop in your custom YAML specifications, and level up your diagnostic infrastructure!

Codebase & Contributing: Looking Glass on Github
Demo Instances:
- AS203038 Demo
- AS214503 Demo