Skip to content

fix: control plane loses NATS connection when NATS pods restart #2986

@migmartri

Description

@migmartri

Problem

When NATS pods restart (e.g., during rolling updates, scaling events, or crashes), the control plane loses its connection to NATS and does not recover. The only workaround is to restart the control plane pods as well.

This affects audit log publishing (via JetStream) and all three NATS-backed KV caches (JWT claims, org memberships, policy eval bundles).

Root Cause

The NATS connection in app/controlplane/cmd/main.go:newNatsConnection is established with minimal options — no explicit reconnect configuration and no error/disconnect/reconnect callbacks are registered.

While the NATS Go client has reconnection enabled by default, the higher-level consumers (audit log publisher, KV caches) don't handle the underlying connection being re-established:

  • Audit log publisher (app/controlplane/pkg/auditor/nats.go): Creates a JetStream context once at startup. After a reconnect, the JetStream context may become stale.
  • KV caches (pkg/cache/natskv.go): A WithReconnect() option and watchReconnect() handler already exist in the cache layer, but none of the three cache constructors in app/controlplane/cmd/wire.go pass a reconnect channel.

Suggested Fix

  1. Register reconnect/disconnect callbacks on the NATS connection in newNatsConnection to log events and signal dependent components.
  2. Wire the reconnect channel into the cache constructors — the infrastructure (WithReconnect, watchReconnect) already exists but is unused.
  3. Re-initialize JetStream context in the audit log publisher after a NATS reconnection event.
  4. Add connection health monitoring — expose NATS connection status in health checks or metrics so pod liveness probes can detect a permanently broken connection.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions