-
Notifications
You must be signed in to change notification settings - Fork 52
fix: control plane loses NATS connection when NATS pods restart #2986
Copy link
Copy link
Closed
Labels
Description
Problem
When NATS pods restart (e.g., during rolling updates, scaling events, or crashes), the control plane loses its connection to NATS and does not recover. The only workaround is to restart the control plane pods as well.
This affects audit log publishing (via JetStream) and all three NATS-backed KV caches (JWT claims, org memberships, policy eval bundles).
Root Cause
The NATS connection in app/controlplane/cmd/main.go:newNatsConnection is established with minimal options — no explicit reconnect configuration and no error/disconnect/reconnect callbacks are registered.
While the NATS Go client has reconnection enabled by default, the higher-level consumers (audit log publisher, KV caches) don't handle the underlying connection being re-established:
- Audit log publisher (
app/controlplane/pkg/auditor/nats.go): Creates a JetStream context once at startup. After a reconnect, the JetStream context may become stale. - KV caches (
pkg/cache/natskv.go): AWithReconnect()option andwatchReconnect()handler already exist in the cache layer, but none of the three cache constructors inapp/controlplane/cmd/wire.gopass a reconnect channel.
Suggested Fix
- Register reconnect/disconnect callbacks on the NATS connection in
newNatsConnectionto log events and signal dependent components. - Wire the reconnect channel into the cache constructors — the infrastructure (
WithReconnect,watchReconnect) already exists but is unused. - Re-initialize JetStream context in the audit log publisher after a NATS reconnection event.
- Add connection health monitoring — expose NATS connection status in health checks or metrics so pod liveness probes can detect a permanently broken connection.
Reactions are currently unavailable