Skip to content

fix: add reloadable NATS connection with automatic reconnect handling#2987

Merged
migmartri merged 3 commits intochainloop-dev:mainfrom
migmartri:fix/nats-reloadable-connection
Apr 4, 2026
Merged

fix: add reloadable NATS connection with automatic reconnect handling#2987
migmartri merged 3 commits intochainloop-dev:mainfrom
migmartri:fix/nats-reloadable-connection

Conversation

@migmartri
Copy link
Copy Markdown
Member

Summary

When NATS pods restart, the control plane loses its connection and does not recover until restarted. This adds a ReloadableConnection wrapper in pkg/natsconn that broadcasts reconnection events to all consumers (caches and audit publisher), allowing them to reinitialize their JetStream handles automatically.

  • New pkg/natsconn package with ReloadableConnection type providing Subscribe/Broadcast fan-out for reconnect events, decoupled from controlplane proto config for cross-repo importability
  • Audit log publisher now reinitializes JetStream stream on NATS reconnection
  • All three NATS KV caches now use the existing WithReconnect plumbing that was previously unwired
  • NATS connection is properly drained on shutdown via Wire cleanup

Fixes #2986

When NATS pods restart, the control plane loses its connection and does
not recover until restarted. This adds a ReloadableConnection wrapper
that broadcasts reconnection events to all consumers (caches and audit
publisher), allowing them to reinitialize their JetStream handles.

The pkg/natsconn package is decoupled from the controlplane proto config
so it can be imported by external consumers.

Refs: chainloop-dev#2986
Signed-off-by: Miguel Martinez Trivino <miguel@chainloop.dev>
Remove dead js field and sync.RWMutex from AuditLogPublisher since
Publish uses core NATS not JetStream. Return cleanup function from
natsconn.New so Wire drains the connection on shutdown. Remove
redundant WHAT comment.

Signed-off-by: Miguel Martinez Trivino <miguel@chainloop.dev>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 8 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="app/controlplane/cmd/wire.go">

<violation number="1" location="app/controlplane/cmd/wire.go:153">
P2: Reconnect subscriptions use `context.Background()`, so they are never canceled and never unsubscribed. Wire these to a lifecycle-canceled context to avoid lingering subscriber/watcher goroutines.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

)

func newClaimsCache(conn *nats.Conn, logger log.Logger) (cache.Cache[*jwt.MapClaims], error) {
func newClaimsCache(rc *natsconn.ReloadableConnection, logger log.Logger) (cache.Cache[*jwt.MapClaims], error) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to provide context in all these initializer?

Pass the application context from main() through wireApp to all NATS
reconnect subscribers (caches and audit publisher). When the context is
cancelled on shutdown, subscriber channels are closed and watcher
goroutines exit cleanly.

Signed-off-by: Miguel Martinez Trivino <miguel@chainloop.dev>
@migmartri migmartri merged commit 67c1a69 into chainloop-dev:main Apr 4, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: control plane loses NATS connection when NATS pods restart

2 participants