Skip to content

Conversation

@jrasell
Copy link
Member

@jrasell jrasell commented Nov 28, 2025

The Nomad server agent can perform configuration reloads of certain items, notably the TLS configuration when it is sent a SIGHUP signal. If the TLS configuration has changed, the server will overwrite the existing in-memory config object and reload its TLS connections.

The RPC connection handler reads two values from the in-memory TLS configuration in order to understand how to correctly handle incoming TLS connections. This can cause a data race condition with the config reload where the config is written at the same time RPC handler reads the boolean values.

One potential fix would be to wrap the entire server config object in a mutex and have all read/writers use this. This, however, seems inefficient as only two TLS config options are concurrently read and the server makes liberal use of reading other config options that are only written at server initialization.

The change introduces two new server object atomics which hold the boolean TLS values used by the RPC connection handler. These atomics give thread safe read/write access while being tightly scoped to the servers current requirements.

Testing & Reproduction steps

The race can be seen on main by running the following example test command: go test -count 10 -race -v -race -run TestServer_Reload_TLSConnections_PlaintextToTLS ./"nomad"

race example output
==================
WARNING: DATA RACE
Write at 0x00c000e7c270 by goroutine 335:
github.com/hashicorp/nomad/nomad.(*Server).reloadTLSConnections()
    /Users/jrasell/Projects/go/nomad/nomad/server.go:696 +0x3f0
github.com/hashicorp/nomad/nomad.TestServer_Reload_TLSConnections_PlaintextToTLS()
    /Users/jrasell/Projects/go/nomad/nomad/server_test.go:246 +0x2c4
testing.tRunner()
    /Users/jrasell/.asdf/installs/golang/1.25.4/go/src/testing/testing.go:1934 +0x164
testing.(*T).Run.gowrap1()
    /Users/jrasell/.asdf/installs/golang/1.25.4/go/src/testing/testing.go:1997 +0x3c

Previous read at 0x00c000e7c270 by goroutine 418:
github.com/hashicorp/nomad/nomad.(*rpcHandler).handleConn()
    /Users/jrasell/Projects/go/nomad/nomad/rpc.go:304 +0x36c
github.com/hashicorp/nomad/nomad.(*rpcHandler).listen.gowrap2()
    /Users/jrasell/Projects/go/nomad/nomad/rpc.go:238 +0x64

Goroutine 335 (running) created at:
testing.(*T).Run()
    /Users/jrasell/.asdf/installs/golang/1.25.4/go/src/testing/testing.go:1997 +0x6e0
testing.runTests.func1()
    /Users/jrasell/.asdf/installs/golang/1.25.4/go/src/testing/testing.go:2477 +0x74
testing.tRunner()
    /Users/jrasell/.asdf/installs/golang/1.25.4/go/src/testing/testing.go:1934 +0x164
testing.runTests()
    /Users/jrasell/.asdf/installs/golang/1.25.4/go/src/testing/testing.go:2475 +0x734
testing.(*M).Run()
    /Users/jrasell/.asdf/installs/golang/1.25.4/go/src/testing/testing.go:2337 +0xaf4
main.main()
    _testmain.go:2003 +0x100

Goroutine 418 (running) created at:
github.com/hashicorp/nomad/nomad.(*rpcHandler).listen()
    /Users/jrasell/Projects/go/nomad/nomad/rpc.go:238 +0x530
github.com/hashicorp/nomad/nomad.(*Server).startRPCListener.gowrap1()
    /Users/jrasell/Projects/go/nomad/nomad/server.go:583 +0x48
==================

Some very quick benchmarks suggest the use of atomics within the RPC handler will perform as well as the current implementation:

$ go test -bench=. -benchmem
goos: darwin
goarch: arm64
pkg: github.com/jrasell/dev-mess
cpu: Apple M2 Pro
Benchmark_atomicBool-12       	601869178	         1.948 ns/op	       0 B/op	       0 allocs/op
Benchmark_nonAtomicBool-12    	591709046	         2.019 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/jrasell/dev-mess	2.564s

Links

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad product documentation, which is stored in the
    web-unified-docs repo. Refer to the web-unified-docs contributor guide for docs guidelines.
    Please also consider whether the change requires notes within the upgrade
    guide
    . If you would like help with the docs, tag the nomad-docs team in this PR.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.

The Nomad server agent can perform configuration reloads of
certain items, notably the TLS configuration when it is sent a
SIGHUP signal. If the TLS configuration has changed, the server
will overwrite the existing in-memory config object and reload
its TLS connections.

The RPC connection handler reads two values from the in-memory
TLS configuration in order to understand how to correctly handle
incoming TLS connections. This can cause a data race condition
with the config reload where the config is written at the same
time RPC handler reads the boolean values.

One potential fix would be to wrap the entire server config object
in a mutex and have all read/writers use this. This, however,
seems inefficient as only two TLS config options are concurrently
read and the server makes liberal use of reading other config
options that are only written at server initialization.

The change introduces two new server object atomics which hold the
boolean TLS values used by the RPC connection handler. These
atomics give thread safe read/write access while being tightly
scoped to the servers current requirements.
@jrasell jrasell self-assigned this Nov 28, 2025
@jrasell jrasell added backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.10.x+ent backport to 1.10.x+ent release line backport/1.11.x backport to 1.11.x release line labels Nov 28, 2025
@jrasell jrasell marked this pull request as ready for review November 28, 2025 14:39
@jrasell jrasell requested review from a team as code owners November 28, 2025 14:39
Copy link
Contributor

@pkazmierczak pkazmierczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally LGTM, I just left some typo and clarification comments.

// schedulers, and notification bus for agents.
type Server struct {

// config holds the server configuration. No fields are not mutated after
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// config holds the server configuration. No fields are not mutated after
// config holds the server configuration. No fields are mutated after

Comment on lines +367 to +368
rpcTLSEnabled: atomic.Bool{},
rpcTLSUpgradeMode: atomic.Bool{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't these actually be pointer values to account for unset fields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zero value stored is false which is right for these two configuration parameters and they are always set by the setRPCTLSAtomics functions no matter the contents of the TLS configuration or if it is nil. The concrete bool means we can perform the Load() atomic call in the RPC hot path without needing additional checks for nils, which feels safer but also easier to understand.

It is possible to turn them into pointers, but the benefit seems that it would save us the space of 2 x uint32 (8 bytes) when they are not set while requiring additional logic to ensure we correctly handle the pointer objects.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the caveat that I've pointed out in my comment

// Ensure the config object is updated. This is done to keep things in sync
// only; all access to TLS enforcement should be done via the atomics which
// are updated below.
s.config.TLSConfig = newTLSConfig
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still have a race condition here if we read from this configuration? Even if nothing else in the server reads this config object, it's going to get read during serialization for the Query Self API

Not having a RW mutex on the (*Server) GetConfig() method smells fishy to me, but it also seems like we have a lot of places currently where we read the config without going through that method. So I'm also open to fixing this definitely-is-a-bug and coming back to a more comprehensive look at how we access/update the config struct later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.10.x+ent backport to 1.10.x+ent release line backport/1.11.x backport to 1.11.x release line

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants