-
Notifications
You must be signed in to change notification settings - Fork 2k
server: Fix potential race condition in TLS reload and RPC handler #27167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The Nomad server agent can perform configuration reloads of certain items, notably the TLS configuration when it is sent a SIGHUP signal. If the TLS configuration has changed, the server will overwrite the existing in-memory config object and reload its TLS connections. The RPC connection handler reads two values from the in-memory TLS configuration in order to understand how to correctly handle incoming TLS connections. This can cause a data race condition with the config reload where the config is written at the same time RPC handler reads the boolean values. One potential fix would be to wrap the entire server config object in a mutex and have all read/writers use this. This, however, seems inefficient as only two TLS config options are concurrently read and the server makes liberal use of reading other config options that are only written at server initialization. The change introduces two new server object atomics which hold the boolean TLS values used by the RPC connection handler. These atomics give thread safe read/write access while being tightly scoped to the servers current requirements.
pkazmierczak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally LGTM, I just left some typo and clarification comments.
| // schedulers, and notification bus for agents. | ||
| type Server struct { | ||
|
|
||
| // config holds the server configuration. No fields are not mutated after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // config holds the server configuration. No fields are not mutated after | |
| // config holds the server configuration. No fields are mutated after |
| rpcTLSEnabled: atomic.Bool{}, | ||
| rpcTLSUpgradeMode: atomic.Bool{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't these actually be pointer values to account for unset fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The zero value stored is false which is right for these two configuration parameters and they are always set by the setRPCTLSAtomics functions no matter the contents of the TLS configuration or if it is nil. The concrete bool means we can perform the Load() atomic call in the RPC hot path without needing additional checks for nils, which feels safer but also easier to understand.
It is possible to turn them into pointers, but the benefit seems that it would save us the space of 2 x uint32 (8 bytes) when they are not set while requiring additional logic to ensure we correctly handle the pointer objects.
tgross
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the caveat that I've pointed out in my comment
| // Ensure the config object is updated. This is done to keep things in sync | ||
| // only; all access to TLS enforcement should be done via the atomics which | ||
| // are updated below. | ||
| s.config.TLSConfig = newTLSConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we still have a race condition here if we read from this configuration? Even if nothing else in the server reads this config object, it's going to get read during serialization for the Query Self API
Not having a RW mutex on the (*Server) GetConfig() method smells fishy to me, but it also seems like we have a lot of places currently where we read the config without going through that method. So I'm also open to fixing this definitely-is-a-bug and coming back to a more comprehensive look at how we access/update the config struct later.
The Nomad server agent can perform configuration reloads of certain items, notably the TLS configuration when it is sent a SIGHUP signal. If the TLS configuration has changed, the server will overwrite the existing in-memory config object and reload its TLS connections.
The RPC connection handler reads two values from the in-memory TLS configuration in order to understand how to correctly handle incoming TLS connections. This can cause a data race condition with the config reload where the config is written at the same time RPC handler reads the boolean values.
One potential fix would be to wrap the entire server config object in a mutex and have all read/writers use this. This, however, seems inefficient as only two TLS config options are concurrently read and the server makes liberal use of reading other config options that are only written at server initialization.
The change introduces two new server object atomics which hold the boolean TLS values used by the RPC connection handler. These atomics give thread safe read/write access while being tightly scoped to the servers current requirements.
Testing & Reproduction steps
The race can be seen on main by running the following example test command:
go test -count 10 -race -v -race -run TestServer_Reload_TLSConnections_PlaintextToTLS ./"nomad"race example output
Some very quick benchmarks suggest the use of atomics within the RPC handler will perform as well as the current implementation:
Links
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
and job configuration, please update the Nomad product documentation, which is stored in the
web-unified-docsrepo. Refer to theweb-unified-docscontributor guide for docs guidelines.Please also consider whether the change requires notes within the upgrade
guide. If you would like help with the docs, tag the
nomad-docsteam in this PR.Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.