-
Notifications
You must be signed in to change notification settings - Fork 2k
fingerprint: Add retry and failure config to env fingerprinters. #27161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This change introduces new optional client fingerprinter configuration fields which can be used to control how the env fingerprinters perform retries and whether errors should halt the agent startup. The retry wrapper is used by the env_aws, env_azure, env_gce, and env_digitalocean fingerprinters and is the handler for retry and error logic on the main fingerprinter. The change is backwards compatible, so running this change without any new config options results in the same behaviour as previously. - retry_interval: Specifies the time to wait between fingerprint attempts. This will default to 2 seconds. - retry_attempts: Specifies the maximum number of fingerprint retries to be made. This will default to 0 and can be set to -1 if the operator wants infinite retries. - exit_on_failure: Determines how the agent handles failure in performing the fingerprint. The change helps alleviate problems in cloud providers where a machine starts before the metadata service and endpoint is available. In this situation, Nomad times out the fingerprinter quickly and marks it as skipped, thus assuming we are not running within that environment. Operators can use the new configuration options to handle these race conditions, and wait for the metadata service to be available and respond.
pkazmierczak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left some minor typo-related comments, but nothing blocking.
Co-authored-by: Piotr Kazmierczak <[email protected]>
|
|
||
| // Fingerprint is an optional configuration block for environment fingerprinters | ||
| // can control retry behavior and failure handling. | ||
| type Fingerprint struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In non-cloud environments we'll typically see something like the following to disable all the cloud fingerprinters. This uses the deprecated options syntax. Should we add an enabled flag to this struct to implement the same behavior?
client {
options = {
"fingerprint.denylist" = "env_aws,env_gce,env_azure,env_digitalocean"
}
}
| // Fingerprint executes the underlying fingerprinter with retry logic based | ||
| // on the client configuration and implements the Fingerprinter interface. | ||
| // | ||
| // If the fingerprinter fails after all retry attempts, the error from the last | ||
| // attempt is returned, unless the configuration indicates that failures should | ||
| // be skipped for this fingerprinter and the error is of the type that indicates | ||
| // an initial probe failure. | ||
| func (rw *RetryWrapper) Fingerprint(req *FingerprintRequest, resp *FingerprintResponse) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand how we're differentiating between failures we need to retry (and block initial fingerprinting for) and failures because that's not the environment we're in. Does this end up blocking the first fingerprint?
This change introduces new optional client fingerprinter configuration fields which can be used to control how the env fingerprinters perform retries and whether errors should halt the agent startup.
The retry wrapper is used by the env_aws, env_azure, env_gce, and env_digitalocean fingerprinters and is the handler for retry and error logic on the main fingerprinter. The change is backwards compatible, so running this change without any new config options results in the same behaviour as previously.
The change helps alleviate problems in cloud providers where a machine starts before the metadata service and endpoint is available. In this situation, Nomad timesout the fingerprinter quickly and marks it as skipped, thus assuming we are not running within that environment. Operators can use the new configuration options to handle these race conditions, and wait for the metadata service to be available and respond.
Links
Jira: https://hashicorp.atlassian.net/browse/NMD-1061
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
and job configuration, please update the Nomad product documentation, which is stored in the
web-unified-docsrepo. Refer to theweb-unified-docscontributor guide for docs guidelines.Please also consider whether the change requires notes within the upgrade
guide. If you would like help with the docs, tag the
nomad-docsteam in this PR.Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.