|
| 1 | +--- |
| 2 | +layout: docs |
| 3 | +page_title: Configure singleton deployments |
| 4 | +description: |- |
| 5 | + Declare a job that guarantees only a single instance can run at a time, with |
| 6 | + minimal downtime. |
| 7 | +--- |
| 8 | + |
| 9 | +# Configure singleton deployments |
| 10 | + |
| 11 | +A singleton deployment is one where there is at most one instance of a given |
| 12 | +allocation running on the cluster at one time. You might need this if the |
| 13 | +workload needs exclusive access to a remote resource like a data store. Nomad |
| 14 | +does not support singleton deployments as a built-in feature. Your workloads |
| 15 | +continue to run even when the Nomad client agent has crashed, so ensuring |
| 16 | +there's at most one allocation for a given workload some cooperation from the |
| 17 | +job. This document describes how to implement singleton deployments. |
| 18 | + |
| 19 | +## Design Goals |
| 20 | + |
| 21 | +The configuration described here meets two primary design goals: |
| 22 | + |
| 23 | +* The design will prevent a specific process with a task from running if there |
| 24 | + is another instance of that task running anywhere else on the Nomad cluster. |
| 25 | +* Nomad should be able to recover from failure of the task or the node on which |
| 26 | + the task is running with minimal downtime, where "recovery" means that the |
| 27 | + original task should be stopped and that Nomad should schedule a replacement |
| 28 | + task. |
| 29 | +* Nomad should minimize false positive detection of failures to avoid |
| 30 | + unnecessary downtime during the cutover. |
| 31 | + |
| 32 | +There's a tradeoff between between recovery speed and false positives. The |
| 33 | +faster you make Nomad attempt to recover from failure, the more likely that a |
| 34 | +transient failure causes a replacement to be scheduled and a subsequent |
| 35 | +downtime. |
| 36 | + |
| 37 | +Note that it's not possible to design a perfectly zero-downtime singleton |
| 38 | +allocation in a distributed system. This design will err on the side of |
| 39 | +correctness: having 0 or 1 allocations running rather than the incorrect 1 or 2 |
| 40 | +allocations running. |
| 41 | + |
| 42 | +## Overview |
| 43 | + |
| 44 | +There are several options available for some details of the implementation, but |
| 45 | +all of them include the following: |
| 46 | + |
| 47 | +* You must have a distributed lock with a TTL that's refreshed from the |
| 48 | + allocation. The process that sets and refreshes the lock must have its |
| 49 | + lifecycle tied to the main task. It can be either in-process, in-task with |
| 50 | + supervision, or run as a sidecar. If the allocation cannot obtain the lock, |
| 51 | + then it must not start whatever process or operations is intended to be a |
| 52 | + singleton. After a configurable window without obtaining the lock, the |
| 53 | + allocation must fail. |
| 54 | +* You must set the [`group.disconnect.stop_on_client_after`][] field. This |
| 55 | + forces a Nomad client that's disconnected from the server to stop the |
| 56 | + singleton allocation, which in turn releases the lock or allows its TTL to |
| 57 | + expire. |
| 58 | + |
| 59 | +The values for the three timers (the lock TTL, the time it takes the alloc to |
| 60 | +give up, and the `stop_on_client_after` duration) are the values that can be |
| 61 | +tuned to reduce the maximum amount of downtime the application can have. |
| 62 | + |
| 63 | +The Nomad [Locks API][] can support the operations needed. In psuedo-code these |
| 64 | +operations are: |
| 65 | + |
| 66 | +* `PUT /v1/var/:path?lock-acquire` |
| 67 | + * On success: start heartbeat every 1/2 TTL |
| 68 | + * On conflict or failure: retry with backoff and timeout. |
| 69 | + * Once out of attempts, exit the process with error code. |
| 70 | +* To heartbeat, `PUT /v1/var/:path?lock-renew` |
| 71 | + * On success: continue |
| 72 | + * On conflict: exit the process with error code |
| 73 | + * On failure: retry with backoff up to TTL. |
| 74 | + * If TTL expires, attempt to revoke lock, then exit the process with error code. |
| 75 | + |
| 76 | +The allocation can safely use the Nomad [Task API][] socket to write to the |
| 77 | +locks API, rather than communicating with the server directly. This reduces load |
| 78 | +on the server and speeds up detection of failed client nodes because the |
| 79 | +disconnected client cannot forward the Task API requests to the leader. |
| 80 | + |
| 81 | +The [`nomad var lock`][] command implements this logic and can be used to shim |
| 82 | +the process being locked. |
| 83 | + |
| 84 | +### ACLs |
| 85 | + |
| 86 | +Allocations cannot write to Variables by default. You must configure a |
| 87 | +[workload-associated ACL policy][] that allows write access in the |
| 88 | +[`namespace.variables`][] block. For example, the following ACL policy allows |
| 89 | +access to write a lock on the path `nomad/jobs/example/lock` in the `prod` |
| 90 | +namespace: |
| 91 | + |
| 92 | +``` |
| 93 | +namespace "prod" { |
| 94 | + variables { |
| 95 | + path "nomad/jobs/example/lock" { |
| 96 | + capabilities = ["write", "read", "list"] |
| 97 | + } |
| 98 | + } |
| 99 | +} |
| 100 | +``` |
| 101 | + |
| 102 | +You set this policy on the job with `nomad acl policy apply -namespace prod -job |
| 103 | +example example-lock ./policy.hcl`. |
| 104 | + |
| 105 | +### Using `nomad var lock` |
| 106 | + |
| 107 | +The easiest way to implement the locking logic is to use `nomad var lock` as a |
| 108 | +shim in your task. The jobspec below assumes there's a Nomad binary in the |
| 109 | +container image. |
| 110 | + |
| 111 | +```hcl |
| 112 | +job "example" { |
| 113 | + group "group" { |
| 114 | +
|
| 115 | + disconnect { |
| 116 | + stop_on_client_after = "1m" |
| 117 | + } |
| 118 | +
|
| 119 | + task "primary" { |
| 120 | + config { |
| 121 | + driver = "docker" |
| 122 | + image = "example/app:1" |
| 123 | + command = "nomad" |
| 124 | + args = [ |
| 125 | + "var", "lock", "nomad/jobs/example/lock", # lock |
| 126 | + "busybox", "httpd", # application |
| 127 | + "-vv", "-f", "-p", "8001", "-h", "/local" # application args |
| 128 | + ] |
| 129 | + } |
| 130 | +
|
| 131 | + identity { |
| 132 | + env = true |
| 133 | + } |
| 134 | + } |
| 135 | + } |
| 136 | +} |
| 137 | +``` |
| 138 | + |
| 139 | +If you don't want to ship a Nomad binary in the container image you can make a |
| 140 | +read-only mount of the binary from a host volume. This will only work in cases |
| 141 | +where the Nomad binary has been statically linked or you have glibc in the |
| 142 | +container image. |
| 143 | + |
| 144 | +```hcl |
| 145 | +job "example" { |
| 146 | + group "group" { |
| 147 | +
|
| 148 | + disconnect { |
| 149 | + stop_on_client_after = "1m" |
| 150 | + } |
| 151 | +
|
| 152 | + volume "binaries" { |
| 153 | + type = "host" |
| 154 | + source = "binaries" |
| 155 | + read_only = true |
| 156 | + } |
| 157 | +
|
| 158 | + task "primary" { |
| 159 | + config { |
| 160 | + driver = "docker" |
| 161 | + image = "example/app:1" |
| 162 | + command = "/opt/bin/nomad" |
| 163 | + args = [ |
| 164 | + "var", "lock", "nomad/jobs/example/lock", # lock |
| 165 | + "busybox", "httpd", # application |
| 166 | + "-vv", "-f", "-p", "8001", "-h", "/local" # application args |
| 167 | + ] |
| 168 | + } |
| 169 | +
|
| 170 | + identity { |
| 171 | + env = true # make NOMAD_TOKEN available to lock command |
| 172 | + } |
| 173 | +
|
| 174 | + volume_mount { |
| 175 | + volume = "binaries" |
| 176 | + destination = "/opt/bin" |
| 177 | + } |
| 178 | + } |
| 179 | + } |
| 180 | +} |
| 181 | +``` |
| 182 | + |
| 183 | +### Sidecar Lock |
| 184 | + |
| 185 | +If cannot implement the lock logic in your application or with a shim such as |
| 186 | +`nomad var lock`, you'rll need to implement it such that the task you are |
| 187 | +locking is running as a sidecar of the locking task, which has |
| 188 | +[`task.leader=true`][] set. |
| 189 | + |
| 190 | +```hcl |
| 191 | +job "example" { |
| 192 | + group "group" { |
| 193 | +
|
| 194 | + disconnect { |
| 195 | + stop_on_client_after = "1m" |
| 196 | + } |
| 197 | +
|
| 198 | + task "lock" { |
| 199 | + leader = true |
| 200 | + config { |
| 201 | + driver = "raw_exec" |
| 202 | + command = "/opt/lock-script.sh" |
| 203 | + pid_mode = "host" |
| 204 | + } |
| 205 | +
|
| 206 | + identity { |
| 207 | + env = true # make NOMAD_TOKEN available to lock command |
| 208 | + } |
| 209 | + } |
| 210 | +
|
| 211 | + task "application" { |
| 212 | + lifecycle { |
| 213 | + hook = "poststart" |
| 214 | + sidecar = true |
| 215 | + } |
| 216 | +
|
| 217 | + config { |
| 218 | + driver = "docker" |
| 219 | + image = "example/app:1" |
| 220 | + } |
| 221 | + } |
| 222 | + } |
| 223 | +} |
| 224 | +``` |
| 225 | + |
| 226 | +The locking task has the following requirements: |
| 227 | + |
| 228 | +* The locking task must be in the same group as the task being locked. |
| 229 | +* The locking task must be able to terminate the task being locked without the |
| 230 | + Nomad client being up (i.e. they share the same PID namespace, or the locking |
| 231 | + task is privileged). |
| 232 | +* The locking task must have a way of signalling the task being locked that it |
| 233 | + is safe to start. For example, the locking task can write a sentinel file into |
| 234 | + the /alloc directory, which the locked task tries to read on startup and |
| 235 | + blocks until it exists. |
| 236 | + |
| 237 | +If the third requirement cannot be met, then you’ll need to split the lock |
| 238 | +acquisition and lock heartbeat into separate tasks: |
| 239 | + |
| 240 | +```hcl |
| 241 | +job "example" { |
| 242 | + group "group" { |
| 243 | +
|
| 244 | + disconnect { |
| 245 | + stop_on_client_after = "1m" |
| 246 | + } |
| 247 | +
|
| 248 | + task "acquire" { |
| 249 | + lifecycle { |
| 250 | + hook = "prestart" |
| 251 | + sidecar = false |
| 252 | + } |
| 253 | + config { |
| 254 | + driver = "raw_exec" |
| 255 | + command = "/opt/lock-acquire-script.sh" |
| 256 | + } |
| 257 | + identity { |
| 258 | + env = true # make NOMAD_TOKEN available to lock command |
| 259 | + } |
| 260 | + } |
| 261 | +
|
| 262 | + task "heartbeat" { |
| 263 | + leader = true |
| 264 | + config { |
| 265 | + driver = "raw_exec" |
| 266 | + command = "/opt/lock-heartbeat-script.sh" |
| 267 | + pid_mode = "host" |
| 268 | + } |
| 269 | + identity { |
| 270 | + env = true # make NOMAD_TOKEN available to lock command |
| 271 | + } |
| 272 | + } |
| 273 | +
|
| 274 | + task "application" { |
| 275 | + lifecycle { |
| 276 | + hook = "poststart" |
| 277 | + sidecar = true |
| 278 | + } |
| 279 | +
|
| 280 | + config { |
| 281 | + driver = "docker" |
| 282 | + image = "example/app:1" |
| 283 | + } |
| 284 | + } |
| 285 | + } |
| 286 | +} |
| 287 | +``` |
| 288 | + |
| 289 | +If the primary task is configured to [`restart`][], the task should be able to |
| 290 | +restart within the lock TTL in order to minimize flapping on restart. This |
| 291 | +improves availability but isn't required for correctness. |
| 292 | + |
| 293 | +[`group.disconnect.stop_on_client_after`]: /nomad/docs/job-specification/disconnect#stop_on_client_after |
| 294 | +[Locks API]: /nomad/api-docs/variables/locks |
| 295 | +[Task API]: /nomad/api-docs/task-api |
| 296 | +[`nomad var lock`]: /nomad/commands/var/lock |
| 297 | +[workload-associated ACL policy]: /nomad/docs/concepts/workload-identity#workload-associated-acl-policies |
| 298 | +[`namespace.variables`]: /nomad/docs/other-specifications/acl-policy#variables |
| 299 | +[`task.leader=true`]: /nomad/docs/job-specification/task#leader |
| 300 | +[`restart`]: /nomad/docs/job-specification/restart |
0 commit comments