Define job restart behaviors

3min
|
Nomad

Nomad will by default attempt to restart a job locally on the node that it is running or scheduled to be running on. These defaults vary by the scheduler type in use for the job: system, service, or batch.

To customize this behavior, the task group can be annotated with configurable options using the restart stanza. Nomad will restart the failed task up to attempts times within a provided interval. Operators can also choose whether to keep attempting restarts on the same node, or to fail the task so that it can be rescheduled on another node, via the mode parameter.

Setting mode to fail in the restart stanza allows rescheduling to occur potentially moving the task to another node and is best practice.

The following CLI example shows job status and allocation status for a failed task that is being restarted by Nomad. Allocations are in the pending state while restarts are attempted. The Recent Events section in the CLI shows ongoing restart attempts.

$ nomad job status demoID            = demoName          = demoSubmit Date   = 2018-04-12T14:37:18-05:00Type          = servicePriority      = 50Datacenters   = dc1Status        = runningPeriodic      = falseParameterized = falseSummaryTask Group  Queued  Starting  Running  Failed  Complete  Lostdemo        0       3         0        0       0         0AllocationsID        Node ID   Task Group  Version  Desired  Status   Created  Modifiedce5bf1d1  8a184f31  demo        0        run      pending  27s ago  5s agod5dee7c8  8a184f31  demo        0        run      pending  27s ago  5s agoed815997  8a184f31  demo        0        run      pending  27s ago  5s ago

In the following example, the allocation ce5bf1d1 is restarted by Nomad approximately every ten seconds, with a small random jitter. It eventually reaches its limit of three attempts and transitions into a failed state, after which it becomes eligible for rescheduling.

$ nomad alloc status ce5bf1d1ID                     = ce5bf1d1Eval ID                = 64e45d11Name                   = demo.demo[1]Node ID                = a0ccdd8bJob ID                 = demoJob Version            = 0Client Status          = failedClient Description     = <none>Desired Status         = runDesired Description    = <none>Created                = 56s agoModified               = 22s agoTask "demo" is "dead"Task ResourcesCPU      Memory   Disk     Addresses100 MHz  300 MiB  300 MiBTask Events:Started At     = 2018-04-12T22:29:08ZFinished At    = 2018-04-12T22:29:08ZTotal Restarts = 3Last Restart   = 2018-04-12T17:28:57-05:00Recent Events:Time                       Type            Description2018-04-12T17:29:08-05:00  Not Restarting  Exceeded allowed attempts 3 in interval 5m0s and mode is "fail"2018-04-12T17:29:08-05:00  Terminated      Exit Code: 1272018-04-12T17:29:08-05:00  Started         Task started by client2018-04-12T17:28:57-05:00  Restarting      Task restarting in 10.364602876s2018-04-12T17:28:57-05:00  Terminated      Exit Code: 1272018-04-12T17:28:57-05:00  Started         Task started by client2018-04-12T17:28:47-05:00  Restarting      Task restarting in 10.666963769s2018-04-12T17:28:47-05:00  Terminated      Exit Code: 1272018-04-12T17:28:47-05:00  Started         Task started by client2018-04-12T17:28:35-05:00  Restarting      Task restarting in 11.777324721s

Recovery strategies

Restart workloads