Skip to content

Retry Strategies

When a worker handler returns an error, zzz_jobs does not immediately discard the job. Instead, it consults the retry strategy configured on the worker to compute a backoff delay, then re-schedules the job for a future attempt. This continues until the job either succeeds or exhausts its maximum attempts.

  1. A worker handler returns an error.
  2. The store increments the job’s attempt counter and records the error message.
  3. If attempt >= max_attempts, the job transitions to discarded (dead letter).
  4. Otherwise, the retry strategy computes a scheduled_at timestamp in the future. The job transitions back to available and will not be claimed until that time arrives.

The RetryStrategy is a tagged union with four variants:

pub const RetryStrategy = union(enum) {
exponential: ExponentialBackoff,
linear: LinearBackoff,
constant: ConstantBackoff,
custom: *const fn (attempt: i32, base_delay: i64) i64,
};

Retry strategies are set per worker at registration time:

supervisor.registerWorker(.{
.name = "my_worker",
.handler = &myHandler,
.retry_strategy = .{ .exponential = .{} }, // default
});

Doubles the delay with each attempt, up to a configurable maximum. This is the default strategy.

.retry_strategy = .{ .exponential = .{
.base_seconds = 15, // initial delay
.max_seconds = 3600, // cap at 1 hour
.jitter = true, // add randomized jitter
} }
FieldTypeDefaultDescription
base_secondsi6415Delay for the first retry
max_secondsi643600Maximum delay (cap)
jitterbooltrueAdd deterministic jitter (up to 25% of the computed delay) to spread out retries

Delay formula: min(base_seconds * 2^attempt, max_seconds) + jitter

Example delays (no jitter, base=15, max=3600):

AttemptDelay
015s
130s
260s
3120s
4240s
5480s
6960s
71920s
83600s (capped)

The maximum number of attempts is controlled by max_attempts in JobOpts. It defaults to 20. You can set it per-worker (via WorkerDef.opts) or per-job (when calling enqueue):

// Per-worker default
supervisor.registerWorker(.{
.name = "fragile_worker",
.handler = &fragileHandler,
.opts = .{ .max_attempts = 3 },
.retry_strategy = .{ .constant = .{ .delay_seconds = 10 } },
});
// Per-job override (takes precedence)
_ = try supervisor.enqueue("fragile_worker", "{}", .{
.max_attempts = 5,
});

When a job’s attempt reaches max_attempts, it transitions to the discarded state. Discarded jobs remain in the store with their error message preserved in the errors field. They are not automatically deleted.

To query discarded jobs:

const discarded_count = try supervisor.store.countByState("default", .discarded);

To clean up old completed jobs (this does not affect discarded jobs):

const deleted = try supervisor.store.deleteCompleted(cutoff_timestamp);

The supervisor emits different telemetry events depending on the retry outcome:

  • job_failed — the job failed but will be retried (attempts remain)
  • job_discarded — the job exhausted all attempts and was discarded

Both events include the duration_ms and error_msg in the JobResult payload. See Telemetry for details on subscribing to these events.