Retry Strategies
When a worker handler returns an error, zzz_jobs does not immediately discard the job. Instead, it consults the retry strategy configured on the worker to compute a backoff delay, then re-schedules the job for a future attempt. This continues until the job either succeeds or exhausts its maximum attempts.
How retries work
Section titled “How retries work”- A worker handler returns an error.
- The store increments the job’s
attemptcounter and records the error message. - If
attempt >= max_attempts, the job transitions to discarded (dead letter). - Otherwise, the retry strategy computes a
scheduled_attimestamp in the future. The job transitions back to available and will not be claimed until that time arrives.
RetryStrategy
Section titled “RetryStrategy”The RetryStrategy is a tagged union with four variants:
pub const RetryStrategy = union(enum) { exponential: ExponentialBackoff, linear: LinearBackoff, constant: ConstantBackoff, custom: *const fn (attempt: i32, base_delay: i64) i64,};Retry strategies are set per worker at registration time:
supervisor.registerWorker(.{ .name = "my_worker", .handler = &myHandler, .retry_strategy = .{ .exponential = .{} }, // default});Built-in strategies
Section titled “Built-in strategies”Doubles the delay with each attempt, up to a configurable maximum. This is the default strategy.
.retry_strategy = .{ .exponential = .{ .base_seconds = 15, // initial delay .max_seconds = 3600, // cap at 1 hour .jitter = true, // add randomized jitter} }| Field | Type | Default | Description |
|---|---|---|---|
base_seconds | i64 | 15 | Delay for the first retry |
max_seconds | i64 | 3600 | Maximum delay (cap) |
jitter | bool | true | Add deterministic jitter (up to 25% of the computed delay) to spread out retries |
Delay formula: min(base_seconds * 2^attempt, max_seconds) + jitter
Example delays (no jitter, base=15, max=3600):
| Attempt | Delay |
|---|---|
| 0 | 15s |
| 1 | 30s |
| 2 | 60s |
| 3 | 120s |
| 4 | 240s |
| 5 | 480s |
| 6 | 960s |
| 7 | 1920s |
| 8 | 3600s (capped) |
Increases the delay linearly with each attempt: delay_seconds * (attempt + 1).
.retry_strategy = .{ .linear = .{ .delay_seconds = 60,} }| Field | Type | Default | Description |
|---|---|---|---|
delay_seconds | i64 | 60 | Base delay multiplied by the attempt number |
Example delays (delay_seconds=60):
| Attempt | Delay |
|---|---|
| 0 | 60s |
| 1 | 120s |
| 2 | 180s |
| 3 | 240s |
Uses the same fixed delay for every retry, regardless of attempt number.
.retry_strategy = .{ .constant = .{ .delay_seconds = 30,} }| Field | Type | Default | Description |
|---|---|---|---|
delay_seconds | i64 | 30 | Fixed delay between every retry |
Every retry waits exactly delay_seconds, making this suitable for transient failures where you expect quick recovery.
Provide your own function to compute the retry delay:
fn myRetryDelay(attempt: i32, base_delay: i64) i64 { _ = base_delay; // Fibonacci-style: 1, 2, 3, 5, 8, 13, ... minutes const fibs = [_]i64{ 60, 120, 180, 300, 480, 780, 1260 }; const idx: usize = @intCast(@min(attempt, fibs.len - 1)); return fibs[idx];}
supervisor.registerWorker(.{ .name = "my_worker", .handler = &myHandler, .retry_strategy = .{ .custom = &myRetryDelay },});The custom function receives the current attempt (0-based) and returns the delay in seconds to add to the current time.
Max attempts
Section titled “Max attempts”The maximum number of attempts is controlled by max_attempts in JobOpts. It defaults to 20. You can set it per-worker (via WorkerDef.opts) or per-job (when calling enqueue):
// Per-worker defaultsupervisor.registerWorker(.{ .name = "fragile_worker", .handler = &fragileHandler, .opts = .{ .max_attempts = 3 }, .retry_strategy = .{ .constant = .{ .delay_seconds = 10 } },});
// Per-job override (takes precedence)_ = try supervisor.enqueue("fragile_worker", "{}", .{ .max_attempts = 5,});Dead letter behavior
Section titled “Dead letter behavior”When a job’s attempt reaches max_attempts, it transitions to the discarded state. Discarded jobs remain in the store with their error message preserved in the errors field. They are not automatically deleted.
To query discarded jobs:
const discarded_count = try supervisor.store.countByState("default", .discarded);To clean up old completed jobs (this does not affect discarded jobs):
const deleted = try supervisor.store.deleteCompleted(cutoff_timestamp);Telemetry integration
Section titled “Telemetry integration”The supervisor emits different telemetry events depending on the retry outcome:
job_failed— the job failed but will be retried (attempts remain)job_discarded— the job exhausted all attempts and was discarded
Both events include the duration_ms and error_msg in the JobResult payload. See Telemetry for details on subscribing to these events.
Next steps
Section titled “Next steps”- Workers and supervisors — configuring workers and the execution lifecycle
- Unique jobs — preventing duplicate work
- Telemetry — monitoring job failures and retries