Scheduling Playbook — Designing a Distributed Job Scheduler

01 How you're actually graded

Six buckets — and judgment outweighs the diagram.

Every FAANG company runs a rubric. The dimensions are roughly the same; the weights differ by company and level. At senior+ the boxes-and-arrows are table stakes — what gets graded hardest is the quality of your decisions: the questions you asked first, the trade-offs you surfaced and defended, and the production reality you volunteered without being asked.

Dimension	Weight	What earns the signal
Requirements & scoping	10–15%	You scoped before drawing, asked enough to bound the problem, pinned the scale number, and stated assumptions out loud.
High-level architecture	20–25%	The right components, a clear data flow, and a reason every box exists. The design satisfies each functional requirement.
Technical depth / deep dives	~30%	You go three questions deep on the hard part without being rescued. This is where staff is won or lost.
Trade-offs & judgment	highest effective	Two viable options, what each costs, and a committed pick for this system. Simplicity over flash when flash isn't warranted.
Communication / driving	cross-cutting	You drive the 45 minutes; the interviewer never has to rescue you. You narrate, checkpoint, and narrow when the design sprawls.
Operational maturity	↑ in 2026	The newest weight: observability, rollout, failure modes, on-call reality — volunteered, not pried out.

The 2026 shift, in one line. Operational concerns are now a first-class graded dimension, and "it depends" without a committed answer reads as evasion rather than nuance. Name the trade-off, then pick.

02 The same answer is scored differently at each level

It's a sliding scale, not a pass/fail bar.

A solid design with reasonable trade-offs is a strong score for a mid-level candidate and a downlevel flag for staff. The questions can be identical; the depth expectation is not. As you climb, the balance tips from breadth toward depth, proactivity, and production reality.

Mid-level

Meta E4 · Google L4 · Amazon SDE-II

breadthdepth

Stores jobs and has a process that runs them at the right time when prompted.
Uses a queue and workers but is shaky on what happens when a worker crashes.
Knows duplicates are a risk but lacks an idempotency story.
Needs guidance toward leader election and exactly-once semantics.

Senior

Meta E5 · Google L5 · Amazon SDE-III

breadthdepth

Decouples scheduling from execution and reframes exactly-once as at-least-once + idempotency unprompted.
Uses a durable store plus a timer mechanism (sorted set / timing wheel) and a queue to workers.
Handles worker crashes with a lease/visibility timeout that re-queues, and a dedup key.
Raises leader election to avoid duplicate assignment.

Staff+

Meta E6 · Google L6 · Amazon Principal

breadthdepth

Establishes the pipeline fast, then spends time on the two-tier time store, leader failover, and exactly-once semantics.
Experience-backed take on clock drift/NTP, precision-as-SLA, and dead-letter for poison jobs.
Treats partitioning the schedule, missed-job recovery on failover, and DAG dependencies as routine.
Frames the Two-Generals impossibility and idempotency as the application’s responsibility.

03 The lens senior engineers narrate through

Borrow AWS's Well-Architected pillars as your trade-off vocabulary.

You don't recite AWS — you anchor each decision to one of these. It signals you evaluate systems across competing concerns rather than optimizing one axis. Each pillar below is mapped to a move you can make in this exact design.

PILLAR 01

Operational Excellence

Watch fire skew and retries.

Hook: “I’d track scheduling lag — actual vs intended fire time — plus job success/failure/retry rates and queue depth, with dead-letter alerts for poison jobs.”

PILLAR 02

Security

Isolate untrusted execution.

Hook: “If jobs run arbitrary code I’d sandbox workers and rate-limit submissions per client so one tenant can’t flood the scheduler.”

PILLAR 03

Reliability

Survive worker and leader death.

Hook: “A job is leased with a TTL; if the worker dies, the lease expires and the job is re-queued. If the leader dies, consensus elects a new one that reloads state from the durable store.”

PILLAR 04

Performance Efficiency

Two-tier time storage.

Hook: “Near-future jobs sit in an in-memory timing wheel for O(1) fire; far-future jobs live in durable storage and are pulled in as their time approaches — so a billion future jobs don’t exhaust memory.”

PILLAR 05

Cost Optimization

Scale execution independently.

Hook: “Scheduling and execution are decoupled, so I scale the cheap scheduler and the expensive worker fleet separately to match actual load.”

PILLAR 06

Sustainability

Bound retries.

Hook: “Retries use backoff with a cap and a dead-letter queue, so a permanently failing job can’t spin forever and burn resources.”

How to use it without sounding like a checklist. Don't list the pillars. Weave one in when you commit: name a trade-off, name the pillar it serves, and make the call. One sentence that does all three reads as senior.

03·5 The architecture you draw on the whiteboard

Due → enqueue → execute once.

Something must happen later — reliably, maybe a billion times, ideally exactly once, surviving machines that die mid-execution. The prompt sounds trivial and hides the hardest guarantee in distributed systems. The senior move is to reframe it: you don’t get true exactly-once — you get at-least-once plus idempotency, which is effectively-once.

Submit + persist Schedule Execute

Due → enqueue → execute once. Jobs persist with a next-run time; a leader-elected scheduler polls for due jobs, enqueues them, and idempotent workers execute and mark them done. Say it: “you don’t get true exactly-once — at-least-once plus idempotency is effectively-once.”

How to narrate it in the room

Reframe before you’re cornered. “The hidden trap is ‘exactly-once.’ I’ll deliver at-least-once and make execution idempotent, which is effectively-once.”
Persist the schedule. “Jobs live in a store with next_run_time and status, partitioned by time bucket so I can find ‘what’s due now’ cheaply across billions of rows.”
Decouple find from run. “A leader-elected scheduler polls due jobs and pushes them to a queue; a worker pool consumes — so execution scales independently and survives worker death.”
Make it survivable. “Visibility timeouts re-deliver jobs whose worker died mid-run; the idempotency key ensures the retry doesn’t double-execute.”

04 The interview, minute by minute

Five phases. Drive every one of them.

The simulation. Framing: a distributed scheduler — billions of scheduled jobs, one-off and recurring (cron), workers assumed to crash mid-execution, and a fire-precision target you must clarify (second-level vs minute-level changes the architecture).

01Requirements & Scoping~6 min · don't draw yet

Grading this window: Do you clarify timing precision (it shapes the architecture), decouple scheduling from execution, and reframe exactly-once before being cornered? That reframe is the senior tell.

Functional requirements to land

Submit jobs — one-off (run at time T) and recurring (cron).
Execute each job at its scheduled time and track status.
Retry on failure.

Non-functional requirements to land

Reliable execution — no job silently dropped.
Effectively-once execution (reframe exactly-once — see below).
Scale to billions of jobs; bounded fire skew as an explicit SLA.
Survive worker and coordinator crashes.

▲ Allow — say this

“First, what fire precision do we need? Second-level versus minute-level changes the architecture — polling frequency, locking overhead, the timer mechanism. I’ll design for the precision you need rather than over-engineer.”

▲ Allow — say this

“I won’t promise true exactly-once — it’s impossible over an unreliable network, the Two-Generals problem. I’ll deliver at-least-once execution plus idempotency, which is effectively-once. That’s the honest guarantee.”

▼ Reject — never say this

“We’ll just guarantee every job runs exactly once.” Claiming literal exactly-once for a distributed scheduler signals you haven’t hit the failure cases that make it impossible.

Scripted exchange

Interviewer

Design a distributed job scheduler.

You

Two things up front. What precision — are we firing to the second or the minute? That drives the timer design. And on guarantees: I won’t claim true exactly-once, which is impossible under network failure; I’ll do at-least-once plus idempotent execution, effectively-once. The architecture decouples the scheduling decision from execution so I can scale them independently. Let me build the job lifecycle, then the failure handling — which is what this question is really testing.

Interviewer

Second-level precision. Assume workers crash.

You

Then leases and re-queue on crash are central, and I’ll use a timing-wheel timer for second-level firing. Let me lay it out.

02Entities, API & Estimation~5 min

Grading this window: A job model with schedule + status + next_run_at, and awareness of clock drift / UTC. Decoupled scheduling and execution.

Entities: Job (id, schedule or cron expression, payload, status, next_run_at), Worker. Interface:

submitJob(schedule | cron, payload) → jobId cancelJob(jobId) recurring: store cron + next_run_at; after a run, compute the next fire time

Store and compare timestamps in UTC, and sync server clocks via NTP — clock drift silently fires jobs early or late. The polling interval defines the precision SLA: poll every 10s and firing is ±10s; document that, don’t treat it as a bug.

The estimate that matters

A billion jobs can’t all sit in memory. That single fact forces a two-tier time store: far-future jobs in durable storage, only near-term jobs in the in-memory timer — the key scaling insight to state early.

▲ Allow — say this

“Clock drift is real, so timestamps are UTC and servers sync via NTP. And the polling interval is my precision SLA — I’ll state it explicitly rather than pretend firing is instant.”

03High-Level Design (the MVP)~13 min

Grading this window: Decoupled scheduler → queue → workers, with durable storage. Right components, clear lifecycle.

The lifecycle

Jobs are stored durably (a DB — the source of truth). A scheduler finds due jobs and publishes them to a queue (Kafka); a fleet of workers pulls, executes, and acks. On success it updates status (and computes next_run_at for recurring jobs).

Finding due jobs has three flavors: DB polling (SELECT … WHERE status=pending AND next_run_at <= now on an index — simple, adds DB load), a Redis sorted set (ZRANGEBYSCORE by timestamp — fast, O(log N)), or a timing wheel (circular buffer of time slots — efficient for many short-delay jobs, used by Kafka and Netty internally).

submit → durable store (job, next_run_at) scheduler: find due jobs (poll / sorted set / timing wheel) → publish to queue workers: pull → execute → ack → update status / compute next_run_at

The trap door the interviewer opens here. “A worker pulls a job and crashes before finishing — what happens?” The job must not be lost or silently double-run. Lease it with a visibility timeout / TTL: if the worker doesn’t ack before the lease expires, the job is re-queued for another worker. Combined with idempotent execution, the rerun is safe. Naming lease + idempotency together is the senior signal.

▲ Allow — say this

“Scheduling and execution are separate services. The scheduler just decides what’s due and enqueues it; the worker fleet executes. That lets me scale the cheap decision layer and the expensive execution layer independently.”

◆ Throttle — only with a reason

Plain DB polling. Perfectly fine — say so — for modest scale or coarse precision, with an index on (status, next_run_at). Name the cost: it adds load on the DB and the poll interval caps precision. Move to a sorted set or timing wheel when precision or scale demands it.

▼ Reject — never say this

“The worker pulls the job and just runs it.” With no lease and no idempotency, a crash mid-job either loses it or causes a double-run — the exact failure this question exists to probe.

04Deep Dives — the stress test~15 min · where staff is decided

Grading this window: Lead toward the two-tier time store, leader election, exactly-once semantics, and crash handling. Staff volunteers these; 30%+ of the score.

Time management at a billion jobs (two-tier)

You can’t hold every future job in memory. Use a timing wheel — a circular buffer where each slot is a time interval — for near-future jobs (O(1) insert/fire), while far-future jobs live in durable storage and are loaded into the wheel as their time approaches. This two-tier design prevents memory exhaustion while keeping near-term firing fast.

Leader election (avoid split-brain)

If two schedulers both decide a job is due, it gets assigned twice. Elect a single leader via consensus (ZooKeeper/etcd/Raft) to make authoritative scheduling decisions. On failover, a new leader is elected in seconds and reloads state from the durable store; to cover the brief gap, workers re-check for missed recurring jobs on startup. Shared scheduling state lives in a distributed DB so any leader has the latest view.

Exactly-once, honestly

True exactly-once is impossible (Two-Generals). Deliver at-least-once and make execution idempotent — idempotency is the application’s responsibility, enforced with a dedup key so a re-run produces no extra effect. A distributed lock with TTL on the job (e.g. Redis SET NX EX) ensures two workers don’t run it concurrently; the TTL releases it if the holder crashes.

Crash & failure handling

Worker crash: lease/visibility timeout expires → job re-queued → another worker runs it (safe because idempotent).
Poison jobs: after capped retries with backoff, route to a dead-letter queue instead of looping forever.
Dependencies: model job ordering as a DAG — a job fires when its predecessors complete.

▲ Allow — say this (staff move)

“I deliberately don’t claim exactly-once. I get effectively-once from at-least-once delivery, a per-job lease with TTL so crashes re-queue safely, and idempotent execution keyed on a job-run id. The leader election prevents double-assignment, and the durable store plus missed-job recovery covers failover gaps.”

▼ Reject — never say this

“We’ll keep all the scheduled jobs in a big in-memory priority queue.” At a billion jobs that exhausts memory and loses everything on restart — you need the durable two-tier design.

Scripted stress-test exchange

Interviewer

A worker pulls a job, starts running it, then crashes. What happens?

You

The job was leased with a TTL, not deleted. When the worker dies it stops renewing the lease; the lease expires and the job becomes visible again, so another worker picks it up. Because execution is idempotent — keyed on a job-run id — the re-run is safe even if the first worker had partially completed. That’s at-least-once delivery made effectively-once by idempotency.

Interviewer

And if the scheduler leader dies?

You

Consensus elects a new leader in a couple of seconds, and it reloads the schedule from the durable store — no job is lost because the store is the source of truth. The risk is jobs due during the election gap; I cover that by having workers re-check for missed recurring jobs on startup, and one-off jobs simply fire slightly late, within the precision SLA I stated. Single-leader avoids two schedulers double-assigning the same job.

05Wrap-up — operability & recap~6 min

Grading this window: Prove you could run it. Volunteer observability and rollout; recap; name what you deferred.

Observability

Scheduling lag / fire skew — intended vs actual fire time, against the precision SLA.
Job success / failure / retry rates and dead-letter volume.
Queue depth and leader-failover events.

Rollout & safety

Rate-limit submissions per client so one tenant can’t flood the scheduler. Roll out scheduler changes behind the leader with fast rollback; the durable store means a bad leader can be replaced without losing jobs.

▲ Allow — say this

“With more time I’d detail DAG-based dependencies, priority tiers, and partitioning the schedule across leaders for horizontal scale. I scoped them out deliberately.”

05 The follow-up gauntlet

The probes you'll get — and the answer that holds.

Interviewers probe the failure cases hardest — worker crashes, leader death, exactly-once. Reframe the guarantee, lean on leases and idempotency.

"A worker pulls a job then crashes before finishing."

The job was leased with a TTL, not removed. The crashed worker stops renewing, the lease expires, and the job becomes visible for another worker. Idempotent execution keyed on a job-run id makes the re-run safe even after a partial first attempt — at-least-once made effectively-once.

"Can you guarantee exactly-once?"

Not literally — it's impossible under network failure, the Two-Generals problem. I guarantee at-least-once delivery plus idempotent execution, which is effectively-once. Idempotency is the application's responsibility, enforced with a dedup key; a distributed lock with TTL prevents concurrent double-runs.

"How do you store a billion future jobs without exhausting memory?"

Two tiers: far-future jobs live in durable storage, and only near-term jobs are loaded into an in-memory timing wheel for O(1) firing. As time advances, the next batch of jobs is pulled from the store into the wheel. Memory holds the near horizon, not the whole future.

"Two schedulers both decide the same job is due."

That's split-brain, and it double-assigns. I elect a single leader via consensus to make authoritative scheduling decisions; followers stand by. A per-job lock with TTL is the backstop so even a transient overlap can't run a job twice.

"Cron: 'every Monday at 8 AM' — how?"

Store the cron expression with a next_run_at timestamp. When the scheduler finds it due, it enqueues the run and computes the next fire time from the expression. Timestamps are UTC and servers NTP-sync to fight clock drift; the poll interval defines the firing precision, which I state as an SLA.

"The scheduler leader dies."

Consensus elects a new leader in seconds, and it reloads the schedule from the durable store, so nothing is lost. Jobs due during the gap are covered by workers re-checking missed recurring jobs on startup, and one-off jobs fire slightly late within the precision SLA. Single-leader is what prevents double-assignment.

Handling a probe you can’t fully answer: reason from the guarantee. “I can’t recite the exact consensus protocol, but the property I need is a single authoritative leader and durable state so failover loses nothing. Here’s how I’d verify no job is dropped or double-run across an election.”

06 What gets you downleveled

The flags that quietly tank an otherwise solid loop.

A clean design with one of these undercurrents still scores below the bar at senior+. None are about getting an answer wrong — they're about how you operate.

Drawing before scoping

Jumping to architecture without bounding the problem or confirming scale. Reads as template-matching.

Hedging without committing

"It depends" with no decision behind it. Name the trade-off, then pick.

Promising exactly-once

Claiming literal exactly-once execution. The honest answer is at-least-once + idempotency = effectively-once.

All jobs in memory

An in-memory priority queue for a billion jobs — it exhausts memory and loses everything on restart. Needs the durable two-tier design.

No lease / no idempotency

Letting a worker run a job with no lease and no dedup, so a crash loses it or a retry double-runs it.

Skipping operations entirely

No observability, no rollout, no failure-mode plan. In 2026 this reads as "has never carried a pager."

Bluffing under a probe

Confident wrong answers when pushed. Far worse than an honest "here's what I'd verify."

Not driving

Waiting to be asked the next question. At staff you own the 45 minutes.

07 Your pre-loop scorecard

Self-grade before you walk in.

Run a mock and score yourself honestly against the dimensions the interviewer uses. If you can't hit "strong" on depth and operability, that's your signal on where to drill.

Dimension	Weak (downlevel)	Strong (at level)
Scoping	Promised exactly-once; ignored precision.	Clarified precision as an SLA; reframed exactly-once as at-least-once + idempotency; decoupled scheduling/execution.
Architecture	One process does everything.	Durable store → scheduler → queue → worker fleet, scaled independently.
Time management	All jobs in memory.	Two-tier: durable store for far-future, in-memory timing wheel for near-term.
Coordination	Multiple schedulers double-assign.	Single leader via consensus; failover reloads from durable store; missed-job recovery.
Exactly-once	No crash story.	Lease with TTL re-queues on crash; idempotent execution keyed on a run id; dead-letter for poison jobs.
Operability	Never mentioned it.	Fire-skew SLO, success/failure/retry/dead-letter metrics, submission rate limiting.

The 60-second recap that lands the level

Quick recap: I reframe exactly-once as at-least-once + idempotency (effectively-once) and clarify firing precision as an SLA; scheduling is decoupled from execution — a durable store feeds a scheduler that enqueues due jobs to workers; time management is two-tier, a timing wheel for near-term over a durable store for the billion far-future jobs; a single elected leader avoids split-brain double-assignment and reloads from the store on failover with missed-job recovery; worker crashes are handled by a lease with TTL that re-queues, made safe by idempotent execution, with a dead-letter for poison jobs. Headline metric: fire skew. With more time: DAG dependencies, priorities, and partitioning across leaders.

★

The one mental model: a scheduler is a durable two-tier timer feeding an idempotent worker fleet under a single leader — every design choice exists to survive a crash without losing or double-running a job. Say “this is an exactly-once-under-failure problem, and I’ll deliver effectively-once” in the first two minutes, and let leases plus idempotency carry the rest.

Design a Distributed Job Scheduler like a billion jobs are already due.

Time Budget · how the 45 min should split

The shape of the problem

Six buckets — and judgment outweighs the diagram.

It's a sliding scale, not a pass/fail bar.

Borrow AWS's Well-Architected pillars as your trade-off vocabulary.

Operational Excellence

Security

Reliability

Performance Efficiency

Cost Optimization

Sustainability

Due → enqueue → execute once.

How to narrate it in the room

Five phases. Drive every one of them.

Functional requirements to land

Non-functional requirements to land

The estimate that matters

The lifecycle

Time management at a billion jobs (two-tier)

Leader election (avoid split-brain)

Exactly-once, honestly

Crash & failure handling

Observability

Rollout & safety

The probes you'll get — and the answer that holds.

"A worker pulls a job then crashes before finishing."

"Can you guarantee exactly-once?"

"How do you store a billion future jobs without exhausting memory?"

"Two schedulers both decide the same job is due."

"Cron: 'every Monday at 8 AM' — how?"

"The scheduler leader dies."

The flags that quietly tank an otherwise solid loop.

Self-grade before you walk in.