Something must happen later — reliably, maybe a billion times, ideally exactly once, and it has to survive machines dying mid-execution. The prompt sounds trivial and hides the hardest guarantee in distributed systems. The senior move is to reframe it before you’re cornered: you don’t get true exactly-once — you get at-least-once plus idempotency, which is effectively-once.
Every FAANG company runs a rubric. The dimensions are roughly the same; the weights differ by company and level. At senior+ the boxes-and-arrows are table stakes — what gets graded hardest is the quality of your decisions: the questions you asked first, the trade-offs you surfaced and defended, and the production reality you volunteered without being asked.
| Dimension | Weight | What earns the signal |
|---|---|---|
| Requirements & scoping | 10–15% | You scoped before drawing, asked enough to bound the problem, pinned the scale number, and stated assumptions out loud. |
| High-level architecture | 20–25% | The right components, a clear data flow, and a reason every box exists. The design satisfies each functional requirement. |
| Technical depth / deep dives | ~30% | You go three questions deep on the hard part without being rescued. This is where staff is won or lost. |
| Trade-offs & judgment | highest effective | Two viable options, what each costs, and a committed pick for this system. Simplicity over flash when flash isn't warranted. |
| Communication / driving | cross-cutting | You drive the 45 minutes; the interviewer never has to rescue you. You narrate, checkpoint, and narrow when the design sprawls. |
| Operational maturity | ↑ in 2026 | The newest weight: observability, rollout, failure modes, on-call reality — volunteered, not pried out. |
A solid design with reasonable trade-offs is a strong score for a mid-level candidate and a downlevel flag for staff. The questions can be identical; the depth expectation is not. As you climb, the balance tips from breadth toward depth, proactivity, and production reality.
You don't recite AWS — you anchor each decision to one of these. It signals you evaluate systems across competing concerns rather than optimizing one axis. Each pillar below is mapped to a move you can make in this exact design.
Watch fire skew and retries.
Isolate untrusted execution.
Survive worker and leader death.
Two-tier time storage.
Scale execution independently.
Bound retries.
Something must happen later — reliably, maybe a billion times, ideally exactly once, surviving machines that die mid-execution. The prompt sounds trivial and hides the hardest guarantee in distributed systems. The senior move is to reframe it: you don’t get true exactly-once — you get at-least-once plus idempotency, which is effectively-once.
The simulation. Framing: a distributed scheduler — billions of scheduled jobs, one-off and recurring (cron), workers assumed to crash mid-execution, and a fire-precision target you must clarify (second-level vs minute-level changes the architecture).
“First, what fire precision do we need? Second-level versus minute-level changes the architecture — polling frequency, locking overhead, the timer mechanism. I’ll design for the precision you need rather than over-engineer.”
“I won’t promise true exactly-once — it’s impossible over an unreliable network, the Two-Generals problem. I’ll deliver at-least-once execution plus idempotency, which is effectively-once. That’s the honest guarantee.”
“We’ll just guarantee every job runs exactly once.” Claiming literal exactly-once for a distributed scheduler signals you haven’t hit the failure cases that make it impossible.
Design a distributed job scheduler.
Two things up front. What precision — are we firing to the second or the minute? That drives the timer design. And on guarantees: I won’t claim true exactly-once, which is impossible under network failure; I’ll do at-least-once plus idempotent execution, effectively-once. The architecture decouples the scheduling decision from execution so I can scale them independently. Let me build the job lifecycle, then the failure handling — which is what this question is really testing.
Second-level precision. Assume workers crash.
Then leases and re-queue on crash are central, and I’ll use a timing-wheel timer for second-level firing. Let me lay it out.
Entities: Job (id, schedule or cron expression, payload, status, next_run_at), Worker. Interface:
Store and compare timestamps in UTC, and sync server clocks via NTP — clock drift silently fires jobs early or late. The polling interval defines the precision SLA: poll every 10s and firing is ±10s; document that, don’t treat it as a bug.
A billion jobs can’t all sit in memory. That single fact forces a two-tier time store: far-future jobs in durable storage, only near-term jobs in the in-memory timer — the key scaling insight to state early.
“Clock drift is real, so timestamps are UTC and servers sync via NTP. And the polling interval is my precision SLA — I’ll state it explicitly rather than pretend firing is instant.”
Jobs are stored durably (a DB — the source of truth). A scheduler finds due jobs and publishes them to a queue (Kafka); a fleet of workers pulls, executes, and acks. On success it updates status (and computes next_run_at for recurring jobs).
Finding due jobs has three flavors: DB polling (SELECT … WHERE status=pending AND next_run_at <= now on an index — simple, adds DB load), a Redis sorted set (ZRANGEBYSCORE by timestamp — fast, O(log N)), or a timing wheel (circular buffer of time slots — efficient for many short-delay jobs, used by Kafka and Netty internally).
“Scheduling and execution are separate services. The scheduler just decides what’s due and enqueues it; the worker fleet executes. That lets me scale the cheap decision layer and the expensive execution layer independently.”
Plain DB polling. Perfectly fine — say so — for modest scale or coarse precision, with an index on (status, next_run_at). Name the cost: it adds load on the DB and the poll interval caps precision. Move to a sorted set or timing wheel when precision or scale demands it.
“The worker pulls the job and just runs it.” With no lease and no idempotency, a crash mid-job either loses it or causes a double-run — the exact failure this question exists to probe.
You can’t hold every future job in memory. Use a timing wheel — a circular buffer where each slot is a time interval — for near-future jobs (O(1) insert/fire), while far-future jobs live in durable storage and are loaded into the wheel as their time approaches. This two-tier design prevents memory exhaustion while keeping near-term firing fast.
If two schedulers both decide a job is due, it gets assigned twice. Elect a single leader via consensus (ZooKeeper/etcd/Raft) to make authoritative scheduling decisions. On failover, a new leader is elected in seconds and reloads state from the durable store; to cover the brief gap, workers re-check for missed recurring jobs on startup. Shared scheduling state lives in a distributed DB so any leader has the latest view.
True exactly-once is impossible (Two-Generals). Deliver at-least-once and make execution idempotent — idempotency is the application’s responsibility, enforced with a dedup key so a re-run produces no extra effect. A distributed lock with TTL on the job (e.g. Redis SET NX EX) ensures two workers don’t run it concurrently; the TTL releases it if the holder crashes.
“I deliberately don’t claim exactly-once. I get effectively-once from at-least-once delivery, a per-job lease with TTL so crashes re-queue safely, and idempotent execution keyed on a job-run id. The leader election prevents double-assignment, and the durable store plus missed-job recovery covers failover gaps.”
“We’ll keep all the scheduled jobs in a big in-memory priority queue.” At a billion jobs that exhausts memory and loses everything on restart — you need the durable two-tier design.
A worker pulls a job, starts running it, then crashes. What happens?
The job was leased with a TTL, not deleted. When the worker dies it stops renewing the lease; the lease expires and the job becomes visible again, so another worker picks it up. Because execution is idempotent — keyed on a job-run id — the re-run is safe even if the first worker had partially completed. That’s at-least-once delivery made effectively-once by idempotency.
And if the scheduler leader dies?
Consensus elects a new leader in a couple of seconds, and it reloads the schedule from the durable store — no job is lost because the store is the source of truth. The risk is jobs due during the election gap; I cover that by having workers re-check for missed recurring jobs on startup, and one-off jobs simply fire slightly late, within the precision SLA I stated. Single-leader avoids two schedulers double-assigning the same job.
Rate-limit submissions per client so one tenant can’t flood the scheduler. Roll out scheduler changes behind the leader with fast rollback; the durable store means a bad leader can be replaced without losing jobs.
“With more time I’d detail DAG-based dependencies, priority tiers, and partitioning the schedule across leaders for horizontal scale. I scoped them out deliberately.”
Interviewers probe the failure cases hardest — worker crashes, leader death, exactly-once. Reframe the guarantee, lean on leases and idempotency.
The job was leased with a TTL, not removed. The crashed worker stops renewing, the lease expires, and the job becomes visible for another worker. Idempotent execution keyed on a job-run id makes the re-run safe even after a partial first attempt — at-least-once made effectively-once.
Not literally — it's impossible under network failure, the Two-Generals problem. I guarantee at-least-once delivery plus idempotent execution, which is effectively-once. Idempotency is the application's responsibility, enforced with a dedup key; a distributed lock with TTL prevents concurrent double-runs.
Two tiers: far-future jobs live in durable storage, and only near-term jobs are loaded into an in-memory timing wheel for O(1) firing. As time advances, the next batch of jobs is pulled from the store into the wheel. Memory holds the near horizon, not the whole future.
That's split-brain, and it double-assigns. I elect a single leader via consensus to make authoritative scheduling decisions; followers stand by. A per-job lock with TTL is the backstop so even a transient overlap can't run a job twice.
Store the cron expression with a next_run_at timestamp. When the scheduler finds it due, it enqueues the run and computes the next fire time from the expression. Timestamps are UTC and servers NTP-sync to fight clock drift; the poll interval defines the firing precision, which I state as an SLA.
Consensus elects a new leader in seconds, and it reloads the schedule from the durable store, so nothing is lost. Jobs due during the gap are covered by workers re-checking missed recurring jobs on startup, and one-off jobs fire slightly late within the precision SLA. Single-leader is what prevents double-assignment.
A clean design with one of these undercurrents still scores below the bar at senior+. None are about getting an answer wrong — they're about how you operate.
Jumping to architecture without bounding the problem or confirming scale. Reads as template-matching.
"It depends" with no decision behind it. Name the trade-off, then pick.
Claiming literal exactly-once execution. The honest answer is at-least-once + idempotency = effectively-once.
An in-memory priority queue for a billion jobs — it exhausts memory and loses everything on restart. Needs the durable two-tier design.
Letting a worker run a job with no lease and no dedup, so a crash loses it or a retry double-runs it.
No observability, no rollout, no failure-mode plan. In 2026 this reads as "has never carried a pager."
Confident wrong answers when pushed. Far worse than an honest "here's what I'd verify."
Waiting to be asked the next question. At staff you own the 45 minutes.
Run a mock and score yourself honestly against the dimensions the interviewer uses. If you can't hit "strong" on depth and operability, that's your signal on where to drill.
| Dimension | Weak (downlevel) | Strong (at level) |
|---|---|---|
| Scoping | Promised exactly-once; ignored precision. | Clarified precision as an SLA; reframed exactly-once as at-least-once + idempotency; decoupled scheduling/execution. |
| Architecture | One process does everything. | Durable store → scheduler → queue → worker fleet, scaled independently. |
| Time management | All jobs in memory. | Two-tier: durable store for far-future, in-memory timing wheel for near-term. |
| Coordination | Multiple schedulers double-assign. | Single leader via consensus; failover reloads from durable store; missed-job recovery. |
| Exactly-once | No crash story. | Lease with TTL re-queues on crash; idempotent execution keyed on a run id; dead-letter for poison jobs. |
| Operability | Never mentioned it. | Fire-skew SLO, success/failure/retry/dead-letter metrics, submission rate limiting. |