An interview is itself a rate-limited resource: 45 minutes, one token bucket of attention, and an interviewer deciding which of your moves to allow and which to throttle. This playbook scripts the whole loop — phase by phase, probe by probe — for the most classic distributed-systems question there is. It tells you what to say, what never to say, and exactly how the signal changes between mid, senior, and staff.
Every FAANG company runs a rubric. The dimensions are roughly the same; the weights differ by company and level. The single biggest mistake at senior+ is treating this as a drawing exercise. The boxes-and-arrows are table stakes. What gets graded hardest is the quality of your decisions — the questions you asked first, the trade-offs you surfaced and defended, the moments you committed instead of hedging, and the operational reality you volunteered without being asked.
| Dimension | Weight | What earns the signal |
|---|---|---|
| Requirements & scoping | 10–15% | You scoped before drawing. Asked enough to bound the problem, not so many that it reads as stalling or template-following. You pinned the scale number and stated assumptions out loud. |
| High-level architecture | 20–25% | The right components, a clear data flow, and a reason every box exists. No orphan services. The design satisfies each functional requirement you wrote down. |
| Technical depth / deep dives | ~30% | You can go three questions deep on the hard part — the race condition, the sharding scheme, the failover mode — without being rescued. This is where staff is won or lost. |
| Trade-offs & judgment | highest effective | You propose at least two viable options, name what each one costs, and commit to one for this system with a reason. Simplicity chosen over flash when flash isn't warranted. |
| Communication / driving | cross-cutting | You drive the 45 minutes; the interviewer never has to rescue you. You narrate, checkpoint, and notice when the design is sprawling and deliberately narrow it. |
| Operational maturity | ↑ in 2026 | The newest weight. Observability, rollout/canary, on-call reality, failure modes — volunteered, not pried out. Skipping it now reads as "has never been on call," a senior red flag. |
A solid high-level design with reasonable trade-offs is a strong score for a new grad — and a downlevel flag for a staff candidate. The questions can be identical; the depth expectation is not. The clearest way to think about it is the ratio of breadth (covering the whole system) to depth (going deep on the hard parts). As you climb, the balance tips toward depth, proactivity, and production reality.
You don't need to recite AWS, but the six Well-Architected pillars are a battle-tested checklist for sounding senior. When you justify a decision, anchor it to one of these. It signals that you evaluate systems the way an architect does — across competing concerns — rather than optimizing a single axis. Below, each pillar is mapped directly onto a move you can make in the rate-limiter design.
Run, monitor, and continuously improve. Operations as code; observability built in.
Protect data and systems; least privilege; defense in depth.
Recover from failure, scale horizontally, design for the partition.
Use resources efficiently; keep the hot path cheap.
Stop paying for what you don't need; right-size.
Minimize the footprint of the workload over time.
A rate limiter sits on the hot path and answers one question per request: has this key exceeded its budget? The whole design hinges on a single, atomic, shared counter — usually a token bucket in Redis updated with one Lua script — so that distributed gateway nodes never disagree about the count.
This is the simulation. For each phase you get a time budget, what the interviewer is grading in that window, the lines that earn the signal, the ones that lose it, and a scripted exchange for the rate-limiter problem specifically. The framing throughout: the rate limiter for a social platform's API — 1M requests/second across 100M daily active users, with a <10ms overhead budget per check.
Separate functional requirements (what it does) from non-functional (how well). Then nail the one number that changes every downstream decision: scale. Write them where the interviewer can see them — this is your contract for the rest of the hour.
"Before I design anything — what scale are we targeting? A startup doing thousands of requests a day and a platform doing a million a second are completely different machines. I'll assume the large-platform case unless you'd rather scope it down."
"I'll treat strong consistency as out of scope. For rate limiting, briefly over-admitting a handful of requests during a partition is fine; rejecting legitimate traffic to stay perfectly consistent is worse. So I'm optimizing for availability and low latency."
Listing ten clarifying questions in a row. One or two sharp ones that visibly change the design is senior; a checklist recited from memory is the opposite signal. Ask, then use the answer.
"Let me just start drawing the architecture." Jumping to boxes before scoping is the most common junior tell. So is silently assuming the scale instead of confirming it.
Design a rate limiter.
Let me scope it first. Are we rate-limiting individual HTTP requests at the edge, or higher-level business actions? I'll assume request-level, server-side — clients can't be trusted to self-regulate. And what scale: are we protecting a public API at, say, a million requests a second, or something smaller?
Million a second. Social platform.
Good. Then my three functional requirements are: identify the client by user / IP / key, enforce configurable rules, and return 429 with retry headers. Non-functionally: under 10ms overhead, highly available, eventual consistency acceptable. Strong consistency I'm putting below the line. Sound right before I move on?
Three entities carry this design: Rules (the policies — requests per window, which clients, which endpoints), Clients (the identities being limited, with their current usage state), and Requests (carry the context that selects which rules apply). The interface is deliberately tiny:
That return shape isn't incidental — remaining and resetTime are exactly what populate your X-RateLimit-* response headers. Calling that out shows you're designing the contract and the client experience together.
The one estimate that drives the architecture: a single Redis instance handles roughly 100K–200K simple ops/sec, and each check is at minimum a read plus a write. So one instance realistically sustains ~50K–100K checks/sec. At 1M req/s you therefore need on the order of 10+ shards. State that math out loud — it's the bridge into your deep dive on scaling.
"I want the limiter to expose one synchronous call that returns allow/deny plus the remaining count and reset time, because those values are what the gateway needs to write the response headers. Keeping the interface this thin means I can change the algorithm behind it without touching callers."
Spending five minutes on a full relational schema. A rate limiter's "data model" is a key and a couple of counters — over-modeling it signals you've pattern-matched to the wrong kind of problem.
Walk the three placements and commit. In-process (in each app server) is fastest but each server sees only its slice of traffic, so global limits become "off by a factor of N." A dedicated service gives global state and rich context but adds a network hop to every request and a new failure point. The API gateway / edge is the standard production answer: every request hits it first, bad traffic is turned away at the door before it touches your app servers. Commit to the gateway and name the trade-off you're accepting — limited business context, since the gateway only sees the HTTP request.
Show you know the field, then choose. The honest interview line is "it depends on the use case" — followed immediately by a committed pick, never left hanging.
| Algorithm | Bursts | Memory | Accuracy | One-line verdict |
|---|---|---|---|---|
| Fixed window | Yes, at edges | Low | Low | Simplest; the boundary lets a user double their limit across the window seam. |
| Sliding window log | No | High | Perfect | Exact, but stores every timestamp — too costly at 100M users. |
| Sliding window counter | Partial | Low | High | Two counters + weighting math; great default for an external API. |
| Token bucket | Yes | Low | High | Handles sustained rate and bursts; what Stripe and AWS use. |
| Leaky bucket | No (smooths) | Medium | Perfect rate | FIFO queue, constant outflow; good for shaping, not inbound API limits. |
For this system, commit to token bucket: it accommodates the bursty nature of real API traffic while still enforcing an average rate, it's cheap (track tokens + last_refill per client), and it's what large API providers actually run. If pressed on the burst-vs-precision trade-off, name sliding-window-counter as the alternative you'd reach for if even bursts had to be tightly bounded.
Token buckets must be shared across all gateway instances, or you're back to the in-process problem. Use Redis as the central, sub-millisecond store. Flow: gateway reads the bucket (tokens, last_refill), computes refill from elapsed time, decrements if a token is available, writes back, and sets a TTL so idle buckets self-clean.
MULTI/EXEC isn't enough, because the read happened outside the transaction. The correct answer is to make the entire read-modify-write atomic with a Lua script, which Redis runs as one indivisible operation. Reaching for this before being prompted is a senior signal; reaching for it only after the interviewer points at the race is a mid signal.Fail fast. Return 429 immediately rather than queuing excess requests — queuing burns memory, makes latency unpredictable, and invites retry storms when users assume failure. Pair the 429 with headers so well-behaved clients can back off intelligently:
"I'll put the limiter at the API gateway. It's the bouncer at the door — abusive traffic gets a 429 before it ever reaches an app server. The cost I'm accepting is that the gateway only sees the HTTP request, so rule logic that needs deep user context has to come from the token or a sidecar lookup."
"There's a race in the read-modify-write across gateways. I'll close it by doing the whole check inside a Redis Lua script, which executes atomically — that expands the atomic boundary to cover the read, not just the write."
"Let's queue the rejected requests." Defensible only for batch/async systems where latency doesn't matter — and you must say so. For an interactive API it's the wrong call; name why before dismissing it.
"We'll just use Redis and increment a counter." Hand-waving the atomicity is the exact gap interviewers probe. "Just use Redis" only survives until the follow-up about 50 servers racing on the same key.
One Redis instance can't carry the load (you established this in estimation). Shard the rate-limit data, but shard consistently: every request for a given client must hit the same shard, or its state fragments and the limit becomes meaningless. Hash the identifier — user ID for authenticated, IP for anonymous, key for developer traffic — with consistent hashing so re-sharding doesn't reshuffle everything. In production you'd lean on Redis Cluster, which spreads keys across 16,384 hash slots automatically rather than you hand-rolling the routing. Ten shards at ~100K ops/sec each covers the target.
Each shard is now critical. Give every shard replicas with automatic failover (Redis Cluster does this). Then make the deliberate call interviewers want to hear reasoned, not guessed: when the limiter can't reach Redis, do you fail open (allow everything, stay available, lose protection) or fail closed (reject everything, stay protected, take the API offline)?
The senior move is to recognize there's no universal answer and to reason from this system. For a social platform, fail closed: limiter failures tend to coincide with traffic spikes, and failing open in that moment floods your databases — turning a limiter outage into a full platform outage. For a payments system you might argue the opposite. State your reasoning, commit, and acknowledge the cost.
Biggest win is connection pooling — reuse persistent Redis connections instead of paying the TCP handshake (20–50ms) per check. Next is geographic distribution: co-locate gateways and Redis by region so a Tokyo user isn't crossing an ocean for a counter check, accepting cross-region eventual consistency in exchange. Mention pipelining/local caching exist but note you'd skip them unless asked — stale local cache causes incorrect decisions, and the complexity usually isn't worth it once pooling and locality are handled.
A single client generating tens of thousands of req/s can overwhelm one shard. Split the response by intent: for legitimate high-volume clients — encourage client-side rate limiting, offer batching, and provision premium tiers. For abuse — auto-blocklist a client that trips limits repeatedly (store the list in a shard, check on cache miss) and lean on edge DDoS protection (Cloudflare / AWS Shield) before traffic reaches the limiter. Note up front that corporate NATs and public WiFi share IPs, so set IP limits generously and prefer authenticated-user limits.
Real systems change limits without a deploy — launch boosts, premium tiers, emergency clamp-downs. Poll-based (gateways poll a config store every ~30s) is simple and covers most cases; its cost is up-to-30s propagation lag, painful during an attack. Push-based (ZooKeeper or Redis pub/sub notifies gateways instantly) updates in seconds but adds real complexity around partial failures and fallback. Recommend poll as the default, push only when seconds-fast updates are a genuine requirement.
"I'll fail closed for this platform. It feels counterintuitive against an availability goal, but limiter failures and traffic spikes happen together — failing open at that moment sends the full flood downstream and collapses the backend. Brief rejected requests beat a cascading outage. For a payments API I'd argue the other way."
"Beyond picking a failure mode, the better answer is not failing — so: replicas per shard with automatic promotion, and I'd roll any limit change out as a canary to a small slice of traffic first, watching reject-rate before going fleet-wide."
Local in-gateway caching of counts. It cuts latency but risks stale decisions; only propose it with the staleness trade-off named and bounded, and say you'd skip it unless the latency budget forced your hand.
"We'll just add more Redis servers." Adding capacity without a partitioning scheme splits each client's state and breaks the limit. The follow-up — "how does a request find the right shard?" — is exactly where this answer falls apart.
A Redis shard dies mid-spike. What happens to the users on it?
Two layers. First, prevention: that shard has replicas, so Cluster promotes one automatically — the outage window is short. Second, if we genuinely can't reach it, we hit our failure mode. I chose fail-closed for this platform, so those users get 429s briefly. That's deliberate — the alternative, fail-open during a spike, floods the backend and takes everyone down, not just one shard's users.
Isn't rejecting valid users bad?
It is — it's a real cost, and I'd surface it on a dashboard and alert on entering fail-closed. But it's bounded and recoverable. A cascading database failure isn't. I'm trading a small, visible degradation for platform-wide reliability. If this were payments I'd weigh it differently, and if the business wanted availability over protection here, fail-open is a one-line config flip — I'd want that decision made explicitly, not by accident.
Limit changes go out as canaries — a small traffic slice first, watch reject-rate, then widen. Keep a fast kill-switch to disable enforcement (fail-open by choice) if a bad rule starts rejecting good traffic. Document the runbook: what an on-call engineer does when reject-rate spikes or a shard is unhealthy.
"Before we wrap — let me name what I'd do with more time, so it's clear I know what I deferred: multi-region active-active with the cross-region drift that implies, and a cost model per shard. I scoped those out, I didn't miss them."
Going silent and waiting for the interviewer to say "we're done." Letting the clock run out without a recap or operability story wastes the highest-leverage minutes you have left.
Interviewers rarely accept the first answer. They push to find the edge of your knowledge, watch how you handle pressure, and check whether you commit or crumble. Here are the probes that come up on this question, with the response that keeps the signal green. The meta-skill: answer the question asked, give the reasoning, then stop — don't over-talk and don't reverse a correct call just because they pushed.
Both are good. Token bucket models bursts explicitly — bucket size is the burst allowance, refill rate is the sustained rate — which matches real API traffic, and it's what Stripe and AWS run. Sliding-window-counter is the better default when you need tightly bounded bursts and simpler reasoning in a distributed setting. I picked token bucket for burst friendliness; if the requirement were "never exceed N in any 60s window, ever," I'd switch.
Classic lost-update. Both read 99, both decide allow, both write 100 — 101 requests slipped through. MULTI/EXEC doesn't fix it because the read is outside the transaction. The fix is a Lua script that does read-decide-write as one atomic Redis operation, so the second request sees the first's result. It's the "expand the atomic boundary" pattern.
The rule, not the algorithm, changes. I'd encode tier in the JWT so the gateway reads it without a DB hit, and select the rule by tier at check time. Systems layer multiple rules anyway — per-user, per-IP, per-endpoint, global — and enforce the most restrictive. Premium is just a different per-user rule resolved from the token.
Rules live in a config store, not in code. Default: gateways poll every ~30s — simple, good enough for launches and tier changes, but up to 30s lag. If we need emergency-fast clamps during an attack, push-based via ZooKeeper or Redis pub/sub notifies gateways in seconds, at the cost of handling partial-update failures. I'd start with poll and add push only if seconds matter.
Allow/reject rates per rule, 429 volume by client and endpoint, Redis health and replication lag per shard, check latency against the 10ms SLO, and a hard alert on entering any failure mode. The signal I care most about: a sudden reject-rate spike on a rule we didn't change usually means an attack or a bad client, not a config error.
Expected — NATs and public WiFi share an IP. That's why IP-based limits should be set generously and treated as a coarse safety net, with the real enforcement on authenticated user IDs wherever we have them. I'd design for shared IPs up front rather than patch hot keys after the fact.
A clean design with one of these undercurrents still scores below the bar at senior+. These are the patterns interviewers write in the "concerns" box. None are about getting an answer wrong — they're about how you operate.
Jumping to architecture without bounding the problem or confirming scale. Reads as template-matching, not problem-solving.
"It depends" with no decision behind it. Nuance is good; evasion isn't. Name the trade-off, then pick.
Stalling at the race condition or the sharding scheme until the interviewer feeds you the answer. Depth is the senior+ bar.
No observability, no rollout, no failure-mode plan. In 2026 this reads as "has never carried a pager."
Confident wrong answers when pushed. Far worse than honest "here's what I'd verify."
Trying to cover everything, going deep on nothing. Strong candidates notice sprawl and deliberately narrow.
Abandoning a sound decision the moment the interviewer raises an eyebrow. They're testing conviction, not always disagreeing.
Waiting to be asked the next question. At staff you own the 45 minutes; the interviewer should rarely need to steer.
Run a mock and score yourself honestly against the dimensions the interviewer uses. If you can't hit "strong" on depth and operability for the rate limiter — the easiest of the classic questions — that's your signal on where to drill before the loop.
| Dimension | Weak (downlevel) | Strong (at level) |
|---|---|---|
| Scoping | Started drawing; assumed scale. | Bounded it in <6 min; confirmed the 1M/s number; stated assumptions aloud. |
| Algorithm choice | Named one option, no trade-off. | Surveyed the field, committed to token bucket, named the alternative and when to switch. |
| The race condition | Needed a hint; stopped at MULTI/EXEC. | Raised it unprompted; fixed it with an atomic Lua script and explained why. |
| Failure mode | "Use replicas" and moved on. | Chose fail-closed for this platform with reasoning, named the cost, and the payments counter-case. |
| Operability | Never mentioned it. | Volunteered observability, canary rollout, kill-switch, and the on-call runbook. |
| Driving | Answered questions reactively. | Led the conversation toward the deep dives that matter; recapped trade-offs at the end. |