Staff-Level System Design Playbook — Designing a Rate Limiter

01 How you're actually graded

The rubric is six buckets — and judgment outweighs the diagram.

Every FAANG company runs a rubric. The dimensions are roughly the same; the weights differ by company and level. The single biggest mistake at senior+ is treating this as a drawing exercise. The boxes-and-arrows are table stakes. What gets graded hardest is the quality of your decisions — the questions you asked first, the trade-offs you surfaced and defended, the moments you committed instead of hedging, and the operational reality you volunteered without being asked.

Dimension	Weight	What earns the signal
Requirements & scoping	10–15%	You scoped before drawing. Asked enough to bound the problem, not so many that it reads as stalling or template-following. You pinned the scale number and stated assumptions out loud.
High-level architecture	20–25%	The right components, a clear data flow, and a reason every box exists. No orphan services. The design satisfies each functional requirement you wrote down.
Technical depth / deep dives	~30%	You can go three questions deep on the hard part — the race condition, the sharding scheme, the failover mode — without being rescued. This is where staff is won or lost.
Trade-offs & judgment	highest effective	You propose at least two viable options, name what each one costs, and commit to one for this system with a reason. Simplicity chosen over flash when flash isn't warranted.
Communication / driving	cross-cutting	You drive the 45 minutes; the interviewer never has to rescue you. You narrate, checkpoint, and notice when the design is sprawling and deliberately narrow it.
Operational maturity	↑ in 2026	The newest weight. Observability, rollout/canary, on-call reality, failure modes — volunteered, not pried out. Skipping it now reads as "has never been on call," a senior red flag.

The 2026 shift, in one line. Companies got tired of hiring engineers who designed clean systems they couldn't operate. If you last interviewed before 2022, the framework feels familiar but the graded rubric moved: operational concerns are now a first-class dimension, and "it depends" without a committed answer reads as evasion rather than nuance.

02 The same answer is scored differently at each level

It's a sliding scale, not a pass/fail bar.

A solid high-level design with reasonable trade-offs is a strong score for a new grad — and a downlevel flag for a staff candidate. The questions can be identical; the depth expectation is not. The clearest way to think about it is the ratio of breadth (covering the whole system) to depth (going deep on the hard parts). As you climb, the balance tips toward depth, proactivity, and production reality.

Mid-level

Meta E4 · Google L4 · Amazon SDE-II

breadthdepth

Drives requirements and basic algorithm choice; explains one algorithm cleanly (Token Bucket is fine).
Places the limiter sensibly (API gateway) and names Redis as shared state.
When prompted about scale, recognizes Redis must be sharded — rough understanding is OK.
Interviewer confirms understanding of each component; not expected to spot every flaw alone.

Senior

Meta E5 · Google L5 · Amazon SDE-III

breadthdepth

Moves quickly past basic algorithm talk to distributed-systems challenges.
Confidently discusses consistent hashing, Redis Cluster, connection pooling unprompted.
Knows operations must be atomic; reaches for a Lua script / transaction without a hint.
Proactively raises hot keys, fail-open vs fail-closed, and latency optimization. Has opinions on config management.

Staff+

Meta E6 · Google L6 · Amazon Principal

breadthdepth

Establishes fundamentals fast, then spends most of the time on production operations and failure modes.
Strong, experience-backed opinions on technology choices; talks multi-region and cross-region consistency naturally.
Treats canary rollout, gradual config propagation, and observability as routine, not advanced topics.
Exceptional proactivity: surfaces edge cases and operational procedures with no prompting. Frames decisions in terms of organizational and cross-team impact.

Staff-specific reality at Meta. At staff level and above, both system-design rounds must pass — you can't coast on one. A staff candidate giving a senior-quality answer is the single most common reason for being downleveled. The fix is not more breadth; it's deeper depth and visible ownership.

03 The lens senior engineers narrate through

Borrow AWS's Well-Architected pillars as your trade-off vocabulary.

You don't need to recite AWS, but the six Well-Architected pillars are a battle-tested checklist for sounding senior. When you justify a decision, anchor it to one of these. It signals that you evaluate systems the way an architect does — across competing concerns — rather than optimizing a single axis. Below, each pillar is mapped directly onto a move you can make in the rate-limiter design.

PILLAR 01

Operational Excellence

Run, monitor, and continuously improve. Operations as code; observability built in.

Hook: "I'd emit per-rule allow/reject rates and Redis health, and alert the moment we trip into fail-open."

PILLAR 02

Security

Protect data and systems; least privilege; defense in depth.

Hook: "The limiter is a security control — it's our first line against credential-stuffing and abusive scraping, layered behind DDoS protection at the edge."

PILLAR 03

Reliability

Recover from failure, scale horizontally, design for the partition.

Hook: "Each Redis shard gets replicas with automatic failover; I'll pick a failure mode deliberately because the limiter often fails exactly when traffic spikes."

PILLAR 04

Performance Efficiency

Use resources efficiently; keep the hot path cheap.

Hook: "Sub-10ms budget per check means connection pooling and a single atomic round trip — not a read then a write."

PILLAR 05

Cost Optimization

Stop paying for what you don't need; right-size.

Hook: "Sliding-window-log gives perfect accuracy but stores every timestamp — at 100M users that memory cost isn't worth the marginal precision."

PILLAR 06

Sustainability

Minimize the footprint of the workload over time.

Hook: "TTL-expiring idle buckets keeps the working set small, which is also the cheaper and greener choice at this scale."

How to use it without sounding like a checklist. Don't list the pillars. Weave one in when you make a call: "I'll fail closed here — I'm trading availability for reliability of the platform as a whole, because failing open during a spike turns a limiter outage into a total outage." That single sentence names a pillar, a trade-off, and a committed decision.

03·5 The architecture you draw on the whiteboard

One atomic counter decides allow, throttle, or reject.

A rate limiter sits on the hot path and answers one question per request: has this key exceeded its budget? The whole design hinges on a single, atomic, shared counter — usually a token bucket in Redis updated with one Lua script — so that distributed gateway nodes never disagree about the count.

Request + check Allow Reject

One counter, atomically. The gateway middleware checks a per-key counter in Redis with one atomic op (Lua: INCR + EXPIRE); under the limit it forwards upstream, over it returns 429. Say it: “the token-bucket counter must be atomic and shared, or distributed gateways will disagree.”

How to narrate it in the room

Pick the algorithm. “Token bucket: each key has tokens that refill at a fixed rate — it smooths bursts and is cheap. Sliding-window log / counter are the alternatives I’d compare.”
Centralize the counter. “Counts live in Redis so every gateway node sees the same state. The check is one atomic Lua script — INCR plus EXPIRE — to avoid race conditions.”
Decide and act fast. “Under the limit, forward upstream; over it, return 429 with Retry-After. The limiter is on the hot path, so it must add sub-millisecond latency.”
Plan for Redis trouble. “If the limiter store is down, fail open rather than take the whole API down — and consider local token buckets synced periodically to cut the round-trip.”

04 The interview, minute by minute

Five phases. Drive every one of them.

This is the simulation. For each phase you get a time budget, what the interviewer is grading in that window, the lines that earn the signal, the ones that lose it, and a scripted exchange for the rate-limiter problem specifically. The framing throughout: the rate limiter for a social platform's API — 1M requests/second across 100M daily active users, with a <10ms overhead budget per check.

01Requirements & Scoping~6 min · don't draw yet

Grading this window: Can you organize the problem space, reduce ambiguity, and bound scope before designing? Asking too few questions reads as memorizing a template; asking too many reads as stalling. Five to ten focused minutes is the senior sweet spot.

Separate functional requirements (what it does) from non-functional (how well). Then nail the one number that changes every downstream decision: scale. Write them where the interviewer can see them — this is your contract for the rest of the hour.

Functional requirements to land

Identify clients by user ID, IP, or API key so different limits apply to different identities.
Limit requests against configurable rules — e.g. 100 req/min per user, with the ability to layer per-IP, per-endpoint, and global limits.
On exceeding a limit, reject with HTTP 429 plus helpful headers (remaining, reset, retry-after).

Non-functional requirements to land

Low latency: the check adds <10ms; it sits on every request's critical path.
High availability with eventual consistency tolerated — small cross-node drift in enforcement is acceptable; strong global consistency is explicitly out of scope.
Scale: 1M req/s across 100M DAU. Get this number from the interviewer; don't assume.

▲ Allow — say this

"Before I design anything — what scale are we targeting? A startup doing thousands of requests a day and a platform doing a million a second are completely different machines. I'll assume the large-platform case unless you'd rather scope it down."

▲ Allow — say this

"I'll treat strong consistency as out of scope. For rate limiting, briefly over-admitting a handful of requests during a partition is fine; rejecting legitimate traffic to stay perfectly consistent is worse. So I'm optimizing for availability and low latency."

◆ Throttle — only with a reason

Listing ten clarifying questions in a row. One or two sharp ones that visibly change the design is senior; a checklist recited from memory is the opposite signal. Ask, then use the answer.

▼ Reject — never say this

"Let me just start drawing the architecture." Jumping to boxes before scoping is the most common junior tell. So is silently assuming the scale instead of confirming it.

Scripted exchange

Interviewer

Design a rate limiter.

You

Let me scope it first. Are we rate-limiting individual HTTP requests at the edge, or higher-level business actions? I'll assume request-level, server-side — clients can't be trusted to self-regulate. And what scale: are we protecting a public API at, say, a million requests a second, or something smaller?

Interviewer

Million a second. Social platform.

You

Good. Then my three functional requirements are: identify the client by user / IP / key, enforce configurable rules, and return 429 with retry headers. Non-functionally: under 10ms overhead, highly available, eventual consistency acceptable. Strong consistency I'm putting below the line. Sound right before I move on?

02Entities, Interface & Estimation~5 min

Grading this window: Do you model the domain cleanly and define a crisp boundary? At senior+, being clear about where the API boundary lives and why is a strong signal — especially if you tie it to how the system will evolve.

Three entities carry this design: Rules (the policies — requests per window, which clients, which endpoints), Clients (the identities being limited, with their current usage state), and Requests (carry the context that selects which rules apply). The interface is deliberately tiny:

isRequestAllowed(clientId, ruleId) → { passes: bool, remaining: int, resetTime: ts }

That return shape isn't incidental — remaining and resetTime are exactly what populate your X-RateLimit-* response headers. Calling that out shows you're designing the contract and the client experience together.

Back-of-the-envelope that matters

The one estimate that drives the architecture: a single Redis instance handles roughly 100K–200K simple ops/sec, and each check is at minimum a read plus a write. So one instance realistically sustains ~50K–100K checks/sec. At 1M req/s you therefore need on the order of 10+ shards. State that math out loud — it's the bridge into your deep dive on scaling.

▲ Allow — say this

"I want the limiter to expose one synchronous call that returns allow/deny plus the remaining count and reset time, because those values are what the gateway needs to write the response headers. Keeping the interface this thin means I can change the algorithm behind it without touching callers."

▼ Reject — never say this

Spending five minutes on a full relational schema. A rate limiter's "data model" is a key and a couple of counters — over-modeling it signals you've pattern-matched to the wrong kind of problem.

03High-Level Design (the MVP)~13 min

Grading this window: The right components, a justified data flow, and one clearly-reasoned algorithm choice. You're not implementing the algorithm — you're showing you know the options and can defend a pick. Build the simplest thing that satisfies the requirements, then evolve it.

Decision 1 — Where does the limiter live?

Walk the three placements and commit. In-process (in each app server) is fastest but each server sees only its slice of traffic, so global limits become "off by a factor of N." A dedicated service gives global state and rich context but adds a network hop to every request and a new failure point. The API gateway / edge is the standard production answer: every request hits it first, bad traffic is turned away at the door before it touches your app servers. Commit to the gateway and name the trade-off you're accepting — limited business context, since the gateway only sees the HTTP request.

Decision 2 — Which algorithm?

Show you know the field, then choose. The honest interview line is "it depends on the use case" — followed immediately by a committed pick, never left hanging.

Algorithm	Bursts	Memory	Accuracy	One-line verdict
Fixed window	Yes, at edges	Low	Low	Simplest; the boundary lets a user double their limit across the window seam.
Sliding window log	No	High	Perfect	Exact, but stores every timestamp — too costly at 100M users.
Sliding window counter	Partial	Low	High	Two counters + weighting math; great default for an external API.
Token bucket	Yes	Low	High	Handles sustained rate and bursts; what Stripe and AWS use.
Leaky bucket	No (smooths)	Medium	Perfect rate	FIFO queue, constant outflow; good for shaping, not inbound API limits.

For this system, commit to token bucket: it accommodates the bursty nature of real API traffic while still enforcing an average rate, it's cheap (track tokens + last_refill per client), and it's what large API providers actually run. If pressed on the burst-vs-precision trade-off, name sliding-window-counter as the alternative you'd reach for if even bursts had to be tightly bounded.

Decision 3 — Where does the state live?

Token buckets must be shared across all gateway instances, or you're back to the in-process problem. Use Redis as the central, sub-millisecond store. Flow: gateway reads the bucket (tokens, last_refill), computes refill from elapsed time, decrements if a token is available, writes back, and sets a TTL so idle buckets self-clean.

The trap door every interviewer opens here. A naive read-then-write has a race: two gateways read 99 (limit 100) at the same instant, both allow, both write 100 — the user just made 101. Wrapping the writes in MULTI/EXEC isn't enough, because the read happened outside the transaction. The correct answer is to make the entire read-modify-write atomic with a Lua script, which Redis runs as one indivisible operation. Reaching for this before being prompted is a senior signal; reaching for it only after the interviewer points at the race is a mid signal.

Decision 4 — What happens on rejection?

Fail fast. Return 429 immediately rather than queuing excess requests — queuing burns memory, makes latency unpredictable, and invites retry storms when users assume failure. Pair the 429 with headers so well-behaved clients can back off intelligently:

HTTP/1.1 429 Too Many Requests X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1640995200 Retry-After: 60

▲ Allow — say this

"I'll put the limiter at the API gateway. It's the bouncer at the door — abusive traffic gets a 429 before it ever reaches an app server. The cost I'm accepting is that the gateway only sees the HTTP request, so rule logic that needs deep user context has to come from the token or a sidecar lookup."

▲ Allow — say this

"There's a race in the read-modify-write across gateways. I'll close it by doing the whole check inside a Redis Lua script, which executes atomically — that expands the atomic boundary to cover the read, not just the write."

◆ Throttle — only with a reason

"Let's queue the rejected requests." Defensible only for batch/async systems where latency doesn't matter — and you must say so. For an interactive API it's the wrong call; name why before dismissing it.

▼ Reject — never say this

"We'll just use Redis and increment a counter." Hand-waving the atomicity is the exact gap interviewers probe. "Just use Redis" only survives until the follow-up about 50 servers racing on the same key.

04Deep Dives — the stress test~15 min · where staff is decided

Grading this window: Can you go deep without rescue? Lead the conversation toward the deep dives that satisfy your non-functional requirements, but stay flexible — the interviewer will probe. At staff, you should be volunteering these, not waiting to be asked. This is 30%+ of the score and the difference between senior and staff.

Scaling to 1M req/s

One Redis instance can't carry the load (you established this in estimation). Shard the rate-limit data, but shard consistently: every request for a given client must hit the same shard, or its state fragments and the limit becomes meaningless. Hash the identifier — user ID for authenticated, IP for anonymous, key for developer traffic — with consistent hashing so re-sharding doesn't reshuffle everything. In production you'd lean on Redis Cluster, which spreads keys across 16,384 hash slots automatically rather than you hand-rolling the routing. Ten shards at ~100K ops/sec each covers the target.

High availability & the failure-mode decision

Each shard is now critical. Give every shard replicas with automatic failover (Redis Cluster does this). Then make the deliberate call interviewers want to hear reasoned, not guessed: when the limiter can't reach Redis, do you fail open (allow everything, stay available, lose protection) or fail closed (reject everything, stay protected, take the API offline)?

The senior move is to recognize there's no universal answer and to reason from this system. For a social platform, fail closed: limiter failures tend to coincide with traffic spikes, and failing open in that moment floods your databases — turning a limiter outage into a full platform outage. For a payments system you might argue the opposite. State your reasoning, commit, and acknowledge the cost.

Latency under budget

Biggest win is connection pooling — reuse persistent Redis connections instead of paying the TCP handshake (20–50ms) per check. Next is geographic distribution: co-locate gateways and Redis by region so a Tokyo user isn't crossing an ocean for a counter check, accepting cross-region eventual consistency in exchange. Mention pipelining/local caching exist but note you'd skip them unless asked — stale local cache causes incorrect decisions, and the complexity usually isn't worth it once pooling and locality are handled.

Hot keys (the viral / abuse case)

A single client generating tens of thousands of req/s can overwhelm one shard. Split the response by intent: for legitimate high-volume clients — encourage client-side rate limiting, offer batching, and provision premium tiers. For abuse — auto-blocklist a client that trips limits repeatedly (store the list in a shard, check on cache miss) and lean on edge DDoS protection (Cloudflare / AWS Shield) before traffic reaches the limiter. Note up front that corporate NATs and public WiFi share IPs, so set IP limits generously and prefer authenticated-user limits.

Dynamic rule configuration

Real systems change limits without a deploy — launch boosts, premium tiers, emergency clamp-downs. Poll-based (gateways poll a config store every ~30s) is simple and covers most cases; its cost is up-to-30s propagation lag, painful during an attack. Push-based (ZooKeeper or Redis pub/sub notifies gateways instantly) updates in seconds but adds real complexity around partial failures and fallback. Recommend poll as the default, push only when seconds-fast updates are a genuine requirement.

▲ Allow — say this

"I'll fail closed for this platform. It feels counterintuitive against an availability goal, but limiter failures and traffic spikes happen together — failing open at that moment sends the full flood downstream and collapses the backend. Brief rejected requests beat a cascading outage. For a payments API I'd argue the other way."

▲ Allow — say this (staff move)

"Beyond picking a failure mode, the better answer is not failing — so: replicas per shard with automatic promotion, and I'd roll any limit change out as a canary to a small slice of traffic first, watching reject-rate before going fleet-wide."

◆ Throttle — only with a reason

Local in-gateway caching of counts. It cuts latency but risks stale decisions; only propose it with the staleness trade-off named and bounded, and say you'd skip it unless the latency budget forced your hand.

▼ Reject — never say this

"We'll just add more Redis servers." Adding capacity without a partitioning scheme splits each client's state and breaks the limit. The follow-up — "how does a request find the right shard?" — is exactly where this answer falls apart.

Scripted stress-test exchange

Interviewer

A Redis shard dies mid-spike. What happens to the users on it?

You

Two layers. First, prevention: that shard has replicas, so Cluster promotes one automatically — the outage window is short. Second, if we genuinely can't reach it, we hit our failure mode. I chose fail-closed for this platform, so those users get 429s briefly. That's deliberate — the alternative, fail-open during a spike, floods the backend and takes everyone down, not just one shard's users.

Interviewer

Isn't rejecting valid users bad?

You

It is — it's a real cost, and I'd surface it on a dashboard and alert on entering fail-closed. But it's bounded and recoverable. A cascading database failure isn't. I'm trading a small, visible degradation for platform-wide reliability. If this were payments I'd weigh it differently, and if the business wanted availability over protection here, fail-open is a one-line config flip — I'd want that decision made explicitly, not by accident.

05Wrap-up — operability & the trade-off recap~6 min

Grading this window: The newest-weighted dimension. You close by proving you could run this thing, not just build it. Volunteer observability and rollout; recap your key trade-offs; name what you'd do with more time. This is what separates "designed a system" from "owns systems in production."

Observability — volunteer it

Per-rule allow / reject rates and 429 volume, sliced by client type and endpoint.
Redis health: CPU, memory, op latency, replication lag, per shard.
A loud alert the instant the system enters fail-open or fail-closed mode — you never want to discover that from users.
Check-latency SLO tracking against the 10ms budget.

Rollout & on-call reality

Limit changes go out as canaries — a small traffic slice first, watch reject-rate, then widen. Keep a fast kill-switch to disable enforcement (fail-open by choice) if a bad rule starts rejecting good traffic. Document the runbook: what an on-call engineer does when reject-rate spikes or a shard is unhealthy.

The 60-second recap that lands the level

Quick recap: limiter at the gateway so bad traffic dies at the door; token bucket for burst tolerance at low memory cost; Redis with a Lua script for atomic, race-free checks; consistent-hashed shards with replicas for scale and availability; fail-closed for this platform, deliberately, with a config escape hatch; and full observability with canary rollout so we can operate it. If I had more time I'd dig into multi-region consistency and a per-tenant cost model.

▲ Allow — say this

"Before we wrap — let me name what I'd do with more time, so it's clear I know what I deferred: multi-region active-active with the cross-region drift that implies, and a cost model per shard. I scoped those out, I didn't miss them."

▼ Reject — never say this

Going silent and waiting for the interviewer to say "we're done." Letting the clock run out without a recap or operability story wastes the highest-leverage minutes you have left.

05 The follow-up gauntlet

The probes you'll get — and the answer that holds.

Interviewers rarely accept the first answer. They push to find the edge of your knowledge, watch how you handle pressure, and check whether you commit or crumble. Here are the probes that come up on this question, with the response that keeps the signal green. The meta-skill: answer the question asked, give the reasoning, then stop — don't over-talk and don't reverse a correct call just because they pushed.

"Why token bucket over sliding window counter?"

Both are good. Token bucket models bursts explicitly — bucket size is the burst allowance, refill rate is the sustained rate — which matches real API traffic, and it's what Stripe and AWS run. Sliding-window-counter is the better default when you need tightly bounded bursts and simpler reasoning in a distributed setting. I picked token bucket for burst friendliness; if the requirement were "never exceed N in any 60s window, ever," I'd switch.

"Two gateways read 99 at once. Walk me through it."

Classic lost-update. Both read 99, both decide allow, both write 100 — 101 requests slipped through. MULTI/EXEC doesn't fix it because the read is outside the transaction. The fix is a Lua script that does read-decide-write as one atomic Redis operation, so the second request sees the first's result. It's the "expand the atomic boundary" pattern.

"Premium users get 10× the limit. How?"

The rule, not the algorithm, changes. I'd encode tier in the JWT so the gateway reads it without a DB hit, and select the rule by tier at check time. Systems layer multiple rules anyway — per-user, per-IP, per-endpoint, global — and enforce the most restrictive. Premium is just a different per-user rule resolved from the token.

"Change a limit instantly, no deploy. How?"

Rules live in a config store, not in code. Default: gateways poll every ~30s — simple, good enough for launches and tier changes, but up to 30s lag. If we need emergency-fast clamps during an attack, push-based via ZooKeeper or Redis pub/sub notifies gateways in seconds, at the cost of handling partial-update failures. I'd start with poll and add push only if seconds matter.

"How do you know it's working in prod?"

Allow/reject rates per rule, 429 volume by client and endpoint, Redis health and replication lag per shard, check latency against the 10ms SLO, and a hard alert on entering any failure mode. The signal I care most about: a sudden reject-rate spike on a rule we didn't change usually means an attack or a bad client, not a config error.

"Users behind one corporate NAT keep getting blocked."

Expected — NATs and public WiFi share an IP. That's why IP-based limits should be set generously and treated as a coarse safety net, with the real enforcement on authenticated user IDs wherever we have them. I'd design for shared IPs up front rather than patch hot keys after the fact.

How to handle a probe you can't fully answer. Don't bluff — bluffing is a fast downlevel. Say what you know, name the unknown precisely, and reason toward it: "I haven't run Cluster failover at this exact scale, but the failure modes I'd worry about are split-brain during a partition and replication lag on promotion — here's how I'd test for each." Honest reasoning under uncertainty reads as more senior than false confidence.

06 What gets you downleveled

The flags that quietly tank an otherwise solid loop.

A clean design with one of these undercurrents still scores below the bar at senior+. These are the patterns interviewers write in the "concerns" box. None are about getting an answer wrong — they're about how you operate.

Drawing before scoping

Jumping to architecture without bounding the problem or confirming scale. Reads as template-matching, not problem-solving.

Hedging without committing

"It depends" with no decision behind it. Nuance is good; evasion isn't. Name the trade-off, then pick.

Needing rescue on the hard part

Stalling at the race condition or the sharding scheme until the interviewer feeds you the answer. Depth is the senior+ bar.

Skipping operations entirely

No observability, no rollout, no failure-mode plan. In 2026 this reads as "has never carried a pager."

Bluffing under a probe

Confident wrong answers when pushed. Far worse than honest "here's what I'd verify."

Letting the design sprawl

Trying to cover everything, going deep on nothing. Strong candidates notice sprawl and deliberately narrow.

Reversing a correct call under pressure

Abandoning a sound decision the moment the interviewer raises an eyebrow. They're testing conviction, not always disagreeing.

Not driving

Waiting to be asked the next question. At staff you own the 45 minutes; the interviewer should rarely need to steer.

07 Your pre-loop scorecard

Self-grade before you walk in.

Run a mock and score yourself honestly against the dimensions the interviewer uses. If you can't hit "strong" on depth and operability for the rate limiter — the easiest of the classic questions — that's your signal on where to drill before the loop.

Dimension	Weak (downlevel)	Strong (at level)
Scoping	Started drawing; assumed scale.	Bounded it in <6 min; confirmed the 1M/s number; stated assumptions aloud.
Algorithm choice	Named one option, no trade-off.	Surveyed the field, committed to token bucket, named the alternative and when to switch.
The race condition	Needed a hint; stopped at MULTI/EXEC.	Raised it unprompted; fixed it with an atomic Lua script and explained why.
Failure mode	"Use replicas" and moved on.	Chose fail-closed for this platform with reasoning, named the cost, and the payments counter-case.
Operability	Never mentioned it.	Volunteered observability, canary rollout, kill-switch, and the on-call runbook.
Driving	Answered questions reactively.	Led the conversation toward the deep dives that matter; recapped trade-offs at the end.

★

The one mental model to carry in. Treat the interview itself as the system you're rate-limiting: 45 minutes is your bucket, every topic costs a token, and the interviewer is watching how you allocate. Spend tokens on requirements and depth; throttle yourself on schema and trivia; never let the clock drain into silence. Drive it, commit your calls, and prove you could operate what you drew.

Design a Rate Limiter like the room is already grading you.

Time Budget · how the 45 min should actually split

The signal you'll see throughout

The rubric is six buckets — and judgment outweighs the diagram.

It's a sliding scale, not a pass/fail bar.

Borrow AWS's Well-Architected pillars as your trade-off vocabulary.

Operational Excellence

Security

Reliability

Performance Efficiency

Cost Optimization

Sustainability

One atomic counter decides allow, throttle, or reject.

How to narrate it in the room

Five phases. Drive every one of them.

Functional requirements to land

Non-functional requirements to land

Back-of-the-envelope that matters

Decision 1 — Where does the limiter live?

Decision 2 — Which algorithm?

Decision 3 — Where does the state live?

Decision 4 — What happens on rejection?

Scaling to 1M req/s

High availability & the failure-mode decision

Latency under budget

Hot keys (the viral / abuse case)

Dynamic rule configuration

Observability — volunteer it

Rollout & on-call reality

The 60-second recap that lands the level

The probes you'll get — and the answer that holds.

"Why token bucket over sliding window counter?"

"Two gateways read 99 at once. Walk me through it."

"Premium users get 10× the limit. How?"

"Change a limit instantly, no deploy. How?"

"How do you know it's working in prod?"

"Users behind one corporate NAT keep getting blocked."

The flags that quietly tank an otherwise solid loop.

Self-grade before you walk in.