A firehose of click events must be counted, deduplicated, and aggregated in near-real-time — and these counts bill advertisers, so correctness has teeth. Think in pipelines, not request-response. The hard parts are the ones nobody mentions in the prompt: duplicate clicks, out-of-order and late events, and the trap of promising “exactly-once” across a distributed system when you should be promising something more honest.
Every FAANG company runs a rubric. The dimensions are roughly the same; the weights differ by company and level. At senior+ the boxes-and-arrows are table stakes — what gets graded hardest is the quality of your decisions: the questions you asked first, the trade-offs you surfaced and defended, and the production reality you volunteered without being asked.
| Dimension | Weight | What earns the signal |
|---|---|---|
| Requirements & scoping | 10–15% | You scoped before drawing, asked enough to bound the problem, pinned the scale number, and stated assumptions out loud. |
| High-level architecture | 20–25% | The right components, a clear data flow, and a reason every box exists. The design satisfies each functional requirement. |
| Technical depth / deep dives | ~30% | You go three questions deep on the hard part without being rescued. This is where staff is won or lost. |
| Trade-offs & judgment | highest effective | Two viable options, what each costs, and a committed pick for this system. Simplicity over flash when flash isn't warranted. |
| Communication / driving | cross-cutting | You drive the 45 minutes; the interviewer never has to rescue you. You narrate, checkpoint, and narrow when the design sprawls. |
| Operational maturity | ↑ in 2026 | The newest weight: observability, rollout, failure modes, on-call reality — volunteered, not pried out. |
A solid design with reasonable trade-offs is a strong score for a mid-level candidate and a downlevel flag for staff. The questions can be identical; the depth expectation is not. As you climb, the balance tips from breadth toward depth, proactivity, and production reality.
You don't recite AWS — you anchor each decision to one of these. It signals you evaluate systems across competing concerns rather than optimizing one axis. Each pillar below is mapped to a move you can make in this exact design.
Watch lag and drift.
Protect billing integrity.
Recover to the exact point.
Right store for the query.
Hot and cold tiers.
Bound window state.
A firehose of clicks must be counted, deduplicated, and aggregated in near-real-time — and these counts bill advertisers, so correctness has teeth. Think in pipelines, not request/response. The hard parts are the ones nobody states in the prompt: duplicate clicks, out-of-order and late events, and the trap of promising “exactly-once.”
The simulation. Framing: an ad click aggregator — millions of clicks/sec at peak, dashboards fresh within seconds to minutes, and authoritative daily billing that must be correct down to dedup. Clicks bill advertisers, so losing or double-counting is lost money or lost trust.
“There are two consumers of the same event log: a fast path for dashboards that can be approximate, and a slow path for billing that must be exact. I’ll design both and route queries by accuracy need.”
“I won’t promise global exactly-once across the whole distributed pipeline — that’s a trap. I’ll promise at-least-once ingestion plus idempotent, deduplicated sinks, which gives effective exactly-once results. That’s the honest guarantee.”
“We’ll just guarantee exactly-once everywhere.” Claiming global exactly-once across ingestion, processing, and storage signals you haven’t operated a real streaming system.
The core entity is the ClickEvent — and the most important field is a stable, unique id (an event_id / impressionId) that makes dedup possible:
Capturing via a server-side 302 redirect matters: the platform observes and records the click before sending the user on, so you can bill what you actually saw. The raw event log (Kafka) is the system of record — it buffers bursts, is replayable for backfills, and is the clean contract between ingestion and compute.
Millions of clicks/sec means any design that does a database write per click (incrementing a counter row) creates hot rows and write amplification. State that early — it rules out the naive counter and forces stream aggregation.
“Every event carries a stable id at capture time — that id is what lets me deduplicate retries and double-clicks downstream and bill exactly once per real click.”
Click → redirect/capture service appends the event to Kafka → a stream processor (Flink) reads the stream and aggregates in event-time windows → results land in a hot OLAP store (ClickHouse / Druid / Pinot) that serves dashboards with fast group-by queries. In parallel, raw events are dumped to a data lake (S3) in columnar format for the batch/billing path.
You could run a Kafka consumer with an in-memory running count — fine for a mid-level answer. Flink earns its place with event-time windowing (so out-of-order clicks land in the right minute), watermarks (knowing when a window is safe to close), exactly-once processing, and fault-tolerant state — all painful to build yourself.
“Dashboards read a columnar OLAP store built for group-by aggregations over campaign and time. I never increment per-click rows in an OLTP database — that’s a hot-row write-amplification disaster at millions of clicks a second.”
A plain Kafka consumer with in-memory counts. Acceptable for a small scale or a mid-level framing — say so — but name what you lose: event-time correctness, watermarks, exactly-once, and fault tolerance.
“We’ll increment a SQL counter per click and read it for the dashboard.” Hot rows, write amplification, and an OLTP store answering OLAP queries — three failures in one sentence.
The honest guarantee is at-least-once ingestion + idempotent, deduplicated results. Flink delivers exactly-once processing via periodic checkpoints of its state plus Kafka’s two-phase commit — on failure, a node restarts from the last checkpoint and replays from that offset. Combined with dedup by stable event id (Flink keyed state) and an idempotent upsert sink, duplicate or replayed events never inflate the count.
Clicks arrive late from network retries or stream lag. Event-time windows with watermarks place a late click in its correct minute bucket; an allowed-lateness window lets a straggler update an already-emitted aggregate, after which the system emits a correction. Make the billing policy on lateness explicit — that’s a senior detail.
Streaming can still drift — transient errors, bad deploys, very late data. Run a periodic batch reconciliation over the raw events archived in S3, compare against the hot-store aggregates, and correct authoritatively for billing. The stream gives speed; the batch gives truth.
A viral campaign concentrates clicks on one key. Partition the stream so a single campaign’s load spreads across parallel Flink tasks, and pre-aggregate locally before the keyed reduce to avoid a single hot reducer.
“I deliberately don’t claim global exactly-once. I get effective exactly-once from at-least-once ingestion, Flink checkpoints with Kafka two-phase commit, dedup on a stable event id, and idempotent upserts — with a batch reconciliation over raw S3 events as the authoritative correction for billing.”
“Late events are rare, we’ll just drop them.” Dropping late clicks silently underbills or overbills — you need an explicit lateness policy and corrections, not a shrug.
Can you guarantee exactly-once end to end?
Not as a literal global guarantee — that’s a trap. What I guarantee is effective exactly-once results: at-least-once ingestion into Kafka, exactly-once processing in Flink via checkpoints and Kafka’s two-phase commit, dedup on a stable event id, and idempotent upserts into the store. A replayed or duplicated event can’t inflate the count because the id collapses it. That’s the language of someone who’s run streaming systems.
And a click that arrives ten minutes late?
Event-time plus watermarks put it in its correct minute window. If it’s within my allowed-lateness, I update that window and emit a correction downstream. If it’s beyond lateness, the streaming layer drops it but the batch reconciliation over the raw S3 archive still picks it up — so billing, which reads the reconciled truth, stays correct even when the live dashboard didn’t catch it.
Deploy Flink jobs with savepoints so you can stop, upgrade, and resume from exact state. Canary aggregation-logic changes and compare against the batch path before trusting them for billing.
“With more time I’d detail click-fraud filtering in-stream and the attribution joins between impressions, clicks, and conversions. I scoped them out deliberately — they’re their own pipelines.”
Interviewers push on correctness under duplicates, lateness, and failure. Refuse the exactly-once trap, name the mechanics, lean on reconciliation.
Every event carries a stable id at capture. Flink holds seen-ids in keyed state within the window and drops duplicates, and the sink is an idempotent upsert keyed on that id. A duplicate collapses to the same record — it can't inflate the count.
For a small scale the consumer is a fine answer. Flink gives event-time windowing so out-of-order clicks land in the right minute, watermarks to know when a window is safe to close, exactly-once processing via checkpoints, and fault-tolerant state — all painful to build and operate yourself.
Event-time windows with watermarks assign it to its correct minute. Within allowed-lateness I update that window and emit a correction; beyond it, the stream drops it but the batch reconciliation over raw S3 events still counts it. The lateness policy must be explicit because it's billing.
Not as a literal global guarantee — I refuse that framing. I guarantee effective exactly-once results: at-least-once ingestion, Flink checkpoints with Kafka two-phase commit, dedup on a stable id, and idempotent upserts. A replay can't double-count because the id collapses it.
That's an OLAP workload — a columnar store like ClickHouse/Druid/Pinot for recent hot data, and a warehouse/lakehouse over archived raw events for long ranges. I route the query by time range and accuracy need; I never run that on an OLTP store.
A periodic batch reconciliation recomputes aggregates from the raw event archive in S3 and compares them to the hot store. Drift triggers an alert and the batch result is the authoritative correction for billing. The stream gives speed; the batch gives truth.
A clean design with one of these undercurrents still scores below the bar at senior+. None are about getting an answer wrong — they're about how you operate.
Jumping to architecture without bounding the problem or confirming scale. Reads as template-matching.
"It depends" with no decision behind it. Name the trade-off, then pick.
Claiming literal exactly-once across the whole pipeline. The honest guarantee is at-least-once + idempotent dedup = effective exactly-once.
Incrementing a row per click — hot rows and write amplification — and serving dashboards from OLTP. Multiple failures at once.
Aggregating without a stable event id, so retries and double-clicks silently inflate billing.
No observability, no rollout, no failure-mode plan. In 2026 this reads as "has never carried a pager."
Confident wrong answers when pushed. Far worse than an honest "here's what I'd verify."
Waiting to be asked the next question. At staff you own the 45 minutes.
Run a mock and score yourself honestly against the dimensions the interviewer uses. If you can't hit "strong" on depth and operability, that's your signal on where to drill.
| Dimension | Weak (downlevel) | Strong (at level) |
|---|---|---|
| Scoping | One path; promised global exactly-once. | Split fast-approximate dashboards from slow-exact billing; refused the exactly-once trap. |
| Capture | Trusted the browser to report clicks. | Server-side 302 redirect so the platform observes every billable click; stable event id. |
| Stream processing | In-memory counter, no windowing. | Kafka → Flink with event-time tumbling windows, watermarks, and checkpointed state. |
| Dedup | No stable key; double-counts. | Dedup on event id in keyed state plus idempotent upsert sink. |
| Reconciliation | Trusted the stream blindly. | Batch reconciliation over raw S3 events as the authoritative billing correction; drift alerts. |
| Operability | Never mentioned it. | Ingestion/watermark lag, dedup rate, hot-vs-batch drift, savepoint-based rollout. |