Streaming-Aggregation Playbook — Designing an Ad Click Aggregator

01 How you're actually graded

Six buckets — and judgment outweighs the diagram.

Every FAANG company runs a rubric. The dimensions are roughly the same; the weights differ by company and level. At senior+ the boxes-and-arrows are table stakes — what gets graded hardest is the quality of your decisions: the questions you asked first, the trade-offs you surfaced and defended, and the production reality you volunteered without being asked.

Dimension	Weight	What earns the signal
Requirements & scoping	10–15%	You scoped before drawing, asked enough to bound the problem, pinned the scale number, and stated assumptions out loud.
High-level architecture	20–25%	The right components, a clear data flow, and a reason every box exists. The design satisfies each functional requirement.
Technical depth / deep dives	~30%	You go three questions deep on the hard part without being rescued. This is where staff is won or lost.
Trade-offs & judgment	highest effective	Two viable options, what each costs, and a committed pick for this system. Simplicity over flash when flash isn't warranted.
Communication / driving	cross-cutting	You drive the 45 minutes; the interviewer never has to rescue you. You narrate, checkpoint, and narrow when the design sprawls.
Operational maturity	↑ in 2026	The newest weight: observability, rollout, failure modes, on-call reality — volunteered, not pried out.

The 2026 shift, in one line. Operational concerns are now a first-class graded dimension, and "it depends" without a committed answer reads as evasion rather than nuance. Name the trade-off, then pick.

02 The same answer is scored differently at each level

It's a sliding scale, not a pass/fail bar.

A solid design with reasonable trade-offs is a strong score for a mid-level candidate and a downlevel flag for staff. The questions can be identical; the depth expectation is not. As you climb, the balance tips from breadth toward depth, proactivity, and production reality.

Mid-level

Meta E4 · Google L4 · Amazon SDE-II

breadthdepth

Sketches clicks going to a store and dashboards reading counts.
Reaches for a queue and aggregation when prompted; may keep an in-memory running count.
Recognizes duplicates are a problem but lacks a stable dedup key.
Needs guidance toward stream processing and correctness.

Senior

Meta E5 · Google L5 · Amazon SDE-III

breadthdepth

Routes clicks through Kafka into a stream processor (Flink) with windowed, event-time aggregation, unprompted.
Deduplicates by a stable event id and uses an OLAP store for dashboards, not OLTP counters.
Knows not to promise global exactly-once — frames at-least-once + idempotent sinks.
Adds a reconciliation/batch path for billing correctness.

Staff+

Meta E6 · Google L6 · Amazon Principal

breadthdepth

Establishes the pipeline fast, then spends time on exactly-once mechanics, late events, and reconciliation.
Experience-backed take on Flink checkpoints + Kafka two-phase commit and watermark tuning.
Treats hot-campaign keys, dual hot/cold storage, and drift monitoring as routine.
Frames the correctness-vs-latency trade and the cross-team seam with billing/fraud.

03 The lens senior engineers narrate through

Borrow AWS's Well-Architected pillars as your trade-off vocabulary.

You don't recite AWS — you anchor each decision to one of these. It signals you evaluate systems across competing concerns rather than optimizing one axis. Each pillar below is mapped to a move you can make in this exact design.

PILLAR 01

Operational Excellence

Watch lag and drift.

Hook: “I’d monitor ingestion and watermark lag, dedup rate, and the drift between the hot store and the batch reconciliation — drift is the early signal of a correctness bug.”

PILLAR 02

Security

Protect billing integrity.

Hook: “The click is captured server-side via a redirect we control, so the platform observes every billable event — the browser can’t bypass it — and click fraud is filtered in-stream.”

PILLAR 03

Reliability

Recover to the exact point.

Hook: “Flink checkpoints its state to durable storage; a crashed node restarts, loads the last checkpoint, and replays from that Kafka offset — no lost or double-counted clicks.”

PILLAR 04

Performance Efficiency

Right store for the query.

Hook: “Dashboards hit a columnar OLAP store built for fast group-by aggregations; I never increment per-click rows in an OLTP database.”

PILLAR 05

Cost Optimization

Hot and cold tiers.

Hook: “Recent aggregates live in a fast OLAP store; raw events archive cheaply to object storage in columnar format for batch reconciliation and history.”

PILLAR 06

Sustainability

Bound window state.

Hook: “Tumbling one-minute windows with TTL’d keyed state keep Flink’s working set small instead of holding unbounded per-event state.”

How to use it without sounding like a checklist. Don't list the pillars. Weave one in when you commit: name a trade-off, name the pillar it serves, and make the call. One sentence that does all three reads as senior.

03·5 The architecture you draw on the whiteboard

A pipeline, not a request — count once, even when events double or arrive late.

A firehose of clicks must be counted, deduplicated, and aggregated in near-real-time — and these counts bill advertisers, so correctness has teeth. Think in pipelines, not request/response. The hard parts are the ones nobody states in the prompt: duplicate clicks, out-of-order and late events, and the trap of promising “exactly-once.”

Stream path Batch reconcile Query

Think pipeline, not request. Clicks (each with an event_id) flow through Kafka into a windowed stream processor that dedups and aggregates; a batch job replays Kafka to correct late or dropped events. Say it: “I’ll promise effectively-once — dedup plus idempotency — not a magic exactly-once.”

How to narrate it in the room

Set the frame. “This is a streaming aggregation problem, and the counts bill advertisers — so correctness has teeth. I’ll design a pipeline, not request/response.”
Kill duplicates. “Every click carries an event_id; the processor dedups on it within a window, and writes are idempotent, so a replayed event doesn’t double-count.”
Handle late and out-of-order. “Windows use watermarks with allowed lateness; anything later is fixed by a batch reconciler that replays the raw Kafka log.”
Be honest about guarantees. “True exactly-once across a distributed system is a trap. At-least-once delivery + idempotent aggregation = effectively-once — that’s what I’d commit to.”

04 The interview, minute by minute

Five phases. Drive every one of them.

The simulation. Framing: an ad click aggregator — millions of clicks/sec at peak, dashboards fresh within seconds to minutes, and authoritative daily billing that must be correct down to dedup. Clicks bill advertisers, so losing or double-counting is lost money or lost trust.

01Requirements & Scoping~6 min · don't draw yet

Grading this window: Do you split fast-approximate (dashboards) from slow-exact (billing), and refuse to promise global exactly-once? That framing is the senior tell.

Functional requirements to land

Capture clicks reliably (server-side, so the platform observes every billable event).
Aggregate counts by campaign / ad / time window.
Serve real-time dashboards and produce authoritative billing totals.

Non-functional requirements to land

High ingest: millions of events/sec, burst-tolerant.
Dashboard freshness: seconds to minutes; billing: correct and authoritative.
Deduplication (retries, double-clicks) and tolerance for late / out-of-order events.
Effective exactly-once — not a literal global guarantee.

▲ Allow — say this

“There are two consumers of the same event log: a fast path for dashboards that can be approximate, and a slow path for billing that must be exact. I’ll design both and route queries by accuracy need.”

▲ Allow — say this

“I won’t promise global exactly-once across the whole distributed pipeline — that’s a trap. I’ll promise at-least-once ingestion plus idempotent, deduplicated sinks, which gives effective exactly-once results. That’s the honest guarantee.”

▼ Reject — never say this

“We’ll just guarantee exactly-once everywhere.” Claiming global exactly-once across ingestion, processing, and storage signals you haven’t operated a real streaming system.

02Entities, API & Estimation~5 min

Grading this window: A stable dedup key in the event model and the server-side capture decision. The event log as system of record.

The core entity is the ClickEvent — and the most important field is a stable, unique id (an event_id / impressionId) that makes dedup possible:

ClickEvent { event_id, campaignId, adId, userId, ts, … } capture: browser → GET /click/{impressionId} → 302 redirect to advertiser (server-side so the platform records the click before redirecting)

Capturing via a server-side 302 redirect matters: the platform observes and records the click before sending the user on, so you can bill what you actually saw. The raw event log (Kafka) is the system of record — it buffers bursts, is replayable for backfills, and is the clean contract between ingestion and compute.

The estimate that matters

Millions of clicks/sec means any design that does a database write per click (incrementing a counter row) creates hot rows and write amplification. State that early — it rules out the naive counter and forces stream aggregation.

▲ Allow — say this

“Every event carries a stable id at capture time — that id is what lets me deduplicate retries and double-clicks downstream and bill exactly once per real click.”

03High-Level Design (the MVP)~13 min

Grading this window: Kafka → stream processor → OLAP, with raw events archived. Right components, clear flow, correct storage choice.

The pipeline

Click → redirect/capture service appends the event to Kafka → a stream processor (Flink) reads the stream and aggregates in event-time windows → results land in a hot OLAP store (ClickHouse / Druid / Pinot) that serves dashboards with fast group-by queries. In parallel, raw events are dumped to a data lake (S3) in columnar format for the batch/billing path.

click → capture (302) → Kafka (system of record) → Flink: event-time tumbling 1-min windows + watermarks → OLAP hot store (dashboards) → S3 raw events → batch reconciliation (billing)

Why a stream processor, not a plain consumer

You could run a Kafka consumer with an in-memory running count — fine for a mid-level answer. Flink earns its place with event-time windowing (so out-of-order clicks land in the right minute), watermarks (knowing when a window is safe to close), exactly-once processing, and fault-tolerant state — all painful to build yourself.

The trap door the interviewer opens here. “A user double-clicks, or the redirect retries — don’t bill twice.” Dedup on the stable event id: Flink keeps the seen-ids in keyed state within the window and drops duplicates, and the sink is an idempotent upsert keyed on that id. Reaching for a stable dedup key before being asked is the senior signal.

▲ Allow — say this

“Dashboards read a columnar OLAP store built for group-by aggregations over campaign and time. I never increment per-click rows in an OLTP database — that’s a hot-row write-amplification disaster at millions of clicks a second.”

◆ Throttle — only with a reason

A plain Kafka consumer with in-memory counts. Acceptable for a small scale or a mid-level framing — say so — but name what you lose: event-time correctness, watermarks, exactly-once, and fault tolerance.

▼ Reject — never say this

“We’ll increment a SQL counter per click and read it for the dashboard.” Hot rows, write amplification, and an OLTP store answering OLAP queries — three failures in one sentence.

04Deep Dives — the stress test~15 min · where staff is decided

Grading this window: Lead toward exactly-once mechanics, late events, reconciliation, and hot keys. Staff volunteers these; 30%+ of the score.

Effective exactly-once, mechanically

The honest guarantee is at-least-once ingestion + idempotent, deduplicated results. Flink delivers exactly-once processing via periodic checkpoints of its state plus Kafka’s two-phase commit — on failure, a node restarts from the last checkpoint and replays from that offset. Combined with dedup by stable event id (Flink keyed state) and an idempotent upsert sink, duplicate or replayed events never inflate the count.

Late & out-of-order events

Clicks arrive late from network retries or stream lag. Event-time windows with watermarks place a late click in its correct minute bucket; an allowed-lateness window lets a straggler update an already-emitted aggregate, after which the system emits a correction. Make the billing policy on lateness explicit — that’s a senior detail.

Reconciliation (the lambda safety net)

Streaming can still drift — transient errors, bad deploys, very late data. Run a periodic batch reconciliation over the raw events archived in S3, compare against the hot-store aggregates, and correct authoritatively for billing. The stream gives speed; the batch gives truth.

Hot keys

A viral campaign concentrates clicks on one key. Partition the stream so a single campaign’s load spreads across parallel Flink tasks, and pre-aggregate locally before the keyed reduce to avoid a single hot reducer.

▲ Allow — say this (staff move)

“I deliberately don’t claim global exactly-once. I get effective exactly-once from at-least-once ingestion, Flink checkpoints with Kafka two-phase commit, dedup on a stable event id, and idempotent upserts — with a batch reconciliation over raw S3 events as the authoritative correction for billing.”

▼ Reject — never say this

“Late events are rare, we’ll just drop them.” Dropping late clicks silently underbills or overbills — you need an explicit lateness policy and corrections, not a shrug.

Scripted stress-test exchange

Interviewer

Can you guarantee exactly-once end to end?

You

Not as a literal global guarantee — that’s a trap. What I guarantee is effective exactly-once results: at-least-once ingestion into Kafka, exactly-once processing in Flink via checkpoints and Kafka’s two-phase commit, dedup on a stable event id, and idempotent upserts into the store. A replayed or duplicated event can’t inflate the count because the id collapses it. That’s the language of someone who’s run streaming systems.

Interviewer

And a click that arrives ten minutes late?

You

Event-time plus watermarks put it in its correct minute window. If it’s within my allowed-lateness, I update that window and emit a correction downstream. If it’s beyond lateness, the streaming layer drops it but the batch reconciliation over the raw S3 archive still picks it up — so billing, which reads the reconciled truth, stays correct even when the live dashboard didn’t catch it.

05Wrap-up — operability & recap~6 min

Grading this window: Prove you could run it. Volunteer observability and rollout; recap; name what you deferred.

Observability

Ingestion lag and watermark lag — how far behind real time the windows are.
Dedup rate and drift between the hot store and the batch reconciliation (the correctness alarm).
Dashboard freshness and Flink checkpoint duration/failures.

Rollout

Deploy Flink jobs with savepoints so you can stop, upgrade, and resume from exact state. Canary aggregation-logic changes and compare against the batch path before trusting them for billing.

▲ Allow — say this

“With more time I’d detail click-fraud filtering in-stream and the attribution joins between impressions, clicks, and conversions. I scoped them out deliberately — they’re their own pipelines.”

05 The follow-up gauntlet

The probes you'll get — and the answer that holds.

Interviewers push on correctness under duplicates, lateness, and failure. Refuse the exactly-once trap, name the mechanics, lean on reconciliation.

"A user double-clicks or the redirect retries — don't bill twice."

Every event carries a stable id at capture. Flink holds seen-ids in keyed state within the window and drops duplicates, and the sink is an idempotent upsert keyed on that id. A duplicate collapses to the same record — it can't inflate the count.

"Why Flink and not a Kafka consumer with an in-memory counter?"

For a small scale the consumer is a fine answer. Flink gives event-time windowing so out-of-order clicks land in the right minute, watermarks to know when a window is safe to close, exactly-once processing via checkpoints, and fault-tolerant state — all painful to build and operate yourself.

"A click arrives ten minutes late."

Event-time windows with watermarks assign it to its correct minute. Within allowed-lateness I update that window and emit a correction; beyond it, the stream drops it but the batch reconciliation over raw S3 events still counts it. The lateness policy must be explicit because it's billing.

"Can you guarantee exactly-once end to end?"

Not as a literal global guarantee — I refuse that framing. I guarantee effective exactly-once results: at-least-once ingestion, Flink checkpoints with Kafka two-phase commit, dedup on a stable id, and idempotent upserts. A replay can't double-count because the id collapses it.

"Advertiser queries 90 days grouped by campaign and minute."

That's an OLAP workload — a columnar store like ClickHouse/Druid/Pinot for recent hot data, and a warehouse/lakehouse over archived raw events for long ranges. I route the query by time range and accuracy need; I never run that on an OLTP store.

"How do you know the counts are actually right?"

A periodic batch reconciliation recomputes aggregates from the raw event archive in S3 and compares them to the hot store. Drift triggers an alert and the batch result is the authoritative correction for billing. The stream gives speed; the batch gives truth.

Handling a probe you can’t fully answer: speak in guarantees, not absolutes. “I can’t recite Flink’s exact checkpoint protocol, but the property I rely on is that it recovers state and stream position to a consistent point, so a failure replays without double-counting once the sink dedups. Here’s how I’d verify that holds.”

06 What gets you downleveled

The flags that quietly tank an otherwise solid loop.

A clean design with one of these undercurrents still scores below the bar at senior+. None are about getting an answer wrong — they're about how you operate.

Drawing before scoping

Jumping to architecture without bounding the problem or confirming scale. Reads as template-matching.

Hedging without committing

"It depends" with no decision behind it. Name the trade-off, then pick.

Promising global exactly-once

Claiming literal exactly-once across the whole pipeline. The honest guarantee is at-least-once + idempotent dedup = effective exactly-once.

Per-click SQL counter

Incrementing a row per click — hot rows and write amplification — and serving dashboards from OLTP. Multiple failures at once.

No dedup key

Aggregating without a stable event id, so retries and double-clicks silently inflate billing.

Skipping operations entirely

No observability, no rollout, no failure-mode plan. In 2026 this reads as "has never carried a pager."

Bluffing under a probe

Confident wrong answers when pushed. Far worse than an honest "here's what I'd verify."

Not driving

Waiting to be asked the next question. At staff you own the 45 minutes.

07 Your pre-loop scorecard

Self-grade before you walk in.

Run a mock and score yourself honestly against the dimensions the interviewer uses. If you can't hit "strong" on depth and operability, that's your signal on where to drill.

Dimension	Weak (downlevel)	Strong (at level)
Scoping	One path; promised global exactly-once.	Split fast-approximate dashboards from slow-exact billing; refused the exactly-once trap.
Capture	Trusted the browser to report clicks.	Server-side 302 redirect so the platform observes every billable click; stable event id.
Stream processing	In-memory counter, no windowing.	Kafka → Flink with event-time tumbling windows, watermarks, and checkpointed state.
Dedup	No stable key; double-counts.	Dedup on event id in keyed state plus idempotent upsert sink.
Reconciliation	Trusted the stream blindly.	Batch reconciliation over raw S3 events as the authoritative billing correction; drift alerts.
Operability	Never mentioned it.	Ingestion/watermark lag, dedup rate, hot-vs-batch drift, savepoint-based rollout.

The 60-second recap that lands the level

Quick recap: two consumers of one Kafka event log — a fast Flink path doing event-time windowed aggregation into a columnar OLAP store for dashboards, and a slow batch path over raw S3 events for authoritative billing; capture is a server-side 302 so we observe every click; dedup is on a stable event id with idempotent upserts; effective exactly-once comes from at-least-once ingestion plus Flink checkpoints and Kafka two-phase commit, not a global claim; late events use watermarks and corrections; reconciliation is the truth for billing. Headline metrics: watermark lag and hot-vs-batch drift. With more time: fraud filtering and attribution joins.

★

The one mental model: a counting pipeline is a fast approximate path and a slow exact path reading the same durable log — speed from the stream, truth from the batch. Say “this is a streaming-aggregation problem and I won’t promise global exactly-once” in the first two minutes, dedup on a stable id, and let reconciliation be the authority where money is on the line.

Design an Ad Click Aggregator — every click is money you can’t lose or double-count.

Time Budget · how the 45 min should split

The shape of the problem

Six buckets — and judgment outweighs the diagram.

It's a sliding scale, not a pass/fail bar.

Borrow AWS's Well-Architected pillars as your trade-off vocabulary.

Operational Excellence

Security

Reliability

Performance Efficiency

Cost Optimization

Sustainability

A pipeline, not a request — count once, even when events double or arrive late.

How to narrate it in the room

Five phases. Drive every one of them.

Functional requirements to land

Non-functional requirements to land

The estimate that matters

The pipeline

Why a stream processor, not a plain consumer

Effective exactly-once, mechanically

Late & out-of-order events

Reconciliation (the lambda safety net)

Hot keys

Observability

Rollout

The probes you'll get — and the answer that holds.

"A user double-clicks or the redirect retries — don't bill twice."

"Why Flink and not a Kafka consumer with an in-memory counter?"

"A click arrives ten minutes late."

"Can you guarantee exactly-once end to end?"

"Advertiser queries 90 days grouped by campaign and minute."

"How do you know the counts are actually right?"

The flags that quietly tank an otherwise solid loop.

Self-grade before you walk in.