Collaborative-State Playbook — Designing Google Docs

01 How you're actually graded

Six buckets — and judgment outweighs the diagram.

Every FAANG company runs a rubric. The dimensions are roughly the same; the weights differ by company and level. At senior+ the boxes-and-arrows are table stakes — what gets graded hardest is the quality of your decisions: the questions you asked first, the trade-offs you surfaced and defended, and the production reality you volunteered without being asked.

Dimension	Weight	What earns the signal
Requirements & scoping	10–15%	You scoped before drawing, asked enough to bound the problem, pinned the scale number, and stated assumptions out loud.
High-level architecture	20–25%	The right components, a clear data flow, and a reason every box exists. The design satisfies each functional requirement.
Technical depth / deep dives	~30%	You go three questions deep on the hard part without being rescued. This is where staff is won or lost.
Trade-offs & judgment	highest effective	Two viable options, what each costs, and a committed pick for this system. Simplicity over flash when flash isn't warranted.
Communication / driving	cross-cutting	You drive the 45 minutes; the interviewer never has to rescue you. You narrate, checkpoint, and narrow when the design sprawls.
Operational maturity	↑ in 2026	The newest weight: observability, rollout, failure modes, on-call reality — volunteered, not pried out.

The 2026 shift, in one line. Operational concerns are now a first-class graded dimension, and "it depends" without a committed answer reads as evasion rather than nuance. Name the trade-off, then pick.

02 The same answer is scored differently at each level

It's a sliding scale, not a pass/fail bar.

A solid design with reasonable trade-offs is a strong score for a mid-level candidate and a downlevel flag for staff. The questions can be identical; the depth expectation is not. As you climb, the balance tips from breadth toward depth, proactivity, and production reality.

Mid-level

Meta E4 · Google L4 · Amazon SDE-II

breadthdepth

Sketches clients talking to a server over WebSocket and edits being broadcast.
Knows concurrent edits can conflict; names OT or CRDT exists but is shaky on mechanics.
Stores the document; may not separate the op log from the materialized content.
Needs guidance on ordering and convergence guarantees.

Senior

Meta E5 · Google L5 · Amazon SDE-III

breadthdepth

Chooses WebSocket with a one-line justification (bidirectional is mandatory) and sends op deltas, not whole files.
Explains OT at a mechanical level: the server orders ops with revision numbers and transforms concurrent ops.
Persists an op log for durability/replay and snapshots periodically; handles cursor presence.
Has an opinion on OT vs CRDT and when each is right.

Staff+

Meta E6 · Google L6 · Amazon Principal

breadthdepth

Establishes the OT pipeline fast, then spends time on WebSocket-server scaling, offline reconciliation, and the OT/CRDT hybrid.
Experience-backed take on convergence-time monitoring and transform-conflict rates.
Treats snapshot/compaction, pub-sub rebroadcast, and gradual rollout as routine.
Frames the central-ordering constraint as both a simplifier (free total order) and a scaling bottleneck.

03 The lens senior engineers narrate through

Borrow AWS's Well-Architected pillars as your trade-off vocabulary.

You don't recite AWS — you anchor each decision to one of these. It signals you evaluate systems across competing concerns rather than optimizing one axis. Each pillar below is mapped to a move you can make in this exact design.

PILLAR 01

Operational Excellence

Measure convergence, not just delivery.

Hook: “I’d log every transform and track convergence time — how long until all clients see the same state — alerting on spikes that signal a transform bug or network issue.”

PILLAR 02

Security

Per-document access control.

Hook: “Every operation is authorized against the document’s ACL at the gateway; a viewer can’t inject edit ops, and share scope is enforced server-side.”

PILLAR 03

Reliability

Never lose an accepted edit.

Hook: “An op is durably appended to the log before it’s acknowledged and broadcast, so a server crash can’t silently drop a keystroke a user already saw confirmed.”

PILLAR 04

Performance Efficiency

Send deltas, hit the latency floor.

Hook: “I broadcast small op deltas, not the document, and keep edit echo under ~100ms because anything slower breaks the illusion of co-presence.”

PILLAR 05

Cost Optimization

Snapshot to avoid replay.

Hook: “Periodic snapshots plus op-log compaction mean opening a doc loads a snapshot and a few recent ops, not a million-keystroke history.”

PILLAR 06

Sustainability

Compact the log.

Hook: “For CRDT paths I’d garbage-collect tombstones on a schedule so metadata doesn’t grow unbounded with every deleted character.”

How to use it without sounding like a checklist. Don't list the pillars. Weave one in when you commit: name a trade-off, name the pillar it serves, and make the call. One sentence that does all three reads as senior.

03·5 The architecture you draw on the whiteboard

Convergence by funnelling — every edit through one authority.

The make-or-break is the conflict-resolution algorithm (Operational Transform or CRDT) — and bluffing it is obvious. Architecturally, the trick is to funnel every client’s edits through one Doc Session that owns the algorithm, so concurrent edits are transformed into a single order and every screen converges. Everything else exists to feed that.

Client ops Transform + broadcast Snapshot

Convergence by funnelling. Every client’s edits flow through one Doc Session that owns the OT/CRDT engine; it orders concurrent ops, appends to a log, and broadcasts the result so all screens converge. Say it: “the conflict-resolution algorithm is the make-or-break — everything else just feeds it.”

How to narrate it in the room

Anchor on the algorithm. “The crux is conflict resolution — Operational Transform or CRDT. I’d pick one and explain how concurrent edits at the same position still converge.”
Force a single authority. “All sessions for a document route to one Doc Session via consistent hashing on docId, so there’s exactly one place that orders operations.”
Persist as log + snapshots. “Ops append to an immutable log for replay and offline sync; periodic snapshots bound how much log you replay to reconstruct the document.”
Layer the rest on top. “WebSockets carry ops and cursor presence; storage and auth make the convergence durable and scoped — they don’t solve the conflict.”

04 The interview, minute by minute

Five phases. Drive every one of them.

The simulation. Framing: a browser-based collaborative editor — many documents, up to ~50–100 concurrent editors on a hot doc, edit echo targeting < 100ms, guaranteed convergence, durability of accepted edits, and high availability.

01Requirements & Scoping~6 min · don't draw yet

Grading this window: Do you name convergence under concurrent edits — not storage — as the crux? And pin the sub-100ms latency budget that drives the transport choice?

Functional requirements to land

Create / open a document.
Multiple users edit concurrently and see each other’s changes in real time.
Presence: see other users’ cursors and selections.

Non-functional requirements to land

Low latency: edit echo < 100ms — above ~100–200ms it stops feeling simultaneous.
Convergence: all clients must reach the same final state, regardless of edit order. This is the requirement, not “fast delivery.”
Durability: an accepted edit is never lost.
Availability for reads/edits.

▲ Allow — say this

“The crux isn’t storing the document — it’s convergence. If two people edit the same position at once, fast delivery alone gives them different documents. The whole architecture is organized around resolving that correctly.”

▲ Allow — say this

“Two scoping questions: linear rich text or a spatial canvas, and do we need offline editing? Both push me toward or away from CRDTs, so I want to know before I commit to OT.”

▼ Reject — never say this

“We’ll save the document and last-write-wins on conflict.” For collaborative editing, last-write-wins silently destroys other users’ work — it’s the anti-answer.

02Entities, API & Transport~5 min

Grading this window: Crisp model and the transport decision with a one-line reason. Sending op deltas (not whole files) is the signal.

Entities: Document (id, content + current revision number), Operation (type insert/delete, position, payload, clientId, baseRevision), Presence (cursor position per user). The transport is the first real decision:

// WebSocket — bidirectional is mandatory: // the client pushes ops AND the server pushes transformed ops back op = { type:"insert", pos:29, char:"R", rev:412, client:"A" }

Send the operation delta, never the whole file — a keystroke is a few bytes, and broadcasting the full document on every edit is both slow and unmergeable.

▲ Allow — say this

“WebSocket, because collaborative editing needs the server to push to clients, not just respond — true bidirectional. And I send small op deltas tagged with the revision they were based on, so the server can transform them against anything that landed since.”

03High-Level Design (the MVP)~13 min

Grading this window: The central-ordering OT pipeline with durability. Components with a clear data flow and a reason each exists.

The pipeline

Clients connect over WebSocket to a real-time gateway. Edits flow to a document/operation service that does three things: assigns a monotonically increasing revision number (a total order), transforms each incoming op against any ops that landed since its base revision (OT), and broadcasts the transformed op to all other clients on that document. Each op is durably appended to an op log before acknowledgment, so nothing accepted is ever lost.

clients <—WS—> gateway <—> op service op service: assign rev# → OT transform → append to log → broadcast storage: op log (durable, replayable) + periodic snapshot + metadata

Why central ordering helps

A single server-assigned revision sequence gives every client the same order to apply ops in — that total order is what makes OT tractable. A message queue in front of the op service buffers and serializes concurrent ops and adds durability.

The trap door the interviewer opens here. “Client A inserts at position 5 while Client B deletes positions 2–4 — what does A’s op become?” That’s OT in one question: A’s “insert at 5” must be transformed to “insert at 3” because two characters vanished before it. Showing you can reason through one transform — even informally — is the whole signal.

▲ Allow — say this

“The server is the single ordering authority: it stamps each op with the next revision number and transforms concurrent ops against each other before broadcasting, so every client converges on the same sequence. The op log makes it durable and replayable.”

◆ Throttle — only with a reason

Leading with CRDTs. They’re the right call for offline-first or a spatial/structured data model — say which. For server-mediated linear text, OT is simpler and is what Google actually ships; reach for CRDT when the constraint demands it.

▼ Reject — never say this

“The client sends the whole updated document and the server saves it.” That can’t merge concurrent edits and throws away everyone else’s changes — the exact failure OT/CRDT exist to prevent.

04Deep Dives — the stress test~15 min · where staff is decided

Grading this window: OT vs CRDT depth, WebSocket-server scaling, offline reconciliation, and snapshotting. Staff volunteers these; they're 30%+ of the score.

OT vs CRDT — the decision, framed like a senior

OT treats edits as operations transformed against context; it needs a central server to order them, is mature, and is what Google Docs uses — but the central ordering becomes a scaling constraint. CRDTs are data types that mathematically converge without a coordinator; they shine for offline and decentralized cases but carry tombstone/metadata and compaction cost. The senior answer is not “OT” or “CRDT” — it’s: OT for the real-time hot path where a central server already exists; CRDT for offline reconciliation and non-text structured data where distributed merge is genuinely required.

Scaling WebSocket servers (the subtle one)

With many WebSocket servers, all editors of one document must still see each other. Two options: route a document’s editors to the same server (consistent hashing on docId), or put a pub/sub broker between the op service and the WebSocket layer so any server can rebroadcast a document’s ops to its connected clients. Pub/sub decouples connections from document affinity and scales cleaner.

Offline reconciliation

A client editing offline buffers ops against its last-known base revision. On reconnect, those ops are transformed against everything the server accepted in the interim (OT) — or merged via CRDT if offline is a first-class requirement. This is exactly the layer where OT’s central-ordering assumption breaks and CRDT earns its keep.

Snapshotting

Don’t replay a million ops on open. Snapshot the materialized document periodically; loading = latest snapshot + the few ops since. Compact the op log behind the snapshot.

▲ Allow — say this (staff move)

“I’d use OT on the hot path and reach for CRDTs only where central ordering stops being free — offline edits and structured non-text data. That hybrid is the real-world answer; Figma and others split exactly this way.”

▼ Reject — never say this

“OT and CRDT are basically the same.” They solve the same problem with opposite assumptions about central authority — conflating them tells the interviewer you’ve only read the headline.

Scripted stress-test exchange

Interviewer

Two users edit the same position in the same millisecond. What happens?

You

The server orders them by revision — say A’s op gets revision N, B’s gets N+1. B’s op is then transformed against A’s before it’s applied and broadcast, so B’s intended insertion shifts to account for A’s. Both clients end up applying the same two ops in the same effective order and converge on an identical document. It’s emphatically not last-write-wins — both edits survive.

Interviewer

Now one of them was offline for an hour.

You

That client buffered its ops against the base revision it last saw. On reconnect, I transform each buffered op against every op the server accepted in the interim and apply them in order — same OT machinery, just a longer transform chain. If offline editing were a primary product requirement rather than an edge case, I’d move that path to a CRDT so the merge needs no central replay.

05Wrap-up — operability & recap~6 min

Grading this window: Prove you could run it. Volunteer convergence observability and rollout; recap; name what you deferred.

Observability — the right metric is convergence

Convergence time: how long until all clients on a doc see the same state — spikes mean a transform bug or network issue.
Transform-conflict rate and client/server divergence events.
Edit-echo p99 against the 100ms budget; WebSocket connection health.

Rollout

Roll transform-logic changes out carefully behind a flag with a fast rollback — a subtle OT bug corrupts documents, so canary on low-traffic docs first and watch divergence before widening.

▲ Allow — say this

“With more time I’d detail rich-text formatting as structured ops and the full offline/CRDT layer. I scoped them out deliberately — I didn’t miss them.”

05 The follow-up gauntlet

The probes you'll get — and the answer that holds.

Interviewers probe the algorithm hard because bluffing is obvious. Reason through one transform, commit to a model, name the hybrid.

"Two users type at the same position, same millisecond — what happens?"

The server assigns each a revision number, giving a total order, then transforms the later op against the earlier one before applying and broadcasting. Both intentions are preserved and every client converges on the same document. It is not last-write-wins — both edits land, just position-adjusted.

"OT or CRDT — which?"

OT for the real-time hot path where a central server already orders ops — simpler, mature, what Google Docs uses. CRDT for offline reconciliation and non-text structured data where you need distributed merge without a coordinator. The senior answer is the hybrid: OT hot path, CRDT for the offline/structured layers.

"How do you scale WebSocket servers but keep co-editors in sync?"

Either route all editors of a document to the same server via consistent hashing on docId, or — cleaner — put a pub/sub broker between the op service and the WebSocket layer so any server can rebroadcast a document's ops to the clients it holds. Pub/sub decouples connection placement from document affinity.

"A user edits offline for an hour, then reconnects."

Their client buffered ops against the last base revision it saw. On reconnect I transform each buffered op against every op the server accepted in the interim and apply in order — same OT machinery. If offline is a first-class requirement, I'd use a CRDT for that path so merge needs no central replay.

"How do you avoid replaying a million ops when a doc opens?"

Periodic snapshots of the materialized document plus op-log compaction. Opening loads the latest snapshot and the handful of ops since it — not the entire edit history.

"How do you know convergence is actually correct in production?"

Log every transform and track convergence time — how long until all clients on a doc agree. Alert on convergence-time spikes and on any client/server divergence event; those are the early signal of a transform bug before users report scrambled text.

Handling a probe you can’t fully answer: show the mental model, not false fluency. “I can’t write the full transform function from memory, but the invariant is that transforming op X against op Y preserves both users’ intentions and yields the same result on every client — here’s how I’d reason about the insert-vs-delete case.”

06 What gets you downleveled

The flags that quietly tank an otherwise solid loop.

A clean design with one of these undercurrents still scores below the bar at senior+. None are about getting an answer wrong — they're about how you operate.

Drawing before scoping

Jumping to architecture without bounding the problem or confirming scale. Reads as template-matching.

Hedging without committing

"It depends" with no decision behind it. Name the trade-off, then pick.

Needing rescue on the hard part

Stalling at the core deep dive until the interviewer feeds you the answer. Depth is the senior+ bar.

Last-write-wins on edits

The anti-answer for collaborative editing — it silently destroys concurrent work. Shows you don't grasp the core requirement.

Bluffing OT/CRDT

Hand-waving the algorithm names without being able to reason through a single transform. Interviewers can tell instantly.

Skipping operations entirely

No observability, no rollout, no failure-mode plan. In 2026 this reads as "has never carried a pager."

Bluffing under a probe

Confident wrong answers when pushed. Far worse than an honest "here's what I'd verify."

Not driving

Waiting to be asked the next question. At staff you own the 45 minutes.

07 Your pre-loop scorecard

Self-grade before you walk in.

Run a mock and score yourself honestly against the dimensions the interviewer uses. If you can't hit "strong" on depth and operability, that's your signal on where to drill.

Dimension	Weak (downlevel)	Strong (at level)
Scoping	Framed it as a storage problem.	Named convergence under concurrent edits as the crux; pinned the sub-100ms budget; asked text-vs-canvas and offline.
Transport	HTTP polling or vague.	WebSocket with a one-line reason; sends op deltas tagged with base revision, not whole files.
OT/CRDT depth	Named them; couldn't reason about a transform.	Walked an insert-vs-delete transform; committed to OT and framed the CRDT hybrid for offline/structured data.
Central ordering	No notion of ordering or durability.	Server-assigned revisions for total order; durable op log before ack; periodic snapshots.
WebSocket scaling	One server holds everyone.	docId affinity via consistent hashing or a pub/sub rebroadcast layer so co-editors stay in sync.
Operability	Never mentioned it.	Tracked convergence time and transform-conflict rate; flag-gated, canaried transform changes.

The 60-second recap that lands the level

Quick recap: the crux is convergence, not storage; WebSocket transport carrying small op deltas; a central op service that assigns revision numbers for a total order, transforms concurrent ops (OT), and durably logs before broadcasting; OT on the hot path with a CRDT hybrid for offline and structured data; WebSocket servers scaled via docId affinity or a pub/sub rebroadcast layer; snapshots to avoid op replay; and convergence-time as the headline SLO. With more time: rich-text structured ops and the full CRDT offline layer.

★

The one mental model: collaborative editing is a convergence problem wearing a text-editor costume. Operations — not document states — flow to a central ordering authority that transforms them so every client lands in the same place. Say “the crux here is conflict resolution and convergence” in the first two minutes, then prove you can reason through one transform, and the room knows you’ve seen this family before.

Design Google Docs like both cursors are already typing.

Time Budget · how the 45 min should split

The shape of the problem

Six buckets — and judgment outweighs the diagram.

It's a sliding scale, not a pass/fail bar.

Borrow AWS's Well-Architected pillars as your trade-off vocabulary.

Operational Excellence

Security

Reliability

Performance Efficiency

Cost Optimization

Sustainability

Convergence by funnelling — every edit through one authority.

How to narrate it in the room

Five phases. Drive every one of them.

Functional requirements to land

Non-functional requirements to land

The pipeline

Why central ordering helps

OT vs CRDT — the decision, framed like a senior

Scaling WebSocket servers (the subtle one)

Offline reconciliation

Snapshotting

Observability — the right metric is convergence

Rollout

The probes you'll get — and the answer that holds.

"Two users type at the same position, same millisecond — what happens?"

"OT or CRDT — which?"

"How do you scale WebSocket servers but keep co-editors in sync?"

"A user edits offline for an hour, then reconnects."

"How do you avoid replaying a million ops when a doc opens?"

"How do you know convergence is actually correct in production?"

The flags that quietly tank an otherwise solid loop.

Self-grade before you walk in.