Design Google Docs: A System Design Interview Conversation
System Design Interview

Design Google Docs: A System Design Interview Conversation

IdealResume TeamSeptember 19, 202514 min read
Share:

The Interview Scenario

Interviewer: "Design Google Docs - a real-time collaborative document editor. You have 45 minutes."

Candidate: "This is a fascinating problem with real-time collaboration. Let me start with clarifying questions."

---

Phase 1: Requirements Clarification (5 minutes)

Candidate: "Key questions:

  1. **Collaboration** - How many simultaneous editors?
  2. **Document types** - Just text, or rich formatting, tables, images?
  3. **Offline support** - Should users edit offline and sync later?
  4. **History** - Version history and undo/redo?
  5. **Sharing** - Permissions model for documents?"

Interviewer: "Let's focus on:

  • Real-time collaboration with 20-50 simultaneous editors
  • Rich text editing (bold, italic, headings, lists)
  • Basic offline with sync
  • Version history (last 30 days)
  • Sharing with view/edit permissions
  • 100 million documents, 10 million daily active users"

Candidate: "Perfect. Critical non-functional requirements:

  • **Real-time sync** - <100ms latency for edits to appear for others
  • **Conflict resolution** - Multiple users editing same line simultaneously
  • **No data loss** - Every keystroke must be preserved
  • **Consistency** - All users see the same document eventually
  • **Presence** - See who else is editing and their cursor position

Interviewer: "Exactly. This is the core challenge."

---

Phase 2: The Core Problem - Conflict Resolution (8 minutes)

Candidate: "Before diving into architecture, let me address the hardest problem: what happens when two users type at the same position simultaneously?"

Example scenario:

```

Initial document: "Hello World"

User A types "Beautiful " at position 6: "Hello Beautiful World"

User B types "Amazing " at position 6: "Hello Amazing World"

Both happen at the same time. What's the result?

```

Option 1: Last Write Wins

  • Simple but loses User A's edit
  • Unacceptable for collaborative editing

Option 2: Locking

  • Lock the line or paragraph being edited
  • Causes terrible UX - users blocked from editing

Option 3: Operational Transformation (OT)

  • Transform concurrent operations to preserve intent
  • Google Docs uses this
  • Complex but gives best UX

Option 4: CRDTs (Conflict-free Replicated Data Types)

  • Data structure that automatically merges
  • Simpler reasoning but larger data overhead
  • Figma uses this approach

Interviewer: "Which would you choose?"

Candidate: "For a Google Docs clone, I'd use Operational Transformation because:

  1. More mature for text editing
  2. Smaller wire protocol (just operations, not full state)
  3. Proven at Google's scale

Let me explain how OT works:"

```

User A's operation: Insert("Beautiful ", position=6)

User B's operation: Insert("Amazing ", position=6)

Server receives A first, then B.

Server applies A: "Hello Beautiful World"

Now B's operation needs transformation:

  • B wanted to insert at position 6
  • But A inserted 10 characters before that
  • Transform B's position: 6 + 10 = 16

Transformed B: Insert("Amazing ", position=16)

Result: "Hello Beautiful Amazing World"

Both insertions preserved!

```

---

Phase 3: High-Level Architecture (10 minutes)

Candidate: "Now let me design the system:"

![Google Docs Real-Time Collaboration Architecture](/images/blog/google-docs-architecture.svg)

Interviewer: "Walk me through the edit flow."

Candidate: "Here's what happens when a user types:

Client Side:

  1. User types 'a' → Local operation created: Insert('a', pos=5)
  2. Operation applied immediately to local state (optimistic)
  3. Operation added to pending buffer
  4. Operation sent to server via WebSocket

Server Side:

  1. Collaboration Service receives operation
  2. Check operation's base revision vs current revision
  3. If behind, transform against all operations since that revision
  4. Apply transformed operation to server state
  5. Broadcast transformed operation to all other clients
  6. Acknowledge to sender with new revision number

Other Clients:

  1. Receive broadcast operation
  2. Transform against any pending local operations
  3. Apply to local state
  4. Update cursor positions

Key insight: The server is the source of truth for operation ordering. Clients apply operations optimistically but may need to rebase."

Interviewer: "How do you ensure clients stay in sync?"

Candidate: "Several mechanisms:

  1. **Revision numbers** - Each operation has a base revision; server tracks current revision
  2. **Acknowledgments** - Server ACKs each operation with new revision
  3. **Operation log** - Server stores all operations in order
  4. **Periodic snapshots** - Full document state saved every N operations
  5. **Resync protocol** - If client falls too far behind, fetch snapshot and replay

```

Client state:

  • Local revision: 42
  • Server acknowledged: 40
  • Pending operations: [op41, op42]

Server broadcast arrives: op from other user, base revision 40

Client must transform:

  • New op transformed against op41, then op42
  • Then applied locally

```"

---

Phase 4: Deep Dive - Scaling Collaboration (10 minutes)

Interviewer: "How does this scale to millions of documents with 50 editors each?"

Candidate: "Great question. Let me address scaling challenges:"

Challenge 1: WebSocket Connection Management

```

┌─────────────────────────────────────────────┐

│ WebSocket Gateway │

│ (Handles 100K connections per node) │

└─────────────────┬───────────────────────────┘

┌─────────────┼─────────────┐

│ │ │

┌───▼───┐ ┌───▼───┐ ┌───▼───┐

│Collab │ │Collab │ │Collab │

│Node 1 │ │Node 2 │ │Node 3 │

│Doc A-M│ │Doc N-Z│ │Doc 0-9│

└───────┘ └───────┘ └───────┘

```

  • Partition documents across Collaboration nodes
  • Each document lives on ONE node (single leader for OT)
  • WebSocket Gateway routes based on document ID
  • If node fails, reassign documents to healthy nodes

Challenge 2: Document Session Affinity

Candidate: "All editors of a document MUST connect to the same Collaboration node:"

```

Document session routing:

  1. Client connects with doc_id
  2. Gateway checks Redis: "Which node owns doc_123?"
  3. If exists: route to that node
  4. If not: consistent hash to assign node, store in Redis
  5. Node loads document state from DB if cold start

```

Challenge 3: Memory Management

Interviewer: "What if a document is huge?"

Candidate: "Documents can be large. Strategies:

  1. **Page/chunk loading** - Load only visible sections
  2. **Hot/cold separation** - Keep active documents in memory
  3. **LRU eviction** - Evict inactive documents after 30 minutes
  4. **Snapshot on eviction** - Save state before removing from memory

For a 100-page document:

  • Load pages 1-10 on open
  • Stream additional pages as user scrolls
  • OT operations include page/section identifier"

Interviewer: "How do you handle node failure?"

Candidate: "Critical for data safety:

  • **Operation log in Kafka/Cassandra** - Durable before ACK
  • **Periodic snapshots** - Full state saved to storage
  • **On failure detection:**
  • Redis marks document as 'migrating'
  • New requests wait briefly
  • Another node claims document
  • Loads latest snapshot + replays ops from log
  • Clients reconnect to new node
  • Resume collaboration

Recovery time target: <5 seconds"

---

Phase 5: Offline Support (5 minutes)

Interviewer: "How does offline editing work?"

Candidate: "Offline adds complexity but is achievable:

When going offline:

  1. Client keeps full document copy locally (IndexedDB)
  2. Continues generating operations locally
  3. Operations stored in local queue

When reconnecting:

  1. Send all queued operations to server
  2. Server transforms each against operations that happened while offline
  3. May result in many transformations if offline for long time
  4. Potential for significant changes requiring user attention

Conflict indicators:

```

User was offline, typed "Chapter 1: Introduction"

Meanwhile, another user deleted that entire section

On reconnect:

  • Server detects conflict
  • Options:

a) Apply best-effort merge

b) Show user: "Your edit conflicts with recent changes"

c) Create a 'branch' for user to manually resolve

I'd use option (b) with highlighting

```

Interviewer: "Is this true CRDT behavior?"

Candidate: "Not quite. Pure OT assumes relatively real-time sync. For long offline periods, I'd actually consider a hybrid approach:

  • Use OT for real-time collaboration
  • Switch to CRDT-like merge for offline sync
  • This is what Figma does"

---

Phase 6: Additional Features (5 minutes)

Interviewer: "Briefly cover presence and version history."

Candidate: "Quick coverage:

Presence (cursor positions, who's online):

```

// Redis pub/sub per document

SUBSCRIBE doc:123:presence

// Each client publishes every 100ms

{

user_id: 456,

cursor_position: 1234,

selection: [1234, 1250],

color: "#FF5733"

}

// Heartbeat for online status

// Expire after 5 seconds of no heartbeat

```

Version History:

```

// Operation log in Cassandra

operations (

doc_id UUID,

revision_id INT,

operation BLOB,

user_id UUID,

timestamp TIMESTAMP,

PRIMARY KEY (doc_id, revision_id)

)

// Named versions (user-created checkpoints)

versions (

doc_id UUID,

version_name TEXT,

revision_id INT,

snapshot BLOB,

created_at TIMESTAMP

)

// Restore = replay operations up to revision

// Diff = apply operations sequentially, track changes

```"

---

Phase 7: Trade-offs Summary (2 minutes)

Candidate: "Key trade-offs:

| Decision | Chose | Alternative | Why |

|----------|-------|-------------|-----|

| Conflict resolution | OT | CRDT | Proven for text, smaller wire format |

| Document partitioning | Single leader per doc | Multi-leader | Simpler OT, stronger consistency |

| Operation storage | Cassandra | PostgreSQL | High write throughput, time-series friendly |

| Real-time transport | WebSocket | SSE/Long-polling | Bidirectional, lower latency |

| Offline sync | OT with conflict detection | Full CRDT | Better UX for real-time case |

Interviewer: "Great job covering the complexities of real-time collaboration."

---

Key Interview Takeaways

  1. **Start with conflict resolution** - It's the core problem; don't skip it
  2. **OT vs CRDT** - Know trade-offs; OT for text, CRDT for complex structures
  3. **Single leader per document** - Simplifies consistency dramatically
  4. **Optimistic local updates** - Apply immediately, transform if needed
  5. **Presence is separate** - Use pub/sub, not the main collaboration channel

Ready to Build Your Perfect Resume?

Let IdealResume help you create ATS-optimized, tailored resumes that get results.

Get Started Free

Found this helpful? Share it with others who might benefit.

Share: