Design Google Docs: A System Design Interview Conversation
The Interview Scenario
Interviewer: "Design Google Docs - a real-time collaborative document editor. You have 45 minutes."
Candidate: "This is a fascinating problem with real-time collaboration. Let me start with clarifying questions."
---
Phase 1: Requirements Clarification (5 minutes)
Candidate: "Key questions:
- **Collaboration** - How many simultaneous editors?
- **Document types** - Just text, or rich formatting, tables, images?
- **Offline support** - Should users edit offline and sync later?
- **History** - Version history and undo/redo?
- **Sharing** - Permissions model for documents?"
Interviewer: "Let's focus on:
- Real-time collaboration with 20-50 simultaneous editors
- Rich text editing (bold, italic, headings, lists)
- Basic offline with sync
- Version history (last 30 days)
- Sharing with view/edit permissions
- 100 million documents, 10 million daily active users"
Candidate: "Perfect. Critical non-functional requirements:
- **Real-time sync** - <100ms latency for edits to appear for others
- **Conflict resolution** - Multiple users editing same line simultaneously
- **No data loss** - Every keystroke must be preserved
- **Consistency** - All users see the same document eventually
- **Presence** - See who else is editing and their cursor position
Interviewer: "Exactly. This is the core challenge."
---
Phase 2: The Core Problem - Conflict Resolution (8 minutes)
Candidate: "Before diving into architecture, let me address the hardest problem: what happens when two users type at the same position simultaneously?"
Example scenario:
```
Initial document: "Hello World"
User A types "Beautiful " at position 6: "Hello Beautiful World"
User B types "Amazing " at position 6: "Hello Amazing World"
Both happen at the same time. What's the result?
```
Option 1: Last Write Wins
- Simple but loses User A's edit
- Unacceptable for collaborative editing
Option 2: Locking
- Lock the line or paragraph being edited
- Causes terrible UX - users blocked from editing
Option 3: Operational Transformation (OT)
- Transform concurrent operations to preserve intent
- Google Docs uses this
- Complex but gives best UX
Option 4: CRDTs (Conflict-free Replicated Data Types)
- Data structure that automatically merges
- Simpler reasoning but larger data overhead
- Figma uses this approach
Interviewer: "Which would you choose?"
Candidate: "For a Google Docs clone, I'd use Operational Transformation because:
- More mature for text editing
- Smaller wire protocol (just operations, not full state)
- Proven at Google's scale
Let me explain how OT works:"
```
User A's operation: Insert("Beautiful ", position=6)
User B's operation: Insert("Amazing ", position=6)
Server receives A first, then B.
Server applies A: "Hello Beautiful World"
Now B's operation needs transformation:
- B wanted to insert at position 6
- But A inserted 10 characters before that
- Transform B's position: 6 + 10 = 16
Transformed B: Insert("Amazing ", position=16)
Result: "Hello Beautiful Amazing World"
Both insertions preserved!
```
---
Phase 3: High-Level Architecture (10 minutes)
Candidate: "Now let me design the system:"

Interviewer: "Walk me through the edit flow."
Candidate: "Here's what happens when a user types:
Client Side:
- User types 'a' → Local operation created: Insert('a', pos=5)
- Operation applied immediately to local state (optimistic)
- Operation added to pending buffer
- Operation sent to server via WebSocket
Server Side:
- Collaboration Service receives operation
- Check operation's base revision vs current revision
- If behind, transform against all operations since that revision
- Apply transformed operation to server state
- Broadcast transformed operation to all other clients
- Acknowledge to sender with new revision number
Other Clients:
- Receive broadcast operation
- Transform against any pending local operations
- Apply to local state
- Update cursor positions
Key insight: The server is the source of truth for operation ordering. Clients apply operations optimistically but may need to rebase."
Interviewer: "How do you ensure clients stay in sync?"
Candidate: "Several mechanisms:
- **Revision numbers** - Each operation has a base revision; server tracks current revision
- **Acknowledgments** - Server ACKs each operation with new revision
- **Operation log** - Server stores all operations in order
- **Periodic snapshots** - Full document state saved every N operations
- **Resync protocol** - If client falls too far behind, fetch snapshot and replay
```
Client state:
- Local revision: 42
- Server acknowledged: 40
- Pending operations: [op41, op42]
Server broadcast arrives: op from other user, base revision 40
Client must transform:
- New op transformed against op41, then op42
- Then applied locally
```"
---
Phase 4: Deep Dive - Scaling Collaboration (10 minutes)
Interviewer: "How does this scale to millions of documents with 50 editors each?"
Candidate: "Great question. Let me address scaling challenges:"
Challenge 1: WebSocket Connection Management
```
┌─────────────────────────────────────────────┐
│ WebSocket Gateway │
│ (Handles 100K connections per node) │
└─────────────────┬───────────────────────────┘
│
┌─────────────┼─────────────┐
│ │ │
┌───▼───┐ ┌───▼───┐ ┌───▼───┐
│Collab │ │Collab │ │Collab │
│Node 1 │ │Node 2 │ │Node 3 │
│Doc A-M│ │Doc N-Z│ │Doc 0-9│
└───────┘ └───────┘ └───────┘
```
- Partition documents across Collaboration nodes
- Each document lives on ONE node (single leader for OT)
- WebSocket Gateway routes based on document ID
- If node fails, reassign documents to healthy nodes
Challenge 2: Document Session Affinity
Candidate: "All editors of a document MUST connect to the same Collaboration node:"
```
Document session routing:
- Client connects with doc_id
- Gateway checks Redis: "Which node owns doc_123?"
- If exists: route to that node
- If not: consistent hash to assign node, store in Redis
- Node loads document state from DB if cold start
```
Challenge 3: Memory Management
Interviewer: "What if a document is huge?"
Candidate: "Documents can be large. Strategies:
- **Page/chunk loading** - Load only visible sections
- **Hot/cold separation** - Keep active documents in memory
- **LRU eviction** - Evict inactive documents after 30 minutes
- **Snapshot on eviction** - Save state before removing from memory
For a 100-page document:
- Load pages 1-10 on open
- Stream additional pages as user scrolls
- OT operations include page/section identifier"
Interviewer: "How do you handle node failure?"
Candidate: "Critical for data safety:
- **Operation log in Kafka/Cassandra** - Durable before ACK
- **Periodic snapshots** - Full state saved to storage
- **On failure detection:**
- Redis marks document as 'migrating'
- New requests wait briefly
- Another node claims document
- Loads latest snapshot + replays ops from log
- Clients reconnect to new node
- Resume collaboration
Recovery time target: <5 seconds"
---
Phase 5: Offline Support (5 minutes)
Interviewer: "How does offline editing work?"
Candidate: "Offline adds complexity but is achievable:
When going offline:
- Client keeps full document copy locally (IndexedDB)
- Continues generating operations locally
- Operations stored in local queue
When reconnecting:
- Send all queued operations to server
- Server transforms each against operations that happened while offline
- May result in many transformations if offline for long time
- Potential for significant changes requiring user attention
Conflict indicators:
```
User was offline, typed "Chapter 1: Introduction"
Meanwhile, another user deleted that entire section
On reconnect:
- Server detects conflict
- Options:
a) Apply best-effort merge
b) Show user: "Your edit conflicts with recent changes"
c) Create a 'branch' for user to manually resolve
I'd use option (b) with highlighting
```
Interviewer: "Is this true CRDT behavior?"
Candidate: "Not quite. Pure OT assumes relatively real-time sync. For long offline periods, I'd actually consider a hybrid approach:
- Use OT for real-time collaboration
- Switch to CRDT-like merge for offline sync
- This is what Figma does"
---
Phase 6: Additional Features (5 minutes)
Interviewer: "Briefly cover presence and version history."
Candidate: "Quick coverage:
Presence (cursor positions, who's online):
```
// Redis pub/sub per document
SUBSCRIBE doc:123:presence
// Each client publishes every 100ms
{
user_id: 456,
cursor_position: 1234,
selection: [1234, 1250],
color: "#FF5733"
}
// Heartbeat for online status
// Expire after 5 seconds of no heartbeat
```
Version History:
```
// Operation log in Cassandra
operations (
doc_id UUID,
revision_id INT,
operation BLOB,
user_id UUID,
timestamp TIMESTAMP,
PRIMARY KEY (doc_id, revision_id)
)
// Named versions (user-created checkpoints)
versions (
doc_id UUID,
version_name TEXT,
revision_id INT,
snapshot BLOB,
created_at TIMESTAMP
)
// Restore = replay operations up to revision
// Diff = apply operations sequentially, track changes
```"
---
Phase 7: Trade-offs Summary (2 minutes)
Candidate: "Key trade-offs:
| Decision | Chose | Alternative | Why |
|----------|-------|-------------|-----|
| Conflict resolution | OT | CRDT | Proven for text, smaller wire format |
| Document partitioning | Single leader per doc | Multi-leader | Simpler OT, stronger consistency |
| Operation storage | Cassandra | PostgreSQL | High write throughput, time-series friendly |
| Real-time transport | WebSocket | SSE/Long-polling | Bidirectional, lower latency |
| Offline sync | OT with conflict detection | Full CRDT | Better UX for real-time case |
Interviewer: "Great job covering the complexities of real-time collaboration."
---
Key Interview Takeaways
- **Start with conflict resolution** - It's the core problem; don't skip it
- **OT vs CRDT** - Know trade-offs; OT for text, CRDT for complex structures
- **Single leader per document** - Simplifies consistency dramatically
- **Optimistic local updates** - Apply immediately, transform if needed
- **Presence is separate** - Use pub/sub, not the main collaboration channel
Ready to Build Your Perfect Resume?
Let IdealResume help you create ATS-optimized, tailored resumes that get results.
Get Started Free