Design Reddit: A System Design Interview Conversation
The Interview Scenario
Interviewer: "Design Reddit - a social news aggregation platform. You have 45 minutes."
Candidate: "Excellent, I'd like to start with some clarifying questions to understand what we're building."
---
Phase 1: Requirements Clarification (5 minutes)
Candidate: "Let me understand the scope:
- **Core features** - Posts, comments, voting, or all three?
- **Subreddits** - Should we support communities/topics?
- **Feeds** - Home feed, popular, subreddit-specific?
- **Ranking** - How should posts be ordered? Hot, new, top?
- **Scale** - What's our target user base?"
Interviewer: "Let's include:
- Posts (text, links, images) and comments
- Subreddits (communities)
- Upvote/downvote system
- Multiple feed types (hot, new, top, home)
- 500 million monthly users, 50 million daily active
- Real-time vote updates on active posts"
Candidate: "Got it. Non-functional requirements:
- **Read-heavy** - Most users browse, fewer post
- **Near real-time** - Votes should update within seconds
- **Consistency** - Vote counts should be accurate eventually
- **Scalability** - Handle viral posts with millions of votes
- **Low latency** - Feed generation under 200ms
Interviewer: "Good assumptions. Continue."
---
Phase 2: Back-of-Envelope Calculations (5 minutes)
Candidate: "Let me calculate the scale:
Content volume:
- 50M DAU
- 1% post daily = 500K new posts/day
- 10% comment daily = 5M new comments/day
- Average post: 1KB text + 500KB media = ~500KB
Votes:
- Average user: 10 votes per session
- 50M DAU × 10 votes = **500M votes/day**
- ~5,800 votes/second average
- Peak (viral post): 50,000+ votes/second
Feed reads:
- Each user loads ~20 pages per session
- 50M × 20 = 1 billion feed loads/day
- ~11,500 feed requests/second
Storage:
- Posts: 500K × 500KB = 250 GB/day
- Growing at ~90 TB/year for media
- Metadata (votes, comments): much smaller"
Interviewer: "What's the most challenging part here?"
Candidate: "Three challenges:
- **Vote velocity** - 50K votes/second on viral posts
- **Feed generation** - Ranking millions of posts in real-time
- **Consistency** - Accurate counts despite massive write volume"
---
Phase 3: High-Level Design (10 minutes)
Candidate: "Here's my architecture:"

Interviewer: "Explain the voting flow."
Candidate: "Voting needs special handling due to volume:
Vote Flow:
- User clicks upvote → Vote Service receives request
- **Idempotency check** - User can only vote once per post
- **Write to cache first** - Increment counter in Redis
- **Queue the event** - Push to Kafka for async processing
- **Return immediately** - User sees updated count
- **Background processing** - Consumer writes to PostgreSQL
- **Batch aggregation** - Periodic sync ensures DB consistency
Why this approach:
- Users get instant feedback
- Redis handles 50K writes/second easily
- Kafka provides durability before DB write
- Background processing handles DB bottleneck"
Interviewer: "How do you prevent double voting?"
Candidate: "Multi-layer protection:
- **Redis Set** - Store (user_id, post_id) pairs for recent votes
- **Database constraint** - Unique index on (user_id, post_id)
- **Idempotent operations** - Vote endpoint is idempotent
- **Change vote** - Clicking same button again removes vote
```
// Redis for fast check
SISMEMBER votes:post:123 user:456
// If not in Redis, check DB (cache miss)
SELECT vote FROM votes WHERE post_id=123 AND user_id=456
```"
---
Phase 4: Deep Dive - Feed Ranking (10 minutes)
Interviewer: "How does feed generation work? Especially the 'Hot' algorithm?"
Candidate: "This is Reddit's secret sauce. Let me explain:"
Hot Ranking Algorithm:
```
score = log10(max(|upvotes - downvotes|, 1))
# Time decay factor
seconds = post_timestamp - reference_timestamp
time_factor = seconds / 45000
# Final hot score
hot_score = score + time_factor
```
Why this works:
- Logarithmic vote scaling prevents runaway posts
- Time decay means new posts can compete
- Reference timestamp normalizes across time
Feed Generation Options:
Option 1: Compute on request
```
SELECT * FROM posts
WHERE subreddit_id IN (user_subscriptions)
ORDER BY hot_score DESC
LIMIT 25
```
- **Pro:** Always fresh
- **Con:** Too slow at scale (computing across millions of posts)
Option 2: Pre-computed feeds (chosen)
```
┌─────────────────────────────────────────────┐
│ Feed Generation Pipeline │
├─────────────────────────────────────────────┤
│ 1. Score Update Worker │
│ - Recalculates hot scores every minute │
│ - Only for posts < 48 hours old │
│ │
│ 2. Feed Builder │
│ - Generates top 1000 posts per subreddit │
│ - Stores in Redis sorted sets │
│ │
│ 3. Home Feed Mixer │
│ - Merges user's subscribed subreddits │
│ - Applies personalization │
└─────────────────────────────────────────────┘
```
Interviewer: "How do you generate the personalized home feed?"
Candidate: "Home feed is more complex:
Fan-out on read:
- User requests home feed
- Get user's subscribed subreddits (cached)
- Fetch top N posts from each subreddit (Redis sorted sets)
- Merge-sort by hot score
- Apply filters (already seen, blocked users)
- Return top 25
Optimization - Tiered approach:
- **Small subreddits (<100K members):** Merge on read
- **Large subreddits (>100K):** Pre-compute into user's feed
Caching strategy:
```
// Redis sorted sets per subreddit
ZREVRANGE subreddit:programming:hot 0 100
// User's computed feed (cached 5 minutes)
LRANGE user:456:feed 0 25
```"
Interviewer: "What about real-time updates on vote counts?"
Candidate: "Two approaches:
1. Polling (simpler):
- Client polls every 30 seconds for active posts
- Batch request: 'Give me current scores for posts [1,2,3,4,5]'
- Efficient enough for most cases
2. WebSocket (for viral posts):
- When post goes viral (>1000 votes/minute)
- Subscribe clients to real-time updates
- Server pushes vote count changes
I'd start with polling and add WebSocket for hot posts later."
---
Phase 5: Data Model (8 minutes)
Interviewer: "Walk me through your database schema."
Candidate: "Here's the core schema:"
```sql
-- Sharded by subreddit_id
posts (
post_id BIGINT PRIMARY KEY,
subreddit_id INT,
author_id BIGINT,
title VARCHAR(300),
content TEXT,
url VARCHAR(500),
created_at TIMESTAMP,
upvotes INT DEFAULT 0,
downvotes INT DEFAULT 0,
hot_score FLOAT,
INDEX (subreddit_id, hot_score),
INDEX (subreddit_id, created_at)
)
-- Sharded by post_id
comments (
comment_id BIGINT PRIMARY KEY,
post_id BIGINT,
parent_comment_id BIGINT NULL, -- for nested comments
author_id BIGINT,
content TEXT,
created_at TIMESTAMP,
upvotes INT DEFAULT 0,
downvotes INT DEFAULT 0
)
-- Sharded by post_id
votes (
user_id BIGINT,
post_id BIGINT,
vote_type TINYINT, -- 1=up, -1=down
created_at TIMESTAMP,
PRIMARY KEY (user_id, post_id)
)
-- Sharded by user_id
subscriptions (
user_id BIGINT,
subreddit_id INT,
PRIMARY KEY (user_id, subreddit_id)
)
```
Interviewer: "How do you handle nested comments efficiently?"
Candidate: "Comments threading is tricky. Options:
Option 1: Adjacency List (simple)
- Store parent_comment_id
- Requires recursive queries or multiple round trips
- Good for shallow threads
Option 2: Materialized Path (chosen)
```sql
comments (
comment_id BIGINT,
path VARCHAR(255), -- e.g., '1.5.23.156'
...
)
-- Get all children
SELECT * FROM comments
WHERE path LIKE '1.5.23.%'
ORDER BY path
```
- Single query gets entire thread
- Sorting by path gives tree order
- Good for Reddit's deep threads
Option 3: Nested Sets
- Complex updates but fast reads
- Not ideal for Reddit's write-heavy comments"
---
Phase 6: Trade-offs (5 minutes)
Interviewer: "Key trade-offs in your design?"
Candidate: "Several important decisions:
1. Vote counting approach:
- **Chose:** Redis + async DB writes
- **Alternative:** Direct DB writes
- **Trade-off:** Eventual consistency vs strong consistency
- **Why:** Speed matters more than exact count at any moment
2. Feed computation:
- **Chose:** Pre-computed with periodic refresh
- **Alternative:** Real-time computation
- **Trade-off:** Slight staleness vs latency
- **Why:** Can't compute across millions of posts in 200ms
3. Comment threading:
- **Chose:** Materialized path
- **Alternative:** Recursive CTEs
- **Trade-off:** Storage overhead vs query complexity
- **Why:** Reddit threads go very deep; need single-query fetch
4. Hot score storage:
- **Chose:** Denormalized on posts table
- **Alternative:** Compute on read
- **Trade-off:** Storage + update cost vs read latency
- **Why:** Read-heavy system; worth the write overhead
Interviewer: "What breaks at 10x scale?"
Candidate: "Main concerns:
- **Feed generation** - Need more aggressive caching, possibly per-user pre-computation
- **Vote processing** - Kafka partition strategy becomes critical
- **Popular subreddits** - Hot shards; might need dedicated resources for top subreddits"
---
Key Interview Takeaways
- **Voting at scale** - Cache first, queue, batch write to DB
- **Feed ranking** - Pre-compute scores, don't compute on request
- **Hot algorithm** - Log scale votes + linear time decay
- **Comments** - Materialized path for deep threading
- **Real-time** - Polling is fine; WebSocket for exceptional cases
Ready to Build Your Perfect Resume?
Let IdealResume help you create ATS-optimized, tailored resumes that get results.
Get Started Free