Design Reddit: A System Design Interview Conversation
System Design Interview

Design Reddit: A System Design Interview Conversation

IdealResume TeamSeptember 24, 202512 min read
Share:

The Interview Scenario

Interviewer: "Design Reddit - a social news aggregation platform. You have 45 minutes."

Candidate: "Excellent, I'd like to start with some clarifying questions to understand what we're building."

---

Phase 1: Requirements Clarification (5 minutes)

Candidate: "Let me understand the scope:

  1. **Core features** - Posts, comments, voting, or all three?
  2. **Subreddits** - Should we support communities/topics?
  3. **Feeds** - Home feed, popular, subreddit-specific?
  4. **Ranking** - How should posts be ordered? Hot, new, top?
  5. **Scale** - What's our target user base?"

Interviewer: "Let's include:

  • Posts (text, links, images) and comments
  • Subreddits (communities)
  • Upvote/downvote system
  • Multiple feed types (hot, new, top, home)
  • 500 million monthly users, 50 million daily active
  • Real-time vote updates on active posts"

Candidate: "Got it. Non-functional requirements:

  • **Read-heavy** - Most users browse, fewer post
  • **Near real-time** - Votes should update within seconds
  • **Consistency** - Vote counts should be accurate eventually
  • **Scalability** - Handle viral posts with millions of votes
  • **Low latency** - Feed generation under 200ms

Interviewer: "Good assumptions. Continue."

---

Phase 2: Back-of-Envelope Calculations (5 minutes)

Candidate: "Let me calculate the scale:

Content volume:

  • 50M DAU
  • 1% post daily = 500K new posts/day
  • 10% comment daily = 5M new comments/day
  • Average post: 1KB text + 500KB media = ~500KB

Votes:

  • Average user: 10 votes per session
  • 50M DAU × 10 votes = **500M votes/day**
  • ~5,800 votes/second average
  • Peak (viral post): 50,000+ votes/second

Feed reads:

  • Each user loads ~20 pages per session
  • 50M × 20 = 1 billion feed loads/day
  • ~11,500 feed requests/second

Storage:

  • Posts: 500K × 500KB = 250 GB/day
  • Growing at ~90 TB/year for media
  • Metadata (votes, comments): much smaller"

Interviewer: "What's the most challenging part here?"

Candidate: "Three challenges:

  1. **Vote velocity** - 50K votes/second on viral posts
  2. **Feed generation** - Ranking millions of posts in real-time
  3. **Consistency** - Accurate counts despite massive write volume"

---

Phase 3: High-Level Design (10 minutes)

Candidate: "Here's my architecture:"

![Reddit High Level Architecture](/images/blog/reddit-architecture.svg)

Interviewer: "Explain the voting flow."

Candidate: "Voting needs special handling due to volume:

Vote Flow:

  1. User clicks upvote → Vote Service receives request
  2. **Idempotency check** - User can only vote once per post
  3. **Write to cache first** - Increment counter in Redis
  4. **Queue the event** - Push to Kafka for async processing
  5. **Return immediately** - User sees updated count
  6. **Background processing** - Consumer writes to PostgreSQL
  7. **Batch aggregation** - Periodic sync ensures DB consistency

Why this approach:

  • Users get instant feedback
  • Redis handles 50K writes/second easily
  • Kafka provides durability before DB write
  • Background processing handles DB bottleneck"

Interviewer: "How do you prevent double voting?"

Candidate: "Multi-layer protection:

  1. **Redis Set** - Store (user_id, post_id) pairs for recent votes
  2. **Database constraint** - Unique index on (user_id, post_id)
  3. **Idempotent operations** - Vote endpoint is idempotent
  4. **Change vote** - Clicking same button again removes vote

```

// Redis for fast check

SISMEMBER votes:post:123 user:456

// If not in Redis, check DB (cache miss)

SELECT vote FROM votes WHERE post_id=123 AND user_id=456

```"

---

Phase 4: Deep Dive - Feed Ranking (10 minutes)

Interviewer: "How does feed generation work? Especially the 'Hot' algorithm?"

Candidate: "This is Reddit's secret sauce. Let me explain:"

Hot Ranking Algorithm:

```

score = log10(max(|upvotes - downvotes|, 1))

# Time decay factor

seconds = post_timestamp - reference_timestamp

time_factor = seconds / 45000

# Final hot score

hot_score = score + time_factor

```

Why this works:

  • Logarithmic vote scaling prevents runaway posts
  • Time decay means new posts can compete
  • Reference timestamp normalizes across time

Feed Generation Options:

Option 1: Compute on request

```

SELECT * FROM posts

WHERE subreddit_id IN (user_subscriptions)

ORDER BY hot_score DESC

LIMIT 25

```

  • **Pro:** Always fresh
  • **Con:** Too slow at scale (computing across millions of posts)

Option 2: Pre-computed feeds (chosen)

```

┌─────────────────────────────────────────────┐

│ Feed Generation Pipeline │

├─────────────────────────────────────────────┤

│ 1. Score Update Worker │

│ - Recalculates hot scores every minute │

│ - Only for posts < 48 hours old │

│ │

│ 2. Feed Builder │

│ - Generates top 1000 posts per subreddit │

│ - Stores in Redis sorted sets │

│ │

│ 3. Home Feed Mixer │

│ - Merges user's subscribed subreddits │

│ - Applies personalization │

└─────────────────────────────────────────────┘

```

Interviewer: "How do you generate the personalized home feed?"

Candidate: "Home feed is more complex:

Fan-out on read:

  1. User requests home feed
  2. Get user's subscribed subreddits (cached)
  3. Fetch top N posts from each subreddit (Redis sorted sets)
  4. Merge-sort by hot score
  5. Apply filters (already seen, blocked users)
  6. Return top 25

Optimization - Tiered approach:

  • **Small subreddits (<100K members):** Merge on read
  • **Large subreddits (>100K):** Pre-compute into user's feed

Caching strategy:

```

// Redis sorted sets per subreddit

ZREVRANGE subreddit:programming:hot 0 100

// User's computed feed (cached 5 minutes)

LRANGE user:456:feed 0 25

```"

Interviewer: "What about real-time updates on vote counts?"

Candidate: "Two approaches:

1. Polling (simpler):

  • Client polls every 30 seconds for active posts
  • Batch request: 'Give me current scores for posts [1,2,3,4,5]'
  • Efficient enough for most cases

2. WebSocket (for viral posts):

  • When post goes viral (>1000 votes/minute)
  • Subscribe clients to real-time updates
  • Server pushes vote count changes

I'd start with polling and add WebSocket for hot posts later."

---

Phase 5: Data Model (8 minutes)

Interviewer: "Walk me through your database schema."

Candidate: "Here's the core schema:"

```sql

-- Sharded by subreddit_id

posts (

post_id BIGINT PRIMARY KEY,

subreddit_id INT,

author_id BIGINT,

title VARCHAR(300),

content TEXT,

url VARCHAR(500),

created_at TIMESTAMP,

upvotes INT DEFAULT 0,

downvotes INT DEFAULT 0,

hot_score FLOAT,

INDEX (subreddit_id, hot_score),

INDEX (subreddit_id, created_at)

)

-- Sharded by post_id

comments (

comment_id BIGINT PRIMARY KEY,

post_id BIGINT,

parent_comment_id BIGINT NULL, -- for nested comments

author_id BIGINT,

content TEXT,

created_at TIMESTAMP,

upvotes INT DEFAULT 0,

downvotes INT DEFAULT 0

)

-- Sharded by post_id

votes (

user_id BIGINT,

post_id BIGINT,

vote_type TINYINT, -- 1=up, -1=down

created_at TIMESTAMP,

PRIMARY KEY (user_id, post_id)

)

-- Sharded by user_id

subscriptions (

user_id BIGINT,

subreddit_id INT,

PRIMARY KEY (user_id, subreddit_id)

)

```

Interviewer: "How do you handle nested comments efficiently?"

Candidate: "Comments threading is tricky. Options:

Option 1: Adjacency List (simple)

  • Store parent_comment_id
  • Requires recursive queries or multiple round trips
  • Good for shallow threads

Option 2: Materialized Path (chosen)

```sql

comments (

comment_id BIGINT,

path VARCHAR(255), -- e.g., '1.5.23.156'

...

)

-- Get all children

SELECT * FROM comments

WHERE path LIKE '1.5.23.%'

ORDER BY path

```

  • Single query gets entire thread
  • Sorting by path gives tree order
  • Good for Reddit's deep threads

Option 3: Nested Sets

  • Complex updates but fast reads
  • Not ideal for Reddit's write-heavy comments"

---

Phase 6: Trade-offs (5 minutes)

Interviewer: "Key trade-offs in your design?"

Candidate: "Several important decisions:

1. Vote counting approach:

  • **Chose:** Redis + async DB writes
  • **Alternative:** Direct DB writes
  • **Trade-off:** Eventual consistency vs strong consistency
  • **Why:** Speed matters more than exact count at any moment

2. Feed computation:

  • **Chose:** Pre-computed with periodic refresh
  • **Alternative:** Real-time computation
  • **Trade-off:** Slight staleness vs latency
  • **Why:** Can't compute across millions of posts in 200ms

3. Comment threading:

  • **Chose:** Materialized path
  • **Alternative:** Recursive CTEs
  • **Trade-off:** Storage overhead vs query complexity
  • **Why:** Reddit threads go very deep; need single-query fetch

4. Hot score storage:

  • **Chose:** Denormalized on posts table
  • **Alternative:** Compute on read
  • **Trade-off:** Storage + update cost vs read latency
  • **Why:** Read-heavy system; worth the write overhead

Interviewer: "What breaks at 10x scale?"

Candidate: "Main concerns:

  1. **Feed generation** - Need more aggressive caching, possibly per-user pre-computation
  2. **Vote processing** - Kafka partition strategy becomes critical
  3. **Popular subreddits** - Hot shards; might need dedicated resources for top subreddits"

---

Key Interview Takeaways

  1. **Voting at scale** - Cache first, queue, batch write to DB
  2. **Feed ranking** - Pre-compute scores, don't compute on request
  3. **Hot algorithm** - Log scale votes + linear time decay
  4. **Comments** - Materialized path for deep threading
  5. **Real-time** - Polling is fine; WebSocket for exceptional cases

Ready to Build Your Perfect Resume?

Let IdealResume help you create ATS-optimized, tailored resumes that get results.

Get Started Free

Found this helpful? Share it with others who might benefit.

Share: