Top 5 System Design Concepts Every Engineer Must Master
Why System Design Matters for Engineers
You've crushed LeetCode. You can invert a binary tree in your sleep. But when the interviewer asks "Design Twitter" or "Design a URL shortener," you freeze.
System design interviews test what LeetCode can't:
- **Big-picture thinking:** How do components fit together?
- **Trade-off analysis:** Why this approach over that one?
- **Real-world constraints:** How do you handle millions of users?
- **Communication:** Can you explain complex systems clearly?
Here are the 5 concepts that separate senior engineers from everyone else.
---
1. Distributed Systems Fundamentals: CAP Theorem and Beyond
What It Is
The CAP Theorem states that a distributed system can only guarantee two of three properties simultaneously:
- **Consistency (C):** Every read receives the most recent write
- **Availability (A):** Every request receives a response (success or failure)
- **Partition Tolerance (P):** The system continues to operate despite network partitions
Since network partitions are inevitable in distributed systems, you're really choosing between CP (consistent but may be unavailable) or AP (available but may be inconsistent).
Deep Dive: Consistency Models
| Model | Guarantee | Example | Use Case |
|-------|-----------|---------|----------|
| Strong Consistency | Read always returns latest write | Spanner, CockroachDB | Financial systems |
| Eventual Consistency | Reads eventually return latest write | DynamoDB, Cassandra | Social feeds |
| Causal Consistency | Causally related ops seen in order | MongoDB | Collaborative editing |
| Read-Your-Writes | User sees their own writes immediately | Session affinity | User profiles |
Interview Application
Question: "Design a distributed key-value store"
Weak Answer: "I'll use a hash map replicated across servers"
Strong Answer:
"First, let me clarify the consistency requirements. For a general-purpose store, I'd offer tunable consistency:
- **For reads:** Quorum reads (R + W > N) for strong consistency, single-node reads for availability
- **For writes:** Write to W nodes before acknowledging, async replication to others
- **Conflict resolution:** Last-write-wins with vector clocks for causality tracking
For partition handling, I'd use consistent hashing to minimize data movement when nodes join/leave..."
Key Implementation Patterns
```
Quorum Formula: R + W > N
- N = total replicas
- W = write acknowledgments required
- R = read acknowledgments required
Example (N=3):
- Strong consistency: W=2, R=2 (majority on both)
- Fast reads: W=3, R=1 (write to all, read from any)
- Fast writes: W=1, R=3 (write to one, read from all)
```
---
2. Database Sharding and Partitioning
What It Is
Sharding splits a database into smaller, independent pieces (shards), each holding a subset of the data. Each shard runs on a separate server.
Partitioning is the general concept; sharding specifically refers to horizontal partitioning across multiple machines.
Sharding Strategies
| Strategy | How It Works | Pros | Cons |
|----------|--------------|------|------|
| Range-based | Shard by value ranges (A-M, N-Z) | Simple, range queries work | Hot spots if data skewed |
| Hash-based | Hash key to determine shard | Even distribution | Range queries hit all shards |
| Geographic | Shard by user location | Low latency, data locality | Cross-region queries slow |
| Directory-based | Lookup table maps key to shard | Flexible | Lookup table is bottleneck |
The Consistent Hashing Deep Dive
Consistent hashing solves the "resharding nightmare"—when you add/remove nodes, you don't want to move all data.
```
Traditional Hashing:
- hash(key) % N = shard number
- Problem: Change N, everything moves
Consistent Hashing:
- Hash both keys and nodes onto a ring (0 to 2^32)
- Key goes to first node clockwise from its position
- Add node: only keys between new node and predecessor move
- Virtual nodes: each physical node has multiple ring positions
```
Interview Application
Question: "Design a database schema for a social network with 500M users"
Weak Answer: "I'll use MySQL with user_id as primary key"
Strong Answer:
"At 500M users, we need sharding. Here's my approach:
Sharding key: user_id (most queries are user-centric)
Shard strategy: Consistent hashing with virtual nodes
- 1000 virtual nodes per physical server
- Easy to add capacity by adding servers
- Minimal data movement on rebalancing
Cross-shard queries (e.g., friend feeds):
- Fan-out on write: Precompute feeds, write to follower shards
- Or fan-out on read: Query friend shards at read time
- Trade-off: Write amplification vs. read latency
Hot spot handling:
- Celebrity users get dedicated shards
- Rate limiting on viral content
- Caching layer absorbs read spikes"
Common Pitfalls
- **Cross-shard transactions:** Avoid if possible; use sagas or eventual consistency
- **Shard key selection:** Choose based on access patterns, not data model
- **Rebalancing:** Plan for this from day one; it's painful to add later
---
3. Message Queues and Event-Driven Architecture
What It Is
Message queues decouple producers (services that generate events) from consumers (services that process them). This enables:
- **Asynchronous processing:** Don't block on slow operations
- **Load leveling:** Absorb traffic spikes
- **Reliability:** Messages persist until processed
- **Scalability:** Add consumers independently
Queue vs. Pub/Sub vs. Event Streaming
| Pattern | Delivery | Ordering | Use Case |
|---------|----------|----------|----------|
| Message Queue | Each message to one consumer | FIFO within queue | Task distribution |
| Pub/Sub | Each message to all subscribers | Per-subscription | Notifications, fanout |
| Event Streaming | Consumers read at own pace | Partition ordering | Event sourcing, analytics |
Technology Comparison
| Technology | Type | Strengths | Weaknesses |
|------------|------|-----------|------------|
| RabbitMQ | Queue/Pub-Sub | Flexible routing, mature | Lower throughput |
| Amazon SQS | Queue | Fully managed, simple | Limited features |
| Apache Kafka | Event Streaming | High throughput, replay | Operational complexity |
| Amazon Kinesis | Event Streaming | Managed Kafka-like | AWS lock-in |
| Redis Streams | Event Streaming | Fast, lightweight | Durability trade-offs |
Interview Application
Question: "Design a notification system for a social media app"
Strong Answer:
"I'll use an event-driven architecture with Kafka:
Event Flow:
- User action (like, comment, follow) → Event producer
- Event → Kafka topic (partitioned by recipient_id)
- Notification service consumes events
- Routes to appropriate channel (push, email, in-app)
Why Kafka:
- High throughput for viral content (millions of likes)
- Partitioning by recipient ensures ordering per user
- Consumer groups for parallel processing
- Replay capability for debugging/reprocessing
Handling Scale:
- Partition count = 10x expected consumer count (room to grow)
- Consumer groups per channel (push, email, SMS)
- Dead letter queue for failed deliveries
- Deduplication at consumer (idempotency keys)
Batching Optimization:
- Aggregate notifications (10 people liked your post)
- Time-window batching for non-urgent notifications
- Priority queue for urgent notifications (DMs)"
Key Design Patterns
1. Competing Consumers
```
Queue → [Consumer 1, Consumer 2, Consumer 3]
Each message processed by exactly one consumer
Use for: Background jobs, task processing
```
2. Fan-Out
```
Event → Topic → [Subscriber A, Subscriber B, Subscriber C]
Each subscriber gets every message
Use for: Notifications, cache invalidation
```
3. Event Sourcing
```
Commands → Events (immutable log) → State (derived)
Rebuild state by replaying events
Use for: Audit logs, debugging, temporal queries
```
---
4. Caching Strategies and Cache Invalidation
What It Is
Caching stores frequently accessed data in fast storage (memory) to reduce latency and database load. The hard part isn't caching—it's knowing when cached data is stale.
Caching Patterns
| Pattern | Description | Consistency | Complexity |
|---------|-------------|-------------|------------|
| Cache-Aside | App checks cache, falls back to DB | App manages | Low |
| Read-Through | Cache loads from DB on miss | Cache manages | Medium |
| Write-Through | Write to cache and DB simultaneously | Strong | Medium |
| Write-Behind | Write to cache, async write to DB | Eventual | High |
| Refresh-Ahead | Proactively refresh before expiry | Strong | High |
Cache-Aside Implementation
```
Read:
- Check cache for key
- If hit: return cached value
- If miss: query database
- Store result in cache with TTL
- Return value
Write:
- Update database
- Invalidate cache (don't update!)
Why? Avoids race conditions between concurrent writes
```
Cache Invalidation Strategies
| Strategy | How It Works | Trade-off |
|----------|--------------|-----------|
| TTL (Time-To-Live) | Expire after fixed time | Simple but data can be stale |
| Event-Driven | Invalidate on data change | Fresh but complex |
| Version-Based | Include version in cache key | Fresh but storage overhead |
| Write-Through | Update cache on write | Fresh but write latency |
Interview Application
Question: "Design a caching strategy for a product catalog"
Strong Answer:
"I'll implement a multi-tier caching strategy:
Tier 1: CDN (Edge Cache)
- Cache product images and static assets
- TTL: 24 hours (images rarely change)
- Invalidation: Purge API on image update
Tier 2: Application Cache (Redis)
- Cache product details (name, price, description)
- TTL: 5 minutes (balance freshness vs. load)
- Pattern: Cache-aside with event-driven invalidation
Tier 3: Local Cache (In-Process)
- Cache category lists and navigation
- TTL: 1 minute
- Reduces Redis roundtrips for hot data
Invalidation Strategy:
- Product update → Publish to invalidation topic
- All app servers subscribe and clear local cache
- Redis key deleted (not updated—avoids race conditions)
- CDN purge for images if changed
Thundering Herd Protection:
- Mutex on cache miss: only one request queries DB
- Others wait for cache population
- Alternatively: probabilistic early expiration
Cache Key Design:
```
product:{id}:v{version}
product:123:v7
```
Version bumps on update → old cache naturally expires"
Common Pitfalls
- **Cache stampede:** Many requests hit DB simultaneously on cache miss
- **Stale data:** Forgetting to invalidate on all update paths
- **Over-caching:** Caching data that's rarely accessed or constantly changing
- **Memory pressure:** Not setting max memory limits, eviction policies
---
5. Rate Limiting and API Gateway Design
What It Is
Rate limiting controls how many requests a client can make in a given time window. It protects your system from:
- **Abuse:** Malicious actors overwhelming your service
- **Mistakes:** Buggy clients sending infinite loops
- **Cost:** Expensive operations being called excessively
- **Fairness:** Heavy users not starving others
Rate Limiting Algorithms
| Algorithm | How It Works | Pros | Cons |
|-----------|--------------|------|------|
| Token Bucket | Tokens added at fixed rate, request costs tokens | Allows bursts, smooth | Slightly complex |
| Leaky Bucket | Requests queued, processed at fixed rate | Smooth output | No bursts allowed |
| Fixed Window | Count requests in time windows | Simple | Burst at window edges |
| Sliding Window Log | Track timestamp of each request | Accurate | Memory intensive |
| Sliding Window Counter | Weighted count across windows | Balanced | Approximate |
Token Bucket Deep Dive
```
Configuration:
- Bucket size (max tokens): 100
- Refill rate: 10 tokens/second
Request arrives:
- Check tokens available
- If tokens >= cost: deduct and allow
- If tokens < cost: reject (429 Too Many Requests)
Tokens refill continuously:
- After 1 second of no requests: 10 tokens added
- After 10 seconds of no requests: bucket full (100)
Result:
- Sustained rate: 10 req/sec
- Burst capacity: up to 100 requests instantly
```
Distributed Rate Limiting
Single-server rate limiting is easy. Distributed is hard.
| Approach | How It Works | Trade-off |
|----------|--------------|-----------|
| Centralized (Redis) | All servers check shared counter | Accurate but latency |
| Local + Sync | Local counters, periodic sync | Fast but approximate |
| Sticky Sessions | Route user to same server | Simple but uneven load |
Interview Application
Question: "Design an API gateway for a microservices architecture"
Strong Answer:
"An API gateway handles cross-cutting concerns for all services:
Core Responsibilities:
- **Request Routing:** Map external URLs to internal services
- **Authentication:** Validate JWT tokens, API keys
- **Rate Limiting:** Protect services from overload
- **Load Balancing:** Distribute requests across instances
- **Caching:** Cache responses for read-heavy endpoints
- **Monitoring:** Log requests, track latencies, error rates
Rate Limiting Implementation:
```
Multi-tier limits:
- Global: 10,000 req/sec (protect infrastructure)
- Per-user: 100 req/min (fair usage)
- Per-endpoint: varies (expensive operations lower)
Algorithm: Token bucket with Redis
- Key: ratelimit:{user_id}:{endpoint}
- Lua script for atomic check-and-decrement
- TTL on keys for automatic cleanup
```
Handling Limit Exceeded:
- Return 429 with Retry-After header
- Include X-RateLimit-Remaining header on all responses
- Graduated response: warn at 80%, soft limit at 90%, hard limit at 100%
High Availability:
- Deploy gateway in multiple availability zones
- Redis cluster for rate limit state
- Fallback: allow requests if Redis unavailable (fail open)
- Circuit breaker: if backend unhealthy, fail fast"
Headers to Include
```
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1640995200
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640995200
```
---
System Design Interview Framework
Use this framework for any system design question:
1. Requirements Clarification (3-5 min)
- Functional requirements: What should the system do?
- Non-functional requirements: Scale, latency, availability?
- Constraints: Budget, timeline, existing infrastructure?
2. Back-of-Envelope Estimation (3-5 min)
- Users: DAU, peak concurrent
- Data: Size per record, total storage
- Bandwidth: Requests per second, data transfer
- Identify bottlenecks early
3. High-Level Design (10-15 min)
- Draw major components
- Show data flow
- Identify APIs between components
4. Deep Dive (10-15 min)
- Pick 2-3 critical components
- Discuss trade-offs
- Address scalability, reliability
5. Wrap-Up (3-5 min)
- Summarize design
- Discuss monitoring, alerting
- Mention future improvements
---
Practice Problems by Concept
| Concept | Practice Problems |
|---------|------------------|
| Distributed Systems | Design a distributed cache, Design a key-value store |
| Sharding | Design Twitter, Design a URL shortener |
| Message Queues | Design a notification system, Design a task scheduler |
| Caching | Design a CDN, Design a news feed |
| Rate Limiting | Design an API gateway, Design a web crawler |
---
From Concepts to Career
System design skills are what separate senior engineers from mid-level. They're tested explicitly in interviews and implicitly in your daily work.
Master these concepts, and you'll:
- Ace system design interviews at top companies
- Architect systems that scale
- Earn the trust of your team and leadership
- Command higher compensation
---
Land the Interview First
Technical skills get you through the interview. Your resume gets you the interview.
IdealResume helps engineers showcase system design experience effectively:
- Highlight distributed systems projects
- Quantify scale and impact
- Optimize for technical recruiter keywords
Your next senior role starts with the right resume.
Ready to Build Your Perfect Resume?
Let IdealResume help you create ATS-optimized, tailored resumes that get results.
Get Started Free