How Apache Kafka Works: Distributed Streaming at Scale
Kafka's Role in Modern Architecture
Apache Kafka is the de facto standard for real-time data streaming. LinkedIn, where it was created, processes 7 trillion messages per day through Kafka. Companies like Netflix, Uber, and Airbnb rely on it for critical data pipelines.
Core Concepts
1. Topics
- Named feeds of messages
- Similar to database tables
- Partitioned for scalability
- Replicated for durability
2. Partitions
- Topics split into partitions
- Each partition is ordered
- Parallelism = number of partitions
- Messages distributed by key
3. Producers
- Publish messages to topics
- Choose partition (key-based or round-robin)
- Acknowledge writes
4. Consumers
- Subscribe to topics
- Read from partitions
- Track position (offset)
- Consumer groups for scaling
5. Brokers
- Kafka servers
- Store and serve data
- Handle replication
- Coordinate consumers
Architecture Deep Dive
Cluster Layout:
- Multiple brokers (typically 3+)
- Each broker handles subset of partitions
- Zookeeper/KRaft for coordination
- No single point of failure
Partition Distribution:
- Partitions spread across brokers
- One broker = leader for partition
- Other brokers = followers (replicas)
- Automatic leader election on failure
Write Path
Producer Flow:
- Producer chooses partition (by key hash or round-robin)
- Request sent to partition leader
- Leader writes to local log
- Followers replicate
- Leader acknowledges when enough replicas confirm
Acknowledgment Modes:
- acks=0: No wait (fastest, least safe)
- acks=1: Leader acknowledges (balanced)
- acks=all: All replicas acknowledge (safest, slowest)
Log Storage
Append-Only Log:
- Messages appended sequentially
- Never modified after write
- Position = offset (monotonic)
Log Segments:
- Log split into segment files
- Active segment receives writes
- Old segments immutable
- Retention-based deletion
Index Files:
- Offset index: offset → position
- Time index: timestamp → offset
- Enables efficient seeking
Read Path
Consumer Flow:
- Consumer requests messages from offset
- Broker serves from page cache or disk
- Consumer processes messages
- Consumer commits offset
Sequential Reads:
- Most reads are sequential
- OS page cache very effective
- Zero-copy optimization (sendfile)
Consumer Groups:
- Multiple consumers share workload
- Each partition assigned to one consumer
- Rebalancing on consumer changes
Replication
Leader-Follower Model:
- Each partition has one leader
- Leader handles all reads/writes
- Followers replicate from leader
- ISR (In-Sync Replicas) tracks healthy replicas
Durability:
- Replication factor typically 3
- Survives up to RF-1 failures
- min.insync.replicas for write guarantees
Failure Handling:
- Leader fails → follower promoted
- Automatic, transparent to clients
- Some messages might be lost (depends on acks)
Kafka's Secret Sauce
1. Sequential I/O
- All writes are appends
- All reads are sequential
- Optimizes for disk throughput
- 100x faster than random I/O
2. Zero-Copy
- Data transferred directly from page cache to network
- Bypasses application layer
- sendfile system call
- Reduces CPU overhead
3. Batching
- Producers batch messages
- Consumers fetch batches
- Reduces network round trips
- Amortizes overhead
4. Compression
- Batch-level compression
- gzip, snappy, lz4, zstd
- Reduces network and storage
Consumer Group Mechanics
Partition Assignment:
- Partitions assigned to consumers
- One consumer per partition (max)
- More consumers than partitions = some idle
Offset Management:
- Consumers track their position
- Offsets stored in Kafka (__consumer_offsets)
- Commit periodically or on each message
- Enables replay from any position
Rebalancing:
- Triggered on consumer join/leave
- Partitions redistributed
- Brief pause in processing
- Cooperative rebalancing minimizes disruption
Stream Processing
Kafka Streams:
- Library for stream processing
- Stateful operations (aggregations, joins)
- Exactly-once semantics
- Scales with Kafka
Common Operations:
- Filter: select messages by criteria
- Map: transform messages
- Aggregate: group and combine
- Join: correlate streams
- Windowing: time-based grouping
Use Cases
1. Event Sourcing
- Events as source of truth
- Replay to rebuild state
- Audit trail built-in
2. Change Data Capture (CDC)
- Database changes to Kafka
- Sync across systems
- Tools: Debezium, Maxwell
3. Log Aggregation
- Centralize logs from many sources
- Real-time processing
- Long-term storage
4. Metrics Pipeline
- Application metrics to Kafka
- Real-time dashboards
- Alerting systems
Scaling Considerations
Horizontal Scaling:
- Add partitions for more parallelism
- Add brokers for more capacity
- Add consumers for more processing
Limitations:
- Partition count affects memory
- Too many partitions = slower leader election
- Key-based ordering limited to single partition
Operational Concerns
Monitoring:
- Under-replicated partitions
- Consumer lag
- Broker CPU/disk utilization
- Request latency
Common Issues:
- Consumer lag building up
- Rebalancing storms
- Disk space exhaustion
- Network saturation
Interview Application
When discussing event streaming:
Key Concepts:
- Partitioning and ordering
- Replication and durability
- Consumer groups
- Exactly-once semantics
Design Questions:
- When to use Kafka?
- How to choose partition count?
- How to handle failures?
- Stream processing patterns?
Trade-offs:
- Throughput vs latency (batching)
- Durability vs performance (acks)
- Ordering vs parallelism (partitions)
Kafka's design choices - sequential I/O, replication, consumer groups - make it the backbone of modern data architecture.
Ready to Build Your Perfect Resume?
Let IdealResume help you create ATS-optimized, tailored resumes that get results.
Get Started Free