civil-and-structural-engineering
Key Concepts in Distributed Systems for Technical Interviews
Table of Contents
Distributed systems are a cornerstone of modern software engineering, powering everything from search engines and social networks to cloud platforms and financial systems. For technical interviews, especially at top technology companies, a deep understanding of distributed systems is often required. This article provides an expanded guide to the key concepts, trade-offs, and design patterns you should master to excel in such interviews.
What Are Distributed Systems?
A distributed system is a collection of independent computers that appears to its users as a single coherent system. These computers, often called nodes, communicate over a network and coordinate their actions to achieve a common goal, such as serving a web application, storing large amounts of data, or processing real-time analytics. Examples include Google's search infrastructure, Amazon's DynamoDB, and Apache Kafka.
The primary goals of a distributed system are to:
- Resource sharing: Allow nodes to share hardware, data, and services.
- Scalability: Grow capacity by adding more nodes without major architectural changes.
- Fault tolerance: Continue operating even when some components fail.
- Transparency: Hide the complexity of distribution from the end user.
However, these benefits come at the cost of increased complexity. Network latency, partial failures, and the need for coordination between nodes introduce challenges that do not exist in single-machine systems. Interviewers often test your ability to navigate these trade-offs.
Core Concepts in Distributed Systems
Scalability
Scalability is the ability of a system to handle increased loads—such as more users, more data, or higher throughput—by adding resources. There are two primary types:
- Vertical scaling (scale-up): Adding more power (CPU, RAM, disk) to a single machine. Simple but has physical limits and often becomes expensive.
- Horizontal scaling (scale-out): Adding more nodes to the system. This is the backbone of cloud-native and distributed systems, allowing near-linear growth but requiring careful design to distribute work and manage consistency.
In interviews, discuss how you would design a scalable system—for example, using load balancers, stateless services, and partitioning data. Be prepared to talk about the trade-offs between vertical and horizontal scaling for specific workloads.
Fault Tolerance
Fault tolerance is the ability of a system to continue operating correctly in the presence of failures—whether hardware crashes, software bugs, or network disruptions. Key techniques include:
- Redundancy: Running multiple replicas of a service or data so that if one fails, another takes over.
- Failover mechanisms: Automatic detection of failures and rerouting of requests to healthy nodes.
- Checkpointing and state recovery: Saving system state periodically so that it can be restored quickly after a crash.
- Circuit breakers: Preventing cascading failures by stopping requests to unhealthy services.
Interviewers often ask: "How would you make this system fault-tolerant?" Be ready to describe redundancy patterns (active-passive, active-active) and how to detect failures without false positives.
Consistency
Consistency refers to the guarantee that all nodes in the system see the same data at the same time. In a distributed system, achieving strong consistency is challenging because of network delays and concurrent updates. Common consistency models include:
- Strong consistency: After a write, any subsequent read returns the new value. Typically achieved via synchronous replication or consensus protocols.
- Eventual consistency: Given enough time without updates, all replicas converge to the same value. Widely used in systems that prioritize availability (e.g., DNS, DynamoDB).
- Read-your-writes consistency: A user's own writes are immediately visible to that user, but other users may see stale data briefly.
Understanding when to choose which model is critical. For example, a banking system requires strong consistency, while a social media feed can tolerate eventual consistency.
Availability
Availability measures whether the system is up and responding to requests. It is often expressed as a percentage of uptime per year (e.g., 99.99% availability). High availability is achieved by eliminating single points of failure, using load balancing, and implementing monitoring and auto-recovery. There is a fundamental trade-off between availability and consistency, captured by the CAP theorem (discussed below).
Partition Tolerance
Partition tolerance is the ability of a system to continue functioning when the network splits into two or more subnets that cannot communicate. In a distributed system, partitions are inevitable. The CAP theorem states that you must choose between consistency and availability when a partition occurs. Systems that choose availability (AP) will continue to serve requests even if data may be stale, while systems that choose consistency (CP) will reject some requests to prevent conflicting data.
Understanding the CAP Theorem
The CAP theorem, originally posed by Eric Brewer, states that a distributed data store can only guarantee at most two of the following three properties: Consistency, Availability, and Partition Tolerance. Since networks are inherently unreliable, partition tolerance is a necessity in most real-world distributed systems. Thus the effective choice is between consistency (CP) and availability (AP).
Classic examples:
- CP systems: Google Spanner, Apache Zookeeper. They sacrifice availability during partitions to maintain strong consistency.
- AP systems: Amazon DynamoDB, Cassandra. They prioritize availability and eventual consistency.
- CA systems: Traditional single-node relational databases (which are not truly distributed) or systems that avoid partitions entirely—a near-impossible goal in practice.
In interviews, you may be asked to design a system and justify your trade-offs. For more depth, see the CAP theorem on Wikipedia.
Consensus Algorithms
Consensus algorithms allow a group of nodes to agree on a single value despite failures. They are vital for leader election, distributed locking, and achieving strong consistency. The two most well-known are Paxos and Raft.
Paxos
Paxos, introduced by Leslie Lamport, is a family of protocols for reaching consensus in a network of unreliable processors. It guarantees safety (no two nodes decide different values) and liveness (eventual agreement) under certain conditions. However, Paxos is notoriously difficult to understand and implement correctly. In interviews, a high-level understanding of its phases—prepare, promise, accept, learned—is sufficient.
Raft
Raft is designed to be more understandable than Paxos. It decomposes consensus into three subproblems: leader election, log replication, and safety. A leader is elected and then manages log replication; followers apply entries only after a majority commits them. Raft is used in etcd, Consul, and many other distributed systems.
For a visual explanation, refer to the Raft visualization.
Replication Strategies
Replication is the process of copying data across multiple nodes. It improves fault tolerance and read throughput but complicates writes. Common replication approaches include:
- Single-leader (primary-secondary): One node accepts writes and propagates changes to followers. Simple but creates a single point of write failure.
- Multi-leader: Multiple nodes accept writes and replicate to each other. More available but introduces write conflicts that must be resolved.
- Leaderless: Any node can accept writes, and read repair or hinted handoff ensures eventual consistency (e.g., Amazon Dynamo).
- Synchronous vs asynchronous: Synchronous replication ensures strong consistency but slows writes, while asynchronous replication risks data loss during leader failure.
When discussing replication in interviews, be ready to talk about the trade-offs between consistency, latency, and durability. The AWS Builder's Library offers practical case studies.
Sharding and Data Partitioning
Sharding (or horizontal partitioning) distributes a large dataset across multiple databases or storage nodes. It allows the system to scale beyond the capacity of a single node and can improve performance by localizing queries. Common sharding strategies:
- Range-based sharding: Data is split by key ranges (e.g., user IDs 1-1000 on shard A, 1001-2000 on shard B). Simple but can lead to hotspots if distribution is uneven.
- Hash-based sharding: A hash function (e.g., consistent hashing) maps each key to a shard. More balanced but resharding is expensive without consistent hashing.
- Directory-based sharding: A lookup table maps keys to shards. Flexible but introduces a single point of failure or coordination overhead.
Consistent hashing is a crucial technique for minimizing data movement when the number of shards changes. It's used in systems like Cassandra and DynamoDB.
Common Challenges in Distributed Systems
Network Partitions
A network partition (or split-brain scenario) occurs when nodes lose communication because of a network failure. During a partition, the system must choose between consistency and availability. Handling partitions gracefully—e.g., by having a quorum-based decision or a backup strategy—is a topic often explored in system design interviews.
Data Consistency and Conflict Resolution
Even without partitions, concurrent writes can lead to conflicting updates. Techniques for conflict resolution include:
- Last-write-wins (LWW): Each write uses a timestamp; the most recent one wins. Simple but can lose data.
- Vector clocks: Track causal relationships between updates to allow application-level merge.
- CRDTs (Conflict-free Replicated Data Types): Data structures that automatically converge to a consistent state without coordination.
Latency
Distributed systems are subject to network latency, which can vary widely. Techniques to mitigate latency include caching, content delivery networks (CDNs), and selecting geographically distributed datacenters. In interviews, consider latency when designing APIs and data flows.
Fault Detection
Distinguishing a crashed node from a slow or partitioned node is difficult. Common mechanisms include heartbeat protocols, timeouts, and gossip-style failure detectors. The choice of timeout values is a trade-off between false positives and detection speed.
Clock Skew and Event Ordering
Without a global clock, ordering events across nodes is tricky. Logical clocks (e.g., Lamport clocks, vector clocks) provide a way to order events in a distributed system without relying on physical time.
Distributed Transactions
Transactions that span multiple nodes require coordination, often using two-phase commit (2PC) or three-phase commit (3PC). 2PC is blocking and not fault-tolerant; 3PC improves but is more complex. Newer approaches like percolator (used in Google Spanner) or saga patterns (for microservices) offer alternatives.
Interview Preparation Tips
To prepare for distributed systems interviews, focus on the following:
- Know the fundamentals: Be able to define and explain scalability, fault tolerance, consistency, availability, and partition tolerance with clear examples.
- Understand the CAP theorem deeply: Discuss real-world databases and their trade-offs (e.g., Cassandra vs. Spanner).
- Practice whiteboarding system design: Common questions include designing a distributed key-value store, a chat system, a URL shortener, or a real-time leaderboard. Outline the replication, sharding, consistency model, and how you handle failures.
- Be ready to discuss trade-offs: Every design decision involves costs. Explain why you choose eventual consistency over strong consistency in a given scenario, or why you pick Raft over Paxos.
- Stay current: Read blogs from companies like Netflix, Uber, and Amazon. The Google Research publications on Spanner, MapReduce, and Chubby provide excellent interview fodder.
Mastering these concepts will not only help you pass technical interviews but also enable you to design and build resilient large-scale systems. Approach each topic with a focus on the underlying principles—the protocols and algorithms are tools, but the ability to reason about trade-offs is what sets strong candidates apart.