- 1. Consistency Guarantees
- 2. Linearizability
- 2.1. What Makes a System Linearizable?
- 2.2. Relying on Linearizability
- 2.3. Implementing Linearizable Systems
- 2.4. The Cost of Linearizability
- 3. Ordering Guarantees
- 3.1. Ordering and Causality
- 3.2. Sequence Number Ordering
- 3.3. Total Order Broadcast
- 4. Distributed Transactions and Consensus
- 4.1. Atomic Commit and Two-Phase Commit (2PC)
- 4.2. Distributed Transactions in Practice
- 4.3. Fault-Tolerant Consensus
- 4.4. Membership and Coordination Services
- 5. Summary
consensus: that is, getting all of the nodes to agree on something.
If two nodes both believe that they are the leader, that situation is called split brain, and it often leads to data loss. Correct implementations of consensus help avoid such problems.
Systems with stronger guarantees may have worse performance or be less fault-tolerant than systems with weaker guarantees.
Transaction isolation is primarily about avoiding race conditions due to concurrently executing transactions, whereas distributed consistency is mostly about coordinating the state of replicas in the face of delays and faults.
Linearizability，the basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic.
To make the system linearizable, we need to add another constraint.
Linearizability Versus Serializability
Serializability is an isolation property of transactions, where every transaction may read and write multiple objects (rows, documents, records).It guarantees that transactions behave the same as if they had executed in some serial order (each transaction running to completion before the next transaction starts).
Linearizability is a recency guarantee on reads and writes of a register (an individual object). It doesn’t group operations together into transactions, so it does not prevent problems such as write skew, unless you take additional measures.
Uniqueness constraints are common in databases.
Similar issues arise if you want to ensure that a bank account balance never goes neg‐ ative, or that you don’t sell more items than you have in stock in the warehouse, or that two people don’t concurrently book the same seat on a flight or in a theater. These constraints all require there to be a single up-to-date value (the account balance, the stock level, the seat occupancy) that all nodes agree on.
The message queue (steps 3 and 4 in Figure 9-5) might be faster than the internal replication inside the storage service.
This problem arises because there are two different communication channels between the web server and the resizer: the file storage and the message queue.
Linearizability essentially means “behave as though there is only a single copy of the data, and all operations on it are atomic”.
- Single-leader replication (potentially linearizable)
- Consensus algorithms (linearizable)
- Multi-leader replication (not linearizable)
- Leaderless replication (probably not linearizable)
Intuitively, it seems as though strict quorum reads and writes should be linearizable in a Dynamo-style model. However, when we have variable network delays, it is possible to have race conditions, as demonstrated in Figure 9-6.
It is safest to assume that a leaderless system with Dynamo-style replication does not provide linearizability.
The CAP theorem as formally defined is of very narrow scope: it only considers one consistency model (namely linearizability) and one kind of fault(network partitions). It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP has been historically influential, it has little practical value for designing systems.
Causality imposes an ordering on events: cause comes before effect.If a system obeys the ordering imposed by causality, we say that it is causally consistent.
Causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures.
In order to maintain causality, you need to know which operation happened before which other operation.
We can use sequence numbers or timestamps to order events.
It is not sufficient when a node has just received a request from a user to create a username, and needs to decide right now whether the request should succeed or fail.
In order to be sure that no other node is in the process of concurrently creating an account with the same username and a lower timestamp, you would have to check with every other node to see what it is doing.
The problem here is that the total order of operations only emerges after you have collected all of the operations.
In order to implement something like a uniqueness constraint for usernames, it’s not sufficient to have a total ordering of operations—you also need to know when that order is finalized.
This idea of knowing when your total order is finalized is captured in the topic of total order broadcast.
Total order broadcast is usually described as a protocol for exchanging messages between nodes. Informally, it requires that two safety properties always be satisfied:
- Reliable delivery
- Totally ordered delivery
Consensus services such as ZooKeeper and etcd actually implement total order broadcast.
An important aspect of total order broadcast is that the order is fixed at the time the messages are delivered.
Total order broadcast is asynchronous: messages are guaranteed to be delivered reliably in a fixed order, but there is no guarantee about when a message will be delivered. By contrast, linearizability is a recency guarantee: a read is guaranteed to see the latest value written.
- Leader election
- Atomic commit
Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes—i.e., to ensure that either all nodes commit or all nodes abort.
The commit/abort process in 2PC is split into two phases (hence the name).
If any of the prepare requests fail or time out, the coordinator aborts the transaction; if any of the commit or abort requests fail, the coordinator retries them indefinitely.
The only way 2PC can complete is by waiting for the coordinator to recover.
Two-phase commit is called a blocking atomic commit protocol due to the fact that 2PC can become stuck waiting for the coordinator to recover.
three-phase commit (3PC) assumes a network with bounded delay and nodes with bounded response times; in most practical systems with unbounded network delay and process pauses , it cannot guarantee atomicity.
Database-internal distributed transactions
Heterogeneous distributed transactions
XA is not a network protocol—it is merely a C API for interfacing with a transaction coordinator.
Informally, consensus means getting several nodes to agree on something.
The consensus problem is normally formalized as follows: one or more nodes may propose values, and the consensus algorithm decides on one of those values.
In this formalism, a consensus algorithm must satisfy the following properties:
- Uniform agreement
The best-known fault-tolerant consensus algorithms are Viewstamped Replication (VSR), Paxos, Raft, and Zab.
They decide on a sequence of values, which makes them total order broadcast algorithms
All of the consensus protocols discussed so far internally use a leader in some form or another, but they don’t guarantee that the leader is unique. Instead, they can make a weaker guarantee: the protocols define an epoch number and guarantee that within each epoch, the leader is unique.
We have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s proposal.
This voting process looks superficially similar to two-phase commit. The biggest differences are that in 2PC the coordinator is not elected, and that fault-tolerant consensus algorithms only require votes from a majority of nodes, whereas 2PC requires a “yes” vote from every participant. Moreover, consensus algorithms define a recovery process by which nodes can get into a consistent state after a new leader is elected, ensuring that the safety properties are always met. These differences are key to the correctness and fault tolerance of a consensus algorithm.
The process by which nodes vote on proposals before they are decided is a kind of synchronous replication.
Consensus systems always require a strict majority to operate.
Most consensus algorithms assume a fixed set of nodes that participate in voting, which means that you can’t just add or remove nodes in the cluster.
Consensus systems generally rely on timeouts to detect failed nodes.
Sometimes, consensus algorithms are particularly sensitive to network problems.
ZooKeeper and etcd are designed to hold small amounts of data that can fit entirely in memory.That small amount of data is replicated across all the nodes using a fault-tolerant total order broadcast algorithm.
ZooKeeper is modeled after Google’s Chubby lock service , implementing not only total order broadcast (and hence consensus), but also an interesting set of other features that turn out to be particularly useful when building distributed systems:
- Linearizable atomic operations
- Total ordering of operations
- Failure detection
- Change notifications
Normally, the kind of data managed by ZooKeeper is quite slow-changing.
A membership service determines which nodes are currently active and live members of a cluster.
It turns out that a wide range of problems are actually reducible to consensus and are equivalent to each other.
- Linearizable compare-and-set registers
- Atomic transaction commit
- Total order broadcast
- Locks and leases
- Membership/coordination service
- Uniqueness constraint
Not every system necessarily requires consensus: for example, leaderless and multi-leader replication systems typically do not use global consensus.