- Unreliable Networks
- Unreliable Clocks
- Knowledge, Truth, and Lies
The usual way of handling this issue is a timeout: after some time you give up waiting and assume that the response is not going to arrive. However, when a timeout occurs, you still don’t know whether the remote node got your request or not.
It is possible to give hard real-time response guarantees and bounded delays in networks, but doing so is very expensive and results in lower utilization of hardware resources. Most non-safety-critical systems choose cheap and unreliable over expensive and reliable.
Say you have a database with a single leader per partition. Only the leader is allowed to accept writes. How does a node know that it is still leader (that it hasn’t been declared dead by the others), and that it may safely accept writes?
One option is for the leader to obtain a lease from the other nodes. Only one node can hold the lease at any one time—thus, when a node obtains a lease, it knows that it is the leader for some amount of time, until the lease expires. In order to remain leader, the node must periodically renew the lease before it expires. If the node fails, it stops renewing the lease, so another node can take over when it expires.
You can imagine the request-handling loop looking something like this:
Firstly, it’s relying on synchronized clocks.However, what if there is an unexpected pause in the execution of the program? For example, imagine the thread stops for 15 seconds around the line
lease.isValid() before finally continuing. In that case, it’s likely that the lease will have expired by the time the request is processed, and another node has already taken over as leader. However, there is nothing to tell this thread that it was paused for so long, so this code won’t notice that the lease has expired until the next iteration of the loop—by which time it may have already done something unsafe by processing the request.
So far in this chapter we have explored the ways in which distributed systems are different from programs running on a single computer: there is no shared memory, only message passing via an unreliable network with variable delays, and the systems may suffer from partial failures, unreliable clocks, and processing pauses.
A node cannot necessarily trust its own judgment of a situation.A distributed system cannot exclusively rely on a single node, because a node may fail at any time, potentially leaving the system stuck and unable to recover. Instead, many distributed algorithms rely on a quorum, that is, voting among the nodes.
That includes decisions about declaring nodes dead. If a quorum of nodes declares another node dead, then it must be considered dead, even if that node still very much feels alive. The individual node must abide by the quorum decision and step down.
If the client holding the lease is paused for too long, its lease expires. Another client can obtain a lease for the same file, and start writing to the file. When the paused client comes back, it believes (incorrectly) that it still has a valid lease and proceeds to also write to the file. As a result, the clients’ writes clash and corrupt the file.
We need to ensure that a node that is under a false belief of being “the chosen one” cannot disrupt the rest of the system.
Fencing tokens can detect and block a node that is inadvertently acting in error.If the node deliberately wanted to subvert the system’s guarantees, it could easily do so by sending messages with a fake fencing token.
Distributed systems problems become much harder if there is a risk that nodes may “lie” (send arbitrary faulty or corrupted responses)—for example, if a node may claim to have received a particular message when in fact it didn’t. Such behavior is known as a Byzantine fault, and the problem of reaching consensus in this untrusting environment is known as the Byzantine Generals Problem.
A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network.
With regard to timing assumptions, three system models are in common use:
- Synchronous model
- Partially synchronous model
- Asynchronous model
Moreover, besides timing issues, we have to consider node failures. The three most common system models for nodes are:
- Crash-stop faults
- Crash-recovery faults
- Byzantine (arbitrary) faults
For modeling real systems, the partially synchronous model with crash-recovery faults is generally the most useful model.
Liveness properties often include the word “eventually” in their definition.
Safety is often informally defined as nothing bad happens, and liveness as something good eventually happens.
An advantage of distinguishing between safety and liveness properties is that it helps us deal with difficult system models.
Safety and liveness properties and system models are very useful for reasoning about the correctness of a distributed algorithm.
Proving an algorithm correct does not mean its implementation on a real system will necessarily always behave correctly.