OP10: Eventual Consistency

Pamela Delgado

With the aim of gaining in lower latency, allowing more concurrency and doing less work (hopefully), the idea of deferring the work that may not be needed to later stages is drawing the attention of systems developers not only for programming languages but also for DBMSs and file systems. However, the decision to introduce this laziness should take into account its drawbacks and system properties. In this OP we try to produce a “recipe” to help systems designers in this respect and we focus specifically in Eventual consistency multi-master replication.
We start by defining the description of the problem, in Eventual Consistency the system will have to guarantee that, after a sufficiently long period of time T (i.e. eventually), updates committed to a master node will propagate to the other masters. We introduce our notation:
Master nodes n1,n2, ... nk
Update set for node i ui = {ui1, ui2, ..., uil}. For each node:
Provisional state ps , real state before rs, real state after rs'
Time before t, time after t0, time to reach eventual consistency T then t0 <= T
The naïve approach will usually run a total order agreement protocol, such as waiting for majority of responses from other master nodes, before applying changes to the local node.
N1. Ordered Set {u, u2, ..., ui} := agreement(nodes, Uki=1 ui)
NS1. Provisional state in time psi(t pi) = rsi, where ti < t pi < t'i
N2. Apply result of N1 in rs each node i rs'i := rsi; foreach (u in N1) apply(u)
NS2. Provisional state in time psi(t ri) = rs'iwhere tit'i <= t ri

We notice that this mechanism will ensure eventual consistency (correctness and termination will derive from the chosen algorithm proofs of these properties), even if some nodes might apply changes later than others. But there are also lazy mechanisms, consider:
L1. Apply provisionally updates for node i ps = rs; f oreach (u in ui) apply(u)
L2. When system policy requires, do N1 and N2.
The idea is to delay agreement work and only do it when a system policy requires it, one example could be when reads are performed on an item (in replica masters). Another policy could estimate a time in which reading old versioned items (in replica masters) is not considered to be harmful. In both cases we can also run N1 in batch mode (performing agreement in larger sets) to be more efficient.
The bet is that frequency(L2) << frequency(ui). The likeliness of this bet to hold depends on how well the policy fits the system, this might depend on the workload (e.g #reads vs #writes) or harmfulness of reading old version or more system probabilities (e.g. balancing updates between replicas).
Benefit of using lazy vs naive would be in terms of time that the system gains to enhance other properties like latency, bandwidth, concurrency, etc. So benefit = time(N1+N2)*(freq(updates)/freq(L2)-1)+time(L1) * freq(updates). And the cost will be to decrease the probability of having consistency between replicas during a period of time at a rate inversely proportional to benefit.