-
Notifications
You must be signed in to change notification settings - Fork 248
/
eventual.html
438 lines (407 loc) · 53.3 KB
/
eventual.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
<!DOCTYPE html>
<html>
<head>
<title>Distributed systems for fun and profit</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style type="text/css">
@font-face {
font-family: 'Droid Sans';
font-style: normal;
font-weight: normal;
src: local('Droid Sans'), local('DroidSans'), url('DroidSans.woff') format('woff');
}
</style>
<script src="assets/jquery-1.6.1.min.js"></script>
<link type="text/css" rel="stylesheet" href="assets/style.css"/>
<link type="text/css" rel="stylesheet" href="assets/assert.css"/>
<link type="text/css" rel="stylesheet" href="assets/prettify.css"/>
<script type="text/javascript">
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-26716650-2', 'mixu.net');
ga('send', 'pageview');
</script>
</head>
<body>
<div id="wrapper">
<div class="header">
<div id="brand">
<h1 style="color: white; background: #D82545; display: inline-block; padding: 6px;">Distributed systems</h1>
<p>for fun and profit</p>
</div>
<div id="navi">
<form class="search" action="http://www.google.com/search">
<input type="hidden" name="as_sitesearch" value="singlepageappbook.com">
<input type="text" name="as_q" value="" class="search_field">
<input type="submit" value="Search" class="search_btn">
</form>
</div>
</div>
<div class="clear">
<hr>
</div>
<div class="header nav">
<p><a href="replication.html">Previous Chapter</a> | <a href="index.html">Home</a> | <a href="appendix.html">Next Chapter</a></p>
</div>
<div class="clear">
<hr>
</div>
<!-- Main -->
<div id="main">
<div id="content" class="post">
<h1>5. Replication: weak consistency model protocols</h1>
<p>Now that we've taken a look at protocols that can enforce single-copy consistency under an increasingly realistic set of supported failure cases, let's turn our attention at the world of options that opens up once we let go of the requirement of single-copy consistency.</p>
<p>By and large, it is hard to come up with a single dimension that defines or characterizes the protocols that allow for replicas to diverge. Most such protocols are highly available, and the key issue is more whether or not the end users find the guarantees, abstractions and APIs useful for their purpose in spite of the fact that the replicas may diverge when node and/or network failures occur.</p>
<p>Why haven't weakly consistent systems been more popular?</p>
<p>As I stated in the introduction, I think that much of distributed programming is about dealing with the implications of two consequences of distribution:</p>
<ul class="list">
<li>that information travels at the speed of light</li>
<li>that independent things fail independently</li>
</ul>
<p>The implication that follows from the limitation on the speed at which information travels is that nodes experience the world in different, unique ways. Computation on a single node is easy, because everything happens in a predictable global total order. Computation on a distributed system is difficult, because there is no global total order.</p>
<p>For the longest while (e.g. decades of research), we've solved this problem by introducing a global total order. I've discussed the many methods for achieving strong consistency by creating order (in a fault-tolerant manner) where there is no naturally occurring total order.</p>
<p>Of course, the problem is that enforcing order is expensive. This breaks down in particular with large scale internet systems, where a system needs to remain available. A system enforcing strong consistency doesn't behave like a distributed system: it behaves like a single system, which is bad for availability during a partition.</p>
<p>Furthermore, for each operation, often a majority of the nodes must be contacted - and often not just once, but twice (as you saw in the discussion on 2PC). This is particularly painful in systems that need to be geographically distributed to provide adequate performance for a global user base.</p>
<p>So behaving like a single system by default is perhaps not desirable.</p>
<p>Perhaps what we want is a system where we can write code that doesn't use expensive coordination, and yet returns a "usable" value. Instead of having a single truth, we will allow different replicas to diverge from each other - both to keep things efficient but also to tolerate partitions - and then try to find a way to deal with the divergence in some manner.</p>
<p>Eventual consistency expresses this idea: that nodes can for some time diverge from each other, but that eventually they will agree on the value.</p>
<p>Within the set of systems providing eventual consistency, there are two types of system designs:</p>
<p><em>Eventual consistency with probabilistic guarantees</em>. This type of system can detect conflicting writes at some later point, but does not guarantee that the results are equivalent to some correct sequential execution. In other words, conflicting updates will sometimes result in overwriting a newer value with an older one and some anomalies can be expected to occur during normal operation (or during partitions).</p>
<p>In recent years, the most influential system design offering single-copy consistency is Amazon's Dynamo, which I will discuss as an example of a system that offers eventual consistency with probabilistic guarantees.</p>
<p><em>Eventual consistency with strong guarantees</em>. This type of system guarantees that the results converge to a common value equivalent to some correct sequential execution. In other words, such systems do not produce in anomalous results; without any coordination you can build replicas of the same service, and those replicas can communicate in any pattern and receive the updates in any order, and they will eventually agree on the end result as long as they all see the same information.</p>
<p>CRDT's (convergent replicated data types) are data types that guarantee convergence to the same value in spite of network delays, partitions and message reordering. They are provably convergent, but the data types that can be implemented as CRDT's are limited.</p>
<p>The CALM (consistency as logical monotonicity) conjecture is an alternative expression of the same principle: it equates logical monotonicity with convergence. If we can conclude that something is logically monotonic, then it is also safe to run without coordination. Confluence analysis - in particular, as applied for the Bloom programming language - can be used to guide programmer decisions about when and where to use the coordination techniques from strongly consistent systems and when it is safe to execute without coordination.</p>
<h2>Reconciling different operation orders</h2>
<p>What does a system that does not enforce single-copy consistency look like? Let's try to make this more concrete by looking a few examples.</p>
<p>Perhaps the most obvious characteristic of systems that do not enforce single-copy consistency is that they allow replicas to diverge from each other. This means that there is no strictly defined pattern of communication: replicas can be separated from each other and yet continue to be available and accept writes.</p>
<p>Let's imagine a system of three replicas, each of which is partitioned from the others. For example, the replicas might be in different datacenters and for some reason unable to communicate. Each replica remains available during the partition, accepting both reads and writes from some set of clients:</p>
<pre>[Clients] - > [A]
--- Partition ---
[Clients] - > [B]
--- Partition ---
[Clients] - > [C]</pre>
<p>After some time, the partitions heal and the replica servers exchange information. They have received different updates from different clients and have diverged each other, so some sort of reconciliation needs to take place. What we would like to happen is that all of the replicas converge to the same result.</p>
<pre>[A] \
--> [merge]
[B] / |
|
[C] ----[merge]---> result</pre>
<p>Another way to think about systems with weak consistency guarantees is to imagine a set of clients sending messages to two in some order. Because there is no coordination protocol that enforces a single total order, the messages can get delivered in different orders at the two replicas:</p>
<pre>[Clients] --> [A] 1, 2, 3
[Clients] --> [B] 2, 3, 1</pre>
<p>This is, in essence, the reason why we need coordination protocols. For example, assume that we are trying to concatenate a string and the operations in messages 1, 2 and 3 are:</p>
<pre>1: { operation: concat('Hello ') }
2: { operation: concat('World') }
3: { operation: concat('!') }</pre>
<p>Then, without coordination, A will produce "Hello World!", and B will produce "World!Hello ".</p>
<pre>A: concat(concat(concat('', 'Hello '), 'World'), '!') = 'Hello World!'
B: concat(concat(concat('', 'World'), '!'), 'Hello ') = 'World!Hello '</pre>
<p>This is, of course, incorrect. Again, what we'd like to happen is that the replicas converge to the same result.</p>
<p>Keeping these two examples in mind, let's look at Amazon's Dynamo first to establish a baseline, and then discuss a number of novel approaches to building systems with weak consistency guarantees, such as CRDT's and the CALM theorem.</p>
<h2>Amazon's Dynamo</h2>
<p>Amazon's Dynamo system design (2007) is probably the best-known system that offers weak consistency guarantees but high availability. It is the basis for many other real world systems, including LinkedIn's Voldemort, Facebook's Cassandra and Basho's Riak.</p>
<p>Dynamo is an eventually consistent, highly available key-value store. A key value store is like a large hash table: a client can set values via <code>set(key, value)</code> and retrieve them by key using <code>get(key)</code>. A Dynamo cluster consists of N peer nodes; each node has a set of keys which is it responsible for storing.</p>
<p>Dynamo prioritizes availability over consistency; it does not guarantee single-copy consistency. Instead, replicas may diverge from each other when values are written; when a key is read, there is a read reconciliation phase that attempts to reconcile differences between replicas before returning the value back to the client.</p>
<p>For many features on Amazon, it is more important to avoid outages than it is to ensure that data is perfectly consistent, as an outage can lead to lost business and a loss of credibility. Furthermore, if the data is not particularly important, then a weakly consistent system can provide better performance and higher availability at a lower cost than a traditional RDBMS.</p>
<p>Since Dynamo is a complete system design, there are many different parts to look at beyond the core replication task. The diagram below illustrates some of the tasks; notably, how a write is routed to a node and written to multiple replicas.</p>
<pre>[ Client ]
|
( Mapping keys to nodes )
|
V
[ Node A ]
| \
( Synchronous replication task: minimum durability )
| \
[ Node B] [ Node C ]
A
|
( Conflict detection; asynchronous replication task:
ensuring that partitioned / recovered nodes recover )
|
V
[ Node D]</pre>
<p>After looking at how a write is initially accepted, we'll look at how conflicts are detected, as well as the asynchronous replica synchronization task. This task is needed because of the high availability design, in which nodes may be temporarily unavailable (down or partitioned). The replica synchronization task ensures that nodes can catch up fairly rapidly even after a failure.</p>
<h3>Consistent hashing</h3>
<p>Whether we are reading or writing, the first thing that needs to happen is that we need to locate where the data should live on the system. This requires some type of key-to-node mapping.</p>
<p>In Dynamo, keys are mapped to nodes using a hashing technique known as <a href="https://github.com/mixu/vnodehash">consistent hashing</a> (which I will not discuss in detail). The main idea is that a key can be mapped to a set of nodes responsible for it by a simple calculation on the client. This means that a client can locate keys without having to query the system for the location of each key; this saves system resources as hashing is generally faster than performing a remote procedure call.</p>
<h3>Partial quorums</h3>
<p>Once we know where a key should be stored, we need to do some work to persist the value. This is a synchronous task; the reason why we will immediately write the value onto multiple nodes is to provide a higher level of durability (e.g. protection from the immediate failure of a node).</p>
<p>Just like Paxos or Raft, Dynamo uses quorums for replication. However, Dynamo's quorums are sloppy (partial) quorums rather than strict (majority) quorums.</p>
<p>Informally, a strict quorum system is a quorum system with the property that any two quorums (sets) in the quorum system overlap. Requiring a majority to vote for an update before accepting it guarantees that only a single history is admitted since each majority quorum must overlap in at least one node. This was the property that Paxos, for example, relied on.</p>
<p>Partial quorums do not have that property; what this means is that a majority is not required and that different subsets of the quorum may contain different versions of the same data. The user can choose the number of nodes to write to and read from:</p>
<ul class="list">
<li>the user can choose some number W-of-N nodes required for a write to succeed; and</li>
<li>the user can specify the number of nodes (R-of-N) to be contacted during a read.</li>
</ul>
<p><code>W</code> and <code>R</code> specify the number of nodes that need to be involved to a write or a read. Writing to more nodes makes writes slightly slower but increases the probability that the value is not lost; reading from more nodes increases the probability that the value read is up to date.</p>
<p>The usual recommendation is that <code>R + W > N</code>, because this means that the read and write quorums overlap in one node - making it less likely that a stale value is returned. A typical configuration is <code>N = 3</code> (e.g. a total of three replicas for each value); this means that the user can choose between:</p>
<pre> R = 1, W = 3;
R = 2, W = 2 or
R = 3, W = 1</pre>
<p>More generally, again assuming <code>R + W > N</code>:</p>
<ul class="list">
<li><code>R = 1</code>, <code>W = N</code>: fast reads, slow writes</li>
<li><code>R = N</code>, <code>W = 1</code>: fast writes, slow reads</li>
<li><code>R = N/2</code> and <code>W = N/2 + 1</code>: favorable to both</li>
</ul>
<p>N is rarely more than 3, because keeping that many copies of large amounts of data around gets expensive!</p>
<p>As I mentioned earlier, the Dynamo paper has inspired many other similar designs. They all use the same partial quorum based replication approach, but with different defaults for N, W and R:</p>
<ul class="list">
<li>Basho's Riak (N = 3, R = 2, W = 2 default)</li>
<li>Linkedin's Voldemort (N = 2 or 3, R = 1, W = 1 default)</li>
<li>Apache's Cassandra (N = 3, R = 1, W = 1 default)</li>
</ul>
<p>There is another detail: when sending a read or write request, are all N nodes asked to respond (Riak), or only a number of nodes that meets the minimum (e.g. R or W; Voldemort). The "send-to-all" approach is faster and less sensitive to latency (since it only waits for the fastest R or W nodes of N) but also less efficient, while the "send-to-minimum" approach is more sensitive to latency (since latency communicating with a single node will delay the operation) but also more efficient (fewer messages / connections overall).</p>
<p>What happens when the read and write quorums overlap, e.g. (<code>R + W > N</code>)? Specifically, it is often claimed that this results in "strong consistency".</p>
<h3>Is R + W > N the same as "strong consistency"?</h3>
<p>No.</p>
<p>It's not completely off base: a system where <code>R + W > N</code> can detect read/write conflicts, since any read quorum and any write quorum share a member. E.g. at least one node is in both quorums:</p>
<pre> 1 2 N/2+1 N/2+2 N
[...] [R] [R + W] [W] [...]</pre>
<p>This guarantees that a previous write will be seen by a subsequent read. However, this only holds if the nodes in N never change. Hence, Dynamo doesn't qualify, because in Dynamo the cluster membership can change if nodes fail.</p>
<p>Dynamo is designed to be always writable. It has a mechanism which handles node failures by adding a different, unrelated server into the set of nodes responsible for certain keys when the original server is down. This means that the quorums are no longer guaranteed to always overlap. Even <code>R = W = N</code> would not qualify, since while the quorum sizes are equal to N, the nodes in those quorums can change during a failure. Concretely, during a partition, if a sufficient number of nodes cannot be reached, Dynamo will add new nodes to the quorum from unrelated but accessible nodes.</p>
<p>Furthermore, Dynamo doesn't handle partitions in the manner that a system enforcing a strong consistency model would: namely, writes are allowed on both sides of a partition, which means that for at least some time the system does not act as a single copy. So calling <code>R + W > N</code> "strongly consistent" is misleading; the guarantees merely probabilistic - which is not what strong consistency refers to.</p>
<h3>Conflict detection and read repair</h3>
<p>Systems that allow replicas to diverge must have a way to eventually reconcile two different values. As briefly mentioned during the partial quorum approach, one way to do this is to detect conflicts at read time, and then apply some conflict resolution method. But how is this done?</p>
<p>In general, this is done by tracking the causal history of a piece of data by supplementing it with some metadata. Clients must keep the metadata information when they read data from the system, and must return back the metadata value when writing to the database.</p>
<p>We've already encountered a method for doing this: vector clocks can be used to represent the history of a value. Indeed, this is what the original Dynamo design uses for detecting conflicts.</p>
<p>However, using vector clocks is not the only alternative. If you look at many practical system designs, you can deduce quite a bit about how they work by looking at the metadata that they track.</p>
<p><em>No metadata</em>. When a system does not track metadata, and only returns the value (e.g. via a client API), it cannot really do anything special about concurrent writes. A common rule is that the last writer wins: in other words, if two writers are writing at the same time, only the value from the slowest writer is kept around.</p>
<p><em>Timestamps</em>. Nominally, the value with the higher timestamp value wins. However, if time is not carefully synchronized, many odd things can happen where old data from a system with a faulty or fast clock overwrites newer values. Facebook's Cassandra is a Dynamo variant that uses timestamps instead of vector clocks.</p>
<p><em>Version numbers</em>. Version numbers may avoid some of the issues related with using timestamps. Note that the smallest mechanism that can accurately track causality when multiple histories are possible are vector clocks, not version numbers.</p>
<p><em>Vector clocks</em>. Using vector clocks, concurrent and out of date updates can be detected. Performing read repair then becomes possible, though in some cases (concurrent changes) we need to ask the client to pick a value. This is because if the changes are concurrent and we know nothing more about the data (as is the case with a simple key-value store), then it is better to ask than to discard data arbitrarily.</p>
<p>When reading a value, the client contacts <code>R</code> of <code>N</code> nodes and asks them for the latest value for a key. It takes all the responses, discards the values that are strictly older (using the vector clock value to detect this). If there is only one unique vector clock + value pair, it returns that. If there are multiple vector clock + value pairs that have been edited concurrently (e.g. are not comparable), then all of those values are returned.</p>
<p>As is obvious from the above, read repair may return multiple values. This means that the client / application developer must occasionally handle these cases by picking a value based on some use-case specific criterion.</p>
<p>In addition, a key component of a practical vector clock system is that the clocks cannot be allowed to grow forever - so there needs to be a procedure for occasionally garbage collecting the clocks in a safe manner to balance fault tolerance with storage requirements.</p>
<h3>Replica synchronization: gossip and Merkle trees</h3>
<p>Given that the Dynamo system design is tolerant of node failures and network partitions, it needs a way to deal with nodes rejoining the cluster after being partitioned, or when a failed node is replaced or partially recovered.</p>
<p>Replica synchronization is used to bring nodes up to date after a failure, and for periodically synchronizing replicas with each other.</p>
<p>Gossip is a probabilistic technique for synchronizing replicas. The pattern of communication (e.g. which node contacts which node) is not determined in advance. Instead, nodes have some probability <code>p</code> of attempting to synchronize with each other. Every <code>t</code> seconds, each node picks a node to communicate with. This provides an additional mechanism beyond the synchronous task (e.g. the partial quorum writes) which brings the replicas up to date.</p>
<p>Gossip is scalable, and has no single point of failure, but can only provide probabilistic guarantees.</p>
<p>In order to make the information exchange during replica synchronization efficient, Dynamo uses a technique called Merkle trees, which I will not cover in detail. The key idea is that a data store can be hashed at multiple different level of granularity: a hash representing the whole content, half the keys, a quarter of the keys and so on.</p>
<p>By maintaining this fairly granular hashing, nodes can compare their data store content much more efficiently than a naive technique. Once the nodes have identified which keys have different values, they exchange the necessary information to bring the replicas up to date.</p>
<h3>Dynamo in practice: probabilistically bounded staleness (PBS)</h3>
<p>And that pretty much covers the Dynamo system design:</p>
<ul class="list">
<li>consistent hashing to determine key placement</li>
<li>partial quorums for reading and writing</li>
<li>conflict detection and read repair via vector clocks and</li>
<li>gossip for replica synchronization</li>
</ul>
<p>How might we characterize the behavior of such a system? A fairly recent paper from Bailis et al. (2012) describes an approach called <a href="http://pbs.cs.berkeley.edu/">PBS</a> (probabilistically bounded staleness) uses simulation and data collected from a real world system to characterize the expected behavior of such a system.</p>
<p>PBS estimates the degree of inconsistency by using information about the anti-entropy (gossip) rate, the network latency and local processing delay to estimate the expected level of consistency of reads. It has been implemented in Cassandra, where timing information is piggybacked on other messages and an estimate is calculated based on a sample of this information in a Monte Carlo simulation.</p>
<p>Based on the paper, during normal operation eventually consistent data stores are often faster and can read a consistent state within tens or hundreds of milliseconds. The table below illustrates amount of time required from a 99.9% probability of consistent reads given different <code>R</code> and <code>W</code> settings on empirical timing data from LinkedIn (SSD and 15k RPM disks) and Yammer:</p>
<p class="img-container"><img src="./images/pbs.png" alt="from the PBS paper"></p>
<p>For example, going from <code>R=1</code>, <code>W=1</code> to <code>R=2</code>, <code>W=1</code> in the Yammer case reduces the inconsistency window from 1352 ms to 202 ms - while keeping the read latencies lower (32.6 ms) than the fastest strict quorum (<code>R=3</code>, <code>W=1</code>; 219.27 ms).</p>
<p>For more details, have a look at the <a href="http://pbs.cs.berkeley.edu/">PBS website</a> and the associated paper.</p>
<h2>Disorderly programming</h2>
<p>Let's look back at the examples of the kinds of situations that we'd like to resolve. The first scenario consisted of three different servers behind partitions; after the partitions healed, we wanted the servers to converge to the same value. Amazon's Dynamo made this possible by reading from <code>R</code> out of <code>N</code> nodes and then performing read reconciliation.</p>
<p>In the second example, we considered a more specific operation: string concatenation. It turns out that there is no known technique for making string concatenation resolve to the same value without imposing an order on the operations (e.g. without expensive coordination). However, there are operations which can be applied safely in any order, where a simple register would not be able to do so. As Pat Helland wrote:</p>
<blockquote>
<p>... operation-centric work can be made commutative (with the right operations and the right semantics) where a simple READ/WRITE semantic does not lend itself to commutativity.</p>
</blockquote>
<p>For example, consider a system that implements a simple accounting system with the <code>debit</code> and <code>credit</code> operations in two different ways:</p>
<ul class="list">
<li>using a register with <code>read</code> and <code>write</code> operations, and</li>
<li>using a integer data type with native <code>debit</code> and <code>credit</code> operations</li>
</ul>
<p>The latter implementation knows more about the internals of the data type, and so it can preserve the intent of the operations in spite of the operations being reordered. Debiting or crediting can be applied in any order, and the end result is the same:</p>
<pre>100 + credit(10) + credit(20) = 130 and
100 + credit(20) + credit(10) = 130</pre>
<p> However, writing a fixed value cannot be done in any order: if writes are reordered, the one of the writes will overwrite the other:</p>
<pre>100 + write(110) + write(130) = 130 but
100 + write(130) + write(110) = 110</pre>
<p>Let's take the example from the beginning of this chapter, but use a different operation. In this scenario, clients are sending messages to two nodes, which see the operations in different orders:</p>
<pre>[Clients] --> [A] 1, 2, 3
[Clients] --> [B] 2, 3, 1</pre>
<p>Instead of string concatenation, assume that we are looking to find the largest value (e.g. MAX()) for a set of integers. The messages 1, 2 and 3 are:</p>
<pre>1: { operation: max(previous, 3) }
2: { operation: max(previous, 5) }
3: { operation: max(previous, 7) }</pre>
<p>Then, without coordination, both A and B will converge to 7, e.g.:</p>
<pre>A: max(max(max(0, 3), 5), 7) = 7
B: max(max(max(0, 5), 7), 3) = 7</pre>
<p>In both cases, two replicas see updates in different order, but we are able to merge the results in a way that has the same result in spite of what the order is. The result converges to the same answer in both cases because of the merge procedure (<code>max</code>) we used.</p>
<p>It is likely not possible to write a merge procedure that works for all data types. In Dynamo, a value is a binary blob, so the best that can be done is to expose it and ask the application to handle each conflict.</p>
<p>However, if we know that the data is of a more specific type, handling these kinds of conflicts becomes possible. CRDT's are data structures designed to provide data types that will always converge, as long as they see the same set of operations (in any order).</p>
<h2>CRDTs: Convergent replicated data types</h2>
<p>CRDTs (convergent replicated datatypes) exploit knowledge regarding the commutativity and associativity of specific operations on specific datatypes.</p>
<p>In order for a set of operations to converge on the same value in an environment where replicas only communicate occasionally, the operations need to be order-independent and insensitive to (message) duplication/redelivery. Thus, their operations need to be:</p>
<ul class="list">
<li>Associative (<code>a+(b+c)=(a+b)+c</code>), so that grouping doesn't matter</li>
<li>Commutative (<code>a+b=b+a</code>), so that order of application doesn't matter</li>
<li>Idempotent (<code>a+a=a</code>), so that duplication does not matter</li>
</ul>
<p>It turns out that these structures are already known in mathematics; they are known as join or meet <a href="http://en.wikipedia.org/wiki/Semilattice">semilattices</a>.</p>
<p>A <a href="http://en.wikipedia.org/wiki/Lattice_%28order%29">lattice</a> is a partially ordered set with a distinct top (least upper bound) and a distinct bottom (greatest lower bound). A semilattice is like a lattice, but one that only has a distinct top or bottom. A join semilattice is one with a distinct top (least upper bound) and a meet semilattice is one with a distinct bottom (greatest lower bound).</p>
<p>Any data type that be expressed as a semilattice can be implemented as a data structure which guarantees convergence. For example, calculating the <code>max()</code> of a set of values will always return the same result regardless of the order in which the values were received, as long as all values are eventually received, because the <code>max()</code> operation is associative, commutative and idempotent.</p>
<p>For example, here are two lattices: one drawn for a set, where the merge operator is <code>union(items)</code> and one drawn for a strictly increasing integer counter, where the merge operator is <code>max(values)</code>:</p>
<pre> { a, b, c } 7
/ | \ / \
{a, b} {b,c} {a,c} 5 7
| \ / | / / | \
{a} {b} {c} 3 5 7</pre>
<p>With data types that can be expressed as semilattices, you can have replicas communicate in any pattern and receive the updates in any order, and they will eventually agree on the end result as long as they all see the same information. That is a powerful property that can be guaranteed as long as the prerequisites hold.</p>
<p>However, expressing a data type as a semilattice often requires some level of interpretation. Many data types have operations which are not in fact order-independent. For example, adding items to a set is associative, commutative and idempotent. However, if we also allow items to be removed from a set, then we need some way to resolve conflicting operations, such as <code>add(A)</code> and <code>remove(A)</code>. What does it mean to remove an element if the local replica never added it? This resolution has to be specified in a manner that is order-independent, and there are several different choices with different tradeoffs.</p>
<p>This means that several familiar data types have more specialized implementations as CRDT's which make a different tradeoff in order to resolve conflicts in an order-independent manner. Unlike a key-value store which simply deals with registers (e.g. values that are opaque blobs from the perspective of the system), someone using CRDTs must use the right data type to avoid anomalies.</p>
<p>Some examples of the different data types specified as CRDT's include:</p>
<ul class="list">
<li>Counters<ul class="list">
<li>Grow-only counter (merge = max(values); payload = single integer)</li>
<li>Positive-negative counter (consists of two grow counters, one for increments and another for decrements)</li>
</ul>
</li>
<li>Registers<ul class="list">
<li>Last Write Wins -register (timestamps or version numbers; merge = max(ts); payload = blob)</li>
<li>Multi-valued -register (vector clocks; merge = take both)</li>
</ul>
</li>
<li>Sets<ul class="list">
<li>Grow-only set (merge = union(items); payload = set; no removal)</li>
<li>Two-phase set (consists of two sets, one for adding, and another for removing; elements can be added once and removed once)</li>
<li>Unique set (an optimized version of the two-phase set)</li>
<li>Last write wins set (merge = max(ts); payload = set)</li>
<li>Positive-negative set (consists of one PN-counter per set item)</li>
<li>Observed-remove set</li>
</ul>
</li>
<li>Graphs and text sequences (see the paper)</li>
</ul>
<p>To ensure anomaly-free operation, you need to find the right data type for your specific application - for example, if you know that you will only remove an item once, then a two-phase set works; if you will only ever add items to a set and never remove them, then a grow-only set works.</p>
<p>Not all data structures have known implementations as CRDTs, but there are CRDT implementations for booleans, counters, sets, registers and graphs in the recent (2011) <a href="http://hal.inria.fr/docs/00/55/55/88/PDF/techreport.pdf">survey paper from Shapiro et al</a>.</p>
<p>Interestingly, the register implementations correspond directly with the implementations that key value stores use: a last-write-wins register uses timestamps or some equivalent and simply converges to the largest timestamp value; a multi-valued register corresponds to the Dynamo strategy of retaining, exposing and reconciling concurrent changes. For the details, I recommend that you take a look at the papers in the further reading section of this chapter.</p>
<h2>The CALM theorem</h2>
<p>The CRDT data structures were based on the recognition that data structures expressible as semilattices are convergent. But programming is about more than just evolving state, unless you are just implementing a data store.</p>
<p>Clearly, order-independence is an important property of any computation that converges: if the order in which data items are received influences the result of the computation, then there is no way to execute a computation without guaranteeing order.</p>
<p>However, there are many programming models in which the order of statements does not play a significant role. For example, in the <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce model</a>, both the Map and the Reduce tasks are specified as stateless tuple-processing tasks that need to be run on a dataset. Concrete decisions about how and in what order data is routed to the tasks is not specified explicitly, instead, the batch job scheduler is responsible for scheduling the tasks to run on the cluster.</p>
<p>Similarly, in SQL one specifies the query, but not how the query is executed. The query is simply a declarative description of the task, and it is the job of the query optimizer to figure out an efficient way to execute the query (across multiple machines, databases and tables).</p>
<p>Of course, these programming models are not as permissive as a general purpose programming language. MapReduce tasks need to be expressible as stateless tasks in an acyclic dataflow program; SQL statements can execute fairly sophisticated computations but many things are hard to express in it.</p>
<p>However, it should be clear from these two examples that there are many kinds of data processing tasks which are amenable to being expressed in a declarative language where the order of execution is not explicitly specified. Programming models which express a desired result while leaving the exact order of statements up to an optimizer to decide often have semantics that are order-independent. This means that such programs may be possible to execute without coordination, since they depend on the inputs they receive but not necessarily the specific order in which the inputs are received.</p>
<p>The key point is that such programs <em>may be</em> safe to execute without coordination. Without a clear rule that characterizes what is safe to execute without coordination, and what is not, we cannot implement a program while remaining certain that the result is correct.</p>
<p>This is what the CALM theorem is about. The CALM theorem is based on a recognition of the link between logical monotonicity and useful forms of eventual consistency (e.g. confluence / convergence). It states that logically monotonic programs are guaranteed to be eventually consistent.</p>
<p>Then, if we know that some computation is logically monotonic, then we know that it is also safe to execute without coordination.</p>
<p>To better understand this, we need to contrast monotonic logic (or monotonic computations) with <a href="http://plato.stanford.edu/entries/logic-nonmonotonic/">non-monotonic logic</a> (or non-monotonic computations).</p>
<dl>
<dt>Monotony</dt>
<dd>if sentence <code>φ</code> is a consequence of a set of premises <code>Γ</code>, then it can also be inferred from any set <code>Δ</code> of premises extending <code>Γ</code></dd>
</dl>
<p>Most standard logical frameworks are monotonic: any inferences made within a framework such as first-order logic, once deductively valid, cannot be invalidated by new information. A non-monotonic logic is a system in which that property does not hold - in other words, if some conclusions can be invalidated by learning new knowledge.</p>
<p>Within the artificial intelligence community, non-monotonic logics are associated with <a href="http://plato.stanford.edu/entries/reasoning-defeasible/">defeasible reasoning</a> - reasoning, in which assertions made utilizing partial information can be invalidated by new knowledge. For example, if we learn that Tweety is a bird, we'll assume that Tweety can fly; but if we later learn that Tweety is a penguin, then we'll have to revise our conclusion.</p>
<p>Monotonicity concerns the relationship between premises (or facts about the world) and conclusions (or assertions about the world). Within a monotonic logic, we know that our results are retraction-free: <a href="http://en.wikipedia.org/wiki/Monotonicity_of_entailment">monotone</a> computations do not need to be recomputed or coordinated; the answer gets more accurate over time. Once we know that Tweety is a bird (and that we're reasoning using monotonic logic), we can safely conclude that Tweety can fly and that nothing we learn can invalidate that conclusion.</p>
<p>While any computation that produces a human-facing result can be interpreted as an assertion about the world (e.g. the value of "foo" is "bar"), determining whether a computation in a von Neumann -machine based programming model is monotonic difficult because it is not exactly clear what the relationship between facts and assertions are and whether those relationships are monotonic.</p>
<p>However, there are a number of programming models for which determining monotonicity is possible. In particular, <a href="http://en.wikipedia.org/wiki/Relational_algebra">relational algebra</a> (e.g. the theoretical underpinnings of SQL) and <a href="http://en.wikipedia.org/wiki/Datalog">Datalog</a> provide highly expressive languages that have well-understood interpretations.</p>
<p>Both basic Datalog and relational algebra (even with recursion) are known to be monotonic. More specifically, computations expressed using a certain set of basic operators are known to be monotonic (selection, projection, natural join, cross product, union and recursive Datalog without negation), and non-monotonicity is introduced by using more advanced operators (negation, set difference, division, universal quantification, aggregation).</p>
<p>This means that computations expressed using a significant number of operators (e.g. map, filter, join, union, intersection) in those systems are logically monotonic; any computations using those operators are also monotonic and thus safe to run without coordination. Expressions that make use of negation and aggregation, on the other hand, are not safe to run without coordination.</p>
<p>It is important to realize the connection between non-monotonicity and operations that are expensive to perform in a distributed system. Specifically, both <em>distributed aggregation</em> and <em>coordination protocols</em> can be considered to be a form of negation. As Joe Hellerstein <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf">writes</a>:</p>
<blockquote>
<p>To establish the veracity of a negated predicate in a distributed setting, an evaluation strategy has to start "counting to 0" to determine emptiness, and wait until the distributed counting process has definitely terminated. Aggregation is the generalization of this idea.</p>
</blockquote>
<p>and:</p>
<blockquote>
<p>This idea can be seen from the other direction as well. Coordination protocols are themselves aggregations, since they entail voting: Two-Phase Commit requires unanimous votes, Paxos consensus requires majority votes, and Byzantine protocols require a 2/3 majority. Waiting requires counting.</p>
</blockquote>
<p>If, then we can express our computation in a manner in which it is possible to test for monotonicity, then we can perform a whole-program static analysis that detects which parts of the program are eventually consistent and safe to run without coordination (the monotonic parts) - and which parts are not (the non-monotonic ones).</p>
<p>Note that this requires a different kind of language, since these inferences are hard to make for traditional programming languages where sequence, selection and iteration are at the core. Which is why the Bloom language was designed.</p>
<h2>What is non-mononicity good for?</h2>
<p>The difference between monotonicity and non-monotonicity is interesting. For example, adding two numbers is monotonic, but calculating an aggregation over two nodes containing numbers is not. What's the difference? One of these is a computation (adding two numbers), while the other is an assertion (calculating an aggregate).</p>
<p>How does a computation differ from an assertion? Let's consider the query "is pizza a vegetable?". To answer that, we need to get at the core: when is it acceptable to infer that something is (or is not) true?</p>
<p>There are several acceptable answers, each corresponding to a different set of assumptions regarding the information that we have and the way we ought to act upon it - and we've come to accept different answers in different contexts.</p>
<p>In everyday reasoning, we make what is known as the <a href="http://en.wikipedia.org/wiki/Open_world_assumption">open-world assumption</a>: we assume that we do not know everything, and hence cannot make conclusions from a lack of knowledge. That is, any sentence may be true, false or unknown.</p>
<pre> OWA + | OWA +
Monotonic logic | Non-monotonic logic
Can derive P(true) | Can assert P(true) | Cannot assert P(true)
Can derive P(false) | Can assert P(false) | Cannot assert P(true)
Cannot derive P(true) | Unknown | Unknown
or P(false)</pre>
<p>When making the open world assumption, we can only safely assert something we can deduce from what is known. Our information about the world is assumed to be incomplete.</p>
<p>Let's first look at the case where we know our reasoning is monotonic. In this case, any (potentially incomplete) knowledge that we have cannot be invalidated by learning new knowledge. So if we can infer that a sentence is true based on some deduction, such as "things that contain two tablespoons of tomato paste are vegetables" and "pizza contains two tablespoons of tomato paste", then we can conclude that "pizza is a vegetable". The same goes for if we can deduce that a sentence is false.</p>
<p>However, if we cannot deduce anything - for example, the set of knowledge we have contains customer information and nothing about pizza or vegetables - then under the open world assumption we have to say that we cannot conclude anything.</p>
<p>With non-monotonic knowledge, anything we know right now can potentially be invalidated. Hence, we cannot safely conclude anything, even if we can deduce true or false from what we currently know.</p>
<p>However, within the database context, and within many computer science applications we prefer to make more definite conclusions. This means assuming what is known as the <a href="http://en.wikipedia.org/wiki/Closed_world_assumption">closed-world assumption</a>: that anything that cannot be shown to be true is false. This means that no explicit declaration of falsehood is needed. In other words, the database of facts that we have is assumed to be complete (minimal), so that anything not in it can be assumed to be false.</p>
<p>For example, under the CWA, if our database does not have an entry for a flight between San Francisco and Helsinki, then we can safely conclude that no such flight exists.</p>
<p>We need one more thing to be able to make definite assertions: <a href="http://en.wikipedia.org/wiki/Circumscription_%28logic%29">logical circumscription</a>. Circumscription is a formalized rule of conjecture. Domain circumscription conjectures that the known entities are all there are. We need to be able to assume that the known entities are all there are in order to reach a definite conclusion.</p>
<pre> CWA + | CWA +
Circumscription + | Circumscription +
Monotonic logic | Non-monotonic logic
Can derive P(true) | Can assert P(true) | Can assert P(true)
Can derive P(false) | Can assert P(false) | Can assert P(false)
Cannot derive P(true) | Can assert P(false) | Can assert P(false)
or P(false)</pre>
<p>In particular, non-monotonic inferences need this assumption. We can only make a confident assertion if we assume that we have complete information, since additional information may otherwise invalidate our assertion.</p>
<p>What does this mean in practice? First, monotonic logic can reach definite conclusions as soon as it can derive that a sentence is true (or false). Second, nonmonotonic logic requires an additional assumption: that the known entities are all there is.</p>
<p>So why are two operations that are on the surface equivalent different? Why is adding two numbers monotonic, but calculating an aggregation over two nodes not? Because the aggregation does not only calculate a sum but also asserts that it has seen all of the values. And the only way to guarantee that is to coordinate across nodes and ensure that the node performing the calculation has really seen all of the values within the system.</p>
<p>Thus, in order to handle nonmonotonicity one needs to either use distributed coordination to ensure that assertions are made only after all the information is known or make assertions with the caveat that the conclusion can be invalidated later on.</p>
<p>Handling non-monotonicity is important for reasons of expressiveness. This comes down to being able to express non-monotone things; for example, it is nice to be able to say that the total of some column is X. The system must detect that this kind of computation requires a global coordination boundary to ensure that we have seen all the entities.</p>
<p>Purely monotone systems are rare. It seems that most applications operate under the closed-world assumption even when they have incomplete data, and we humans are fine with that. When a database tells you that a direct flight between San Francisco and Helsinki does not exist, you will probably treat this as "according to this database, there is no direct flight", but you do not rule out the possibility that that in reality such a flight might still exist.</p>
<p>Really, this issue only becomes interesting when replicas can diverge (e.g. during a partition or due to delays during normal operation). Then there is a need for a more specific consideration: whether the answer is based on just the current node, or the totality of the system.</p>
<p>Further, since nonmonotonicity is caused by making an assertion, it seems plausible that many computations can proceed for a long time and only apply coordination at the point where some result or assertion is passed to a 3rd party system or end user. Certainly it is not necessary for every single read and write operation within a system to enforce a total order, if those reads and writes are simply a part of a long running computation.</p>
<h2>The Bloom language</h2>
<p>The <a href="http://www.bloom-lang.net/">Bloom language</a> is a language designed to make use of the CALM theorem. It is a Ruby DSL which has its formal basis in a temporal logic programming language called Dedalus.</p>
<p>In Bloom, each node has a database consisting of collections and lattices. Programs are expressed as sets of unordered statements which interact with collections (sets of facts) and lattices (CRDTs). Statements are order-independent by default, but one can also write non-monotonic functions.</p>
<p>Have a look at the <a href="http://www.bloom-lang.net/">Bloom website</a> and <a href="https://github.com/bloom-lang/bud/tree/master/docs">tutorials</a> to learn more about Bloom.</p>
<hr>
<h2>Further reading</h2>
<h4>The CALM theorem, confluence analysis and Bloom</h4>
<p><a href="http://vimeo.com/53904989">Joe Hellerstein's talk @RICON 2012</a> is a good introduction to the topic, as is <a href="http://vimeo.com/45111940">Neil Conway's talk @Basho</a>. For Bloom in particular, see <a href="http://channel9.msdn.com/Events/Lang-NEXT/Lang-NEXT-2012/Bloom-Disorderly-Programming-for-a-Distributed-World">Peter Alvaro's talk@Microsoft</a>.</p>
<ul class="list">
<li><a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf">The Declarative Imperative: Experiences and Conjectures in Distributed Logic</a> - Hellerstein, 2010</li>
<li><a href="http://db.cs.berkeley.edu/papers/cidr11-bloom.pdf">Consistency Analysis in Bloom: a CALM and Collected Approach</a> - Alvaro et al., 2011</li>
<li><a href="http://db.cs.berkeley.edu/papers/UCB-lattice-tr.pdf">Logic and Lattices for Distributed Programming</a> - Conway et al., 2012</li>
<li><a href="http://db.cs.berkeley.edu/papers/datalog2011-dedalus.pdf">Dedalus: Datalog in Time and Space</a> - Alvaro et al., 2011</li>
</ul>
<h4>CRDTs</h4>
<p><a href="http://research.microsoft.com/apps/video/dl.aspx?id=153540">Marc Shapiro's talk @ Microsoft</a> is a good starting point for understanding CRDT's.</p>
<ul class="list">
<li><a href="http://hal.archives-ouvertes.fr/docs/00/39/79/81/PDF/RR-6956.pdf">CRDTs: Consistency Without Concurrency Control</a> - Letitia et al., 2009</li>
<li><a href="http://hal.inria.fr/docs/00/55/55/88/PDF/techreport.pdf">A comprehensive study of Convergent and Commutative Replicated Data Types</a>, Shapiro et al., 2011</li>
<li><a href="http://arxiv.org/pdf/1210.3368v1.pdf">An Optimized conflict-free Replicated Set</a> - Bieniusa et al., 2012</li>
</ul>
<h4>Dynamo; PBS; optimistic replication</h4>
<ul class="list">
<li><a href="http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf">Dynamo: Amazon’s Highly Available Key-value Store</a> - DeCandia et al., 2007</li>
<li><a href="http://scholar.google.com/scholar?q=PNUTS:+Yahoo!'s+Hosted+Data+Serving+Platform">PNUTS: Yahoo!'s Hosted Data Serving Platform</a> - Cooper et al., 2008</li>
<li><a href="http://scholar.google.com/scholar?q=The+Bayou+Architecture%3A+Support+for+Data+Sharing+among+Mobile+Users">The Bayou Architecture: Support for Data Sharing among Mobile Users</a> - Demers et al. 1994</li>
<li><a href="http://pbs.cs.berkeley.edu/pbs-vldb2012.pdf">Probabilistically Bound Staleness for Practical Partial Quorums</a> - Bailis et al., 2012</li>
<li><a href="https://queue.acm.org/detail.cfm?id=2462076">Eventual Consistency Today: Limitations, Extensions, and Beyond</a> - Bailis & Ghodsi, 2013</li>
<li><a href="http://www.ysaito.com/survey.pdf">Optimistic replication</a> - Saito & Shapiro, 2005</li>
</ul>
</div>
</div>
<div class="clear">
<hr>
</div>
<div class="header nav">
<p><a href="replication.html">Previous Chapter</a> | <a href="index.html">Home</a> | <a href="appendix.html">Next Chapter</a></p>
</div>
<div class="clear">
<hr>
</div>
<div class="header nav">
<p>"Distributed systems: for fun and profit" by <a href="http://mixu.net/">Mikito Takada</a>.</p>
</div>
<div class="clear">
<hr>
</div>
<div class="header nav">
<div id="disqus_thread"></div>
<script type="text/javascript">
/* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE * * */
var disqus_shortname = 'mixuds'; // required: replace example with your forum shortname
/* * * DON'T EDIT BELOW THIS LINE * * */
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
<a href="http://disqus.com" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
</div>
</div>
</div>
</div>
</body>
</html>