forked from alinush/6.824-lecture-notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathl15-spanner.html
768 lines (630 loc) · 22.8 KB
/
l15-spanner.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
<h1>6.824 2015 Lecture 15 Spanner</h1>
<p><strong>Note:</strong> These lecture notes were slightly modified from the ones posted on the
6.824 <a href="http://nil.csail.mit.edu/6.824/2015/schedule.html">course website</a> from
Spring 2015.</p>
<h2>Intro</h2>
<p><a href="http://research.google.com/archive/spanner.html">Spanner paper, OSDI 2012</a></p>
<ul>
<li>Shattered old assumption: cannot assume that clocks are tightly synchronized
<ul>
<li>tightly synchronized clocks are now feasible in a global scale distributed
system: GPS and atomic clocks as independent sources</li>
</ul></li>
<li><em>Data model:</em> immutable versioned data</li>
<li>built and deployed system in multiple data centers</li>
<li>Paxos helps you determine order of events. Why do we still need time?</li>
<li>used synchronized time to allow local reads without locks</li>
<li>transactions on top of replication
<ul>
<li>two-phase commit across groups of replicas</li>
</ul></li>
<li>concurrency control
<ul>
<li>strict two phase locking with timestamps</li>
</ul></li>
<li>Paxos
<ul>
<li>long-lived leader (timed leases)</li>
<li>pipelined (multiple proposals in flight)</li>
<li>out-of-order commit, in-order apply</li>
</ul></li>
</ul>
<h2>Spanner and 'research'</h2>
<ul>
<li>team is chock-full of PhDs</li>
<li>we write research papers when we feel the urge and we have something to say</li>
<li>cutting edge development, unbelievable scale, but we are not researchers</li>
</ul>
<h2>Historical context</h2>
<p><a href="http://research.google.com/archive/bigtable.html">Bigtable paper, OSDI 2006</a></p>
<ul>
<li>started development at end of 2003 (6 PhDs)</li>
<li>first customer launched on Bigtable mid 2005</li>
<li>distributed key-value store
<ul>
<li>single-row transactions</li>
<li>later added lazy replication</li>
</ul></li>
<li>value proposition
<ul>
<li>scale to large numbers</li>
<li>automatic resharding</li>
</ul></li>
<li>Bigtable was one of the progenitors of "NoSQL" or more precisely "of how do
you store a lot of data without building a database"</li>
<li>basic tenets at the time (design assumptions for Bigtable):
<ul>
<li>who needs a database? key-value store suffices</li>
<li>who needs SQL? unnecessary for most applications</li>
<li>who needs transactions? two-phase commit is too expensive</li>
</ul></li>
</ul>
<h2>Why Spanner?</h2>
<ul>
<li>found that Bigtable is too hard to use
<ul>
<li>users like the power that SQL database give them</li>
<li>engineers shouldn't have to code around
<ul>
<li>the lack of transactions</li>
<li>the bugs that manifest due to weak semantics provided by lazy replication</li>
</ul></li>
<li>programmer productivity matters </li>
</ul></li>
</ul>
<p>Megastore, started ca. 2006, built on top of Bigtable</p>
<ul>
<li>optimistic concurrency control</li>
<li>paxos-based replication
<ul>
<li>no long-lived leader (paxos "election" on every write)</li>
<li>every paxos message was written to bigtable</li>
</ul></li>
<li>broader class of transactions than bigtable</li>
<li>SQL-like schema and query languages</li>
<li>had consistent replication</li>
</ul>
<p>Dremel, data analysis at Google, started ca. 2008</p>
<ul>
<li>column-oriented storage and query engine</li>
<li>http://research.google.com/pubs/pub36632.html</li>
<li>popular because it allowed SQL</li>
</ul>
<h2>Transactions</h2>
<p><a href="http://research.google.com/pubs/pub36726.html">Percolator, general purpose transactions</a></p>
<ul>
<li>snapshot isolation: a normal transaction has one commit point (logically
when you commit, everything happened then)
<ul>
<li>TODO: lookup what this means, because I couldn't write down his explanation</li>
</ul></li>
<li>built on top of Bigtable</li>
<li>users demanded transactions, but we weren't ready to build that into bigtable</li>
</ul>
<h2>Spanner</h2>
<ul>
<li>we knew we needed
<ul>
<li>a database</li>
<li>SQL</li>
<li>consistent replication across data centers</li>
<li>general purpose transactions</li>
</ul></li>
<li>the rest was "merely engineering"</li>
</ul>
<p>TrueTime came along... (story about how they found out about a guy in NY who
was working on distributed clocks and they realized it could be useful for their
concurrency control)</p>
<h2>Globally synchronized clocks</h2>
<ul>
<li>spanner behaves like a single-machine database
<ul>
<li>consistent replication: replicas all report the same state</li>
<li>external consistency: replicas all report the same order of events</li>
</ul></li>
<li>nice semantics</li>
</ul>
<h2>Were we wrong with bigtable</h2>
<p>Yes, and no:</p>
<ul>
<li>yes for the long-term: didn't know in 2003 what they knew in 2009, didn't have
the people or the technology</li>
<li>no, because lots of people use bigtable at Google</li>
</ul>
<p>Imagine you are running a startup. What long-term issues can be postponed?</p>
<p>Startup dilemma: </p>
<ul>
<li>too much time spent on scalable storage => wasted effort => not done in time
=> fail </li>
<li>too little time spent on scalable storage => when they get popular can't scale
=> fail</li>
</ul>
<p>What do you have the skill/ability/will/vision to do?</p>
<ul>
<li>we could not have built Spanner 10 years ago: or even 5 years ago</li>
<li>someone told them they should build transactions in, but they didn't do it
because they couldn't at the time</li>
</ul>
<h2>Interesting questions</h2>
<p>Why has the Bigtable paper had arguably a bigger impact on both the research
communities and technology communities?</p>
<ul>
<li>research vs. practice</li>
</ul>
<p>Why do system-researchers insist on building scalable key-value stores (and not
databases)?</p>
<h2>Lessons</h2>
<h3>Lesson 0</h3>
<p>Timing is everything. Except luck trumps timing.</p>
<p>You can't plan timing when the world is changing: design the best you can for
the problems you have in front of you</p>
<p>TrueTime happened due to fortuitous confluence of events and people (i.e. luck).
Same with Bigtable. Spanner's initial design (before 2008) was nowhere near what
Google has now: they had anti-luck until the project was restarted in 2008.</p>
<h3>Lesson 1</h3>
<p>Build what you need, and don't overdesign. Don't underdesign either, because
you'll pay for it.</p>
<h3>Lesson 2</h3>
<p>Sometimes ignorance really is bliss. Or maybe luck.</p>
<p>If you have blinders on, you can't overreach. If we had known we needed a
distributed replicated database with external consistency in 2004, we would have
failed.</p>
<h3>Lesson 3</h3>
<p>Your userbase matters.</p>
<ul>
<li>bigtable was started when Google <code>< 2000</code> employees
<ul>
<li>limited # of products</li>
<li>not that many engineers</li>
</ul></li>
<li>spanner was started when Google was <code>10K</code> employees
<ul>
<li>more products</li>
<li>many more engineers, many more junior engineers, many more acquired companies</li>
</ul></li>
<li>productivity of your employees matters</li>
</ul>
<h3>Wrap up</h3>
<p>You can't buy luck. You can't plan for luck. But you can't ignore luck.</p>
<p>You can increase your chances to be lucky:</p>
<ul>
<li>have strong technical skills</li>
<li>work on your design sense (find opportunities to learn!)</li>
<li>build a strong network of colleagues and friends</li>
<li>learn how to work on a team</li>
<li>learn what you are good at, and what you are <em>not</em> good at
<ul>
<li>be brutally honest with yourself</li>
<li>be willing to ask for help</li>
<li>admit when you are wrong</li>
<li>people don't like working with people that constantly tell them they are wrong</li>
</ul></li>
</ul>
<h2>What Spanner lacks?</h2>
<p>Maybe disconnected access: Can we build apps that use DBs and can operate offline?</p>
<p><a href="https://www.cs.berkeley.edu/~brewer/cs262b/Coda-TOCS.pdf">Disconnected operation in Coda file system</a> work.</p>
<h1>6.824 notes</h1>
<p><a href="papers/spanner.pdf">Spanner: Google's Globally-Distributed Database</a>,
Corbett et al, OSDI 2012</p>
<p>Why this paper?</p>
<ul>
<li>modern, high performance, driven by real-world needs</li>
<li>sophisticated use of paxos</li>
<li>tackles consistency + performance (will be a big theme)</li>
<li>Lab 4 a (hugely) simplified version of Spanner</li>
</ul>
<p>What are the big ideas?</p>
<ul>
<li>shard management w/ paxos replication</li>
<li>high performance despite synchronous WAN replication</li>
<li>fast reads by <strong>asking only the nearest replica</strong></li>
<li>consistency despite sharding (this is the real focus)</li>
<li><strong>clever use of time</strong> for consistency</li>
<li>distributed transactions</li>
</ul>
<p>This is a dense paper! I've tried to boil down some of the ideas to simpler
form.</p>
<h2>Sharding</h2>
<p>Idea: sharding</p>
<ul>
<li>we've seen this before in FDS</li>
<li>the real problem is managing configuration changes</li>
<li>Spanner has a more convincing design for this than FDS</li>
</ul>
<p>Simplified sharding outline (lab 4):</p>
<ul>
<li>replica groups, paxos-replicated
<ul>
<li>paxos log in each replica group</li>
</ul></li>
<li>master, paxos-replicated
<ul>
<li>assigns shards to groups</li>
<li>numbered configurations</li>
</ul></li>
<li>if master moves a shard, groups eventually see new config
<ul>
<li><code>"start handoff Num=7"</code> op in both groups' paxos logs</li>
<li>though perhaps not at the same time</li>
</ul></li>
<li><code>dst</code> can't finish handoff until it has copies of shard data at majority
<ul>
<li>and can't wait long for possibly-dead minority</li>
<li>minority must catch up, so perhaps put shard data in paxos log (!)</li>
</ul></li>
<li><code>"end handoff Num=7"</code> op in both groups' logs</li>
</ul>
<p><strong>Q:</strong> What if a Put is concurrent w/ handoff?</p>
<ul>
<li>client sees new config, sends Put to new group before handoff starts?</li>
<li>client has stale view and sends it to old group after handoff?</li>
<li>arrives at either during handoff?</li>
</ul>
<p><strong>Q:</strong> What if a failure during handoff?
- e.g. old group thinks shard is handed off
+ but new group fails before it thinks so</p>
<p><strong>Q:</strong> Can <em>two</em> groups think they are serving a shard?</p>
<p><strong>Q:</strong> Could old group still serve shard if can't hear master?</p>
<p><strong>Idea:</strong> wide-area synchronous replication</p>
<ul>
<li><em>Goal:</em> survive single-site disasters</li>
<li><em>Goal:</em> replica near customers</li>
<li><em>Goal:</em> don't lose any updates</li>
</ul>
<p>Considered impractical until a few years ago</p>
<ul>
<li>paxos too expensive, so maybe primary/backup?</li>
<li>if primary waits for ACK from backup
<ul>
<li>50ms network will limit throughput and cause palpable delay</li>
<li>esp if app has to do multiple reads at 50ms each</li>
</ul></li>
<li>if primary does not wait, it will reply to client before durable</li>
<li>danger of split brain; can't make network reliable</li>
</ul>
<p>What's changed?</p>
<ul>
<li>other site may be only 5 ms away -- San Francisco / Los Angeles</li>
<li>faster/cheaper WAN</li>
<li>apps written to tolerate delays
<ul>
<li>may make many slow read requests</li>
<li>but issue them in parallel</li>
<li>maybe time out quickly and try elsewhere, or redundant gets</li>
</ul></li>
<li>huge # of concurrent clients lets you get hi thruput despite high delay
<ul>
<li>run their requests in parallel</li>
</ul></li>
<li>people appreciate paxos more and have streamlined variants
<ul>
<li>fewer msgs
<ul>
<li>page 9 of paxos paper: 1 round per op w/ leader + bulk preprepare</li>
<li>paper's scheme a little more involved b/c they must ensure
there's at most one leader</li>
</ul></li>
<li>read at any replica</li>
</ul></li>
</ul>
<p>Actual performance?</p>
<ul>
<li>Table 3
<ul>
<li>pretend just measuring paxos for writes, read at any replica for reads
latency
<ul>
<li>why doesn't write latency go up w/ more replicas?</li>
<li>why does std dev of latency go down w/ more replicas?</li>
<li>r/o a <em>lot</em> faster since not a paxos agreement + use closest replica
throughput</li>
<li>why does read throughput go up w/ # replicas?</li>
<li>why doesn't write throughput go up?</li>
<li>does write thruput seem to be going down?</li>
</ul></li>
<li>what can we conclude from Table 3?
<ul>
<li>is the system fast? slow?</li>
</ul></li>
<li>how fast do your paxoses run?
<ul>
<li>mine takes 10 ms per agreement</li>
<li>with purely local communication and no disk</li>
<li>Spanner paxos might wait for disk write</li>
</ul></li>
</ul></li>
<li>Figure 5
<ul>
<li><code>npaxos=5</code>, all leaders in same zone</li>
<li>why does killing a non-leader in each group have no effect?
for killing all the leaders ("leader-hard")
<ul>
<li>why flat for a few seconds?</li>
<li>what causes it to start going up?</li>
<li>why does it take 5 to 10 seconds to recover?</li>
<li>why is slope <em>higher</em> until it rejoins?</li>
</ul></li>
</ul></li>
</ul>
<p>Spanner reads from any paxos replica</p>
<ul>
<li>read does <em>not</em> involve a paxos agreement</li>
<li>just reads the data directly from replica's k/v DB</li>
<li>maybe 100x faster -- same room rather than cross-country</li>
</ul>
<p><strong>Q:</strong> Could we <em>write</em> to just one replica?</p>
<p><strong>Q:</strong> Is reading from any replica correct?</p>
<p>Example of problem:</p>
<ul>
<li>photo sharing site</li>
<li>i have photos</li>
<li>i have an ACL (access control list) saying who can see my photos</li>
<li>i take my mom out of my ACL, then upload new photo</li>
<li>really it's web front ends doing these client reads/writes</li>
</ul>
<p>Order of events:</p>
<ol>
<li>W1: I write ACL on group G1 (bare majority), then</li>
<li>W2: I add image on G2 (bare majority), then</li>
<li>mom reads image -- may get old data from lagging G2 replica</li>
<li>mom reads ACL -- may get new data from G1</li>
</ol>
<p>This system is not acting like a single server!</p>
<ul>
<li>there was not really any point at which the image was
<ul>
<li>present but the ACL hadn't been updated</li>
</ul></li>
</ul>
<p>This problem is caused by a combination of</p>
<ul>
<li>partitioning -- replica groups operate independently</li>
<li>cutting corners for performance -- read from any replica</li>
</ul>
<p>How can we fix this?</p>
<ol>
<li>Make reads see latest data
<ul>
<li>e.g. full paxos for reads expensive!</li>
</ul></li>
<li>Make reads see <em>consistent</em> data
<ul>
<li>data as it existed at <em>some</em> previous point in time</li>
<li>i.e. before #1, between #1 and #2, or after #2</li>
<li>this turns out to be much cheaper</li>
<li>spanner does this</li>
</ul></li>
</ol>
<p>Here's a super-simplification of spanner's consistency story for r/o clients</p>
<ul>
<li>"snapshot" or "lock-free" reads</li>
<li>assume for now that all the clocks agree</li>
<li>server (paxos leader) tags each write with the time at which it occurred</li>
<li>k/v DB stores <em>multiple</em> values for each key,
<ul>
<li>each with a different time</li>
</ul></li>
<li>reading client picks a time <code>t</code>
<ul>
<li>for each read
<ul>
<li>ask relevant replica to do the read at time <code>t</code></li>
</ul></li>
</ul></li>
<li>how does a replica read a key at time <code>t</code>?
<ul>
<li>return the stored value with highest time <code><= t</code></li>
</ul></li>
<li>but wait, the replica may be behind
<ul>
<li>that is, there may be a write at time <code>< t</code>, but replica hasn't seen it</li>
<li>so replica must somehow be sure it has seen all writes <code><= t</code></li>
<li>idea: has it seen <em>any</em> operation from time <code>> t</code>?
<ul>
<li>if yes, and paxos group always agrees on ops in time order,
it's enough to check/wait for an op with time <code>> t</code></li>
<li>that is what spanner does on reads (4.1.3)</li>
</ul></li>
</ul></li>
<li>what time should a reading client pick?
<ul>
<li>using current time may force lagging replicas to wait</li>
<li>so perhaps a little in the past</li>
<li>client may miss latest updates</li>
<li>but at least it will see consistent snapshot</li>
<li>in our example, won't see new image w/o also seeing ACL update</li>
</ul></li>
</ul>
<p>How does that fix our ACL/image example?</p>
<ol>
<li>W1: I write ACL, G1 assigns it time=10, then</li>
<li>W2: I add image, G2 assigns it time=15 (> 10 since clocks agree)</li>
<li>mom picks a time, for example t=14</li>
<li>mom reads ACL t=14 from lagging G1 replica
<ul>
<li>if it hasn't seen paxos agreements up through t=14, it knows to wait
so it will return G1</li>
</ul></li>
<li>mom reads image from G2 at t=14
<ul>
<li>image may have been written on that replica</li>
<li>but it will know to <em>not</em> return it since image's time is 15</li>
<li>other choices of <code>t</code> work too.</li>
</ul></li>
</ol>
<p><strong>Q:</strong> Is it reasonable to assume that different computers' clocks agree?</p>
<ul>
<li>Why might they not agree?</li>
</ul>
<p><strong>Q:</strong> What may go wrong if servers' clocks don't agree?</p>
<p>A performance problem: reading client may pick time in the future, forcing
reading replicas to wait to "catch up"</p>
<p>A correctness problem:</p>
<ul>
<li>again, the ACL/image example</li>
<li>G1 and G2 disagree about what time it is</li>
</ul>
<p>Sequence of events:</p>
<ol>
<li>W1: I write ACL on G1 -- stamped with time=15</li>
<li>W2: I add image on G2 -- stamped with time=10</li>
</ol>
<p>Now a client read at t=14 will see image but not ACL update</p>
<p><strong>Q:</strong> Why doesn't spanner just ensure that the clocks are all correct?</p>
<ul>
<li>after all, it has all those master GPS / atomic clocks</li>
</ul>
<h2>TrueTime (section 3)</h2>
<ul>
<li>there is an actual "absolute" time <code>t_abs</code>
<ul>
<li>but server clocks are typically off by some unknown amount</li>
<li>TrueTime can bound the error</li>
</ul></li>
<li>so <code>now()</code> yields an interval: [earliest,latest]
<ul>
<li>earliest and latest are ordinary scalar times
<ul>
<li>perhaps microseconds since Jan 1 1970</li>
</ul></li>
</ul></li>
<li><code>t_abs</code> is highly likely to be between earliest and latest</li>
</ul>
<p><strong>Q:</strong> How does TrueTime choose the interval?</p>
<p><strong>Q:</strong> Why are GPS time receivers able to avoid this problem?</p>
<ul>
<li>Do they actually avoid it?</li>
<li>What about the "atomic clocks"?</li>
</ul>
<p>Spanner assigns each write a scalar time</p>
<ul>
<li>might not be the actual absolute time</li>
<li>but is chosen to ensure consistency</li>
</ul>
<p>The danger:</p>
<ul>
<li>W1 at G1, G1's interval is [20,30]
<ul>
<li>is any time in that interval OK?</li>
</ul></li>
<li>then W2 at G2, G2's interval is [11,21]
<ul>
<li>is any time in that interval OK?</li>
</ul></li>
<li>if they are not careful, might get s1=25 s2=15</li>
</ul>
<p>So what we want is:</p>
<ul>
<li>if W2 starts after W1 finishes, then <code>s2 > s1</code></li>
<li>simplified <em>"external consistency invariant"</em> from 4.1.2</li>
<li>causes snapshot reads to see data consistent w/ true order of W1, W2</li>
</ul>
<p>How does spanner assign times to writes?</p>
<ul>
<li>(again, this is much simplified, see 4.1.2)</li>
<li>a write request arrives at paxos leader</li>
<li><code>s</code> will be the write's time-stamp</li>
<li>leader sets <code>s</code> to <code>TrueTime now().latest</code>
<ul>
<li>this is "Start" in 4.1.2</li>
</ul></li>
<li>then leader <em>delays</em> until <code>s < now().earliest</code>
<ul>
<li>i.e. until <code>s</code> is guaranteed to be in the past (compared to absolute time)</li>
<li>this is "commit wait" in 4.1.2</li>
</ul></li>
<li>then leader runs paxos to cause the write to happen</li>
<li>then leader replies to client</li>
</ul>
<p>Does this work for our example?</p>
<ul>
<li>W1 at G1, TrueTime says [20,30]
<ul>
<li><code>s1 = 30</code></li>
<li>commit wait until TrueTime says [31,41]</li>
<li>reply to client</li>
</ul></li>
<li>W2 at G2, TrueTime <em>must</em> now say <code>>= [21,31]</code>
<ul>
<li>(otherwise TrueTime is broken)</li>
<li>s2 = 31</li>
<li>commit wait until TrueTime says [32,43]</li>
<li>reply to client</li>
</ul></li>
<li>it does work for this example:
<ul>
<li>the client observed that W1 finished before S2 started,</li>
<li>and indeed <code>s2 > s1</code></li>
<li>even though G2's TrueTime clock was slow by the most it could be</li>
<li>so if my mom sees S2, she is guaranteed to also see W1</li>
</ul></li>
</ul>
<p>Why the "Start" rule?</p>
<ul>
<li>i.e. why choose the time at the end of the TrueTime interval?</li>
<li>previous writers waited only until their timestamps were barely <code>< t_abs</code></li>
<li>new writer must choose <code>s</code> greater than any completed write</li>
<li><code>t_abs</code> might be as high as <code>now().latest</code></li>
<li>so s = now().latest</li>
</ul>
<p>Why the "Commit Wait" rule?</p>
<ul>
<li>ensures that <code>s < t_abs</code></li>
<li>otherwise write might complete with an s in the future
<ul>
<li>and would let Start rule give too low an <code>s</code> to a subsequent write</li>
</ul></li>
</ul>
<p><strong>Q:</strong> Why commit <em>wait</em>; why not immediately write value with chosen time?</p>
<ul>
<li>indirectly forces subsequent write to have high enough s
<ul>
<li>the system has no other way to communicate minimum acceptable next s
for writes in different replica groups</li>
</ul></li>
<li>waiting forces writes that some external agent is serializing
to have monotonically increasing timestamps</li>
<li>w/o wait, our example goes back to s1=30 s2=21</li>
<li>you could imagine explicit schemes to communicate last write's TS
to the next write</li>
</ul>
<p><strong>Q:</strong> How long is the commit wait?</p>
<p>This answers today's Question: a large TrueTime uncertainty requires a long
commit wait so Spanner authors are interested in accurate low-uncertainty time</p>
<p>Let's step back</p>
<ul>
<li>why did we get into all this timestamp stuff?
<ul>
<li>our replicas were 100s or 1000s of miles apart (for locality/fault tol)</li>
<li>we wanted fast reads from a local replica (no full paxos)</li>
<li>our data was partitioned over many replica groups w/ separate clocks</li>
<li>we wanted consistency for reads:
<ul>
<li>if W1 then W2, reads don't see W2 but not W1</li>
</ul></li>
<li>it's complex but it makes sense as a</li>
<li>high-performance evolution of Lab 3 / Lab 4</li>
</ul></li>
</ul>
<p>Why is this timestamp technique interesting?</p>
<ul>
<li>we want to enforce order -- things that happened in some
order in real time are ordered the same way by the
distributed system -- "external consistency"</li>
<li>the naive approach requires a central agent, or lots of communication</li>
<li>Spanner does the synchronization implicitly via time
<ul>
<li>time can be a form of communication</li>
<li>e.g. we agree in advance to meet for dinner at 6:00pm</li>
</ul></li>
</ul>
<p>There's a lot of additional complexity in the paper</p>
<ul>
<li>transactions, two phase commit, two phase locking,
<ul>
<li>schema change, query language, &c</li>
</ul></li>
<li>some of this we'll see more of later</li>
<li>in particular, the problem of ordering events in a
distributed system will come up a lot, soon</li>
</ul>