-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtcp.html
1650 lines (1536 loc) · 115 KB
/
tcp.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>5.2 Reliable Byte Stream (TCP) — Computer Networks: A Systems Approach Version 6.1-dev documentation</title>
<link rel="shortcut icon" href="../static/bridge.ico"/>
<script type="text/javascript" src="../static/js/modernizr.min.js"></script>
<script type="text/javascript" id="documentation_options" data-url_root="../" src="../static/documentation_options.js"></script>
<script type="text/javascript" src="../static/jquery.js"></script>
<script type="text/javascript" src="../static/underscore.js"></script>
<script type="text/javascript" src="../static/doctools.js"></script>
<script type="text/javascript" src="../static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../static/js/theme.js"></script>
<link rel="stylesheet" href="../static/css/theme.css" type="text/css" />
<link rel="stylesheet" href="../static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../static/css/rtd_theme_mods.css" type="text/css" />
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="5.3 Remote Procedure Call" href="rpc.html" />
<link rel="prev" title="5.1 Simple Demultiplexor (UDP)" href="udp.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../index.html" class="icon icon-home"> Computer Networks: A Systems Approach
</a>
<div class="version">
Version 6.1-dev
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<p class="caption"><span class="caption-text">Table of Contents</span></p>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../preface.html">Preface</a></li>
<li class="toctree-l1"><a class="reference internal" href="../foundation.html">Chapter 1: Foundation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../direct.html">Chapter 2: Direct Links</a></li>
<li class="toctree-l1"><a class="reference internal" href="../internetworking.html">Chapter 3: Internetworking</a></li>
<li class="toctree-l1"><a class="reference internal" href="../scaling.html">Chapter 4: Advanced Internetworking</a></li>
<li class="toctree-l1 current"><a class="reference internal" href="../e2e.html">Chapter 5: End-to-End Protocols</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="problem.html">Problem: Getting Processes to Communicate</a></li>
<li class="toctree-l2"><a class="reference internal" href="udp.html">5.1 Simple Demultiplexor (UDP)</a></li>
<li class="toctree-l2 current"><a class="current reference internal" href="#">5.2 Reliable Byte Stream (TCP)</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#end-to-end-issues">End-to-End Issues</a></li>
<li class="toctree-l3"><a class="reference internal" href="#segment-format">Segment Format</a></li>
<li class="toctree-l3"><a class="reference internal" href="#connection-establishment-and-termination">Connection Establishment and Termination</a><ul>
<li class="toctree-l4"><a class="reference internal" href="#three-way-handshake">Three-Way Handshake</a></li>
<li class="toctree-l4"><a class="reference internal" href="#state-transition-diagram">State-Transition Diagram</a></li>
</ul>
</li>
<li class="toctree-l3"><a class="reference internal" href="#sliding-window-revisited">Sliding Window Revisited</a><ul>
<li class="toctree-l4"><a class="reference internal" href="#reliable-and-ordered-delivery">Reliable and Ordered Delivery</a></li>
<li class="toctree-l4"><a class="reference internal" href="#flow-control">Flow Control</a></li>
<li class="toctree-l4"><a class="reference internal" href="#protecting-against-wraparound">Protecting Against Wraparound</a></li>
<li class="toctree-l4"><a class="reference internal" href="#keeping-the-pipe-full">Keeping the Pipe Full</a></li>
</ul>
</li>
<li class="toctree-l3"><a class="reference internal" href="#triggering-transmission">Triggering Transmission</a><ul>
<li class="toctree-l4"><a class="reference internal" href="#silly-window-syndrome">Silly Window Syndrome</a></li>
<li class="toctree-l4"><a class="reference internal" href="#nagles-algorithm">Nagle’s Algorithm</a></li>
</ul>
</li>
<li class="toctree-l3"><a class="reference internal" href="#adaptive-retransmission">Adaptive Retransmission</a><ul>
<li class="toctree-l4"><a class="reference internal" href="#original-algorithm">Original Algorithm</a></li>
<li class="toctree-l4"><a class="reference internal" href="#karn-partridge-algorithm">Karn/Partridge Algorithm</a></li>
<li class="toctree-l4"><a class="reference internal" href="#jacobson-karels-algorithm">Jacobson/Karels Algorithm</a></li>
<li class="toctree-l4"><a class="reference internal" href="#implementation">Implementation</a></li>
</ul>
</li>
<li class="toctree-l3"><a class="reference internal" href="#record-boundaries">Record Boundaries</a></li>
<li class="toctree-l3"><a class="reference internal" href="#tcp-extensions">TCP Extensions</a></li>
<li class="toctree-l3"><a class="reference internal" href="#performance">Performance</a></li>
<li class="toctree-l3"><a class="reference internal" href="#alternative-design-choices">Alternative Design Choices</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="rpc.html">5.3 Remote Procedure Call</a></li>
<li class="toctree-l2"><a class="reference internal" href="rtp.html">5.4 Transport for Real-Time (RTP)</a></li>
<li class="toctree-l2"><a class="reference internal" href="trend.html">Perspective: HTTP is the New Narrow Waist</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../congestion.html">Chapter 6: Congestion Control</a></li>
<li class="toctree-l1"><a class="reference internal" href="../data.html">Chapter 7: End-to-End Data</a></li>
<li class="toctree-l1"><a class="reference internal" href="../security.html">Chapter 8: Network Security</a></li>
<li class="toctree-l1"><a class="reference internal" href="../applications.html">Chapter 9: Applications</a></li>
<li class="toctree-l1"><a class="reference internal" href="../README.html">About This Book</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">Computer Networks: A Systems Approach</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html">Docs</a> »</li>
<li><a href="../e2e.html">Chapter 5: End-to-End Protocols</a> »</li>
<li>5.2 Reliable Byte Stream (TCP)</li>
<li class="wy-breadcrumbs-aside">
<a href="../_sources/e2e/tcp.rst.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<div class="section" id="reliable-byte-stream-tcp">
<h1>5.2 Reliable Byte Stream (TCP)<a class="headerlink" href="#reliable-byte-stream-tcp" title="Permalink to this headline">¶</a></h1>
<p>In contrast to a simple demultiplexing protocol like UDP, a more
sophisticated transport protocol is one that offers a reliable,
connection-oriented, byte-stream service. Such a service has proven
useful to a wide assortment of applications because it frees the
application from having to worry about missing or reordered data. The
Internet’s Transmission Control Protocol is probably the most widely
used protocol of this type; it is also the most carefully tuned. It is
for these two reasons that this section studies TCP in detail, although
we identify and discuss alternative design choices at the end of the
section.</p>
<p>In terms of the properties of transport protocols given in the problem
statement at the start of this chapter, TCP guarantees the reliable,
in-order delivery of a stream of bytes. It is a full-duplex protocol,
meaning that each TCP connection supports a pair of byte streams, one
flowing in each direction. It also includes a flow-control mechanism for
each of these byte streams that allows the receiver to limit how much
data the sender can transmit at a given time. Finally, like UDP, TCP
supports a demultiplexing mechanism that allows multiple application
programs on any given host to simultaneously carry on a conversation
with their peers.</p>
<p>In addition to the above features, TCP also implements a highly tuned
congestion-control mechanism. The idea of this mechanism is to throttle
how fast TCP sends data, not for the sake of keeping the sender from
over-running the receiver, but so as to keep the sender from overloading
the network. A description of TCP’s congestion-control mechanism is
postponed until the next chapter, where we discuss it in the larger
context of how network resources are fairly allocated.</p>
<p>Since many people confuse congestion control and flow control, we
restate the difference. <em>Flow control</em> involves preventing senders from
over-running the capacity of receivers. <em>Congestion control</em> involves
preventing too much data from being injected into the network, thereby
causing switches or links to become overloaded. Thus, flow control is an
end-to-end issue, while congestion control is concerned with how hosts
and networks interact.</p>
<div class="section" id="end-to-end-issues">
<h2>End-to-End Issues<a class="headerlink" href="#end-to-end-issues" title="Permalink to this headline">¶</a></h2>
<p>At the heart of TCP is the sliding window algorithm. Even though this is
the same basic algorithm as is often used at the link level, because TCP
runs over the Internet rather than a physical point-to-point link, there
are many important differences. This subsection identifies these
differences and explains how they complicate TCP. The following
subsections then describe how TCP addresses these and other
complications.</p>
<p>First, whereas the link-level sliding window algorithm presented runs
over a single physical link that always connects the same two computers,
TCP supports logical connections between processes that are running on
any two computers in the Internet. This means that TCP needs an explicit
connection establishment phase during which the two sides of the
connection agree to exchange data with each other. This difference is
analogous to having to dial up the other party, rather than having a
dedicated phone line. TCP also has an explicit connection teardown
phase. One of the things that happens during connection establishment is
that the two parties establish some shared state to enable the sliding
window algorithm to begin. Connection teardown is needed so each host
knows it is OK to free this state.</p>
<p>Second, whereas a single physical link that always connects the same two
computers has a fixed round-trip time (RTT), TCP connections are likely
to have widely different round-trip times. For example, a TCP connection
between a host in San Francisco and a host in Boston, which are
separated by several thousand kilometers, might have an RTT of 100 ms,
while a TCP connection between two hosts in the same room, only a few
meters apart, might have an RTT of only 1 ms. The same TCP protocol must
be able to support both of these connections. To make matters worse, the
TCP connection between hosts in San Francisco and Boston might have an
RTT of 100 ms at 3 a.m., but an RTT of 500 ms at 3 p.m. Variations in
the RTT are even possible during a single TCP connection that lasts only
a few minutes. What this means to the sliding window algorithm is that
the timeout mechanism that triggers retransmissions must be adaptive.
(Certainly, the timeout for a point-to-point link must be a settable
parameter, but it is not necessary to adapt this timer for a particular
pair of nodes.)</p>
<p>A third difference is that packets may be reordered as they cross the
Internet, but this is not possible on a point-to-point link where the
first packet put into one end of the link must be the first to appear at
the other end. Packets that are slightly out of order do not cause a
problem since the sliding window algorithm can reorder packets correctly
using the sequence number. The real issue is how far out of order
packets can get or, said another way, how late a packet can arrive at
the destination. In the worst case, a packet can be delayed in the
Internet until the IP time to live (<code class="docutils literal notranslate"><span class="pre">TTL</span></code>) field expires, at which
time the packet is discarded (and hence there is no danger of it
arriving late). Knowing that IP throws packets away after their <code class="docutils literal notranslate"><span class="pre">TTL</span></code>
expires, TCP assumes that each packet has a maximum lifetime. The exact
lifetime, known as the <em>maximum segment lifetime</em> (MSL), is an
engineering choice. The current recommended setting is 120 seconds. Keep
in mind that IP does not directly enforce this 120-second value; it is
simply a conservative estimate that TCP makes of how long a packet might
live in the Internet. The implication is significant—TCP has to be
prepared for very old packets to suddenly show up at the receiver,
potentially confusing the sliding window algorithm.</p>
<p>Fourth, the computers connected to a point-to-point link are generally
engineered to support the link. For example, if a link’s delay ×
bandwidth product is computed to be 8 KB—meaning that a window size is
selected to allow up to 8 KB of data to be unacknowledged at a given
time—then it is likely that the computers at either end of the link have
the ability to buffer up to 8 KB of data. Designing the system otherwise
would be silly. On the other hand, almost any kind of computer can be
connected to the Internet, making the amount of resources dedicated to
any one TCP connection highly variable, especially considering that any
one host can potentially support hundreds of TCP connections at the same
time. This means that TCP must include a mechanism that each side uses
to “learn” what resources (e.g., how much buffer space) the other side
is able to apply to the connection. This is the flow control issue.</p>
<p>Fifth, because the transmitting side of a directly connected link cannot
send any faster than the bandwidth of the link allows, and only one host
is pumping data into the link, it is not possible to unknowingly congest
the link. Said another way, the load on the link is visible in the form
of a queue of packets at the sender. In contrast, the sending side of a
TCP connection has no idea what links will be traversed to reach the
destination. For example, the sending machine might be directly
connected to a relatively fast Ethernet—and capable of sending data at a
rate of 10 Gbps—but somewhere out in the middle of the network, a
1.5-Mbps link must be traversed. And, to make matters worse, data being
generated by many different sources might be trying to traverse this
same slow link. This leads to the problem of network congestion.
Discussion of this topic is delayed until the next chapter.</p>
<p>We conclude this discussion of end-to-end issues by comparing TCP’s
approach to providing a reliable/ordered delivery service with the
approach used by virtual-circuit-based networks like the historically
important X.25 network. In TCP, the underlying IP network is assumed to
be unreliable and to deliver messages out of order; TCP uses the sliding
window algorithm on an end-to-end basis to provide reliable/ordered
delivery. In contrast, X.25 networks use the sliding window protocol
within the network, on a hop-by-hop basis. The assumption behind this
approach is that if messages are delivered reliably and in order between
each pair of nodes along the path between the source host and the
destination host, then the end-to-end service also guarantees
reliable/ordered delivery.</p>
<p>The problem with this latter approach is that a sequence of hop-by-hop
guarantees does not necessarily add up to an end-to-end guarantee.
First, if a heterogeneous link (say, an Ethernet) is added to one end of
the path, then there is no guarantee that this hop will preserve the
same service as the other hops. Second, just because the sliding window
protocol guarantees that messages are delivered correctly from node A to
node B, and then from node B to node C, it does not guarantee that
node B behaves perfectly. For example, network nodes have been known to
introduce errors into messages while transferring them from an input
buffer to an output buffer. They have also been known to accidentally
reorder messages. As a consequence of these small windows of
vulnerability, it is still necessary to provide true end-to-end checks
to guarantee reliable/ordered service, even though the lower levels of
the system also implement that functionality.</p>
<div class="admonition-key-takeaway admonition">
<p class="first admonition-title">Key Takeaway</p>
<p class="last">This discussion serves to illustrate one of the most important
principles in system design—the <em>end-to-end argument</em>. In a nutshell,
the end-to-end argument says that a function (in our example,
providing reliable/ordered delivery) should not be provided in the
lower levels of the system unless it can be completely and correctly
implemented at that level. Therefore, this rule argues in favor of
the TCP/IP approach. This rule is not absolute, however. It does
allow for functions to be incompletely provided at a low level as a
performance optimization. This is why it is perfectly consistent with
the end-to-end argument to perform error detection (e.g., CRC) on a
hop-by-hop basis; detecting and retransmitting a single corrupt
packet across one hop is preferable to having to retransmit an entire
file end-to-end.</p>
</div>
</div>
<div class="section" id="segment-format">
<h2>Segment Format<a class="headerlink" href="#segment-format" title="Permalink to this headline">¶</a></h2>
<p>TCP is a byte-oriented protocol, which means that the sender writes
bytes into a TCP connection and the receiver reads bytes out of the
TCP connection. Although “byte stream” describes the service TCP
offers to application processes, TCP does not, itself, transmit
individual bytes over the Internet. Instead, TCP on the source host
buffers enough bytes from the sending process to fill a reasonably
sized packet and then sends this packet to its peer on the destination
host. TCP on the destination host then empties the contents of the
packet into a receive buffer, and the receiving process reads from
this buffer at its leisure. This situation is illustrated in
<a class="reference internal" href="#fig-tcp-stream"><span class="std std-numref">Figure 127</span></a>, which, for simplicity, shows
data flowing in only one direction. Remember that, in general, a
single TCP connection supports byte streams flowing in both
directions.</p>
<div class="figure align-center" id="id3">
<span id="fig-tcp-stream"></span><a class="reference internal image-reference" href="../_images/f05-03-9780123850591.png"><img alt="../_images/f05-03-9780123850591.png" src="../_images/f05-03-9780123850591.png" style="width: 500px;" /></a>
<p class="caption"><span class="caption-number">Figure 127. </span><span class="caption-text">How TCP manages a byte stream.</span></p>
</div>
<p>The packets exchanged between TCP peers in <a class="reference internal" href="#fig-tcp-stream"><span class="std std-numref">Figure 127</span></a> are called <em>segments</em>, since each one carries a
segment of the byte stream. Each TCP segment contains the header
schematically depicted in <a class="reference internal" href="#fig-tcp-format"><span class="std std-numref">Figure 128</span></a>. The
relevance of most of these fields will become apparent throughout this
section. For now, we simply introduce them.</p>
<div class="figure align-center" id="id4">
<span id="fig-tcp-format"></span><a class="reference internal image-reference" href="../_images/f05-04-9780123850591.png"><img alt="../_images/f05-04-9780123850591.png" src="../_images/f05-04-9780123850591.png" style="width: 400px;" /></a>
<p class="caption"><span class="caption-number">Figure 128. </span><span class="caption-text">TCP header format.</span></p>
</div>
<p>The <code class="docutils literal notranslate"><span class="pre">SrcPort</span></code> and <code class="docutils literal notranslate"><span class="pre">DstPort</span></code> fields identify the source and
destination ports, respectively, just as in UDP. These two fields, plus
the source and destination IP addresses, combine to uniquely identify
each TCP connection. That is, TCP’s demux key is given by the 4-tuple</p>
<div class="code c highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">(</span><span class="n">SrcPort</span><span class="p">,</span> <span class="n">SrcIPAddr</span><span class="p">,</span> <span class="n">DstPort</span><span class="p">,</span> <span class="n">DstIPAddr</span><span class="p">)</span>
</pre></div>
</div>
<p>Note that because TCP connections come and go, it is possible for a
connection between a particular pair of ports to be established, used to
send and receive data, and closed, and then at a later time for the same
pair of ports to be involved in a second connection. We sometimes refer
to this situation as two different <em>incarnations</em> of the same
connection.</p>
<p>The <code class="docutils literal notranslate"><span class="pre">Acknowledgement</span></code>, <code class="docutils literal notranslate"><span class="pre">SequenceNum</span></code>, and <code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span></code>
fields are all involved in TCP’s sliding window algorithm. Because TCP
is a byte-oriented protocol, each byte of data has a sequence number.
The <code class="docutils literal notranslate"><span class="pre">SequenceNum</span></code> field contains the sequence number for the first
byte of data carried in that segment, and the <code class="docutils literal notranslate"><span class="pre">Acknowledgement</span></code> and
<code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span></code> fields carry information about the flow of data
going in the other direction. To simplify our discussion, we ignore
the fact that data can flow in both directions, and we concentrate on
data that has a particular <code class="docutils literal notranslate"><span class="pre">SequenceNum</span></code> flowing in one direction
and <code class="docutils literal notranslate"><span class="pre">Acknowledgement</span></code> and <code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span></code> values flowing in the
opposite direction, as illustrated in <a class="reference internal" href="#fig-tcp-flow"><span class="std std-numref">Figure 129</span></a>. The use of these three fields is described more fully
later in this chapter.</p>
<div class="figure align-center" id="id5">
<span id="fig-tcp-flow"></span><a class="reference internal image-reference" href="../_images/f05-05-9780123850591.png"><img alt="../_images/f05-05-9780123850591.png" src="../_images/f05-05-9780123850591.png" style="width: 500px;" /></a>
<p class="caption"><span class="caption-number">Figure 129. </span><span class="caption-text">Simplified illustration (showing only one direction)
of the TCP process, with data flow in one direction and ACKs in
the other.</span></p>
</div>
<p>The 6-bit <code class="docutils literal notranslate"><span class="pre">Flags</span></code> field is used to relay control information between
TCP peers. The possible flags include <code class="docutils literal notranslate"><span class="pre">SYN</span></code>, <code class="docutils literal notranslate"><span class="pre">FIN</span></code>, <code class="docutils literal notranslate"><span class="pre">RESET</span></code>,
<code class="docutils literal notranslate"><span class="pre">PUSH</span></code>, <code class="docutils literal notranslate"><span class="pre">URG</span></code>, and <code class="docutils literal notranslate"><span class="pre">ACK</span></code>. The <code class="docutils literal notranslate"><span class="pre">SYN</span></code> and <code class="docutils literal notranslate"><span class="pre">FIN</span></code> flags are used
when establishing and terminating a TCP connection, respectively. Their
use is described in a later section. The <code class="docutils literal notranslate"><span class="pre">ACK</span></code> flag is set any time
the <code class="docutils literal notranslate"><span class="pre">Acknowledgement</span></code> field is valid, implying that the receiver
should pay attention to it. The <code class="docutils literal notranslate"><span class="pre">URG</span></code> flag signifies that this segment
contains urgent data. When this flag is set, the <code class="docutils literal notranslate"><span class="pre">UrgPtr</span></code> field
indicates where the nonurgent data contained in this segment begins. The
urgent data is contained at the front of the segment body, up to and
including a value of <code class="docutils literal notranslate"><span class="pre">UrgPtr</span></code> bytes into the segment. The <code class="docutils literal notranslate"><span class="pre">PUSH</span></code>
flag signifies that the sender invoked the push operation, which
indicates to the receiving side of TCP that it should notify the
receiving process of this fact. We discuss these last two features more
in a later section. Finally, the <code class="docutils literal notranslate"><span class="pre">RESET</span></code> flag signifies that the
receiver has become confused—for example, because it received a segment
it did not expect to receive—and so wants to abort the connection.</p>
<p>Finally, the <code class="docutils literal notranslate"><span class="pre">Checksum</span></code> field is used in exactly the same way as for
UDP—it is computed over the TCP header, the TCP data, and the
pseudoheader, which is made up of the source address, destination
address, and length fields from the IP header. The checksum is required
for TCP in both IPv4 and IPv6. Also, since the TCP header is of variable
length (options can be attached after the mandatory fields), a
<code class="docutils literal notranslate"><span class="pre">HdrLen</span></code> field is included that gives the length of the header in
32-bit words. This field is also known as the <code class="docutils literal notranslate"><span class="pre">Offset</span></code> field, since it
measures the offset from the start of the packet to the start of the
data.</p>
</div>
<div class="section" id="connection-establishment-and-termination">
<h2>Connection Establishment and Termination<a class="headerlink" href="#connection-establishment-and-termination" title="Permalink to this headline">¶</a></h2>
<p>A TCP connection begins with a client (caller) doing an active open to a
server (callee). Assuming that the server had earlier done a passive
open, the two sides engage in an exchange of messages to establish the
connection. (Recall from Chapter 1 that a party wanting to initiate a
connection performs an active open, while a party willing to accept a
connection does a passive open.<a class="footnote-reference" href="#id2" id="id1">[*]</a>) Only after this connection
establishment phase is over do the two sides begin sending data.
Likewise, as soon as a participant is done sending data, it closes one
direction of the connection, which causes TCP to initiate a round of
connection termination messages. Notice that, while connection setup is
an asymmetric activity (one side does a passive open and the other side
does an active open), connection teardown is symmetric (each side has to
close the connection independently). Therefore, it is possible for one
side to have done a close, meaning that it can no longer send data, but
for the other side to keep the other half of the bidirectional
connection open and to continue sending data.</p>
<table class="docutils footnote" frame="void" id="id2" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[*]</a></td><td>To be more precise, TCP allows connection setup to be symmetric,
with both sides trying to open the connection at the same time,
but the common case is for one side to do an active open and the
other side to do a passive open.</td></tr>
</tbody>
</table>
<div class="section" id="three-way-handshake">
<h3>Three-Way Handshake<a class="headerlink" href="#three-way-handshake" title="Permalink to this headline">¶</a></h3>
<p>The algorithm used by TCP to establish and terminate a connection is
called a <em>three-way handshake</em>. We first describe the basic algorithm
and then show how it is used by TCP. The three-way handshake involves
the exchange of three messages between the client and the server, as
illustrated by the timeline given in <a class="reference internal" href="#fig-twh-timeline"><span class="std std-numref">Figure 130</span></a>.</p>
<div class="figure align-center" id="id6">
<span id="fig-twh-timeline"></span><a class="reference internal image-reference" href="../_images/f05-06-9780123850591.png"><img alt="../_images/f05-06-9780123850591.png" src="../_images/f05-06-9780123850591.png" style="width: 400px;" /></a>
<p class="caption"><span class="caption-number">Figure 130. </span><span class="caption-text">Timeline for three-way handshake algorithm.</span></p>
</div>
<p>The idea is that two parties want to agree on a set of parameters,
which, in the case of opening a TCP connection, are the starting
sequence numbers the two sides plan to use for their respective byte
streams. In general, the parameters might be any facts that each side
wants the other to know about. First, the client (the active
participant) sends a segment to the server (the passive participant)
stating the initial sequence number it plans to use (<code class="docutils literal notranslate"><span class="pre">Flags</span></code> =
<code class="docutils literal notranslate"><span class="pre">SYN</span></code>, <code class="docutils literal notranslate"><span class="pre">SequenceNum</span></code> = x). The server then responds with a single
segment that both acknowledges the client’s sequence number (<code class="docutils literal notranslate"><span class="pre">Flags</span> <span class="pre">=</span>
<span class="pre">ACK,</span> <span class="pre">Ack</span> <span class="pre">=</span> <span class="pre">x</span> <span class="pre">+</span> <span class="pre">1</span></code>) and states its own beginning sequence number
(<code class="docutils literal notranslate"><span class="pre">Flags</span> <span class="pre">=</span> <span class="pre">SYN,</span> <span class="pre">SequenceNum</span> <span class="pre">=</span> <span class="pre">y</span></code>). That is, both the <code class="docutils literal notranslate"><span class="pre">SYN</span></code> and
<code class="docutils literal notranslate"><span class="pre">ACK</span></code> bits are set in the <code class="docutils literal notranslate"><span class="pre">Flags</span></code> field of this second message.
Finally, the client responds with a third segment that acknowledges
the server’s sequence number (<code class="docutils literal notranslate"><span class="pre">Flags</span> <span class="pre">=</span> <span class="pre">ACK,</span> <span class="pre">Ack</span> <span class="pre">=</span> <span class="pre">y</span> <span class="pre">+</span> <span class="pre">1</span></code>). The
reason why each side acknowledges a sequence number that is one larger
than the one sent is that the <code class="docutils literal notranslate"><span class="pre">Acknowledgement</span></code> field actually
identifies the “next sequence number expected,” thereby implicitly
acknowledging all earlier sequence numbers. Although not shown in this
timeline, a timer is scheduled for each of the first two segments, and
if the expected response is not received the segment is retransmitted.</p>
<p>You may be asking yourself why the client and server have to exchange
starting sequence numbers with each other at connection setup time. It
would be simpler if each side simply started at some “well-known”
sequence number, such as 0. In fact, the TCP specification requires that
each side of a connection select an initial starting sequence number at
random. The reason for this is to protect against two incarnations of
the same connection reusing the same sequence numbers too soon—that is,
while there is still a chance that a segment from an earlier incarnation
of a connection might interfere with a later incarnation of the
connection.</p>
</div>
<div class="section" id="state-transition-diagram">
<h3>State-Transition Diagram<a class="headerlink" href="#state-transition-diagram" title="Permalink to this headline">¶</a></h3>
<p>TCP is complex enough that its specification includes a state-transition
diagram. A copy of this diagram is given in <a class="reference internal" href="#fig-tcp-std"><span class="std std-numref">Figure 131</span></a>.
This diagram shows only the states involved in opening a connection
(everything above ESTABLISHED) and in closing a connection (everything
below ESTABLISHED). Everything that goes on while a connection is
open—that is, the operation of the sliding window algorithm—is hidden in
the ESTABLISHED state.</p>
<div class="figure align-center" id="id7">
<span id="fig-tcp-std"></span><a class="reference internal image-reference" href="../_images/f05-07-9780123850591.png"><img alt="../_images/f05-07-9780123850591.png" src="../_images/f05-07-9780123850591.png" style="width: 600px;" /></a>
<p class="caption"><span class="caption-number">Figure 131. </span><span class="caption-text">TCP state-transition diagram.</span></p>
</div>
<p>TCP’s state-transition diagram is fairly easy to understand. Each box
denotes a state that one end of a TCP connection can find itself in. All
connections start in the CLOSED state. As the connection progresses, the
connection moves from state to state according to the arcs. Each arc is
labeled with a tag of the form <em>event/action</em>. Thus, if a connection is
in the LISTEN state and a SYN segment arrives (i.e., a segment with the
<code class="docutils literal notranslate"><span class="pre">SYN</span></code> flag set), the connection makes a transition to the SYN_RCVD
state and takes the action of replying with an <code class="docutils literal notranslate"><span class="pre">ACK+SYN</span></code> segment.</p>
<p>Notice that two kinds of events trigger a state transition: (1) a
segment arrives from the peer (e.g., the event on the arc from LISTEN
to SYN_RCVD), or (2) the local application process invokes an
operation on TCP (e.g., the <em>active open</em> event on the arc from CLOSED
to SYN_SENT). In other words, TCP’s state-transition diagram
effectively defines the <em>semantics</em> of both its peer-to-peer interface
and its service interface. The <em>syntax</em> of these two interfaces is
given by the segment format (as illustrated in <a class="reference internal" href="#fig-tcp-format"><span class="std std-numref">Figure 128</span></a>) and by some application programming interface, such
as the socket API, respectively.</p>
<p>Now let’s trace the typical transitions taken through the diagram in
<a class="reference internal" href="#fig-tcp-std"><span class="std std-numref">Figure 131</span></a>. Keep in mind that at each end of the
connection, TCP makes different transitions from state to state. When
opening a connection, the server first invokes a passive open operation
on TCP, which causes TCP to move to the LISTEN state. At some later
time, the client does an active open, which causes its end of the
connection to send a SYN segment to the server and to move to the
SYN_SENT state. When the SYN segment arrives at the server, it moves to
the SYN_RCVD state and responds with a SYN+ACK segment. The arrival of
this segment causes the client to move to the ESTABLISHED state and to
send an ACK back to the server. When this ACK arrives, the server
finally moves to the ESTABLISHED state. In other words, we have just
traced the three-way handshake.</p>
<p>There are three things to notice about the connection establishment half
of the state-transition diagram. First, if the client’s ACK to the
server is lost, corresponding to the third leg of the three-way
handshake, then the connection still functions correctly. This is
because the client side is already in the ESTABLISHED state, so the
local application process can start sending data to the other end. Each
of these data segments will have the <code class="docutils literal notranslate"><span class="pre">ACK</span></code> flag set, and the correct
value in the <code class="docutils literal notranslate"><span class="pre">Acknowledgement</span></code> field, so the server will move to the
ESTABLISHED state when the first data segment arrives. This is actually
an important point about TCP—every segment reports what sequence number
the sender is expecting to see next, even if this repeats the same
sequence number contained in one or more previous segments.</p>
<p>The second thing to notice about the state-transition diagram is that
there is a funny transition out of the LISTEN state whenever the local
process invokes a <em>send</em> operation on TCP. That is, it is possible for a
passive participant to identify both ends of the connection (i.e.,
itself and the remote participant that it is willing to have connect to
it), and then for it to change its mind about waiting for the other side
and instead actively establish the connection. To the best of our
knowledge, this is a feature of TCP that no application process actually
takes advantage of.</p>
<p>The final thing to notice about the diagram is the arcs that are not
shown. Specifically, most of the states that involve sending a segment
to the other side also schedule a timeout that eventually causes the
segment to be present if the expected response does not happen. These
retransmissions are not depicted in the state-transition diagram. If
after several tries the expected response does not arrive, TCP gives up
and returns to the CLOSED state.</p>
<p>Turning our attention now to the process of terminating a connection,
the important thing to keep in mind is that the application process on
both sides of the connection must independently close its half of the
connection. If only one side closes the connection, then this means it
has no more data to send, but it is still available to receive data from
the other side. This complicates the state-transition diagram because it
must account for the possibility that the two sides invoke the <em>close</em>
operator at the same time, as well as the possibility that first one
side invokes close and then, at some later time, the other side invokes
close. Thus, on any one side there are three combinations of transitions
that get a connection from the ESTABLISHED state to the CLOSED state:</p>
<ul class="simple">
<li>This side closes first: ESTABLISHED <span class="math notranslate nohighlight">\(\rightarrow\)</span> FIN_WAIT_1 <span class="math notranslate nohighlight">\(\rightarrow\)</span> FIN_WAIT_2 <span class="math notranslate nohighlight">\(\rightarrow\)</span> TIME_WAIT <span class="math notranslate nohighlight">\(\rightarrow\)</span> CLOSED.</li>
<li>The other side closes first: ESTABLISHED <span class="math notranslate nohighlight">\(\rightarrow\)</span> CLOSE_WAIT <span class="math notranslate nohighlight">\(\rightarrow\)</span> LAST_ACK <span class="math notranslate nohighlight">\(\rightarrow\)</span> CLOSED.</li>
<li>Both sides close at the same time: ESTABLISHED <span class="math notranslate nohighlight">\(\rightarrow\)</span> FIN_WAIT_1 <span class="math notranslate nohighlight">\(\rightarrow\)</span> CLOSING <span class="math notranslate nohighlight">\(\rightarrow\)</span> TIME_WAIT <span class="math notranslate nohighlight">\(\rightarrow\)</span> CLOSED.</li>
</ul>
<p>There is actually a fourth, although rare, sequence of transitions that
leads to the CLOSED state; it follows the arc from FIN_WAIT_1 to
TIME_WAIT. We leave it as an exercise for you to figure out what
combination of circumstances leads to this fourth possibility.</p>
<p>The main thing to recognize about connection teardown is that a
connection in the TIME_WAIT state cannot move to the CLOSED state until
it has waited for two times the maximum amount of time an IP datagram
might live in the Internet (i.e., 120 seconds). The reason for this is
that, while the local side of the connection has sent an ACK in response
to the other side’s FIN segment, it does not know that the ACK was
successfully delivered. As a consequence, the other side might
retransmit its FIN segment, and this second FIN segment might be delayed
in the network. If the connection were allowed to move directly to the
CLOSED state, then another pair of application processes might come
along and open the same connection (i.e., use the same pair of port
numbers), and the delayed FIN segment from the earlier incarnation of
the connection would immediately initiate the termination of the later
incarnation of that connection.</p>
</div>
</div>
<div class="section" id="sliding-window-revisited">
<h2>Sliding Window Revisited<a class="headerlink" href="#sliding-window-revisited" title="Permalink to this headline">¶</a></h2>
<p>We are now ready to discuss TCP’s variant of the sliding window
algorithm, which serves several purposes: (1) it guarantees the reliable
delivery of data, (2) it ensures that data is delivered in order, and
(3) it enforces flow control between the sender and the receiver. TCP’s
use of the sliding window algorithm is the same as at the link level in
the case of the first two of these three functions. Where TCP differs
from the link-level algorithm is that it folds the flow-control function
in as well. In particular, rather than having a fixed-size sliding
window, the receiver <em>advertises</em> a window size to the sender. This is
done using the <code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span></code> field in the TCP header. The sender
is then limited to having no more than a value of <code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span></code>
bytes of unacknowledged data at any given time. The receiver selects a
suitable value for <code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span></code> based on the amount of memory
allocated to the connection for the purpose of buffering data. The idea
is to keep the sender from over-running the receiver’s buffer. We
discuss this at greater length below.</p>
<div class="section" id="reliable-and-ordered-delivery">
<h3>Reliable and Ordered Delivery<a class="headerlink" href="#reliable-and-ordered-delivery" title="Permalink to this headline">¶</a></h3>
<p>To see how the sending and receiving sides of TCP interact with each
other to implement reliable and ordered delivery, consider the
situation illustrated in <a class="reference internal" href="#fig-tcp-fc"><span class="std std-numref">Figure 132</span></a>. TCP on the
sending side maintains a send buffer. This buffer is used to store
data that has been sent but not yet acknowledged, as well as data that
has been written by the sending application but not transmitted. On
the receiving side, TCP maintains a receive buffer. This buffer holds
data that arrives out of order, as well as data that is in the correct
order (i.e., there are no missing bytes earlier in the stream) but
that the application process has not yet had the chance to read.</p>
<div class="figure align-center" id="id8">
<span id="fig-tcp-fc"></span><a class="reference internal image-reference" href="../_images/f05-08-9780123850591.png"><img alt="../_images/f05-08-9780123850591.png" src="../_images/f05-08-9780123850591.png" style="width: 500px;" /></a>
<p class="caption"><span class="caption-number">Figure 132. </span><span class="caption-text">Relationship between TCP send buffer (a) and receive
buffer (b).</span></p>
</div>
<p>To make the following discussion simpler to follow, we initially ignore
the fact that both the buffers and the sequence numbers are of some
finite size and hence will eventually wrap around. Also, we do not
distinguish between a pointer into a buffer where a particular byte of
data is stored and the sequence number for that byte.</p>
<p>Looking first at the sending side, three pointers are maintained into
the send buffer, each with an obvious meaning: <code class="docutils literal notranslate"><span class="pre">LastByteAcked</span></code>,
<code class="docutils literal notranslate"><span class="pre">LastByteSent</span></code>, and <code class="docutils literal notranslate"><span class="pre">LastByteWritten</span></code>. Clearly,</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">LastByteAcked</span> <span class="o"><=</span> <span class="n">LastByteSent</span>
</pre></div>
</div>
<p>since the receiver cannot have acknowledged a byte that has not yet been
sent, and</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">LastByteSent</span> <span class="o"><=</span> <span class="n">LastByteWritten</span>
</pre></div>
</div>
<p>since TCP cannot send a byte that the application process has not yet
written. Also note that none of the bytes to the left of
<code class="docutils literal notranslate"><span class="pre">LastByteAcked</span></code> need to be saved in the buffer because they have
already been acknowledged, and none of the bytes to the right of
<code class="docutils literal notranslate"><span class="pre">LastByteWritten</span></code> need to be buffered because they have not yet been
generated.</p>
<p>A similar set of pointers (sequence numbers) are maintained on the
receiving side: <code class="docutils literal notranslate"><span class="pre">LastByteRead</span></code>, <code class="docutils literal notranslate"><span class="pre">NextByteExpected</span></code>, and
<code class="docutils literal notranslate"><span class="pre">LastByteRcvd</span></code>. The inequalities are a little less intuitive, however,
because of the problem of out-of-order delivery. The first relationship</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">LastByteRead</span> <span class="o"><</span> <span class="n">NextByteExpected</span>
</pre></div>
</div>
<p>is true because a byte cannot be read by the application until it is
received <em>and</em> all preceding bytes have also been received.
<code class="docutils literal notranslate"><span class="pre">NextByteExpected</span></code> points to the byte immediately after the latest
byte to meet this criterion. Second,</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">NextByteExpected</span> <span class="o"><=</span> <span class="n">LastByteRcvd</span> <span class="o">+</span> <span class="mi">1</span>
</pre></div>
</div>
<p>since, if data has arrived in order, <code class="docutils literal notranslate"><span class="pre">NextByteExpected</span></code> points to the
byte after <code class="docutils literal notranslate"><span class="pre">LastByteRcvd</span></code>, whereas if data has arrived out of order,
then <code class="docutils literal notranslate"><span class="pre">NextByteExpected</span></code> points to the start of the first gap in the
data, as in <a class="reference internal" href="#fig-tcp-fc"><span class="std std-numref">Figure 132</span></a>. Note that bytes to the left of
<code class="docutils literal notranslate"><span class="pre">LastByteRead</span></code> need not be buffered because they have already been
read by the local application process, and bytes to the right of
<code class="docutils literal notranslate"><span class="pre">LastByteRcvd</span></code> need not be buffered because they have not yet arrived.</p>
</div>
<div class="section" id="flow-control">
<h3>Flow Control<a class="headerlink" href="#flow-control" title="Permalink to this headline">¶</a></h3>
<p>Most of the above discussion is similar to that found in the standard
sliding window algorithm; the only real difference is that this time we
elaborated on the fact that the sending and receiving application
processes are filling and emptying their local buffer, respectively.
(The earlier discussion glossed over the fact that data arriving from an
upstream node was filling the send buffer and data being transmitted to
a downstream node was emptying the receive buffer.)</p>
<p>You should make sure you understand this much before proceeding because
now comes the point where the two algorithms differ more significantly.
In what follows, we reintroduce the fact that both buffers are of some
finite size, denoted <code class="docutils literal notranslate"><span class="pre">MaxSendBuffer</span></code> and <code class="docutils literal notranslate"><span class="pre">MaxRcvBuffer</span></code>, although we
don’t worry about the details of how they are implemented. In other
words, we are only interested in the number of bytes being buffered, not
in where those bytes are actually stored.</p>
<p>Recall that in a sliding window protocol, the size of the window sets
the amount of data that can be sent without waiting for acknowledgment
from the receiver. Thus, the receiver throttles the sender by
advertising a window that is no larger than the amount of data that it
can buffer. Observe that TCP on the receive side must keep</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">LastByteRcvd</span> <span class="o">-</span> <span class="n">LastByteRead</span> <span class="o"><=</span> <span class="n">MaxRcvBuffer</span>
</pre></div>
</div>
<p>to avoid overflowing its buffer. It therefore advertises a window size
of</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">AdvertisedWindow</span> <span class="o">=</span> <span class="n">MaxRcvBuffer</span> <span class="o">-</span> <span class="p">((</span><span class="n">NextByteExpected</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">-</span> <span class="n">LastByteRead</span><span class="p">)</span>
</pre></div>
</div>
<p>which represents the amount of free space remaining in its buffer. As
data arrives, the receiver acknowledges it as long as all the preceding
bytes have also arrived. In addition, <code class="docutils literal notranslate"><span class="pre">LastByteRcvd</span></code> moves to the
right (is incremented), meaning that the advertised window potentially
shrinks. Whether or not it shrinks depends on how fast the local
application process is consuming data. If the local process is reading
data just as fast as it arrives (causing <code class="docutils literal notranslate"><span class="pre">LastByteRead</span></code> to be
incremented at the same rate as <code class="docutils literal notranslate"><span class="pre">LastByteRcvd</span></code>), then the advertised
window stays open (i.e., <code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span> <span class="pre">=</span> <span class="pre">MaxRcvBuffer</span></code>). If,
however, the receiving process falls behind, perhaps because it performs
a very expensive operation on each byte of data that it reads, then the
advertised window grows smaller with every segment that arrives, until
it eventually goes to 0.</p>
<p>TCP on the send side must then adhere to the advertised window it gets
from the receiver. This means that at any given time, it must ensure
that</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">LastByteSent</span> <span class="o">-</span> <span class="n">LastByteAcked</span> <span class="o"><=</span> <span class="n">AdvertisedWindow</span>
</pre></div>
</div>
<p>Said another way, the sender computes an <em>effective</em> window that limits
how much data it can send:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">EffectiveWindow</span> <span class="o">=</span> <span class="n">AdvertisedWindow</span> <span class="o">-</span> <span class="p">(</span><span class="n">LastByteSent</span> <span class="o">-</span> <span class="n">LastByteAcked</span><span class="p">)</span>
</pre></div>
</div>
<p>Clearly, <code class="docutils literal notranslate"><span class="pre">EffectiveWindow</span></code> must be greater than 0 before the source
can send more data. It is possible, therefore, that a segment arrives
acknowledging x bytes, thereby allowing the sender to increment
<code class="docutils literal notranslate"><span class="pre">LastByteAcked</span></code> by x, but because the receiving process was not
reading any data, the advertised window is now x bytes smaller than the
time before. In such a situation, the sender would be able to free
buffer space, but not to send any more data.</p>
<p>All the while this is going on, the send side must also make sure that
the local application process does not overflow the send buffer—that is,</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">LastByteWritten</span> <span class="o">-</span> <span class="n">LastByteAcked</span> <span class="o"><=</span> <span class="n">MaxSendBuffer</span>
</pre></div>
</div>
<p>If the sending process tries to write y bytes to TCP, but</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">(</span><span class="n">LastByteWritten</span> <span class="o">-</span> <span class="n">LastByteAcked</span><span class="p">)</span> <span class="o">+</span> <span class="n">y</span> <span class="o">></span> <span class="n">MaxSendBuffer</span>
</pre></div>
</div>
<p>then TCP blocks the sending process and does not allow it to generate
more data.</p>
<p>It is now possible to understand how a slow receiving process ultimately
stops a fast sending process. First, the receive buffer fills up, which
means the advertised window shrinks to 0. An advertised window of 0
means that the sending side cannot transmit any data, even though data
it has previously sent has been successfully acknowledged. Finally, not
being able to transmit any data means that the send buffer fills up,
which ultimately causes TCP to block the sending process. As soon as the
receiving process starts to read data again, the receive-side TCP is
able to open its window back up, which allows the send-side TCP to
transmit data out of its buffer. When this data is eventually
acknowledged, <code class="docutils literal notranslate"><span class="pre">LastByteAcked</span></code> is incremented, the buffer space holding
this acknowledged data becomes free, and the sending process is
unblocked and allowed to proceed.</p>
<p>There is only one remaining detail that must be resolved—how does the
sending side know that the advertised window is no longer 0? As
mentioned above, TCP <em>always</em> sends a segment in response to a received
data segment, and this response contains the latest values for the
<code class="docutils literal notranslate"><span class="pre">Acknowledge</span></code> and <code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span></code> fields, even if these values
have not changed since the last time they were sent. The problem is
this. Once the receive side has advertised a window size of 0, the
sender is not permitted to send any more data, which means it has no way
to discover that the advertised window is no longer 0 at some time in
the future. TCP on the receive side does not spontaneously send nondata
segments; it only sends them in response to an arriving data segment.</p>
<p>TCP deals with this situation as follows. Whenever the other side
advertises a window size of 0, the sending side persists in sending a
segment with 1 byte of data every so often. It knows that this data will
probably not be accepted, but it tries anyway, because each of these
1-byte segments triggers a response that contains the current advertised
window. Eventually, one of these 1-byte probes triggers a response that
reports a nonzero advertised window.</p>
<p>Note that these 1-byte messages are called <em>Zero Window Probes</em> and in
practice they are sent every 5 to 60 seconds. As for what single byte of
data to send in the probe: it’s the next byte of actual data just
outside the window. (It has to be real data in case it’s accepted by the
receiver.)</p>
<div class="admonition-key-takeaway admonition">
<p class="first admonition-title">Key Takeaway</p>
<p class="last">Note that the reason the sending side periodically sends this probe
segment is that TCP is designed to make the receive side as simple as
possible—it simply responds to segments from the sender, and it never
initiates any activity on its own. This is an example of a
well-recognized (although not universally applied) protocol design
rule, which, for lack of a better name, we call the <em>smart sender/
dumb receiver</em> rule. Recall that we saw another example of this rule
when we discussed the use of NAKs in sliding window algorithm.</p>
</div>
</div>
<div class="section" id="protecting-against-wraparound">
<h3>Protecting Against Wraparound<a class="headerlink" href="#protecting-against-wraparound" title="Permalink to this headline">¶</a></h3>
<p>This subsection and the next consider the size of the <code class="docutils literal notranslate"><span class="pre">SequenceNum</span></code>
and <code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span></code> fields and the implications of their sizes on
TCP’s correctness and performance. TCP’s <code class="docutils literal notranslate"><span class="pre">SequenceNum</span></code> field is
32 bits long, and its <code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span></code> field is 16 bits long,
meaning that TCP has easily satisfied the requirement of the sliding
window algorithm that the sequence number space be twice as big as the
window size: 2<sup>32</sup> >> 2 × 2<sup>16</sup>. However, this
requirement is not the interesting thing about these two fields.
Consider each field in turn.</p>
<p>The relevance of the 32-bit sequence number space is that the sequence
number used on a given connection might wrap around—a byte with
sequence number S could be sent at one time, and then at a later time
a second byte with the same sequence number S might be sent. Once
again, we assume that packets cannot survive in the Internet for
longer than the recommended MSL. Thus, we currently need to make sure
that the sequence number does not wrap around within a 120-second
period of time. Whether or not this happens depends on how fast data
can be transmitted over the Internet—that is, how fast the 32-bit
sequence number space can be consumed. (This discussion assumes that
we are trying to consume the sequence number space as fast as
possible, but of course we will be if we are doing our job of keeping
the pipe full.) <a class="reference internal" href="#tab-eqnum"><span class="std std-numref">Table 22</span></a> shows how long it takes
for the sequence number to wrap around on networks with various
bandwidths.</p>
<span id="tab-eqnum"></span><table border="1" class="colwidths-auto docutils align-center" id="id9">
<caption><span class="caption-number">Table 22. </span><span class="caption-text">Time Until 32-Bit Sequence Number Space Wraps Around.</span><a class="headerlink" href="#id9" title="Permalink to this table">¶</a></caption>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Bandwidth</th>
<th class="head">Time until Wraparound</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>T1 (1.5 Mbps)</td>
<td>6.4 hours</td>
</tr>
<tr class="row-odd"><td>T3 (45 Mbps)</td>
<td>13 minutes</td>
</tr>
<tr class="row-even"><td>Fast Ethernet (100 Mbps)</td>
<td>6 minutes</td>
</tr>
<tr class="row-odd"><td>OC-3 (155 Mbps)</td>
<td>4 minutes</td>
</tr>
<tr class="row-even"><td>OC-48 (2.5 Gbps)</td>
<td>14 seconds</td>
</tr>
<tr class="row-odd"><td>OC-192 (10 Gbps)</td>
<td>3 seconds</td>
</tr>
<tr class="row-even"><td>10GigE (10 Gbps)</td>
<td>3 seconds</td>
</tr>
</tbody>
</table>
<p>As you can see, the 32-bit sequence number space is adequate at modest
bandwidths, but given that OC-192 links are now common in the Internet
backbone, and that most servers now come with 10Gig Ethernet (or 10
Gbps) interfaces, we’re now well-past the point where 32 bits is too
small. Fortunately, the IETF has worked out an extension to TCP that
effectively extends the sequence number space to protect against the
sequence number wrapping around. This and related extensions are
described in a later section.</p>
</div>
<div class="section" id="keeping-the-pipe-full">
<h3>Keeping the Pipe Full<a class="headerlink" href="#keeping-the-pipe-full" title="Permalink to this headline">¶</a></h3>
<p>The relevance of the 16-bit <code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span></code> field is that it must
be big enough to allow the sender to keep the pipe full. Clearly, the
receiver is free to not open the window as large as the
<code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span></code> field allows; we are interested in the situation in
which the receiver has enough buffer space to handle as much data as the
largest possible <code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span></code> allows.</p>
<p>In this case, it is not just the network bandwidth but the delay x
bandwidth product that dictates how big the <code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span></code> field
needs to be—the window needs to be opened far enough to allow a full
delay × bandwidth product’s worth of data to be transmitted. Assuming an
RTT of 100 ms (a typical number for a cross-country connection in the
United States), <a class="reference internal" href="#tab-adv-win"><span class="std std-numref">Table 23</span></a> gives the delay × bandwidth
product for several network technologies.</p>
<span id="tab-adv-win"></span><table border="1" class="colwidths-auto docutils align-center" id="id10">
<caption><span class="caption-number">Table 23. </span><span class="caption-text">Required Window Size for 100-ms RTT</span><a class="headerlink" href="#id10" title="Permalink to this table">¶</a></caption>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Bandwidth</th>
<th class="head">Delay × Bandwidth Product</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>T1 (1.5 Mbps)</td>
<td>18 KB</td>
</tr>
<tr class="row-odd"><td>T3 (45 Mbps)</td>
<td>549 KB</td>
</tr>
<tr class="row-even"><td>Fast Ethernet (100 Mbps)</td>
<td>1.2 MB</td>
</tr>
<tr class="row-odd"><td>OC-3 (155 Mbps)</td>
<td>1.8 MB</td>
</tr>
<tr class="row-even"><td>OC-48 (2.5 Gbps)</td>
<td>29.6 MB</td>
</tr>
<tr class="row-odd"><td>OC-192 (10 Gbps)</td>
<td>118.4 MB</td>
</tr>
<tr class="row-even"><td>10GigE (10 Gbps)</td>
<td>118.4 MB</td>
</tr>
</tbody>
</table>
<p>As you can see, TCP’s <code class="docutils literal notranslate"><span class="pre">AdvertisedWindow</span></code> field is in even worse shape
than its <code class="docutils literal notranslate"><span class="pre">SequenceNum</span></code> field—it is not big enough to handle even a T3
connection across the continental United States, since a 16-bit field
allows us to advertise a window of only 64 KB. The very same TCP
extension mentioned above provides a mechanism for effectively
increasing the size of the advertised window.</p>
</div>
</div>
<div class="section" id="triggering-transmission">
<h2>Triggering Transmission<a class="headerlink" href="#triggering-transmission" title="Permalink to this headline">¶</a></h2>
<p>We next consider a surprisingly subtle issue: how TCP decides to
transmit a segment. As described earlier, TCP supports a byte-stream
abstraction; that is, application programs write bytes into the stream,
and it is up to TCP to decide that it has enough bytes to send a
segment. What factors govern this decision?</p>
<p>If we ignore the possibility of flow control—that is, we assume the
window is wide open, as would be the case when a connection first
starts—then TCP has three mechanisms to trigger the transmission of a
segment. First, TCP maintains a variable, typically called the <em>maximum
segment size</em> (<code class="docutils literal notranslate"><span class="pre">MSS</span></code>), and it sends a segment as soon as it has
collected <code class="docutils literal notranslate"><span class="pre">MSS</span></code> bytes from the sending process. <code class="docutils literal notranslate"><span class="pre">MSS</span></code> is usually set
to the size of the largest segment TCP can send without causing the
local IP to fragment. That is, <code class="docutils literal notranslate"><span class="pre">MSS</span></code> is set to the maximum
transmission unit (MTU) of the directly connected network, minus the
size of the TCP and IP headers. The second thing that triggers TCP to
transmit a segment is that the sending process has explicitly asked it
to do so. Specifically, TCP supports a <em>push</em> operation, and the sending
process invokes this operation to effectively flush the buffer of unsent
bytes. The final trigger for transmitting a segment is that a timer
fires; the resulting segment contains as many bytes as are currently
buffered for transmission. However, as we will soon see, this “timer”
isn’t exactly what you expect.</p>
<div class="section" id="silly-window-syndrome">
<h3>Silly Window Syndrome<a class="headerlink" href="#silly-window-syndrome" title="Permalink to this headline">¶</a></h3>
<p>Of course, we can’t just ignore flow control, which plays an obvious
role in throttling the sender. If the sender has <code class="docutils literal notranslate"><span class="pre">MSS</span></code> bytes of data
to send and the window is open at least that much, then the sender
transmits a full segment. Suppose, however, that the sender is
accumulating bytes to send, but the window is currently closed. Now
suppose an ACK arrives that effectively opens the window enough for the
sender to transmit, say, <code class="docutils literal notranslate"><span class="pre">MSS/2</span></code> bytes. Should the sender transmit a
half-full segment or wait for the window to open to a full <code class="docutils literal notranslate"><span class="pre">MSS</span></code>? The
original specification was silent on this point, and early
implementations of TCP decided to go ahead and transmit a half-full
segment. After all, there is no telling how long it will be before the
window opens further.</p>
<p>It turns out that the strategy of aggressively taking advantage of any
available window leads to a situation now known as the <em>silly window
syndrome</em>. <a class="reference internal" href="#fig-sillywindow"><span class="std std-numref">Figure 133</span></a> helps visualize what
happens. If you think of a TCP stream as a conveyor belt with “full”
containers (data segments) going in one direction and empty containers
(ACKs) going in the reverse direction, then <code class="docutils literal notranslate"><span class="pre">MSS</span></code>-sized segments
correspond to large containers and 1-byte segments correspond to very
small containers. As long as the sender is sending <code class="docutils literal notranslate"><span class="pre">MSS</span></code>-sized
segments and the receiver ACKs at least one <code class="docutils literal notranslate"><span class="pre">MSS</span></code> of data at a time,
everything is good (<a class="reference internal" href="#fig-sillywindow"><span class="std std-numref">Figure 133(a)</span></a>). But,