One document matched: draft-ietf-tcpimpl-cong-control-00.txt
TCP Implementation Working Group W. Stevens
INTERNET DRAFT Consultant
File: draft-ietf-tcpimpl-cong-control-00.txt M. Allman
NASA Lewis/Sterling Software
V. Paxson
LBNL
August, 1998
TCP Congestion Control
Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas,
and its working groups. Note that other groups may also distribute
working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as ``work in
progress.''
To view the entire list of current Internet-Drafts, please check the
"1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern
Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific
Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast).
Abstract
This document defines TCP's four intertwined congestion control
algorithms: slow start, congestion avoidance, fast retransmit, and
fast recovery. In addition, the document specifies how TCP should
begin transmission after a relatively long idle period, as well as
discussing various acknowledgment generation methods.
1 Introduction
This document specifies four TCP [Pos81] congestion control
algorithms: slow start, congestion avoidance, fast retransmit and
fast recovery. These algorithms were devised in [Jac88] and
[Jac90]. Their use with TCP is required by [Bra89].
This document is an update of [Ste97]. In addition to specifying
the congestion control algorithms, this document specifies what TCP
connections should do after a relatively long idle period, as well
as specifying and clarifying some of the issues pertaining to TCP
ACK generation.
Note that [Ste94] provides examples of these algorithms in action
and [WS95] provides an explanation of the source code for the BSD
implementation of these algorithms.
Expires: February, 1999 [Page 1]
draft-ietf-tcpimpl-cong-control-00.txt August 1998
This document is organized as follows. Section 2 provides various
definitions which will be used throughout the paper. Section 3
provides a specification of the congestion control algorithms.
Section 4 outlines concerns related to the congestion control
algorithms and finally, section 5 outlines security considerations.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [Bra97].
2 Definitions
This section provides the definition of several terms that will be
used throughout the remainder of this document.
SEGMENT:
A segment is ANY TCP/IP data or acknowledgment packet (or both).
MAXIMUM SEGMENT SIZE (MSS):
The MSS is the largest segment size that can be used. The size
does not include the TCP/IP headers and options.
FULL-SIZED SEGMENT:
A segment that contains the maximum number of data bytes
permitted (i.e., a segment containing MSS bytes of data).
RECEIVER WINDOW (rwnd)
The most recently advertised receiver window.
CONGESTION WINDOW (cwnd):
A TCP state variable that limits the amount of data a TCP can
send. At any given time, a TCP MUST NOT send data with a
sequence number higher than the sum of the highest acknowledged
sequence number and the minimum of cwnd and rwnd.
INITIAL WINDOW (IW):
The initial window is the size of the sender's congestion window
when a connection is established.
LOSS WINDOW (LW):
The loss window is the size of the congestion window after a TCP
sender detects loss using its retransmission timer.
RESTART WINDOW (RW):
The restart window is the size of the congestion window after a
TCP restarts transmission after an idle period.
3 Congestion Control Algorithms
This section defines the four congestion control algorithms: slow
start, congestion avoidance, fast retransmit and fast recovery,
developed in [Jac88] and [Jac90]. In some situations it may be
beneficial for a TCP sender to be more conservative than the
algorithms allow, however a TCP MUST NOT be more aggressive than the
Expires: February, 1999 [Page 2]
draft-ietf-tcpimpl-cong-control-00.txt August 1998
following algorithms allow (that is, MUST NOT send data when the
value of cwnd computed by the following algorithms would not allow
the data to be sent).
3.1 Slow Start and Congestion Avoidance
The slow start and congestion avoidance algorithms MUST be used by a
TCP sender to control the amount of outstanding data being injected
into the network. To implement these algorithms, two variables are
added to the TCP per-connection state. The congestion window (cwnd)
is a sender-side limit on the amount of data the sender can transmit
into the network before receiving an acknowledgment (ACK), while the
receiver's advertised window (rwnd) is a receiver-side limit on the
amount of outstanding data. The minimum of cwnd and rwnd governs
data transmission.
Another state variable, the slow start threshold (ssthresh), is used
to determine whether the slow start or congestion avoidance
algorithm is used to control data transmission, as discussed below.
Beginning transmission into a network with unknown conditions
requires TCP to slowly probe the network to determine the available
capacity, in order to avoid congesting the network with an
inappropriately large burst of data. The slow start algorithm is
used for this purpose at the beginning of a transfer, or after
repairing loss detected by the retransmission timer.
IW, the initial value of cwnd, MUST be less than or equal to MSS
bytes.
We note that a non-standard, experimental TCP extension allows that
a TCP MAY use a larger initial window (IW), as defined in equation 1
[AFP98]:
IW = min (4*MSS, max (2*MSS, 4380 bytes)) (1)
With this extension, a TCP sender MAY use a 2 segment initial
window, regardless of the segment size, and 3 and 4 segment initial
windows MAY be used, provided the combined size of the segments does
not exceed 4380 bytes. We do NOT allow this change as part of the
standard defined by this document. However, we include discussion
of (1) in the remainder of this document as a guideline for those
experimenting with the change, rather than conforming to the present
standards for TCP congestion control.
The initial value of ssthresh MAY be arbitrarily high (for example,
some implementations use the size of the advertised window), but it
may be reduced in response to congestion. The slow start algorithm
is used when cwnd < ssthresh, while the congestion avoidance
algorithm is used when cwnd > ssthresh. When cwnd and ssthresh are
equal the sender may use either slow start or congestion avoidance.
During slow start, a TCP increments cwnd by at most MSS bytes for
each ACK received that acknowledges new data. Slow start ends when
Expires: February, 1999 [Page 3]
draft-ietf-tcpimpl-cong-control-00.txt August 1998
cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted
above); or when cwnd reaches rwnd; or when congestion is observed.
During congestion avoidance, cwnd is incremented by 1 full-sized
segment per round-trip time (RTT). Congestion avoidance continues
until cwnd reaches the receiver's advertised window or congestion is
detected. One formula commonly used to update cwnd during
congestion avoidance is given in equation 2:
cwnd += MSS*MSS/cwnd (2)
This provides an acceptable approximation to the underlying
principle of increasing cwnd by 1 full-sized segment per RTT. (Note
that for a connection in which the receiver acknowledges every data
segment, (2) proves slightly more aggressive than 1 segment per RTT,
and for a receiver acknowledging every-other packet, (2) is less
aggressive.)
Implementation Note: Since integer arithmetic is usually used in TCP
implementations, the formula given in equation 2 can fail to
increase cwnd when the congestion window is very large (larger than
MSS*MSS). If the above formula yields 0, the result SHOULD be
rounded up to 1 byte.
Implementation Note: older implementations have an additional
additive constant on the right-hand side of (2). This is incorrect
and can actually lead to diminished performance [PAD+98].
Another acceptable way to increase cwnd during congestion avoidance
is to count the number of bytes that have been acknowledged by ACKs
for new data. (A drawback of this implementation is that it
requires maintaining an additional state variable.) When the number
of bytes acknowledged reaches cwnd, then cwnd can be incremented by
up to MSS bytes. Note that during congestion avoidance, cwnd MUST
NOT be increased by more than the larger of either 1 full-sized
segment per RTT, or the value computed using equation 2.
Implementation Note: some implementations maintain cwnd in units of
bytes, while others in units of full-sized segments. The latter
will find equation (2) difficult to use, and may prefer to use the
counting approach discussed in the previous paragraph.
When a TCP sender detects segment loss using the retransmission
timer, the value of ssthresh MUST be set to no more than the value
given in equation 3:
ssthresh = max (min (cwnd, rwnd) / 2, 2*MSS) (3)
Implementation Note: an easy mistake to make is to forget the inner
min() operation and simply use cwnd, which in some implementations
may incidentally increase well beyond rwnd.
Furthermore, upon a timeout cwnd MUST be set to no more than the
loss window, LW, which equals 1 full-sized segment (regardless of
Expires: February, 1999 [Page 4]
draft-ietf-tcpimpl-cong-control-00.txt August 1998
the value of IW). Therefore, after retransmitting the dropped
segment the TCP sender uses the slow start algorithm to increase the
window from 1 full-sized segment to the new value of ssthresh, at
which point congestion avoidance again takes over in a fashion
identical to that for a connection's initial slow start.
3.3 Fast Retransmit/Fast Recovery
A TCP receiver SHOULD send an immediate duplicate ACK when an
out-of-order segment arrives. The purpose of this ACK is to inform
the sender that a segment was received out-of-order and which
sequence number is expected. From the sender's perspective,
duplicate ACKs can be caused by a number of network problems.
First, they can be caused by dropped segments. In this case, all
segments after the dropped segment will trigger duplicate ACKs.
Second, duplicate ACKs can be caused by the re-ordering of data
segments by the network (not a rare event along some network paths).
Finally, duplicate ACKs can be caused by replication of ACK or data
segments by the network.
The TCP sender SHOULD use the "fast retransmit" algorithm to detect
and repair loss, based on incoming duplicate ACKs. The fast
retransmit algorithm uses the arrival of 3 duplicate ACKs (i.e., 4
identical ACKs) as an indication that a segment has been lost.
After receiving 3 duplicate ACKs, TCP performs a retransmission of
what appears to be the missing segment, without waiting for the
retransmission timer to expire.
After the fast retransmit sends what appears to be the missing
segment, the "fast recovery" algorithm governs the transmission of
new data until a non-duplicate ACK arrives. The reason for not
performing slow start is that the receipt of the duplicate ACKs not
only tells the TCP that a segment has been lost, but also that
segments are leaving the network. In other words, since the
receiver can only generate a duplicate ACK when a segment has
arrived, that segment has left the network and is in the receiver's
buffer, so we know it is no longer consuming network resources.
Furthermore, since the ACK "clock" [Jac88] is preserved, the TCP
sender can continue to transmit new segments (although transmission
must continue using a reduced cwnd).
The fast retransmit and fast recovery algorithms are usually
implemented together as follows.
1. When the third duplicate ACK is received, set ssthresh to no
more than the value given in equation 3.
2. Retransmit the lost segment and set cwnd to ssthresh plus 3*MSS.
This artificially "inflates" the congestion window by the number
of segments (three) that have left the network and which the
receiver has buffered.
3. For each additional duplicate ACK received, increment cwnd by
MSS. This artificially inflates the congestion window in order
Expires: February, 1999 [Page 5]
draft-ietf-tcpimpl-cong-control-00.txt August 1998
to reflect the additional segment that has left the network.
4. Transmit a segment, if allowed by the new value of cwnd and the
receiver's advertised window.
5. When the next ACK arrives that acknowledges new data, set cwnd
to ssthresh (the value set in step 1). This is termed
"deflating" the window.
This ACK should be the acknowledgment elicited by the
retransmission from step 1, one RTT after the retransmission
(though it may arrive sooner in the presence of significant
out-of-order delivery of data segments at the receiver).
Additionally, this ACK should acknowledge all the intermediate
segments sent between the lost segment and the receipt of the
first duplicate ACK, if none of these were lost.
Implementing fast retransmit/fast recovery in this manner can lead
to a phenomenon which allows the fast retransmit algorithm to repair
multiple dropped segments from a single window of data [Flo94].
However, in doing so, the size of cwnd is also reduced multiple
times, which reduces performance. The following steps MAY be taken
to reduce the impact of successive fast retransmits on performance.
A. After the third duplicate ACK is received follow step 1 above,
but also record the highest sequence number transmitted
(send_high).
B. Instead of reducing cwnd to ssthresh upon receipt of the first
non-duplicate ACK received (step 5), the sender should wait
until an ACK covering send_high is received. In addition, each
duplicate ACK received should continue to artificially inflate
cwnd by 1 MSS.
C. A non-duplicate ACK that does not cover send_high, followed by 3
duplicate ACKs should not reduce ssthresh or cwnd but SHOULD
trigger a retransmission. A non-duplicate ACK that does not
cover send_high SHOULD NOT cause any adjustment in cwnd.
4 Additional Considerations
4.1 Re-starting Idle Connections
A known problem with the TCP congestion control algorithms described
above is that they allow a potentially inappropriate burst of
traffic to be transmitted after TCP has been idle for a relatively
long period of time. After an idle period, TCP cannot use the ACK
clock to strobe new segments into the network, as all the ACKs have
drained from the network. Therefore, as specified above, TCP can
potentially send a cwnd-size line-rate burst into the network after
an idle period.
[Jac88] recommends that a TCP use slow start to restart transmission
after a relatively long idle period. Slow start serves to restart
Expires: February, 1999 [Page 6]
draft-ietf-tcpimpl-cong-control-00.txt August 1998
the ACK clock, just as it does at the beginning of a transfer. This
mechanism has been widely deployed in the following manner. When
TCP has not received a segment for more than one retransmission
timeout, cwnd is reduced to the value of the restart window (RW)
before transmission begins.
For the purposes of this standard, we define RW = IW = 1 full-sized
segment.
We note that the non-standard experimental extension to TCP defined
in [AFP98] defines RW = min(IW, cwnd), with the definition of IW
adjusted per equation (1) above.
Using the last time a segment was received to determine whether or
not to decrease cwnd fails to deflate cwnd in the common case of
persistent HTTP connections [HTH98]. In this case, a WWW server
receives a request before transmitting data to the WWW browser. The
reception of the request makes the test for an idle connection fail,
and allows the TCP to begin transmission with a possibly
inappropriately large cwnd.
Therefore, a TCP SHOULD reduce cwnd to no more than RW before
beginning transmission if the TCP has not sent data in an interval
exceeding the retransmission timeout.
4.2 Acknowledgment Mechanisms
The delayed ACK algorithm specified in [Bra89] SHOULD be used by a
TCP receiver. When used, a TCP receiver MUST NOT excessively delay
acknowledgments. Specifically, an ACK MUST be generated for every
second full-sized segment. (This "MUST" is listed in [Bra89] in one
place as a SHOULD and another as a MUST; here we unambiguously state
it is a MUST.) Furthermore, an ACK SHOULD be generated for every
second segment regardless of size. Finally, an ACK MUST NOT be
delayed for more than 500 ms waiting on a second full-sized segment
to arrive. Out-of-order data segments SHOULD be acknowledged
immediately, in order to trigger the fast retransmit algorithm.
A TCP receiver MUST NOT generate more than one ACK for every
incoming segment.
TCP implementations that implement the selective acknowledgment
(SACK) option [MMFR96] are able to determine which segments have not
arrived at the receiver. Consequently, they can retransmit the lost
segments more quickly than TCPs without SACKs. This allows a TCP
sender to repair multiple losses in roughly one RTT after detecting
loss [FF96,MM96a,MM96b]. While no specific SACK-based recovery
algorithm has yet been standardized, any SACK-based algorithm should
follow the general principles embodied by the above algorithms.
First, as soon as loss is detected, ssthresh should be decreased per
equation (3). Second, in the RTT following loss detection, the
number of segments sent should be no more than half the number
transmitted in the previous RTT (i.e., before loss occurred).
Third, after the recovery period is finished, cwnd should be set to
Expires: February, 1999 [Page 7]
draft-ietf-tcpimpl-cong-control-00.txt August 1998
the reduced value of ssthresh. The SACK-based algorithms outlined
in [FF96,MM96a,MM96b] adhere to these guidelines.
5. Security Considerations
This document requires a TCP to diminish its sending rate in the
presence of retransmission timeouts and the arrival of duplicate
acknowledgments. An attacker can therefore impair the performance
of a TCP connection by either causing data packets or their
acknowledgments to be lost, or by forging excessive duplicate
acknowledgments. Causing two congestion control events back-to-back
will often cut ssthresh to its minimum value of 2*MSS, causing the
connection to immediately enter the slower-performing congestion
avoidance phase.
The Internet to a considerable degree relies on the correct
implementation of these algorithms in order to preserve network
stability and avoid congestion collapse. An attacker could cause
TCP endpoints to respond more aggressively in the face of congestion
by forging excessive duplicate acknowledgments or excessive
acknowledgments for new data. Conceivably, such an attack could
drive a portion of the network into congestion collapse.
Acknowledgments
The four algorithms that are described were developed by Van
Jacobson.
Some of the text from this document is taken from "TCP/IP
Illustrated, Volume 1: The Protocols" by W. Richard Stevens
(Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The
Implementation" by Gary R. Wright and W. Richard Stevens
(Addison-Wesley, 1995). This material is used with the permission
of Addison-Wesley.
Sally Floyd devised the algorithm presented in section 3.3 for
avoiding multiple cwnd cutbacks in the presence of multiple packets
lost from the same flight. Craig Partridge and Joe Touch
contributed a number of helpful suggestions.
References
[AFP98] M. Allman, S. Floyd, C. Partridge, Increasing TCP's Initial
Window Size, Internet-Draft draft-floyd-incr-init-win-03.txt.
May, 1998. (Work in progress).
[Bra89] B. Braden, ed., "Requirements for Internet Hosts --
Communication Layers," RFC 1122, Oct. 1989.
[Bra97] S. Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
Expires: February, 1999 [Page 8]
draft-ietf-tcpimpl-cong-control-00.txt August 1998
[FF96] Kevin Fall and Sally Floyd. Simulation-based Comparisons of
Tahoe, Reno and SACK TCP. Computer Communication Review, July
1996. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z.
[Flo94] S. Floyd, TCP and Successive Fast Retransmits. Technical
report, October 1994.
ftp://ftp.ee.lbl.gov/papers/fastretrans.ps.
[HTH98] Amy Hughes, Joe Touch, John Heidemann. Internet-Draft
draft-ietf-tcpimpl-restart-00.txt, March 1998. (Work in
progress).
[Jac88] V. Jacobson, "Congestion Avoidance and Control," Computer
Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988.
ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z.
[Jac90] V. Jacobson, "Modified TCP Congestion Avoidance Algorithm,"
end2end-interest mailing list, April 30, 1990.
ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail.
[MM96a] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining TCP
Congestion Control," Proceedings of SIGCOMM'96, August, 1996,
Stanford, CA. Available from
http://www.psc.edu/networking/papers/papers.html
[MM96b] M. Mathis, J. Mahdavi, "TCP Rate-Halving with Bounding
Parameters" Available from
http://www.psc.edu/networking/papers/FACKnotes/current.
[MMFR96] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, "TCP Selective
Acknowledgement Options", RFC 2018, October 1996.
[PAD+98] V. Paxson, M. Allman, S. Dawson, J. Griner, I. Heavens,
K. Lahey, J. Semke, B. Volz. Internet-Draft
draft-ietf-tcpimpl-prob-04.txt, August 1998. (Work in
progress).
[Pos81] J. Postel, Transmission Control Protocol, September 1981.
RFC 793.
[Ste94] W. R. Stevens, "TCP/IP Illustrated, Volume 1: The
Protocols", Addison-Wesley, 1994.
[Ste97] W. R. Stevens, "TCP Slow Start, Congestion Avoidance, Fast
Retransmit, and Fast Recovery Algorithms", RFC 2001, January
1997.
[WS95] G. R. Wright, W. R. Stevens, "TCP/IP Illustrated, Volume 2:
The Implementation", Addison-Wesley, 1995.
Expires: February, 1999 [Page 9]
draft-ietf-tcpimpl-cong-control-00.txt August 1998
Author's Address:
W. Richard Stevens
1202 E. Paseo del Zorro
Tucson, AZ 85718
520-297-9416
rstevens@kohala.com
http://www.kohala.com/~rstevens
Mark Allman
NASA Lewis Research Center/Sterling Software
21000 Brookpark Rd. MS 54-2
Cleveland, OH 44135
216-433-6586
mallman@lerc.nasa.gov
http://gigahertz.lerc.nasa.gov/~mallman
Vern Paxson
Network Research Group
Lawrence Berkeley National Laboratory
Berkeley, CA 94720
USA
510-486-7504
vern@ee.lbl.gov
Expires: February, 1999 [Page 10]
| PAFTECH AB 2003-2026 | 2026-04-23 13:52:43 |