One document matched: draft-kuehlewind-conex-tcp-modifications-00.xml
<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
There has to be one entity for each item to be referenced.
An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC3168 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3168.xml">
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2018 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2018.xml">
<!ENTITY RFC3522 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3522.xml">
<!ENTITY RFC3708 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3708.xml">
<!ENTITY RFC4015 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4015.xml">
<!ENTITY RFC5562 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5562.xml">
<!ENTITY RFC5681 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5681.xml">
<!ENTITY RFC5682 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5682.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs),
please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
(Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space
(using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="exp"
docName="draft-kuehlewind-conex-tcp-modifications-00"
ipr="trust200902">
<!-- category values: std, bcp, info, exp, and historic
ipr values: trust200902, noModificationTrust200902, noDerivativesTrust200902,
or pre5378Trust200902
you can add the attributes updates="NNNN" and obsoletes="NNNN"
they will automatically be output with "(if approved)" -->
<!-- ***** FRONT MATTER ***** -->
<front>
<!-- The abbreviated title is used in the page header - it is only necessary if the
full title is longer than 39 characters -->
<title>TCP modifications for Congestion Exposure</title>
<!-- add 'role="editor"' below for the editors if appropriate -->
<!-- Another author who claims to be an editor -->
<author fullname="Mirja Kuehlewind" initials="M." role="editor"
surname="Kuehlewind">
<organization>University of Stuttgart</organization>
<address>
<postal>
<street>Pfaffenwaldring 47</street>
<code>70569</code>
<city>Stuttgart</city>
<country>Germany</country>
</postal>
<email>mirja.kuehlewind@ikr.uni-stuttgart.de</email>
</address>
</author>
<author fullname="Richard Scheffenegger" initials="R."
surname="Scheffenegger">
<organization>NetApp, Inc.</organization>
<address>
<postal>
<street>Am Euro Platz 2</street>
<code>1120</code>
<city>Vienna</city>
<region></region>
<country>Austria</country>
</postal>
<phone>+43 1 3676811 3146</phone>
<email>rs@netapp.com</email>
</address>
</author>
<date year="2011" />
<area>Transport</area>
<workgroup>Congestion Exposure (ConEx)</workgroup>
<keyword>Internet-Draft</keyword>
<keyword>I-D</keyword>
<abstract>
<t>Congestion Exposure (ConEx) is a mechanism by which senders inform the network
about the congestion encountered by previous packets on the same flow.
This document describes the necessary modifications to use ConEx with the
Transmission Control Protocol (TCP).
</t>
</abstract>
</front>
<middle>
<section title="Introduction">
<t>Congestion Exposure (ConEx) is a mechanism by which senders inform the network
about the congestion encountered by previous packets on the same flow.
This document describes the necessary modifications to use ConEx with the
Transmission Control Protocol (TCP). The ConEx signal is based on loss or
ECN marks <xref target="RFC3168"/> as a congestion indication.
</t>
<t>With standard TCP without Selective Acknowledgments (SACK)
<xref target="RFC2018"/> the actual number of losses is hard to detect,
thus we recommend to enable SACK when using ConEx.
However, we discuss both cases, with and without SACK support, later on.
</t>
<t>Explicit Congestion Notification (ECN) is defined in such a way that only a
single congestion signal is guaranteed to be delivered per Round-trip Time
(RTT). For ConEx a more accurate <!-- timely? --> feedback signal
would be beneficial. Such an extension to ECN is defined in a seperate document
<xref target="draft-kuehlewind-conex-accurate-ecn"/>, as it can also be useful
for other mechanisms, as e.g. <xref target="DCTCP"/> or whenever the congestion
control reaction should be proportional to the expirienced congestion.
</t>
<t>The current version of this draft is only a first collection of ConEx-based TCP
modification and should not be regared as feature-complete as
the specification for the abstract ConEx mechanism is still under discussion.
The next version will also go more precisely into implementation details.
</t>
<section title="Requirements Language">
<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in <xref
target="RFC2119"/>.</t>
</section>
</section>
<section title="Sender-side Modifications">
<t>A ConEx sender MUST negotitate for both SACK and the more accurate ECN feedback
in the TCP handshake. Depending on the capability of the receiver, the following
operation modes exist:
<list style="symbols">
<t> Full-ConEx (SACK and accurate ECN feedback) </t>
<t> accECN-ConEx (no SACK but accurate ECN feedback)</t>
<t> ECN-ConEx (no SACK and no accurate ECN feedback but 'classic' ECN)</t>
<t> SACK-ECN-ConEx (SACK and 'classic' instead of accurate ECN)</t>
<t> SACK-ConEx (SACK but no ECN at all)</t>
<t> Basic-ConEx (neither SACK nor ECN) </t>
</list>
</t>
<t>The handling of the IPv6 bits is described in the next section.
</t>
</section>
<section title="Setting the ConEx IPv6 Bits"> <!-- RS: remove IPv6 - Conex signals (bits) should be defined agnostic of IP version, right?) -->
<t>ConEx is currently/will be defined as an destination option for IPv6.
The use of four bits have been defined, namely the X (ConEx-capable),
the L (loss experienced), the E (ECN experienced) and C (credit) bit.
</t>
<t>By setting the X bit a packet is marked as ConEx-capable. It is not
decided yet which or if any packets should not be ConEx capable.
(e.g. control packets as pure ACKs or retransmits). It is not defined yet which bits
(E, L, C) can be set at the same time in one (data) packet. It is assumed
that ConEx marked packets are accounted by their respective IP size, as
all the signals (Loss, ECN) are attributes of an IP packet, not a TCP segment
or merely the TCP payload. Further discussion is needed here.
</t>
<section title="ECN">
<t>A receiver can support the accurate ECN feedback scheme, the
'classic' ECN or neither. In the case ECN is not supported at
all, the transport is not ECN-capable and no ECN marks will occur,
thus the E bit will never be set. In the other cases a ConEx sender
has to maintain a gauge for the number of outstanding ConEx marks,
the congestion exposure gauge (CEG).
</t>
<t>The CEG is increased when ECN information is received from an ECN-capable
receiver supporting the 'classic' ECN scheme or the accurate ECN
feedback scheme. When the ConEx sender receives an ACK indicating one or more
segments were received with a CE mark, CEG is increased by the appropriate
number of bytes sent by the IP layer (e.g. by MTU bytes for each SMSS segment).
Whenever a packet is sent with the E bit set, this gauge is decreased by the
IP size of that packet. The two cases, depending on the receiver capability,
are discussed in the following sections.
</t>
<!--<t>TBD: Discussion to set ECN in which packets. Initially apply RFC5562 rules
([SYN,ACK] and data segments only), as security implications of ECN on
control packets ([SYN], pure [ACK], window probe, window update, ...) is an open
research question. However, running bidirectional ECN on all TCP segments
including TCP control packets, may allow for more timely and accurate ConEx
signals. Also, ConEx provides a framework to possibly address some of these
security risks.</t>-->
<section title="Accurate ECN feedback">
<t>When the accurate ECN feedback scheme is supported by the
receiver, the receiver will maintain an echo congestion counter
(ECC). The ECC will hold the number of CE marks received. A
sender that is understanding the accurate ECN feedback will be
able to reconstruct this ECC value on the sender side by
maintaining a counter ECC.r.
</t>
<t>On the arrival of every ACK, the sender calculates
the difference D between the local ECC.r counter, and the signaled
value of the receiver side ECC counter. The value
of ECC.r is increased by D, and D is assumed to be the number
of CE marked packets that arrived at the receiver since it
sent the previously received ACK.
</t>
<t>Whenever the counter ECC.r is increased, the gauge CEG has to
be increased by the amount of bytes sent on the IP layer which
were marked:</t>
<t>CEG += acked_bytes + (IP.header+TCP.header)*D</t>
</section>
<section title="Classic ECN support">
<t>A ConEx sender that communicates with a classic ECN receiver (conforming
to <xref target="RFC3168"/> or <xref target="RFC5562"/>) MAY
run in one of these modes:
<list style="symbols">
<t>Full compliance mode:<vspace blankLines="1"/>
The ConEx sender fully conforms to
all the semantics of the ECN signaling as defined by
<xref target="RFC5562"/>. In this mode, only a single
congestion indication can be signaled by the receiver
per RTT. Whenever the ECE flag toggels from "0" to "1",
the gauge CEG is increased by the SMSS plus headers.
<vspace blankLines="1"/>
Note that under severe congestion, a session adhering
to these semantics may not provide enough ConEx marks.
This may cause appropriate sanctions by an audit device in
a ConEx enabled network.</t>
<t>Simple compatibility mode:<vspace blankLines="1"/>
<!--Alternatively, a ConEx sender MAY
set the CWR flag opportunistically, to extract more than
one ECE indication per RTT. In
the most simple form, CWR can be set on a permanent basis.-->
The sender will set the CWR permanently to force the receiver to signal
only one ECE per CE mark. Unfortunately, in a high congestion situation where
all packets are CE marled over a certain periode of time, the use of delayed
ACKs, as it is usually done today, will prevent a feedback of every CE mark.
With an ACK rate of m, about m-1/m CE indications will not be signaled back by
the receiver (e.g. 50% with M=2). Thus, in this mode the ConEx sender MUST
increase CEG by a count of<vspace blankLines="1"/>
M*(SMSS+IP.header+TCP.header)<vspace blankLines="1"/>
for each received ECE signal.</t>
<t>Advanced compatibility mode:<vspace blankLines="1"/>
More sophisticated heuristics, such as a phase locked
loop, to set CWR only on those data segments, that
will actually trigger an (delayed) ACK, could extract
congestion notifications more timely. A ConEx sender MAY
choose to implement such a heuristic. In addition, further
heuristics SHOULD be implemented, to determine the value
of each ECE notification. E.g. for each consecutive ACK received
with the ECE flag set, D should be increased by one, and
CEG increased by<vspace blankLines="1"/>
CEG += min((SMSS+IP.header+TCP.header)*D, ((acked_bytes/SMSS)+1)*(IP+TCP Header) + acked_bytes)<vspace blankLines="1"/>
If an ACK is received with the ECE flag cleared, D must be
set to zero. This heuristic is conservative during more
serious congestion, and more relaxed at low congestion
levels.</t>
</list>
</t>
</section>
</section>
<section title="Loss Detection with/without SACK">
<t>For all the data segments that are determined by a ConEx sender
as lost, an identical number of IP bytes MUST be be sent
with the ConEx L bit set. Loss detection typically happens by use
of duplicate ACKs, or the firing of the retransmission timer. A
ConEx sender MUST maintain a loss exposure gauge (LEG), indicating
the number of outstanding IP bytes that must be sent with the ConEx
L bit. When a data segment is retransmitted, LEG should be
increased by the size of the IP packet containing the retransmission.
When sending subsequent segments (including TCP control segments),
the ConEx L bit is set as long as LEG is positive, and LEG is
decreased by the size of the sent IP packet with the ConEx L bit set.
</t>
<t>Any retransmission may be spurious. To accomodate that, a ConEx
sender SHOULD make use of heuristics to detect such spurious
retransmissions (e.g. F-RTO <xref target="RFC5682"/>, DSACK
<xref target="RFC3708"/>, and Eifel <xref target="RFC3522"/>,
<xref target="RFC4015"/>). When such a heuristic has determined,
that a certain number of packets were retransmitted
erroneously, the ConEx sender should subtract the size of these
IP packets from LEG. As ConEx credits have only a limited lifetime,
whenever the gauge becomes negative, it should be drained at a low
rate (e.g. 1 count per sent packet).
</t>
<t>Note that the above heuristics delays the conex signal by one
segment, and also decouples them from the retransmissions themselves, as
some control packets (e.g. pure ACKs, window probes, or window updates)
may be sent in between data segment retransmissions.
A simpler approach would be to set the ConEx signal for each
retransmitted data segment. However, it is important to remember, that
a ConEx signal and TCP segments do not natively belong together.
</t>
</section>
<section title="Credit Bits">
<t>The ConEx abstract mechanism requires that the transport SHOULD signal sufficient
credit in advance to cover any reasonably expected congestion during its
feedback delay. To be very conservative the number of credits would need to equal
the number of packets in flight, as every packet could get lost or congestion marked.
With a more moderate view, only an increase in the sending rate should cause
congestion.
</t>
<t>For TCP sender using the <xref target="RFC5681"/> congestion control algorithm, we recommand to
only send credit in Slow Start, as in Congestion Avoidance an increase of one segement per RTT
should only cause a minor amout of congestion marks. If an more aggressive congestion control is
used, a sufficient amount of credits need to be set.
</t>
<t>In TCP Slow Start the sending rate will increase exponentially and that means double every RTT.
Thus the number of credits should equal half the number of packets in flight in
every RTT. Under the assumption that all marks will not get invalid for the whole
Slow Start phase, marks of a previous RTT have to be sumed up. Thus the marking of
every fourth packet will allow sufficient credits in Slow Start.
</t>
<figure
title="Credits in Slow Start (with an intial window of 3)"
align="center" anchor="SS_credit">
<artwork align="center"><![CDATA[
RTT1 |------XC------>|
|------X------->|
|------X------->| credit=1 in_flight=3
| |
RTT2 |------X------->|
|------XC------>|
|------X------->|
|------X------->|
|------X------->|
|------XC------>| credit=3 in_flight=6
| |
RTT3 |------X------->|
|------X------->|
|------X------->|
|------XC------>|
|------X------->|
|------X------->|
|------X------->|
|------XC------>|
|------X------->|
|------X------->|
|------X------->|
|------XC------>| credit=6 in_flight=12
| . |
| : |
]]></artwork></figure>
<!--<t>TBD: Detailed discussion around 1/4th ConEx C marking during slow start; Any slow
start (session start, idle restart, RTO) or specific slow starts. Spurious RTO
interaction?
</t>-->
<t>If a ConEx sender detects an increasing number of losses even though
the sender reduced the sending rate, the sender SHOULD assume that
those losses are incorperated by an audit device and thus should send
further credits. Upto now its not clear if the credits say valid as long as the connection
is established or if an expiration of the credits need to be assumed by the sender.
</t>
<!--<t>TBD: additional loss would reduce the sending rate,
until enough credits are available to sustain a sending rate. Adding
one C bit for each loss recovery episode may be simpler?
</t>-->
<!--
3.1. On Beginning of a TCP session
A conex sender should build conex credits, by sending the 1st and 3rd
data segment with the conex C bit set. For a detailed discussion as
to the background for choosing these two data segments to build conex
credits with the network, see [I-D.briscoe-tsvwg-re-ecn-tcp].
In addition, a conex sender SHOULD maintain a gauge to account for
the number of (data) bytes [packets], which need to be sent with the
conex L bit set, and a similar gauge to account for the segments that
still have to carry the conex E bit. As multiple indications of lost
segments or ECN marks can arrive simultaneously at the sender, a
second counter should be maintained for both the conex L and E bits,
to space the sending of these conex bits more evenly. For
implementation details, see Appendix A.
3.5. On restarting idle TCP Connections
Conex signals are valid only for a limited amount of time.
Furthermore, TCP does not currently account for lost ACKs or allow
the use of ECN marking on control segments (e.g. pure ACKs, window
probes, window updates, FIN or RST segments without data). In a
common TCP connection, data will flow only unidirectional for a
certain period of time, while only TCP control packets are traversing
the return half-connection. Subsequently, the data direction may
change. If a TCP determines, that the duration between two sent data
segments becomes too large, it will reduce it's congestion window
(see Section 4.1 in [RFC5681]). This MUST be accompanied in a conex
sender by the building of new conex credits, by setting conex C bit
in the 1st and 3rd data segment sent after the restart. Furthermore,
the gauges, counters and flags maintained by the sender should be
reinitialized to zero, as any previous value will be invalid at that
point in time.
-->
</section>
<!--<section title="Credit Bits during Congestion Avoidance">
<t>TBD: When increasing the congestion window while in CA, will the
one additional segment need further credits?
</t>
</section>-->
<section title="Timeliness of the conex signals">
<t>Conex signals will anyway be evaluated with a slight time delay of about one RTT by a network
node. Therefore, it is not necessary to immediately signal ConEx
bits when they become known (e.g. L and E bits), but a sender SHOULD NOT
delay ConEx signaling excessively.
</t>
<t>Multiple ConEx bits may become available for signaling at the same
time, for example when an ACK is received by the sender, that
indicates that at least one segment has been lost, and that one or
more ECN marks were received at the same time. This may happen
during excessive congestion, where buffer queues overflow and some
packets are marked, while others have to be dropped nevertheless.
Another possibility when this may happen are lost ACKs, so that a
subsequent ACK carries summary information not previously available
to the sender.
</t>
<t>It may be preferrable to signal only one ConEx bit per segment, and to
space out the signaling of multiple bits across a (short) period of
time - or number of segments. However, that delay should not be
excessive, and ideally also shorter than the RTT of the affected TCP
session. The heuristic sketched in Appendix A uses a maximum delay
of 10 packets or 1/4 of the congestion window, whatever is smaller
to minimize delay.
</t>
<t>It is important to remember, that ConEx bits and TCP retransmissions
do not interact with each other. However, a retransmission should be
accompanied by one ConEx L bit in close proximity nevertheless. This does not mean,
that TCP retransmissions may never contain ConEx marks. In
a typical scenario using SACK, the first retransmission would not carry
a ConEx L bit, while subsequent retransmissions in the same recovery
episode, would be marked with the ConEx L bit.
Spreading the ConEx bits over a small number of segments increases
the liklihood that most devices along the path will see some
ConEx marks even during heavy congestion.
</t>
</section>
</section>
<section anchor="Acknowledgements" title="Acknowledgements">
<t></t>
</section>
<!-- Possibly a 'Contributors' section ... -->
<section anchor="IANA" title="IANA Considerations">
<t></t>
</section>
<section anchor="Security" title="Security Considerations">
<t></t>
</section>
</middle>
<!-- *****BACK MATTER ***** -->
<back>
<references title="Normative References">
<!--?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"?-->
&RFC2119;
&RFC3168;
&RFC2018;
&RFC5562;
</references>
<references title="Informative References">
&RFC3522;
&RFC3708;
&RFC4015;
&RFC5681;
&RFC5682;
<?rfc include="reference.I-D.briscoe-tsvwg-re-ecn-tcp.xml"?>
<reference anchor="draft-kuehlewind-conex-accurate-ecn" >
<front>
<title>Accurate ECN Feedback in TCP</title>
<author initials="M" surname="Kuehlewind">
<organization></organization></author>
<author initials="R" surname="Scheffenegger">
<organization></organization></author>
<date month="Jun" year="2011"/>
</front>
<seriesInfo name="Internet-Draft" value="draft-kuehlewind-conex-accurate-ecn-00"/>
</reference>
<reference anchor="DCTCP">
<front>
<title>DCTCP: Efficient Packet Transport for the Commoditized Data Center</title>
<author initials="M" surname="Alizadeh">
<organization></organization></author>
<author initials="A" surname="Greenberg">
<organization></organization></author>
<author initials="D" surname="Maltz">
<organization></organization></author>
<author initials="J" surname="Padhye">
<organization></organization></author>
<author initials="P" surname="Patel">
<organization></organization></author>
<author initials="B" surname="Prabhakar">
<organization></organization></author>
<author initials="S" surname="Sengupta">
<organization></organization></author>
<author initials="M" surname="Sridharan">
<organization></organization></author>
<date month="Jan" year="2010"/>
</front>
</reference>
</references>
<section anchor="ApxA" title="Spacing conex marks evenly">
<t>Under certain circumstances, very high marking conex marking rates may
need to be signaled. However, as conex maintains running averages, it
may be beneficial to send these marks more evenly spaced, than in bursts
of consecutive segments, all with the conex bits set.</t>
<t>If only a single conex mark needs to be sent, it should be sent immediately
to maintain optimal timeliness. Any subsequent conex marks may be delayed
slightly, to disentangle retransmissions of the transport protocol from
packets carrying conex marks.</t>
<t>The following algorithm will provide such a method. When very high marking
rates are required, it will automatically set a conex mark with every sent
packet.</t>
<t>When an ACK is received, the sender determines the number of bytes which
need to be sent with a conex mark, and the next segment to be sent should
carry at least part of the conex signal:</t>
<figure><preamble>CEF .. congestion.expirienced.flag
CEG .. congestion.expirienced.gauge
CEC .. congestion.expirienced.counter
LEF .. loss.expirienced.flag
LEG .. loss.expirienced.gauge
LEC .. loss.expirienced.counter</preamble><artwork><![CDATA[
if (marks.received > 0) {
CEF = 1; # for the immediate mark
CEG += bytes.received.with.CE; # for the delayed mark
}
if (lost segment retransmitted) {
LEF = 1;
LEG += IP.size.of.retransmitted.segment;
}
]]></artwork></figure>
<t>When sending a segment, the following algorithm is run to determine
if the segment should carry a conex mark. Note that the counter is
initialized to at most 10*(MTU of the path). This spaces two consecutive received
marks at most 10 full sized data segments apart.</t>
<t>For each of the L (loss) and E (ECN) conex bits, a similar algorithm
needs to run.</t>
<figure><artwork><![CDATA[
if ((LEF == 0) and (LEG > 0)) {
if ((LEC <= 0) or (LEG >= LEC)) {
LEG -= IP.size.of.segment.to.be.sent;
LEC = min(10*(SMSS+IP.header+TCP.header), Flightsize/4);
LEF = 1;
} else {
LEC -= LEG;
}
}
if ((CEF == 0) and (CEG > 0)) {
if ((CEC <= 0) or (CEG >= CEC)) {
CEG -= IP.size.of.segment.to.be.sent;
CEC = min(10*(SMSS+IP.header+TCP.header), Flightsize/4);
CEF = 1;
} else {
CEC -= CEG;
}
}
if (LEF == 1) {
LEF = 0;
SendSegment(conex.L.bit);
} else {
if (CEF == 1) {
CEF == 0;
SendSegment(conex.E.bit);
} else
SendSegment;
}
if (CEG < 0) {
CEG += 1; # slowly reduce credit
}
if (LEG < 0) {
LEG += 1;
}
}
]]></artwork></figure>
</section>
</back>
</rfc>
| PAFTECH AB 2003-2026 | 2026-04-23 14:37:11 |