One document matched: draft-kuehlewind-tcpm-accurate-ecn-01.xml
<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
There has to be one entity for each item to be referenced.
An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC3168 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3168.xml">
<!ENTITY RFC3540 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3540.xml">
<!ENTITY RFC5562 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5562.xml">
<!ENTITY RFC5681 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5681.xml">
<!ENTITY RFC5690 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5690.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs),
please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
(Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space
(using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="exp" docName="draft-kuehlewind-tcpm-accurate-ecn-01" ipr="trust200902">
<!-- updates="3186" -->
<!-- category values: std, bcp, info, exp, and historic
ipr values: trust200902, noModificationTrust200902, noDerivativesTrust200902,
or pre5378Trust200902
you can add the attributes updates="NNNN" and obsoletes="NNNN"
they will automatically be output with "(if approved)" -->
<!-- ***** FRONT MATTER ***** -->
<front>
<!-- The abbreviated title is used in the page header - it is only necessary if the
full title is longer than 39 characters -->
<title>More Accurate ECN Feedback in TCP</title>
<!-- add 'role="editor"' below for the editors if appropriate -->
<!-- Another author who claims to be an editor -->
<author fullname="Mirja Kühlewind" initials="M." role="editor"
surname="Kühlewind">
<organization>University of Stuttgart</organization>
<address>
<postal>
<street>Pfaffenwaldring 47</street>
<code>70569</code>
<city>Stuttgart</city>
<country>Germany</country>
</postal>
<email>mirja.kuehlewind@ikr.uni-stuttgart.de</email>
</address>
</author>
<author fullname="Richard Scheffenegger" initials="R."
surname="Scheffenegger">
<organization>NetApp, Inc.</organization>
<address>
<postal>
<street>Am Euro Platz 2</street>
<code>1120</code>
<city>Vienna</city>
<region></region>
<country>Austria</country>
</postal>
<phone>+43 1 3676811 3146</phone>
<email>rs@netapp.com</email>
</address>
</author>
<date year="2012" />
<area>Transport</area>
<workgroup>TCP Maintenance and Minor Extensions (tcpm)</workgroup>
<keyword>Internet-Draft</keyword>
<keyword>I-D</keyword>
<abstract>
<t>Explicit Congestion Notification (ECN) is an IP/TCP mechanism where network nodes
can mark IP packets instead of dropping them to indicate congestion to the end-points.
An ECN-capable receiver will feedback this information to the sender. ECN is specified for
TCP in such a way that only one feedback signal can be transmitted per Round-Trip Time (RTT).
Recently, new TCP mechanisms like ConEx or DCTCP need more accurate ECN feedback information in the case
where more than one marking is received in one RTT.
This documents specifies a different scheme for the ECN feedback in the TCP header
to provide more than one feedback signal per RTT.
</t>
</abstract>
</front>
<middle>
<section title="Introduction">
<t>Explicit Congestion Notification (ECN) <xref target="RFC3168"/> is
an IP/TCP mechanism where
network nodes can mark IP packets instead of dropping them to indicate congestion to
the end-points. An ECN-capable receiver will feedback this information to the sender.
ECN is specified for TCP in such a way that only one feedback signal can be
transmitted per Round-Trip Time (RTT).
Recently, proposed mechanisms like Congestion Exposure (ConEx) or DCTCP
<xref target="Ali10"/> need more accurate ECN feedback information
in case when more than one marking is received in one RTT.
</t>
<t>This documents specifies a different scheme for the ECN feedback in the TCP header
to provide more than one feedback signal per RTT. This modification does not obsolete
<xref target="RFC3168"/>. To avoid confusion we call the ECN specification of
<xref target="RFC3168"/> 'classic ECN' in this document. This document provides an
extension that requires additional negotiation in the TCP handshake by using the
TCP nonce sum (NS) bit, as specified in <xref target="RFC3540"/>, which is
currently not used when SYN is set. If the more accurate ECN extension has been
negotiated successfully, the meaning of ECN TCP bits and the ECN NS bit is different
from the specification in <xref target="RFC3168"/> and <xref target="RFC3540"/>. This document specifies the
additional negotiation as well as the new coding of the TCP ECN/NS bits.
</t>
<t> The proposed coding scheme maintains the given bit space as the ECN feedback
information is needed in a timely manner and as such should be reported in every
ACK. The reuse will avoid additional network load as the ACK size will not
increase. Moreover, the more accurate ECN information will replace the classic
ECN feedback if negotiated. Thus those bits are not needed otherwise.
But the proposed schemes requires also the use of the NS bit in the TCP handshake
as well as for the more accurate ECN feedback itself. The proposed more accurate
ECN feedback extension can include the ECN-Nonce integrity mechanism as some coding
space is left open. The use of ECN-Nonce is not part of the specification in this
document but is discussed in the appendix.
</t>
<section title="Use Cases">
<t> The following scenarios should briefly show where the accurate feedback
is needed or provides additional value:
<list hangIndent="8" style="hanging">
<t hangText="A Standard (RFC5681) TCP sender that supports ConEx:"><vspace/>
In this case the congestion control algorithm still ignores multiple
marks per RTT, while the ConEx mechanism uses the extra information
per RTT to re-echo more precise congestion information. </t>
<t hangText="A sender using DCTCP congestion control without ConEx:"><vspace/>
The congestion control algorithm uses the extra info per RTT to
perform its decrease depending on the number of congestion marks.</t>
<t hangText="A sender using DCTCP congestion control and supports ConEx:"><vspace/>
Both the congestion control algorithm and ConEx use the accurate
ECN feedback mechanism.</t>
<t hangText="A standard TCP sender (using RFC5681 congestion control algorithm) without ConEx:"><vspace/>
No accurate feedback is necessary here. The congestion control
algorithm still react only on one signal per RTT. But it is best
to have one generic feedback mechanism, whether it is used or not.</t>
</list>
</t>
</section>
<section title="Overview ECN and ECN Nonce in IP/TCP">
<t>ECN requires two bits in the IP header. The ECN capability of a
packet is indicated when either one of the two bits is set. An
ECN sender can set one or the other bit to indicate
an ECN-capable transport (ECT) which results in two signals,
ECT(0) and ECT(1). A network node can set both bits simultaneously
when it experiences congestion. When both bits are set the
packet is regarded as "Congestion Experienced" (CE).
</t>
<t>In the TCP header the first two bits in byte 14 are defined
for the use of ECN. The TCP mechanism for signaling the reception
of a congestion mark uses the ECN-Echo (ECE) flag in the TCP header.
To enable the TCP receiver to determine when to stop setting the
ECN-Echo flag, the CWR flag is set by the sender upon reception of
the feedback signal. This leads always to a full RTT of ACKs with
ECE set. Thus any additional CE markings arriving within this RTT
can not signaled back anymore.
</t>
<t>
ECN-Nonce <xref target="RFC3540"/> is an optional addition to ECN
that is used to protect the TCP sender against accidental or
malicious concealment of marked or dropped packets. This addition
defines the last bit of byte 13 in the TCP header as the Nonce
Sum (NS) bit. With ECN-Nonce a nonce sum is maintain that counts
the occurrence of ECT(1) packets.<vspace blankLines="20" />
</t>
<figure anchor="TCPHdr" align="center" title="The (post-ECN Nonce) definition of the TCP header flags">
<artwork align="center"><![CDATA[
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| | | N | C | E | U | A | P | R | S | F |
| Header Length | Reserved | S | W | C | R | C | S | S | Y | I |
| | | | R | E | G | K | H | T | N | N |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
]]></artwork>
</figure>
</section>
<section title="Requirements">
<t>The requirements of the accurate ECN feedback protocol for the use of
e.g. Conex or DCTCP are to have a fairly accurate (not necessarily
perfect), timely and protected signaling. This leads to the following
requirements:</t>
<t><list hangIndent="8" style="hanging">
<t hangText="Resilience"><vspace/>The ECN feedback signal is carried within the
TCP acknowledgment. TCP ACKs can get lost. Moreover, delayed ACK are mostly used
with TCP. That means in most cases only every second data packets triggers an ACK.
In a high congestion situation where most of the packet are marked with CE, an
accurate feedback mechanism must still be able to signal sufficient congestion
information. Thus the accurate ECN feedback extension has to take delayed ACK and
ACK loss into account.</t>
<t hangText="Timely"><vspace/>The CE marking is induced by a network node on the
transmission path and echoed by the receiver in the TCP acknowledgment. Thus when
this information arrives at the sender, its naturally already about one RTT old.
With a sufficient ACK rate a further delay of a small number of ACK can be
tolerated but with large delays this information will be out dated due to high
dynamic in the network. TCP congestion control which introduces parts of these
dynamics operates on a time scale of one RTT. Thus the congestion feedback
information should be delivered timely (within one RTT).</t>
<t hangText="Integrity"><vspace/>With ECN Nonce, a misbehaving receiver or network node
can be detected with a certain probability. As this accurate ECN feedback is
reusing the NS bit, it is encouraged to ensure integrity as least as good as
ECN Nonce. If this is not possible, alternative approaches should be provided
how a mechanism using the accurate ECN feedback extension can re-ensure
integrity or give strong incentives for the receiver and network node to
cooperate honestly. <!--If and what kind of
enforcements a sender should do, when detecting wrong feedback information, is
out-of-scope.--></t>
<t hangText="Accuracy"><vspace/><!--In TCP usually delayed ACKs are used. Thats means in
most cases only for every second data packets an acknowledgment is sent. Moreover,
an ACK can get lost.-->Classic ECN feeds back one congestion notification per RTT, as
this is supposed to be used for TCP congestion control which reduces the sending
rate at most once per RTT. The accurate ECN feedback scheme has to ensure that
if a congestion events occurs at least one congestion notification is echoed and
received per RTT as classic ECN would do. Of course, the goal of this extension is to
reconstruct the number of CE marking more accurately. However, a sender should
not assume to get the exact number of congestion marking in all situations.</t>
<t hangText="Complexity"><vspace/>Of course, the more accurate ECN feedback can
also be used, even if only one ECN feedback signal per RTT is need. The
implementation should be as simple as possible and only a minimum of addition
state information should be needed. A proposal fulfilling this for a more
accurate ECN feedback can then also be the standard ECN feedback mechanism.</t>
</list></t>
</section>
<section title="Design choices">
<t>The idea of this document is to use the ECE, CWR and NS bits for additional
capability negotiation during the <SYN> / <SYN,ACK> exchange, and
then for the more accurate ECN feedback itself on subsequent packets in the
flow (where SYN is not set).
</t>
<t>Alternatively, a new TCP option could be introduced, to help maintain the
accuracy, and integrity of the ECN feedback between receiver and sender.
Such an option could provide more information. E.g. ECN for RTP/UDP provides
explicit the number of ECT(0), ECT(1), CE, non-ECT marked and lost packets.
However, deploying new TCP options has its own challenges. A separate
document proposes a new TCP Option for accurate ECN feedback
<xref target="draft-kuehlewind-tcpm-accurate-ecn-option"/>. This option
could be used in addition to a more accurate ECN feedback scheme described
here or in addition to classic ECN, when available and needed.
</t>
<!--<t>Combining the idea of <xref target="eci_mode"/> and <xref target="cp_mode"/>,
further extending it to a one-octet option, would allow the signaling of two
values, each with 4 bit. The gains in worst case ACK loss, delayed ACK ratios
and maintaining ECN Nonce would scale accordingly.
</t>
<t>Alternatively, if timestamp capability negotiation is supported, a few
bits could be extracted from the timestamp value, to provide extended
signaling. However, processing TCP options (or overloaded TCP options) is
more complex than processing of header flags.
</t>-->
<t>As seen in <xref target="TCPHdr"/>, there are currently three unused flag bits
in the TCP header. The proposed scheme could be extended by one or
more bits, to add higher resiliency against ACK loss. The relative gain
would be proportionally higher resiliency against ACK loss, while the respective
drawbacks would remain identical. Thus the approach in this document is to
maintain the scope of the given number of header bits as they seem to be
already sufficient. This accurate ECN feedback scheme will only be used
instead of the classic ECN and never in parallel.
</t>
</section>
<section title="Requirements Language">
<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in <xref
target="RFC2119">RFC 2119</xref>.</t>
<t>We use the following terminology from <xref target="RFC3168"/> and <xref target="RFC3540"/>:</t>
<t>The ECN field in the IP header:
<list hangIndent="10" style="empty"><t>
<list hangIndent="9" style="hanging">
<t hangText="CE:">the Congestion Experienced codepoint, and</t>
<t hangText="ECT(0):">the first ECN-Capable Transport codepoint, and</t>
<t hangText="ECT(1):">the second ECN-Capable Transport codepoint.</t>
</list></t>
</list></t>
<t>The ECN flags in the TCP header:
<list hangIndent="10" style="empty"><t>
<list hangIndent="9" style="hanging">
<t hangText="CWR:">the Congestion Window Reduced flag,</t>
<t hangText="ECE:">the ECN-Echo flag, and</t>
<t hangText="NS:">ECN Nonce Sum.</t>
</list></t>
</list></t>
<t> In this document, we will call the ECN feedback scheme as specified
in <xref target="RFC3168"/> the 'classic ECN'
and our new proposal the 'more accurate ECN feedback' scheme.
A 'congestion mark' is defined as an IP packet where the CE codepoint is set.
A 'congestion event' refers to one or more congestion marks belong to the same
overload situation in the network (usually during one RTT).
</t>
</section>
</section>
<section title="Negotiation during the TCP handshake" anchor="TCPNeg">
<t> During the TCP hand-shake at the start of a connection, an originator
of the connection (host A) MUST
indicate a request to get more accurate ECN feedback by setting the TCP flags
NS=1, CWR=1 and ECE=1 in the initial <SYN>.
</t>
<t> A responding host (host B) MUST return a <SYN,ACK> with flags
CWR=1 and ECE=0. The responding host MUST NOT set this combination
of flags unless the preceding <SYN> has already requested
support for more accurate ECN feedback as above. Normally a server (B) will
reply to a client with NS=0, but if the initial <SYN> from client A is
marked CE, the sever B SHOULD set the NS flag to 1 to indicate the congestion
immediately instead of delaying the signal to the first acknowledgment when
the actually data transmission already started.
<!--a server B MUST
increment its local value of
ECC. But B cannot reflect the value of ECC in the SYN ACK, because
it is still using the 3 bits to negotiate connection capabilities. -->
<!-- [RS] ECC not yet defined. Suggest to remove this discussion for later
(sec. 312 or 313). Also, need to add a paragraph stating that we encourage
the use of ECT during non-data segments (SYN, pure ACK)... -->
So, server B MAY set the alternative TCP header flags in its
<SYN,ACK>: NS=1, CWR=1 and ECE=0.
</t>
<t> The addition of ECN to TCP <SYN,ACK> packets is discussed
and specified as experimental in <xref target="RFC5562"/>. The addition
of ECN to the <SYN> packet is optional. The security implication
when using this option are not further discussed here.
</t>
<t> This handshake is summarized in Table 1 below, with X indicating
NS can be either 0 or 1 depending on whether congestion had been
experienced. The handshakes used for the other flavors of ECN are
also shown for comparison. To compress the width of the table, the
headings of the first four columns have been severely abbreviated, as
follows:
<list style="hanging" hangIndent="4">
<t hangText="Ac:">*Ac*curate ECN Feedback</t>
<t hangText="N:">ECN-*N*once (RFC3540)</t>
<t hangText="E:">*E*CN (RFC3168)</t>
<t hangText="I:">Not-ECN (*I*mplicit congestion notification).</t>
</list>
</t>
<texttable anchor="Tab1" align="center"
title="ECN capability negotiation between Sender (A) and Receiver (B)">
<ttcol align="left">Ac</ttcol>
<ttcol align="center">N</ttcol>
<ttcol align="center">E</ttcol>
<ttcol align="center">I</ttcol>
<ttcol align="center"><SYN> A->B</ttcol>
<ttcol align="center"><SYN,ACK> B->A</ttcol>
<ttcol align="left">Mode</ttcol>
<c/> <c/> <c/> <c/> <c>NS CWR ECE</c> <c>NS CWR ECE</c> <c/>
<c>AB</c> <c/> <c/> <c/> <c>1 1 1</c> <c>X 1 0</c> <c>accurate ECN</c>
<c>A</c> <c>B</c> <c/> <c/> <c>1 1 1</c> <c>1 0 1</c> <c>ECN Nonce</c>
<c>A</c> <c/> <c>B</c> <c/> <c>1 1 1</c> <c>0 0 1</c> <c>classic ECN</c>
<c>A</c> <c/> <c/> <c>B</c> <c>1 1 1</c> <c>0 0 0</c> <c>Not ECN</c>
<c>A</c> <c/> <c/> <c>B</c> <c>1 1 1</c> <c>X 1 1</c> <c>Not ECN (broken)</c>
</texttable>
<t> Recall that, if the <SYN,ACK> reflects the same flag settings as the
preceding <SYN> (because there is a broken TCP
implementation that behaves this way), RFC3168 specifies that the
whole connection MUST revert to Not-ECT.
</t>
</section>
<section title="More Accurate ECN Feedback">
<t>In this section we refer the sender to be the one sending data and
the receiver as the one that will acknowledge this data. Of course
such a scenario is describing only one half connection of a TCP
connection. The proposed scheme, if negotiated, will be used for
both half connection as both, sender and receiver, need to be
capable to echo and understand the accurate ECN feedback scheme.
</t>
<t> This section proposes the new coding of the two ECN TCP bits
(ECE/CWR) as well as the TCP NS bit to provide a more accurate
ECN feedback. This coding MUST only be used if the more accurate
ECN feedback has been negotiated successfully in the TCP handshake.
</t>
<t> Section <xref target="comp_mode"/> provides basically another alternative to allow a
compatibility mode when a sender needs more accurate ECN feedback
but has to operate with a legacy <xref target="RFC3168"/> classic
ECN receiver.
</t>
<section title="Codepoint Coding" anchor="TCPSig">
<t> The more accurate ECN feedback coding uses the ECE, CWR and NS
bits as one field to encode 8 distinct codepoints. This overloaded
use of these 3 header flags as one 3-bit more Accurate ECN (AcE)
field is shown in <xref target="AcE_ACK"/>. The actual definition
of the TCP header, including the addition of support for the ECN
Nonce, is shown for comparison in <xref target="TCPHdr"/>. This
specification does not redefine the names of these three TCP
flags, it merely overloads them with another definition once a
flow with more accurate ECN feedback is established.</t>
<figure title="Definition of the AcE field within bytes 13 and 14 of the TCP Header (when SYN=0)." align="center" anchor="AcE_ACK">
<artwork align="center"><![CDATA[
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| | | | U | A | P | R | S | F |
| Header Length | Reserved | AcE | R | C | S | S | Y | I |
| | | | G | K | H | T | N | N |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
]]></artwork></figure>
<t>The 8 possible codepoints are shown below. Five of them are used
to encode a "congestion indication" (CI) counter. The other three
codepoints are undefined but can be used for some kind of integrity
check (see appendix <xref target="e1-codepoints"/>). The CI counter
maintains the number of CE marks observed at the receiver (see
<xref target="receiver-impl"/>).</t>
<t>Also note that, whenever the SYN flag of a TCP segment is set
(including when the ACK flag is also set), the NS, CWR and ECE
flags (i.e. the AcE field of the <SYN,ACK>) MUST NOT be
interpreted as the 3-bit codepoint, which is only used in non-SYN
packets.</t>
<texttable anchor="Tab_CP" align="center" title="Codepoint assignment for accurate ECN feedback">
<ttcol align="center">AcE</ttcol>
<ttcol align="center">NS</ttcol>
<ttcol align="center">CWR</ttcol>
<ttcol align="center">ECE</ttcol>
<ttcol align="center">CI (base5)</ttcol>
<c>0</c> <c>0</c> <c>0</c> <c>0</c> <c>0</c>
<c>1</c> <c>0</c> <c>0</c> <c>1</c> <c>1</c>
<c>2</c> <c>0</c> <c>1</c> <c>0</c> <c>2</c>
<c>3</c> <c>0</c> <c>1</c> <c>1</c> <c>3</c>
<c>4</c> <c>1</c> <c>0</c> <c>0</c> <c>4</c>
<c>5</c> <c>1</c> <c>0</c> <c>1</c> <c>-</c>
<c>6</c> <c>1</c> <c>1</c> <c>0</c> <c>-</c>
<c>7</c> <c>1</c> <c>1</c> <c>1</c> <c>-</c>
</texttable>
<t>By default an accurate ECN receiver MUST echo one of the codepoints
encoding the CI counter value. Whenever a CE is received and thus the
value of the CI has changed, the receiver MUST echo the CI in the
next ACK. Moreover, the receiver MUST repeat the codepoint, that
provides the CI counter, directly on the subsequent ACK. Thus every
value of CI will be transmitted at least twice. Otherwise the
receiver MAY send one of the other, currently undefined, codepoints.</t>
<!--<t>For resilience against lost ACKs, a second ACK has to
transmit the previous codepoint again, whether another
congestion indication (CE) or ECT(1) mark arrives or not.</t>-->
<t>This requirement may conflict with delayed ACK ratios
larger than two, using the available number of codepoints.
A receiver MUST change the ACK'ing rate such that a sufficient
rate of feedback signals can be sent. Details on how the
change in the ACK'ing rate can be implemented are given in the
section <xref target="receiver"/>.
<!--Under certain
circumstances, i.e. the sender using excessive ECT(1) marks, every packet
may be immediately get ACK'ed. The available codepoints for CI allow
the indefinite use of delayed ACKs with a ratio of two, even during
heavy network congestion. --></t>
</section>
<section title="More Accurate ECN TCP Sender">
<t> This section specifies the sender-side action describing how
to exclude the number of congestion markings from the given
receiver feedback signal.
</t>
<t>When the more accurate ECN feedback scheme is supported by the sender,
the sender will maintain a congestion indication received (CI.r) counter.
This CI.r counter will hold the number of CE marks as signaled by the
receiver, and reconstructed by the sender.</t>
<t>On the arrival of every ACK, the sender calculates the difference D
between the local CI.r value modulo 5, and the signaled CI value of the
codepoint in the ACK. The value of CI.r is increased by D, and
D is assumed to be the number of CE marked packets that arrived at
the receiver since it sent the previously received ACK.</t>
</section>
<section title="More Accurate ECN TCP Receiver" anchor="receiver">
<t> This section describes the receiver-side action to signal the accurate ECN
feedback back to the sender. The receiver will need to maintain a
congestion indication (CI) counter of how many CE marking have been
seen during a connection. Thus for each incoming segment with a CE
marking, the receiver will increase CI by 1.
With each ACK the receiver will calculate CI modulo 5 and set the
respective codepoint in the AcE field (see table <xref target="Tab_CP"/>).
To avoid counter wrap-arounds in a high congestion situation, the
receiver SHOULD switch from a delayed ACK behavior to send ACKs
immediately after the data packet reception if needed.
</t>
<section title="Implementation" anchor="receiver-impl">
<t> The receiver counts how many packets carry a
congestion notification. This could, in principle, be achieved by
directly increasing the CI for every incoming CE marked
segment. Since the space for communicating the information back to the
sender in ACKs is limited, instead of directly increasing this counter,
a "gauge" (CI.g) is increased instead.
</t>
<t>When sending an ACK, the CI is increased by either CI.g or
at maximum by 4 as a larger increase could cause an overflow in
the codepoint counter signaling. Thereafter, CI.g is reduced by
the same amount. Then the current CI value (modulo 5) is
encoded in the current ACK. To avoid losing information, it must
be ensured that an ACK is sent at least after 5 incoming, outstanding
congestion marks (i.e. when CI.g exceeds 5). Architecturally
the counters never decrease during a TCP session. However, any
overflow MUST be modulo a multiple of 5 for CI.
</t>
<t>For resilience against lost ACKs, an indicator flag (CI.i) SHOULD
be used to ensure that, whether another congestion indication arrives
or not, a second ACK transmits the previous counter value again.
Thus when a codepoint is transmitted the first time, CI.i will be set
to one. Then with the next ACK the same codepoint is transmitted again
and the CI.i is reset to zero. Only when CI.i is zero, the counter CI
can be increased. In case of heavy congestion (basically all segments
are CE marked) the CI.g might grow continuously. In this case the ACK
rate should be increased by sending an immediate ACK for an incoming data segment.
</t>
<t>The following table provides an example showing an half-connection with a
TCP sender A and a TCP receiver B. The sender maintains a counter CI.r
to reconstruct the number of CE mark seen at the receiver-side.
</t>
<texttable anchor="Tab4" align="center" title="Codepoint signal example">
<ttcol align="center"> </ttcol>
<ttcol align="center">Data</ttcol>
<ttcol align="right">TCP A</ttcol>
<ttcol align="right">IP</ttcol>
<ttcol align="right">TCP B</ttcol>
<ttcol align="center">Data</ttcol>
<c> </c> <c/> <c>SEQ ACK CTL</c> <c/> <c>SEQ ACK CTL</c> <c/>
<c>--</c> <c/> <c>-------------</c> <c>----------</c> <c>-------------</c> <c/>
<c>1</c> <c/> <c>0100 SYN</c> <c> ----> </c> <c> </c> <c/>
<c> </c> <c/> <c> CWR,ECE,NS</c> <c> </c> <c> </c> <c/>
<c>2</c> <c/> <c> </c> <c> <---- <!--ECT0--></c> <c>0300 0101 SYN </c> <c/>
<c> </c> <c/> <c> </c> <c> </c> <c> ACK,CWR </c> <c/>
<c>3</c> <c/> <c>0101 0301 ACK</c> <c> ECT0 -CE-></c> <c> </c> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=0 CI.g=1</c> <c/>
<c>4</c> <c>100</c> <c>0101 0301 ACK</c> <c>ECT0 ----></c> <c/> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=1 CI.g=0</c> <c/>
<c>5</c> <c/> <c></c> <c><---- </c> <c>0301 0201 ACK</c> <c/>
<c> </c> <c/> <c/> <c/> <c>ECI=CI.1</c> <c/>
<c/> <c/> <c>CI.r=1</c> <c/> <c/> <c/>
<c>6</c> <c>100</c> <c>0201 0301 ACK</c> <c>ECT0 -CE-></c> <c/> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=1 CI.g=1</c> <c/>
<c>7</c> <c>100</c> <c>0301 0301 ACK</c> <c>ECT0 -CE-></c> <c/> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=1 CI.g=2</c> <c/>
<c>8</c> <c/> <c></c> <c>XX-- </c> <c>0301 0401 ACK</c> <c/>
<c> </c> <c/> <c/> <c/> <c>ECI=CI.1</c> <c/>
<c/> <c/> <c>CI.r=1</c> <c/> <c/> <c/>
<c>9</c> <c>100</c> <c>0401 0301 ACK</c> <c>ECT0 -CE-></c> <c/> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=1 CI.g=3</c> <c/>
<c>10</c> <c>100</c> <c>0501 0301 ACK</c> <c>ECT0 -CE-></c> <c/> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=5 CI.g=0</c> <c/>
<c>11</c> <c/> <c></c> <c><---- </c> <c>0301 0601 ACK</c> <c/>
<c> </c> <c/> <c/> <c/> <c>ECI=CI.0</c> <c/>
<c/> <c/> <c>CI.r=5</c> <c/> <c/> <c/>
<c>12</c> <c>100</c> <c>0601 0301 ACK</c> <c>ECT0 -CE-></c> <c/> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=5 CI.g=1</c> <c/>
<c>13</c> <c>100</c> <c>0701 0301 ACK</c> <c>ECT0 -CE-></c> <c/> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=5 CI.g=2</c> <c/>
<c>14</c> <c/> <c></c> <c><---- </c> <c>0301 0801 ACK</c> <c/>
<c> </c> <c/> <c/> <c/> <c>ECI=CI.0</c> <c/>
<c/> <c/> <c>CI.r=5</c> <c/> <c/> <c/>
<!--
| 1 | | 0100 SYN | FNE | - | R.ECC=0 | |
| | | CWR,ECE,NS | | | | |
| 2 | | R.ECC=0 | <- | FNE | 0300 0101 | |
| | | | | | SYN,ACK,CWR | |
| 3 | | 0101 0301 ACK | RECT | - | R.ECC=0 | |
| 4 | 1000 | 0101 0301 ACK | FNE | - | R.ECC=0 | |
| 5 | | R.ECC=0 | <- | FNE | 0301 1102 ACK | 1460 |
| 6 | | R.ECC=0 | <- | RECT | 1762 1102 ACK | 1460 |
| 7 | | R.ECC=0 | <- | FNE | 3222 1102 ACK | 1460 |
| 8 | | 1102 1762 ACK | RECT | - | R.ECC=0 | |
| 9 | | R.ECC=0 | <- | RECT | 4682 1102 ACK | 1460 |
| 10 | | R.ECC=0 | <- | RECT | 6142 1102 ACK | 1460 |
| 11 | | 1102 3222 ACK | RECT | - | R.ECC=0 | |
| 12 | | R.ECC=0 | <- | RECT | 7602 1102 ACK | 1460 |
| 13 | | R.ECC=1 | <*- | RECT | 9062 1102 ACK | 1460 |
| | | ... | | | | |
-->
</texttable>
</section>
</section>
<section title="Advanced Compatibility Mode" anchor="comp_mode">
<t>TBD (more detailed description see draft-ietf-conex-tcp-modifications)</t>
<t>This section describes a possible mechanism to achieve more accurate ECN feedback
even when the receiver is not capable of the new more accurate ECN feedback
scheme with the drawback of less reliability.
</t>
<t>During initial deployment, a large number of receivers will only support
<xref target="RFC3168"/> classic ECN feedback. Such a receiver will set the
ECE bit whenever it receives a segment with the CE codepoint set, and clear
the ECE bit only when it receives a segment with the CWR bit set. As the CE
codepoint has priority over the CWR bit (Note: the wording in this regard
is ambiguous in <xref target="RFC3168"/>, but the reference implementation of
ECN in ns2 is clear), a <xref target="RFC3168"/> compliant
receiver will not clear the ECE bit on the reception of a segment, where both
CE and CWR are set simultaneously. This property allows the use of a compatibility
mode, to extract more accurate feedback from legacy <xref target="RFC3168"/>
receivers by setting the CWR permanently.
</t>
<t>Assuming a delayed ACK ratio of one (no delayed ACKs), a sender can permanently set the CWR
bit in the TCP header, to receive a more accurate feedback of the CE codepoints
as seen at the receiver. This feedback signal is however very brittle and any
ACK loss may cause congestion information to become lost.
Delayed ACKs and ACK loss can both not be accounted for in a reliable
way, however. Therefore, a sender would need to use heuristics to determine the
current delay ACK ratio M used by the receiver (e.g. most receivers will
use M=2), and also the recent ACK loss ratio<!--(1)-->. Acknowledge Congestion Control
(AckCC) as defined in <xref target="RFC5690"/> can not be used, as deployment
of this feature is only experimental.
</t>
<t>Using a phase locked loop algorithm, the CWR bit can then be set only on
those data segments, that will trigger a (delayed) ACK. Thereby, no congestion
information is lost, as long as the ACK carrying the ECE bit is seen by the
sender.
</t>
<t>Whenever the sender sees an ACK with
ECE set, this indicates that at least one, and at most M data
segments with the CE codepoint set where seen by the receiver. The sender
SHOULD react, as if M CE indications where reflected back to the sender by
the receiver, unless additional heuristics (e.g. dead time correction)
can determine a more accurate value of the "true" number of received CE marks.
</t>
</section>
</section>
<section title="Acknowledgements">
<t> We want to thank Bob Briscoe and Michael Welzl for their input and discussion.
Special thanks to Bob Briscoe, who first proposed the use of the ECN bits as one field and the handshake negotiation for more accurate ECN.
</t>
</section>
<section anchor="IANA" title="IANA Considerations">
<t>This memo includes no request to IANA.</t>
<!--<t> If this memo was to progress to standards track, it would update RFC3168
and RFC3540, to add new combinations of flags in the TCP header for capability
negotiation (see <xref target="TCPNeg"/>) and a change in TCP ECN semantics
(see <xref target="TCPSig"/>).</t>-->
</section>
<section anchor="Security" title="Security Considerations">
<t>TBD</t>
<t>ACK loss</t>
<t>This scheme sends each codepoint (of the two subsets) at least two
times. In the worst case at least one, and often two or more consecutive
ACKs can be dropped without losing congestion information. Further
refinements, such as interleaving ACKs when sending codepoints belonging
to the two subsets (e.g. CI, E1), can allow the loss of any two
consecutive ACKs, without the sender losing congestion information, at
the cost of also reducing the ACK ratio.
</t>
<t>At low congestion rates, the sending of the current value of the CI
counter by default allows higher numbers of consecutive ACKs to be
lost, without impacting the accuracy of the ECN signal.
</t>
<t>ECN Nonce</t>
<t>In the proposed scheme there are three more codepoints available that
could be used for an integrity check like ECN Nonce. If ECN nonce would
be implemented as proposed in <xref target="e1-codepoints"/>, even more
information would be provided for ECN Nonce than in the original
specification.</t>
<t>A delayed ACK ratio of two can be sustained indefinitely even during
heavy congestion, but not during excessive ECT(1) marking, which is
under the control of the sender. A higher ACK ratio can be sustained when
congestion is low, but a low ACK ratio my be needed for the E1 feedback.
</t>
</section>
</middle>
<!-- *****BACK MATTER ***** -->
<back>
<references title="Normative References">
<!--?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"?-->
&RFC2119;
&RFC3168;
&RFC3540;
</references>
<references title="Informative References">
<?rfc include="reference.I-D.briscoe-tsvwg-re-ecn-tcp.xml"?>
&RFC5562;
&RFC5681;
&RFC5690;
<reference anchor="Ali10">
<front>
<title>DCTCP: Efficient Packet Transport for the Commoditized Data Center</title>
<author initials="M" surname="Alizadeh">
<organization></organization></author>
<author initials="A" surname="Greenberg">
<organization></organization></author>
<author initials="D" surname="Maltz">
<organization></organization></author>
<author initials="J" surname="Padhye">
<organization></organization></author>
<author initials="P" surname="Patel">
<organization></organization></author>
<author initials="B" surname="Prabhakar">
<organization></organization></author>
<author initials="S" surname="Sengupta">
<organization></organization></author>
<author initials="M" surname="Sridharan">
<organization></organization></author>
<date month="Jan" year="2010"/>
</front>
</reference>
<reference anchor="draft-kuehlewind-tcpm-accurate-ecn-option" >
<front>
<title>Accurate ECN Feedback Option in TCP</title>
<author initials="M" surname="Kuehlewind">
<organization></organization></author>
<author initials="R" surname="Scheffenegger">
<organization></organization></author>
<date month="Jul" year="2012"/>
</front>
<seriesInfo name="Internet-Draft" value="draft-kuehlewind-tcpm-accurate-ecn-option-01"/>
</reference>
</references>
<section title="Estimating CE-marked bytes">
<t>TBD (see draft-ietf-conex-tcp-modifications-02 and 'late ACK' scheme of 1 Bit scheme in draft-kuehlewind-tcpm-accurate-ecn-00)
</t>
</section>
<section anchor="e1-codepoints" title="Use with ECN Nonce">
<t>In ECN Nonce, by comparing the number of incoming ECT(1) notifications with
the actual number of packets that were transmitted with an ECT(1) mark
as well as the sum of the sender's two internal counters, the sender
can probabilistically detect a receiver that sends false marks or supresses
accurate ECN feedback, or a path that does not properly support ECN.</t>
<texttable anchor="Tab_E1" align="center" title="Codepoint assignment for accurate ECN feedback and ECN Nonce">
<ttcol align="center">ECI</ttcol>
<ttcol align="center">NS</ttcol>
<ttcol align="center">CWR</ttcol>
<ttcol align="center">ECE</ttcol>
<ttcol align="center">CI (base5)</ttcol>
<ttcol align="center">E1 (base3)</ttcol>
<c>0</c> <c>0</c> <c>0</c> <c>0</c> <c>0</c> <c>-</c>
<c>1</c> <c>0</c> <c>0</c> <c>1</c> <c>1</c> <c>-</c>
<c>2</c> <c>0</c> <c>1</c> <c>0</c> <c>2</c> <c>-</c>
<c>3</c> <c>0</c> <c>1</c> <c>1</c> <c>3</c> <c>-</c>
<c>4</c> <c>1</c> <c>0</c> <c>0</c> <c>4</c> <c>-</c>
<c>5</c> <c>1</c> <c>0</c> <c>1</c> <c>-</c> <c>0</c>
<c>6</c> <c>1</c> <c>1</c> <c>0</c> <c>-</c> <c>1</c>
<c>7</c> <c>1</c> <c>1</c> <c>1</c> <c>-</c> <c>2</c>
</texttable>
<t>If an ECT(1) mark is received, an ETC(1) counter (E1) is
incremented. The receiver has to convey that
updated information to the sender with the next possible ACK
using the three remaining codepoints as show in table
<xref target="Tab_E1"/>. Thus on the reception
of a ECT(1) marked packet, the receiver should
signal the current value of the E1 counter (modulo 3) in the next
ACK. If a CE mark was received before sending the next ACK (e.g.
delayed ACKs) sending that update MUST take precedence. The receiver should
also repeat sending every E1 value. But this repetition does not need to be in the
consecutive ACK as the E1 value will only be transmitted when no changes in the CI
have occurred. Each E1 value will therefore be sent exactly twice. The repetition of every
signal will provide further resilience against lost ACKs.</t>
<t>As only a limited number of E1 codepoints exist and the receiver might not
acknowledge every single data packet immediately (delayed ACKs), a sender SHOULD NOT
mark more than 1/m of the packets with ECT(1), where m is the ACK ratio (e.g. 50%
when every second data packet triggers an ACK). This constraint will avoid a
permanent feedback of E1 only, and must be maintained also on short timescales.
A sender SHOULD send no more than 3 consecutive packets marked with ECT(1).
<!--, and never more than two consecutive packets.--></t>
<t>The same counter / gauge method as described in <xref target="receiver-impl"/> can
be used to count and return (using
a different mapping) the number of incoming packets marked ECT(1)
(called E1 in the algorithm). As few codepoints are available for
conveying the E1 counter value, an immediate ACK MUST be triggered
whenever the gauge E1.g exceeds a threshold of 3. The sender receives
the receiver's counter values and compares them with the locally
maintained counter. <!--Any increase of these counters is added to the
sender's internal counters, yielding a precise number of CE-marked
and ECT(1) marked packets.-->
<vspace blankLines="100" /></t>
<section anchor="app-codepoints" title="Pseudo Code for the Codepoint Coding">
<t>
<figure><artwork><![CDATA[
IP signals: CE
TCP Fields: AcE
Counters:
CI Congestion Indication - counter [0..(n*5-1)]
CI.g Congestion Indication - Gauge [0.."inf"])
CI.i Congestion Indication - indicator flag [0,1]
At session initialization, all these counters are initialized to zero.
When a segment (Data, ACK) is received, perform the following steps:
If (CE) # When a CE codepoint is received,
CI.g++ # Increase CI.g by 1
If (ECT(1)) # When a ECT(1) codepoint is received,
E1.g++ # Increase E1.g by 1
If (CI.g > 5) or # When ACK rate is not sufficient to keep
(E1.g > 3) # gauges close to zero,
Send ACK immediately # increase ACK rate
When preparing an ACK to be sent:
If (CI.g > 0) or # When there is a unsent change in CI
( (E1.i != 0) and # this check is to in effect alternate
(CI.i != 0) ) # sending CI and E1 codepoints
If (CI.i == 0) and # updates to CI allowed
(CI.g > 0) # update is meaningful
CI.i = 1 # set flag to repeat CI value
CI += min(4,CI.g) # 4 for 5 codepoints
CI %= 5 # using modulo the available codepoints
CI.g -= min(4,CI.g) # reduce the holding gauge accordingly
Else
CI.i-- # just in case CI.f was set to
# more than 1 for resiliency
Send ACK with AcE set to CI
Else
If (E1.g > 0) or
(E1.i != 0)
If (E1.i == 0) and
(E1.g > 0)
E1.i = 1
E1 += min(2, E1.g)
E1 %= 3
E1.g -= min(2, E1.g)
Else
E1.i--
Send ACK with AcE set to E1
Else
Send ACK with AcE set to CI # default action
Sender:
Counters:
CI.r - current value of CEs seen by receiver
E1.s - sum of all sent ECT(1) marked packets (up to snd.nxt)
E1.s(t) - value of E1.s at time (in sequence space) t
E1.r - value signaled by receiver about received ECT(1) segments
E1.r(t) - value of E1.r at time (in sequence space) t
CI.r(t) - ditto
# Note: With a codepoint implementation,
# a reverse table ECI[n] -> CI.r / E1.r is needed.
# The wire protocol transports the absolute value
# of the receiver-side counter.
# Thus the (positive only) delta needs to be calculated,
# and added to the sender-side counter.
If ACK AcE in the set of CI values
D = (AcE.CI + 5 - (CI.r mod 5)) mod 5
CI.r += D
If ACK AcE in the set of E1 values
D = (Ace.E1 + 3 - (E1.r mod 3)) mod 3
E1.r += D
# Before CI.r or E1.r reach a (binary) rollover,
# they need to roll over some multiple of 5
# and 3 respectively.
CI.r = CI.r modulo 255 # 5 * 51
E1.r = E1.r modulo 255 # 3 * 85
# (an implementation may choose to use another constant,
# ie 3^4*5^4 (50625) for 16-bit integers,
# or 3^8*5^8 (2562890625) for 32-bit integers)
# The following test can (probabilistically) reveal,
# if the receiver or path is not properly
# handling ECN (CE, E1) marks
If not E1.r(t) <= E1.s(t) <= E1.r(t) + CI.r(t)
# -> receiver or path do not properly reflect ECN
# (or too many ACKs got lost, which can be checked
# also by the sender).
]]></artwork></figure></t>
</section>
</section>
</back>
</rfc>
| PAFTECH AB 2003-2026 | 2026-04-23 14:40:39 |