One document matched: draft-kuehlewind-conex-accurate-ecn-00.xml
<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
There has to be one entity for each item to be referenced.
An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC3168 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3168.xml">
<!ENTITY RFC3540 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3540.xml">
<!ENTITY RFC5562 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5562.xml">
<!ENTITY RFC5681 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5681.xml">
<!ENTITY RFC5690 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5690.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs),
please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
(Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space
(using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="exp" docName="draft-kuehlewind-conex-accurate-ecn-00" ipr="trust200902">
<!-- updates="3186" -->
<!-- category values: std, bcp, info, exp, and historic
ipr values: trust200902, noModificationTrust200902, noDerivativesTrust200902,
or pre5378Trust200902
you can add the attributes updates="NNNN" and obsoletes="NNNN"
they will automatically be output with "(if approved)" -->
<!-- ***** FRONT MATTER ***** -->
<front>
<!-- The abbreviated title is used in the page header - it is only necessary if the
full title is longer than 39 characters -->
<title>Accurate ECN Feedback in TCP</title>
<!-- add 'role="editor"' below for the editors if appropriate -->
<!-- Another author who claims to be an editor -->
<author fullname="Mirja Kühlewind" initials="M." role="editor"
surname="Kühlewind">
<organization>University of Stuttgart</organization>
<address>
<postal>
<street>Pfaffenwaldring 47</street>
<code>70569</code>
<city>Stuttgart</city>
<country>Germany</country>
</postal>
<email>mirja.kuehlewind@ikr.uni-stuttgart.de</email>
</address>
</author>
<author fullname="Richard Scheffenegger" initials="R."
surname="Scheffenegger">
<organization>NetApp, Inc.</organization>
<address>
<postal>
<street>Am Euro Platz 2</street>
<code>1120</code>
<city>Vienna</city>
<region></region>
<country>Austria</country>
</postal>
<phone>+43 1 3676811 3146</phone>
<email>rs@netapp.com</email>
</address>
</author>
<date year="2011" />
<area>Transport</area>
<workgroup>Congestion Exposure (ConEx)</workgroup>
<keyword>Internet-Draft</keyword>
<keyword>I-D</keyword>
<abstract>
<t>Explicit Congestion Notification (ECN) is an IP/TCP mechanism where network nodes
can mark IP packets instead of dropping them to indicate congestion to the end-points.
An ECN-capable receiver will feedback this information to the sender. ECN is specified for
TCP in such a way that only one feedback signal can be transmitted per Round-Trip Time (RTT).
Recently new TCP mechanisms like ConEx or DCTCP need more accurate feedback information in the case
where more than one marking is received in one RTT.
</t>
</abstract>
</front>
<middle>
<section title="Introduction">
<t>Explicit Congestion Notification (ECN) <xref target="RFC3168"/> is
an IP/TCP mechanism where
network nodes can mark IP packets instead of dropping them to indicate congestion to
the end-points. An ECN-capable receiver will feedback this information to the sender.
ECN is specified for TCP in such a way that only one feedback signal can be
transmitted per Round-Trip Time (RTT).
Recently proposed mechanisms like Congestion Exposure (ConEx) or DCTCP
<xref target="Ali10"/> need more accurate feedback information
in case when more than one marking is received in one RTT.
</t>
<t>This documents discusses and (will in a further version specify) a different scheme for the ECN feedback in the TCP header
to provide more than one feedback signal per RTT. This modification does not obsolete
<xref target="RFC3168"/>. It provides an extension that requires
additional negotiation in the TCP handshake by using the TCP nonce sum (NS) bit
<!--, as specified in <xref target="RFC3540"/>,--> which is
currently not used when SYN is set.
</t>
<t> In the current version of this document there are different coding schemes proposed for
discussion. All proposed codings aim to scope with the given bit space. All schemes require
the use of the NS bit at least in the TCP handshake. Depending of the coding scheme the
accurate ECN feedback extension will or will not include the ECN-Nonce integrity mechanism.
A later version of this document will choose between the coding options, and remove the rationale
for the choice and the specs of those schemes not chosen.
If a scheme will be chosen that does not include ECN Nonce, a mechanism that is requiring a
more accurate ECN feedback needs to provide an own method to ensure the integrity of the
congestion feedback information or has to scope with the uncertainty of this information.
</t>
<t>
The following scenarios should briefly show where the accurate feedback is needed or provides additional value:
<list style="letters">
<t>A Standard TCP sender with <xref target="RFC5681"/> congestion control algorithm that supports ConEx:
<vspace blankLines="0" />
In this case the congestion control algorithm still ignores multiple marks per RTT,
while the ConEx mechanism uses the extra information per RTT to re-echo more precise congestion
information. </t>
<t>A sender using DCTCP without ConEx:<vspace blankLines="0" />
The congestion control algorithm uses the extra info per RTT to perform its decrease depending on the
number of congestion marks.</t>
<t>A sender using DCTCP congestion control and supports ConEx:<vspace blankLines="0" />
Both the congestion control algorithm and ConEx use the accurate ECN feedback mechanism.</t>
<t>A standard TCP sender using RFC5681 congestion control algorithm without ConEx:<vspace blankLines="0" />
No accurate feedback is necessary here. The congestion control algorithm still react only on one signal
per RTT. But its best to have one generic feedback mechanism, whether you use it or not.</t>
</list>
</t>
<section title="Overview ECN and ECN Nonce in TCP">
<t>ECN requires two bits in the IP header. The ECN capability of a packet is indicated,
when either one of the two bits is set. An ECN sender can set one or the other bit to indicate
an ECN-capable transport (ETC)
which results in two signals --- ECT(0) and respectively ECT(1).
A network node can set both bits simultaneously
when it experiences congestion. When both bits are
set the packets is regarded as "Congestion Experienced" (CE).
</t>
<t>In the TCP header two bits in byte 14 are defined for the use of ECN. The TCP mechanism
for signaling the reception of a congestion mark
uses the ECN-Echo (ECE) flag in the TCP header. To enable the TCP receiver to
determine when to stop setting the ECN-Echo flag, the CWR flag is set by the sender
upon reception of the feedback signal.
</t>
<t>
ECN-Nonce <xref target="RFC3540"/> is
an optional addition to ECN that is used to protects the TCP sender against
accidental or malicious concealment of marked or dropped packets. This addition defines the
last bit of the 13 byte in the TCP header as the Nonce Sum (NS) bit. With ECN-Nonce
a nonce sum is maintain that counts the occurrence of ECT(1) packets.
</t>
<figure anchor="TCPHdr" align="center" title="The (post-ECN Nonce) definition of the TCP header flags">
<artwork align="center"><![CDATA[
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| | | N | C | E | U | A | P | R | S | F |
| Header Length | Reserved | S | W | C | R | C | S | S | Y | I |
| | | | R | E | G | K | H | T | N | N |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
]]></artwork>
</figure>
</section>
<section title="Design choices">
<t>
The idea of this document is to use the ECE, CWR and NS bits for additional capability negotiation during the SYN/SYN-ACK exchange, and then for the more accurate feedback itself on subsequent packets in the flow (with SYN=0).
</t>
<t>Alternatively, a new TCP option could be introduced, to help maintain the accuracy,
and integrity of the ECN feedback between receiver and sender. Such an option could provide more information. E.g. ECN for RTP/UDP provides explicit the number of ECT(0), ECT(1), CE, non-ECT marked and lost packets. However, deploying new TCP options has it's own challenges.
</t>
<!--<t>Combining the idea of <xref target="eci_mode"/> and <xref target="cp_mode"/>,
further extending it to a one-octet option, would allow the signaling of two
values, each with 4 bit. The gains in worst case ACK loss, delayed ACK ratios
and maintaining ECN Nonce would scale accordingly.
</t>
<t>Alternatively, if timestamp capability negotiation is supported, a few
bits could be extracted from the timestamp value, to provide extended
signaling. However, processing TCP options (or overloaded TCP options) is
more complex than processing of header flags.
</t>-->
<t>As seen in <xref target="TCPHdr"/>, there are currently three unused flag bits
in the TCP header. Any of the below described schemes could be extended by one or
more bits, to add higher resiliency against ACK loss. The relative gains
would be proportional to each of the described schemes, while the respective
drawbacks would remain identical. Thus the approach in this document is to scope with the
given number of bits as they seem to be already sufficient and the accurate ECN feedback scheme
will only be used instead of the classic ECN and never in parallel.
</t>
</section>
<section title="Requirements Language">
<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in <xref
target="RFC2119">RFC 2119</xref>.</t>
<t>We use the following terminology from <xref target="RFC3168"/> and <xref target="RFC3540"/>:</t>
<t>The ECN field in the IP header:
<list hangIndent="10" style="empty">
<t>CE: the Congestion Experienced codepoint; and</t>
<t>ECT(0)/ECT(1): either one of the two ECN-Capable Transport codepoints.</t>
</list></t>
<t>The ECN flags in the TCP header:
<list hangIndent="10" style="empty">
<t>CWR: the Congestion Window Reduced flag;</t>
<t>ECE: the ECN-Echo flag; and</t>
<t>NS: ECN Nonce Sum.</t>
</list></t>
<t> In this document, we will call the ECN feedback scheme as specified
in <xref target="RFC3168"/> the 'classic ECN'
and our new proposal the 'accurate ECN feedback' scheme.
A 'congestion mark' is defined as an IP packet where the CE codepoint is set.
</t>
</section>
</section>
<section title="Negotiation in TCP handshake" anchor="TCPNeg">
<t> During the TCP hand-shake at the start of a connection, an originator
of the connection (host A) MUST
indicate a request to get more accurate ECN feedback by setting the TCP flags
NS=1, CWR=1 and ECE=1 in the initial SYN.
</t>
<t>
A responding host (host B) MUST return a SYN ACK with flags
CWR=1 and ECE=0. The responding host MUST NOT set this combination
of flags unless the preceding SYN has already requested
support for accurate ECN feedback as above. Normally a server (B) will
reply to a client with NS=0, but if the initial SYN from client A is
marked CE, the sever B can set the NS flag to 1 to indicate the congestion
immediately instead of delaying the signal to the first acknowledgment when
the actually data transmission already started.
<!--a server B MUST
increment its local value of
ECC. But B cannot reflect the value of ECC in the SYN ACK, because
it is still using the 3 bits to negotiate connection capabilities. -->
<!-- [RS] ECC not yet defined. Suggest to remove this discussion for later
(sec. 312 or 313). Also, need to add a paragraph stating that we encourage
the use of ECT during non-data segments (SYN, pure ACK)... -->
So, server B MAY set the alternative TCP header flags in its SYN
ACK: NS=1, CWR=1 and ECE=0.
</t>
<t> The Addition of ECN to TCP SYN/ACK packets is discussed
and specified as experimental in <xref target="RFC5562"/>. The addition
of ECN to the SYN packet is optional. The security implication
when using this option are not further discussed here.
</t>
<t>
These handshakes are summarized in Table 1 below, with X indicating
NS can be either 0 or 1 depending on whether congestion had been
experienced. The handshakes used for the other flavors of ECN are
also shown for comparison. To compress the width of the table, the
headings of the first four columns have been severely abbreviated, as
follows:
</t>
<t>
Ac: *Ac*curate ECN Feedback
</t>
<t>
N: ECN-*N*once (RFC3540)
</t>
<t>
E: *E*CN (RFC3168)
</t>
<t>
I: Not-ECN (*I*mplicit congestion notification).
</t>
<texttable anchor="Tab1" align="center"
title="ECN capability negotiation between Sender (A) and Receiver (B)">
<ttcol align="left">Ac</ttcol>
<ttcol align="center">N</ttcol>
<ttcol align="center">E</ttcol>
<ttcol align="center">I</ttcol>
<ttcol align="center">[SYN] A->B</ttcol>
<ttcol align="center">[SYN,ACK] B->A</ttcol>
<ttcol align="left">Mode</ttcol>
<c/> <c/> <c/> <c/> <c>NS CWR ECE</c> <c>NS CWR ECE</c> <c/>
<c>AB</c> <c/> <c/> <c/> <c>1 1 1</c> <c>X 1 0</c> <c>accurate ECN</c>
<c>A</c> <c>B</c> <c/> <c/> <c>1 1 1</c> <c>1 0 1</c> <c>ECN Nonce</c>
<c>A</c> <c/> <c>B</c> <c/> <c>1 1 1</c> <c>0 0 1</c> <c>classic ECN</c>
<c>A</c> <c/> <c/> <c>B</c> <c>1 1 1</c> <c>0 0 0</c> <c>Not ECN</c>
<c>A</c> <c/> <c/> <c>B</c> <c>1 1 1</c> <c>1 1 1</c> <c>Not ECN (broken)</c>
</texttable>
<t>
Recall that, if the SYN ACK reflects the same flag settings as the
preceding SYN (because there is a broken RFC3168 compliant
implementation that behaves this way), RFC3168 specifies that the
whole connection MUST revert to Not-ECT.
</t>
</section>
<section title="Accurate Feedback">
<t>In this section we refer the sender to be the on sending data and the receiver as the one that
will acknowledge this data. Of course such a scenario is describing only one half connection
of a TCP connection. The proposed scheme, if negotiated, will be used for both half
connection as both, sender and receiver, need to be capable to echo and understand the
accurate ECN feedback scheme.
</t>
<section title="Coding" anchor="TCPSig">
<t>
This section proposes three different coding schemes for discussion. First, requirements are
listed that will allow to evaluate the proposed schemes against each other. A later version
of this document will choose between the coding options, and remove the rationale
for the choice and the specs of those schemes not chosen.
The next section provides basically a fourth alternative to allow a compatibility mode when a
sender needs accurate feedback but has to operate with a legacy <xref target="RFC3168"/> receiver.
</t>
<section title="Requirements">
<t>The requirements of the accurate ECN feedback protocol for the use of e.g. Conex or DCTCP
are to have a fairly accurate (not necessarily perfect), timely
and protected signaling. This leads to the following requirements:</t>
<t><list hangIndent="8" style="hanging">
<t hangText="Resilience"><vspace/>The ECN feedback signal is implicit carried within the
TCP acknowledgment. TCP ACKs can get lost. Moreover, delayed ACK are usually used
with TCP. That means in most cases only every second data packets gets acknowledged.
In a high congestion situation where most of the packet are marked with CE, an
accurate feedback mechanism must still be able to signal sufficient congestion
information. Thus the accurate ECN feedback extension has to take delayed ACK and
ACK loss into account.</t>
<t hangText="Timely"><vspace/>The CE marking is induced by a network node on the
transmission path and echoed by the receiver in the TCP acknowledgment. Thus when
this information arrives at the sender, its naturally already about one RTT old.
With a sufficient ACK rate a further delay of a small number of ACK can be
tolerated but with large delays this information will be out dated due to high
dynamic in the network. TCP congestion control which introduces parts of this
dynamic operates on an time scale of one RTT. Thus the congestion feedback
information should be delivered timely (within one RTT).</t>
<t hangText="Integrity"><vspace/>With ECN Nonce, a misbehaving receiver can be detected
with a certain probability. As this accurate ECN feedback might reuse the NS bit
it is encouraged to ensure integrity as least as good as ECN Nonce. If this is
not possible, alternative approaches should be provided how a mechanism using the accurate ECN
feedback extension can re-ensure integrity or give strong incentives for the
receiver and network node to cooperate honestly. <!--If and what kind of
enforcements a sender should do, when detecting wrong feedback information, is
out-of-scope.--></t>
<t hangText="Accuracy"><vspace/><!--In TCP usually delayed ACKs are used. Thats means in
most cases only for every second data packets an acknowledgment is sent. Moreover,
an ACK can get lost.-->Classic ECN feeds back one congestion notification per RTT, as
this is supposed to be used for TCP congestion control which reduces the sending
rate at most once per RTT. The accurate ECN feedback scheme has to ensure that
if a congestion events occurs at least one congestion notification is echoed and
received per RRT as classic ECN would do. Of course, the goal of this extension is to
reconstruct the number of CE marking more accurately. However, a sender should
not assume to get the exact number of congestion marking in a high congestion
situation.</t>
<t hangText="Complexity"><vspace/>Of course, the more accurate ECN feedback can also be
used, even if only one ECN feedback signal per RTT is need.
To enable this proposal for a more
accurate ECN feedback as the standard ECN feedback mechanism, the implementation should
be as simple as possible and a minimum of addition state information should be needed.</t>
</list></t>
</section>
<section title="One bit feedback flag" anchor="sm_mode"> <!-- [RS] "ACK state machine mode?" -->
<t>This option is using a one bit flag, namely the ECE bit, to signal more accurate ECN
feedback. Other than classic ECN feedback, a accurate ECN feedback receiver MUST set
the ECE bit in N subsequent ACK packets (only). A accurate ECN feedback receiver MUST
NOT wait for a CWR bit from the sender to reset the ECE bit.
N is not defined yet but is intended to be 2.
</t>
<t>Moreover, when a congestion situation occurs or stops, the receiver MUST immediately
acknowledge the data packet and MUST NOT delay the acknowledgment until a further data
packet is arrived. A congestion situation occurs when the previous data packet was CE=0
but the current one is CE=1. And a congestion situation stops when the previous data
packet was CE=1 and the current one is CE=0.
</t>
<t>The following figure shows a simple state machine to describe
the receiver behavior for N=1.
</t>
<figure align="center" anchor="DCTCP_ACK" title="Two state ACK generation state machine">
<artwork align="center"><![CDATA[
Send immediate
ACK with ECE=0
.---. .------------. .---.
Send 1 ACK / v v | | \
for every | .------. .------. | Send 1 ACK
m packets | | CE=0 | | CE=1 | | for every
with ECE=0 | '------' '------' | m packets
\ | | ^ ^ / with ECE=1
'---' '------------' '---'
Send immediate
ACK with ECE=1
]]></artwork>
</figure>
<section title="Discussion">
<t>ACK loss</t>
<t>The simplest way to get a more accurate ECN feedback, which allows more than one
signal per RTT, is to set the ECE flag only once when a congestion marks occurs
instead of setting the ECE flag in every packets until a CWR flag is received. This
solution still only allows one signal per acknowledgment which might not be sufficient
when more than one packet is acknowledged at once (delayed ACKs). And even more
important, this information can get lost with the loss only one ACK packet carrying
this information. One solution would be to carry the same information in a defined
number of subsequent ACK packets. This would reduce again the number of feedback
signals that can be transmitted in one RTT but improve the integrity.
More sophisticated solutions based on ACK loss detection might be possible as well.
</t>
<t>
<!--This scheme was first proposed in <xref target="Ali10"/> for the use with DCTCP.-->
Note that the semantics of classic ECN are changed, and the CWR flag is no longer
interpreted by the receiver to reset the ECE flag.
A simple extension of this scheme could make use of the CWR flag. E.g. the receiver could
always repeat the value of the ECE flag of the predecessor ACK in the CWR flag.
However, only a single
lost ACK can be addressed that way. Two consecutive ACKs becoming lost may still
result in a loss of ECN information to the sender.
<!--This could allows an extension
of this scheme, to accomodate some ACK loss. However, in it's basic form, this
signaling scheme is still very vulnerable to ACK loss.-->
</t>
<t>In low congestion situations (less than one CE mark per RTT on average),
the loss of m subsequent ACKs would result in complete
loss of the congestion information. The opposite would
be true during high congestion, where the sender can incorrectly assume that all segments
were received with the CE codepoint. </t>
<t>With DCTCP <xref target="Ali10"/> it was proposed to acknowledge a data packet directly
without delay when a congestion situation occurs, as already described above.
This scheme allows a more accurate feedback
signal in a high congestion/marking situation.
However, using Delayed ACKs is important for a variety
of reasons, including reducing the load on the data sender.
<!--To use delayed ACKs (one cumulative ACK for every m
consecutively received packets), the DCTCP receiver uses
the trivial two state state-machine shown in Figure <xref target="DCTCP_ACK"/> to
determine whether to immediately send an ACK, and wether to
set the ECN-Echo bit. The states correspond
to whether the last received packet was marked with the CE
codepoint or not. Since the sender knows how many packets
each ACK covers, it can exactly reconstruct the runs of
marks seen by the receiver.-->
</t>
<!--+ reaction/ack loss recognition (sender action needed? with/without NS bit?):
delay congestion signal + redundant feedback signal in two subsequent ack or
ack loss detection (for conex)?-->
<t>As this heuristic is triggering immediate ACKs whenever the received CE
bit toggles, arbitrarily large ACK ratios are supported. However, the effective
ACK ratio is depending on the congestion state of the network. Thus it may collapse
to 1 (one ACK for each data segment)More sophisticated solutions based on ACK loss
detection might be possible as well, when every other segment is received with CE
set.
<!--An additional shortcoming
of this scheme is the possible deactivation of delayed ACKs, when every
other segment is received with CE set. The state machine will then trigger
one ACK for each received segment.-->
</t>
<!--<t>In the context of
DCTCP, with very low RTTs, and special active queue managment (AQM) rules,
any glitch will get corrected fast without much impact. However, for general
deployment, the basic form of this signaling scheme does not appear viable.
</t>-->
<t>ECN Nonce</t>
<t>As the ECN Nonce bit is not used otherwise, ECN Nonce <xref target="RFC3540"/> can
be used complementary. Network paths not supporting ECN, misbehaving, or
malicious receivers withholding ECN information can therefore be detected.
</t>
</section>
</section>
<section title="Three bit field with counter feedback" anchor="eci_mode"> <!-- [RS] shorten to "Echo congestion incremenet" or "Echo congestion value"? -->
<t>
The receiver maintains an unsigned integer counter which we call ECC
(echo congestion counter). This counter maintains a count of how
many times a CE marked packet has arrived during the half-connection.
Once a TCP connection is established, the three TCP option flags
(ECE, CWR and NS) <!--used for ECN-related functions in other versions of
ECN--> are used as a 3-bit field for the receiver to permanently signal the
sender the current value of ECC, modulo 8, whenever it sends a TCP
ACK. We will call these three bits the echo congestion increment (ECI) field.
</t>
<t>This overloaded use of these 3 option flags as one 3-bit ECI field is
shown in <xref target="ECI_ACK"/>. The actual definition of the TCP header,
including the addition of support for the ECN Nonce, is shown for
comparison in <xref target="TCPHdr"/>. This specification does not redefine the
names of these three TCP option flags, it merely overloads them with
another definition once a flow with accurate ECN feedback is established.
</t>
<figure
title="Definition of the ECI field within bytes 13 and 14 of the TCP Header (when SYN=0)."
align="center" anchor="ECI_ACK">
<artwork align="center"><![CDATA[
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| | | | U | A | P | R | S | F |
| Header Length | Reserved | ECI | R | C | S | S | Y | I |
| | | | G | K | H | T | N | N |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
]]></artwork></figure>
<t>Also note that, whenever the SYN flag of a TCP segment is set
(including when the ACK flag is also set), the NS, CWR and ECE flags
(i.e. the ECI field of the SYNACK) MUST NOT be interpreted as the
3-bit ECI value, which is only set as a copy of the local ECC value
in non-SYN packets.
</t>
<t>This scheme was first proposed in <xref target="I-D.briscoe-tsvwg-re-ecn-tcp"/>
for the use with re-ECN. <!--However, without the external framework to
detect and address misbehaving receivers, a sender alone can not detect
if a receiver is concealing ECN information.-->
</t>
<section title="Discussion">
<t>ACK loss</t>
<t><!--The ECI method was chosen for echoing congestion marking because a
re-ECN sender needs to know about every CE mark arriving at the
receiver, not just whether at least one arrives within a round trip
time (which is all the ECE/CWR mechanism supported). And, -->
As pure ACKs are not protected by TCP reliable delivery, we repeat the same
ECI value in every ACK until it changes. Even if many ACKs in a row
are lost, as soon as one gets through, the ECI field it repeats from
previous ACKs that didn't get through will update the sender on how
many CE marks arrived since the last ACK got through.</t>
<t>The sender will only lose a record of the arrival of a CE mark if all
the ACKS are lost (and all of them were pure ACKs) for a stream of
data long enough to contain 8 or more CE marks. So, if the marking
fraction was p, at least 8/p pure ACKs would have to be lost. For
example, if p was 5%, a sequence of 160 pure ACKs (without delayed ACKs)
would all have to be lost. When ACK are delay this number has to be reduced
by 1/m. This would still require a sequence of 80 pure lost ACKs with the usual
delay rate of m=2. </t>
<t>Additionally, to protect against such extremely unlikely events, if a re-
ECN sender detects a sequence of pure ACKs has been lost it can
assume the ECI field wrapped as many times as possible within the
sequence. E.g., if a re-ECN sender receives an ACK with an
acknowledgement number that acknowledges L (>m) segments since the
previous ACK but with a sequence number unchanged from the previously
received ACK, it can conservatively assume that the ECI field
incremented by D' = L - ((L-D) mod 8), where D is the apparent
increase in the ECI field. For example if the ACK arriving after 9
pure ACK losses apparently increased ECI by 2, the assumed increment
of ECI would still be 2. But if ECI apparently increased by 2 after
11 pure ACK losses, ECI should be assumed to have increased by 10.</t>
<!--<t>A re-ECN sender MAY implement a heuristic algorithm to predict beyond
reasonable doubt that the ECI field probably did not wrap within a
sequence of lost pure ACKs. But such an algorithm is OPTIONAL. Such
an algorithm MUST NOT be used unless it is proven to work even in the
presence of correlation between high ACK loss rate on the back
channel and high CE marking rate on the forward channel.</t>
<t>Whatever assumption a re-ECN sender makes about potentially lost CE
marks, both its congestion control and its re-echoing behaviour
SHOULD be consistent with the assumption it makes.</t>-->
<!--<t>As the current value of the ECC is repeatedly signaled, in situations with a low
congestion rate this scheme has
no issue with ACK loss .
In the worst, when all ACKs in one RTT are lost but one, sender and receiver stay in sync as long as fewer than 7 segments were CE marked. As in this scenario, the ACK clock is basically disabled, TCP performance
will be impacted regardless of the accuracy of the ECN feedback signal.
</t>
<t>During phases of high congestion, where every data segment is marked
with CE, no more than two or three consecutive, delayed (m=2) ACKs can get
lost to keep in sync. The number of ACK that can get lost is thus depending on
the rate of the delayed ACKs and the actually length of the period of CE marks.
If the receiver is sending immediate ACKs, e.g. when
some reordering or loss has been detected, up to 6 consecutive ACKs
can be lost without loosing the synchronization between sender and
receiver. The latter scenario may happen if a single segment was
dropped, and subsequent segments get CE marked as the congestion on the link
persists.
</t>
<t>The highest ACK ratio, where at least a single lost ACK will never
cause the counters between sender and receiver to become unsynchronized,
is four. This again assumes a very high congestion scenario, where each
data segment is marked with CE.
</t>-->
<t>ECN Nonce</t>
<t>ECN Nonce cannot be used in parallel to this scheme. But mechanism
that make use of this new scheme might provide stronger incentives to declare
congestion honestly when needed.
E.g. with ConEx each congestion notification suppressed by the
receiver should lead the ConEx audit function to
discard an equivalent number of bytes such that the receiver does not gain from
suppressing feedback. This mechanism would even provide a stronger integrity mechanism
than ECN-Nonce does.
Without an external framework to discourage
the withholding of ECN information, this scheme is vulnerable to the problems
described in <xref target="RFC3540"/>.
</t>
</section>
<!--Receiver Action in RECN Mode
Every time a CE marked packet arrives at a receiver in RECN mode,
the receiver transport increments its local value of ECC and MUST
echo its value, modulo 8, to the sender in the ECI field of the
next ACK. It MUST repeat the same value of ECI in every
subsequent ACK until the next CE event, when it increments ECI
again.
The increment of the local ECC values is modulo 8 so the field
value simply wraps round back to zero when it overflows. The
least significant bit is to the right (labelled bit 9).
A receiver in RECN mode MAY delay the echo of a CE to the next
delayed-ACK, which would be necessary if ACK-withholding were
implemented.
Sender Action in RECN Mode
On the arrival of every ACK, the sender compares the ECI field
with its own ECC value, then replaces its local value with that
from the ACK. The difference D (D = (ECI + 8 - ECC mod 8) mod 8)
is assumed to be the number of CE marked packets that arrived at
the receiver since it sent the previously received ACK (but see
below for the sender's safety strategy).
As we have already emphasised, the re-ECN protocol makes no
changes and has no effect on the TCP congestion control algorithm.
So, the first increment of ECI (or detection of a drop) in a RTT
triggers the standard TCP congestion response, no more than one
congestion response per round trip, as usual. However, the sender
re-echoes every increment of ECI irrespective of RTTs.
A TCP sender also acts as the receiver for the other half-
connection. The host will maintain two ECC values S.ECC and R.ECC
as sender and receiver respectively. Every TCP header sent by a
host in RECN mode will also repeat the prevailing value of R.ECC
in its ECI field. If a sender in RECN mode has to retransmit a
packet due to a suspected loss, the re-transmitted packet MUST
carry the latest prevailing value of R.ECC when it is re-
transmitted, which will not necessarily be the one it carried
originally.-->
</section>
<section title="Codepoints with dual counter feedback" anchor="cp_mode"> <!-- [RS] "Codepoint Nonce Feedback"? -->
<t> In-line with the definition of the previous section in Figure 3, the
ECE, CWR and NS bits are used as one field but instead they are encoding 8 codepoints.
These 8 codepoints, as shown below, encode either a "congestion
indication" (CI) counter or an ECT(1) counter (E1). These counters maintain
the number of CE marks or the number of ECT(1) signals observed at the
receiver respectively.</t>
<texttable anchor="Tab2" align="center" title="Codepoint assignment for accurate ECN feedback">
<ttcol align="center">ECI</ttcol>
<ttcol align="center">NS</ttcol>
<ttcol align="center">CWR</ttcol>
<ttcol align="center">ECE</ttcol>
<ttcol align="center">CI (base5)</ttcol>
<ttcol align="center">E1 (base3)</ttcol>
<c>0</c> <c>0</c> <c>0</c> <c>0</c> <c>0</c> <c>-</c>
<c>1</c> <c>0</c> <c>0</c> <c>1</c> <c>1</c> <c>-</c>
<c>2</c> <c>0</c> <c>1</c> <c>0</c> <c>2</c> <c>-</c>
<c>3</c> <c>0</c> <c>1</c> <c>1</c> <c>3</c> <c>-</c>
<c>4</c> <c>1</c> <c>0</c> <c>0</c> <c>4</c> <c>-</c>
<c>5</c> <c>1</c> <c>0</c> <c>1</c> <c>-</c> <c>0</c>
<c>6</c> <c>1</c> <c>1</c> <c>0</c> <c>-</c> <c>1</c>
<c>7</c> <c>1</c> <c>1</c> <c>1</c> <c>-</c> <c>2</c>
</texttable>
<!--<t>TBD: Stipulate the alternate sending of the two counters, unless either
counter changes, and then to send a new value twice? This would decrease
ACK ratio under high congestion, but increase the number of consecutive
lost ACKs to 2, which can be always tolerated.</t> -->
<t>By default an accurate ECN receiver MUST echo the CI counter
(modulo 5) with the respective codepoints. Whenever an CE occurs and thus the value of the
CI has changed, the receiver MUST echo the CI in the next ACK.
Moreover, the receiver MUST repeat the codepoint, that provides
the CI counter, directly on the subsequent ACK. Thus every value of CI
will be transmitted at least twice.</t>
<t>If an ECT(1) mark is receipt and thus E1 increases, the receiver has to convey that
updated information to the sender as soon as possible. Thus on the reception
of a ECT(1) marked packet, the receiver MUST
signal the current value of the E1 counter (modulo 3) in the next
ACK, unless a CE mark was receipt which is not echoed yet twice. The receiver MUST
also repeat very E1 value. But this repetition does not need to be in the
subsequent ACK as the E1 value will only be transmitted when no changes in the CI
have occured. Each E1 value will be send excatly twice. The repetition of every
signal will provide further resilience against lost ACKs. </t>
<t>As only a limited number of E1 codepoints exist and the receiver might not
acknowledge every single data packet immediately (delayed ACKs), a sender SHOULD NOT
mark more than 1/m of the packets with ECT(1), where m is the ACK ratio (e.g. 50% when
every second data packet triggers an ACK). This constraint will avoid a
permanent feedback of E1 only. <!--, and never more than two
consecutive packets.--></t>
<!--<t>For resilience against lost ACKs, a second ACK has to
transmit the previous codepoint again, whether another
congestion indication (CE) or ECT(1) mark arrives or not.</t>-->
<t>This requirement may conflict with delayed ACK ratios
larger than two, using the available number of codepoints. A receiver
MUST change the ACK'ing rate such
that a sufficient rate of feedback signals can be sent. Details on how the change in the ACK'ing rate should be implemented are given in the next subsection.
<!--Under certain
circumstances, i.e. the sender using excessive ECT(1) marks, every packet
may be immediately get ACK'ed. The available codepoints for CI allow
the indefinite use of delayed ACKs with a ratio of two, even during
heavy network congestion. --></t>
<section title="Implementation">
<t> The basic idea is for the receiver to count how many packets carry a
congestion notification. This could, in principle, be achieved by
increasing a "congestion indication" counter (CI.c) for every incoming CE marked
segment. Since the space for communicating the information back to the
sender in ACKs is limited, instead of directly increasing this counter,
a "gauge" (CI.g) is increased instead.
</t>
<t>When sending an ACK, the content of this gauge (capped by the maximum
number that can be encoded in the ACK, e.g. 4 for CI, and 2 for E1) is
copied to the actual counter, and CI.g is reduced by the value
that was copied over and transmitted, unless CI.g was zero before. To
avoid losing information, it is ensured that an ACK is sent at least
after 5 incoming congestion marks (i.e. when CI.g exceeds 5).
</t>
<t>For resilience against lost ACKs, an indicator flag (CI.i) ensures that,
whether another congestion indication arrives or not, a second ACK
transmits the previous counter value again.
</t>
<t>The same counter / gauge method is used to count and feed back (using
a different mapping) the number of incoming packets marked ECT(1)
(called E1 in the algorithm). As fewer codepoints are available for
conveying the E1 counter value, an immediate ACK MUST be triggered
whenever the gauge E1.g exceeds a threshold of 3. The sender receives
the receiver's counter values and compares them with the locally
maintained counter. Any increase of these counters is added to the
sender's internal counters, yielding a precise number of CE-marked
and ECT(1) marked packets. Architecturally the counters never decrease
during a TCP session. However, any overflow must be modulo 5 for CI,
and modulo 3 for E1.</t>
<t>The following table provides an example showing an half-connection with an TCP sender A and
receiver B. The sender maintains a counter CI.r to reconstruct the number of CE mark
receipt at receiver-side.</t>
<texttable anchor="Tab4" align="center" title="Codepoint signal example">
<ttcol align="center"> </ttcol>
<ttcol align="center">Data</ttcol>
<ttcol align="right">TCP A</ttcol>
<ttcol align="right">IP</ttcol>
<ttcol align="right">TCP B</ttcol>
<ttcol align="center">Data</ttcol>
<c> </c> <c/> <c>SEQ ACK CTL</c> <c/> <c>SEQ ACK CTL</c> <c/>
<c>--</c> <c/> <c>-------------</c> <c>----------</c> <c>-------------</c> <c/>
<c>1</c> <c/> <c>0100 SYN</c> <c> ----> </c> <c> </c> <c/>
<c> </c> <c/> <c> CWR,ECE,NS</c> <c> </c> <c> </c> <c/>
<c>2</c> <c/> <c> </c> <c> <---- <!--ECT0--></c> <c>0300 0101 SYN </c> <c/>
<c> </c> <c/> <c> </c> <c> </c> <c> ACK,CWR </c> <c/>
<c>3</c> <c/> <c>0101 0301 ACK</c> <c> ECT0 -CE-></c> <c> </c> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=0 CI.g=1</c> <c/>
<c>4</c> <c>100</c> <c>0101 0301 ACK</c> <c>ECT0 ----></c> <c/> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=1 CI.g=0</c> <c/>
<c>5</c> <c/> <c></c> <c><---- </c> <c>0301 0201 ACK</c> <c/>
<c> </c> <c/> <c/> <c/> <c>ECI=CI.1</c> <c/>
<c/> <c/> <c>CI.r=1</c> <c/> <c/> <c/>
<c>6</c> <c>100</c> <c>0201 0301 ACK</c> <c>ECT0 -CE-></c> <c/> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=1 CI.g=1</c> <c/>
<c>7</c> <c>100</c> <c>0301 0301 ACK</c> <c>ECT0 -CE-></c> <c/> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=1 CI.g=2</c> <c/>
<c>8</c> <c/> <c></c> <c>XX-- </c> <c>0301 0401 ACK</c> <c/>
<c> </c> <c/> <c/> <c/> <c>ECI=CI.1</c> <c/>
<c/> <c/> <c>CI.r=1</c> <c/> <c/> <c/>
<c>9</c> <c>100</c> <c>0401 0301 ACK</c> <c>ECT0 -CE-></c> <c/> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=1 CI.g=3</c> <c/>
<c>10</c> <c>100</c> <c>0501 0301 ACK</c> <c>ECT0 -CE-></c> <c/> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=5 CI.g=0</c> <c/>
<c>11</c> <c/> <c></c> <c><---- </c> <c>0301 0601 ACK</c> <c/>
<c> </c> <c/> <c/> <c/> <c>ECI=CI.0</c> <c/>
<c/> <c/> <c>CI.r=5</c> <c/> <c/> <c/>
<c>12</c> <c>100</c> <c>0601 0301 ACK</c> <c>ECT0 -CE-></c> <c/> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=5 CI.g=1</c> <c/>
<c>13</c> <c>100</c> <c>0701 0301 ACK</c> <c>ECT0 -CE-></c> <c/> <c/>
<!--<c> </c> <c/> <c>ECI=CI.0 </c> <c/> <c/> <c/>-->
<c/> <c/> <c> </c> <c/> <c>CI.c=5 CI.g=2</c> <c/>
<c>14</c> <c/> <c></c> <c><---- </c> <c>0301 0801 ACK</c> <c/>
<c> </c> <c/> <c/> <c/> <c>ECI=CI.0</c> <c/>
<c/> <c/> <c>CI.r=5</c> <c/> <c/> <c/>
<!--
| 1 | | 0100 SYN | FNE | - | R.ECC=0 | |
| | | CWR,ECE,NS | | | | |
| 2 | | R.ECC=0 | <- | FNE | 0300 0101 | |
| | | | | | SYN,ACK,CWR | |
| 3 | | 0101 0301 ACK | RECT | - | R.ECC=0 | |
| 4 | 1000 | 0101 0301 ACK | FNE | - | R.ECC=0 | |
| 5 | | R.ECC=0 | <- | FNE | 0301 1102 ACK | 1460 |
| 6 | | R.ECC=0 | <- | RECT | 1762 1102 ACK | 1460 |
| 7 | | R.ECC=0 | <- | FNE | 3222 1102 ACK | 1460 |
| 8 | | 1102 1762 ACK | RECT | - | R.ECC=0 | |
| 9 | | R.ECC=0 | <- | RECT | 4682 1102 ACK | 1460 |
| 10 | | R.ECC=0 | <- | RECT | 6142 1102 ACK | 1460 |
| 11 | | 1102 3222 ACK | RECT | - | R.ECC=0 | |
| 12 | | R.ECC=0 | <- | RECT | 7602 1102 ACK | 1460 |
| 13 | | R.ECC=1 | <*- | RECT | 9062 1102 ACK | 1460 |
| | | ... | | | | |
-->
</texttable>
</section>
<section title="Discussion">
<t>ACK loss</t>
<t>As this scheme sends each codepoint (of the two subsets) at least two
times, at least one, and up to two consecutive ACKs can be lost. Further
refinements, such as interleaving ACKs when sending codepoints belonging
to the two subsets (e.g. CI, E1), can allow the loss of any two
consecutive ACKs, without the sender losing congestion information, at
the cost of also reducing the ACK ratio.
</t>
<t>At low congestion rates, the sending of the current value of the CI
counter by default allows higher numbers of consecutive ACKs to be
lost, without impacting the accuracy of the ECN signal.
</t>
<t>ECN Nonce</t>
<t>By comparing the number of incoming ECT(1) notifications with
the actual number of packets that were transmitted with an ECT(1) mark
as well as the sum of the sender's two internal counters, the sender
can probabilistic detect a receiver that would send false marks or supress
accurate ECN feedback, or a path that doesn't properly support ECN.
</t>
<t>This approach maintains a balanced selection of properties found in
ECN Nonce, <xref target="eci_mode"/>, and <xref target="sm_mode"/>.
A delayed ACK ratio of two can be sustained indefinitely even during
heavy congestion, but not during excessive ECT(1) marking, which is
under the control of the sender. An higher ACK ratios can be sustained
even when congestion is low but its need for the E1 feedback.
<!-- as a high ACK ratios will not cause a loss in timeliness or accuracy.
-> MK: dont understand that...?-->
</t>
</section>
<!--
This approach aims to keep one integrity mechanism similar to ECN-None.
If the codepints are taken from 3 bits, and
assigned properly, the wire protocol does not impose limits on
regular delayed ACKs (1 ack per 2 data seg, typical), even under
severe congestion where 100% CE marks are received...
The idea similar to the idea of the previous section is to signal the
"absolute value" on the wire protocol, not some deltas (or
bit-flipping).
Next idea is, to signal two counters independently from each
other (one for the CE, one for ECT1) so that the sender can check
a equation, and probabilistically determine, if the
received counters are trustable.
Compared to the very simple feedback of one counter, the limited
number of codepoints requires two major changes, to address
resiliency and accuracy:
The value on the wire should be repeated at least twice even
under worst case conditions (so that a single ACK loss is always
acceptable), and the chance between successive counter values
must not overflow.
Finally, as a fall-back mechanism to maintain these properties
under certain conditions (ie AckCC with DelACK > 2, or extreme
high ECT marking probabilities), the receive may be required to
interrupt a pending delack, and instead send out an immediate ACK.
Note that for normal implementations with delack=2 and low ECT1
marking probabilities, this will not be triggered.
In the original ECI scheme, a binary counter is maintained in the
receiver, and the lowest 3 bits mapped directly to ECI; Overflows of a
binary counter are by definition at some power of 2...
The new scheme requires (much?) more state in the receiver - a counter, a
gauge and a Boolean flag, two times (for CE and ECT(1)).
The CE counter is mapped to 5 codepoints in ECI, and the ECT1 to 3
codepoints. As there is no natural base-5 or base-3 counter, the overflows
have to be handled explicitly (ie. Modulo 5^n / 3^n) in the receiver too.
This scheme has all the required properties:
<list hangIndent="10" style="empty">
<t>Resiliency - In worst case, any one ACK can be dropped, two consecutive
dropped ACKs only impact accuracy in 50% - when CE/ECT marking rates are
very high (>>50%). At normal marking rates, there is a high redundancy in
the signal - many ACKs may be lost without the counters getting
unsynchronized.</t>
<t>Timely - CE signals can be fed back at the same rate they are received,
using delayed ACKs only, even when keeping the value constant for every 2
ACKs; (The sender has control over ECT(1), and should not send at a ECT(1)
marking rate exceeding 50%; if it does, the scheme below will disable
delayed ACKs to keep up. Alternatively, the Gauge could be allowed to fill
up and the signal returned with delay. The sender should be
able to implicitly disable delayed ACKs on purpose sometimes (ie.
End-of-stream / accurate timing information etc).</t>
<t>Integrity - The feedback of two independent signals allows the sender to
verify the plausibility of the counters reported by the receiver.
Lost (or never sent) ACKs can also be detected by the sender. As with
ECN-Nonce, a misbehaving receiver can only be detected with a certain
probability though. If and what kind of enforcements a sender should do,
would be out-of-scope (ie cwnd=RW, IW or 1; RST; logging...)</t>
<t>Accuracy - Multiple signals can be conveyed from the receiver back to
the sender. For CE signals, these signals may be delayed by 3 (data)
segments. With the algoritm below, ECT(1) signals may lag 4-5 segments
behind (until delacks are disabled; then this offset is kept).</t>
</list>-->
</section>
<section title="Short Summary of the Discussions">
<t>With the exception of the signaling scheme described in <xref target="sm_mode"/>, all
signaling may fail to work, if middleboxes intervene and check on the semantic of
<xref target="RFC3168"/> signals.</t>
<t>The scheme described in <xref target="cp_mode"/> is the most complex to implement
especially on a receiver, with much additional state to be kept there, compared to the
other signaling schemes. With the advances in compute power, many more cycles are
available to process TCP than ever before. </t>
<t><xref target="Tab3"/> gives an overview of the relative implications of the different
proposed signaling schemes. Further discussion should be included here in the next version of this document.</t>
<texttable anchor="Tab3" title="Overview of accurate feedback schemes">
<ttcol align="center">Section</ttcol>
<ttcol align="center">Resi- liency</ttcol>
<ttcol align="center">Timely</ttcol>
<ttcol align="center">Integrity</ttcol>
<ttcol align="center">Accuracy</ttcol>
<ttcol align="center">Complexity</ttcol>
<c>1-bit-flag</c>
<c> -</c> <c> +</c> <c> + <!--*)--> </c> <c> -</c> <c> +</c>
<c>3-bit-field</c>
<c> ++</c> <c> ++</c> <c>-- </c> <c>++</c> <c> -</c>
<c>Codepoints</c>
<c> +</c> <c> +</c> <c>+ </c> <c>++</c> <c>--</c>
<!--<c><xref target="comp_mode" format="counter"/></c>
<c> minusminus</c> <c> -</c> <c>- *)</c> <c>minusminus</c> <c>++</c> -->
<!--<postamble>*) could be combined with ECN-Nonce</postamble>-->
</texttable>
</section>
</section>
<section title="TCP Sender">
<t> This section will specify the sender-side action describing how to exclude the accurate number of congestion markings from the given receiver feedback signal.
</t>
</section>
<section title="TCP Receiver">
<t> This section will describe the receiver-side action to signal the accurate ECN feedback back to the sender. In any case the receiver will need to maintain a counter of how many CE marking has been seen during a connection. Depending on the chosen coding scheme there will be different action to set the corresponding bits in the TCP header. For all case it might be helpful if the receiver is able to switch form a delayed ACK behavior to send ACKs immediately after the data packet reception in a hight congestion situation.
</t>
</section>
<section title="Advanced Compatibility Mode" anchor="comp_mode">
<t>
This section describes a possiblity to achieve more accurate feedback even when
the receiver is not capable of the new accurate ECN feedback scheme with the drawback of
less reliability.
</t>
<t>During initial deployment, a large number of receivers will only support
<xref target="RFC3168"/> classic ECN feedback. Such a receiver will set the
ECE bit whenever it receives a segment with the CE codepoint set, and clear
the ECE bit only when it receives a segment with the CWR bit set. As the CE
codepoint has priority over the CWR bit (Note: the wording in this regard
is ambiguous in <xref target="RFC3168"/>, but the reference implementation of
ECN in ns2 is clear), a <xref target="RFC3168"/> compliant
receiver will not clear the ECE bit on the reception of a segment, where both
CE and CWR are set simultaneously. This property allows the use of a compatibility
mode, to extract more accurate feedback from legacy <xref target="RFC3168"/>
receivers by setting the CWR permanently.
</t>
<t>Assuming an delayed ACK ratio of one, a sender can permanently set the CWR
bit in the TCP header, to receive a more accurate feedback of the CE codepoints
as seen at the receiver. This feedback signal is however very brittle and any
ACK loss may cause congestion information to become lost.
Delayed ACKs and ACK loss can both not be accounted for in a reliable
way, however. Therefore, a sender would need to use heuristics to determine the
current delay ACK ratio m used by the receiver (e.g. most receivers will
use m=2), and also the recent ACK loss ratio (l). Acknowledge Congestion Control
(AckCC) as defined in <xref target="RFC5690"/> can not be used, as deployment
of this feature is only experimental.
</t>
<t>Using a phase locked loop algorithm, the CWR bit can then be set only on
those data segments, that will trigger a (delayed) ACK. Thereby, no congestion
information is lost, as long as the ACK carrying the ECE bit is seen by the
sender.
</t>
<t>Whenever the sender sees an ACK with
ECE set, this indicates that at least one, and at most m / (m - l) data
segments with the CE codepoint set where seen by the receiver. The sender
SHOULD react, as if m CE indications where reflected back to the sender by
the receiver, unless additional heuristics (e.g. dead time correction)
can determine a more accurate value of the "true" number of received CE marks.
</t>
</section>
</section>
<section title="Acknowledgements">
<t> We want to thank Michael Welzl and Bob Briscoe for their input and discussion.
</t>
</section>
<section anchor="IANA" title="IANA Considerations">
<t>This memo includes no request to IANA.</t>
<!--<t> If this memo was to progress to standards track, it would update RFC3168
and RFC3540, to add new combinations of flags in the TCP header for capability
negotiation (see <xref target="TCPNeg"/>) and a change in TCP ECN semantics
(see <xref target="TCPSig"/>).</t>-->
</section>
<section anchor="Security" title="Security Considerations">
<t>For coding schemes that increase robustness for the ECN feedback, similar
considerations as in RFC3540 apply for the selection of when to sent a ECT(1)
codepoint.</t>
</section>
</middle>
<!-- *****BACK MATTER ***** -->
<back>
<references title="Normative References">
<!--?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"?-->
&RFC2119;
&RFC3168;
&RFC3540;
</references>
<references title="Informative References">
<?rfc include="reference.I-D.briscoe-tsvwg-re-ecn-tcp.xml"?>
&RFC5562;
&RFC5681;
&RFC5690;
<reference anchor="Ali10">
<front>
<title>DCTCP: Efficient Packet Transport for the Commoditized Data Center</title>
<author initials="M" surname="Alizadeh">
<organization></organization></author>
<author initials="A" surname="Greenberg">
<organization></organization></author>
<author initials="D" surname="Maltz">
<organization></organization></author>
<author initials="J" surname="Padhye">
<organization></organization></author>
<author initials="P" surname="Patel">
<organization></organization></author>
<author initials="B" surname="Prabhakar">
<organization></organization></author>
<author initials="S" surname="Sengupta">
<organization></organization></author>
<author initials="M" surname="Sridharan">
<organization></organization></author>
<date month="Jan" year="2010"/>
</front>
</reference>
</references>
<section anchor="app-codepoints" title="Pseudo Code for the Codepoint Coding">
<t>Receiver:</t>
<t>Input signals: CE , ECT(1)<vspace blankLines="0" />
TCP Fields: ECI (3-bit field from CWR and ECE). CI.cm and E1.cm map into these 8 codepoints (ie. 5 and 3 codepoints)</t>
<t>These counters get tracked by the following variables:</t>
<t>CI.c (congestion indication - counter, modulo a multiple of the available codepoints to represent CI.c in the ECI field. Range[0..n*CI.cp-1])<vspace blankLines="0" />
CI.g (congestion indication - gauge, [0.."inf"])<vspace blankLines="0" />
CI.i (congestion indication - iteration, [0,1])<vspace blankLines="0" />
These are to track CE indications.</t>
<t>E1.c, E1.g and E1.r (doing the same, but for ECT(1) signals).</t>
<t>Constants:<vspace blankLines="0" />
CI.cp (number of codepoints available to signal)<vspace blankLines="0" />
CI.cm[] (codepoint mapping for CI)<vspace blankLines="0" />
E1.cp (number of codepoints available for E1 signal)<vspace blankLines="0" />
E1.cm[0..(E1.cp-1)] (codepoint mappings for E1)</t>
<figure><artwork><![CDATA[
At session initialization, all these counters are set to 0;
When a Segement (Data, ACK) is received,
perform the following steps:
If a CE codepoint is received,
Increase CI.g by 1
If a ECT(1) codepoint is received,
Increase E1.g by 1
If (CI.g > 5) # When ACK rate is not sufficient to keep
or (E1.g > 3) # gauge close to zero, increase ACK rate
# works independent of delACK number (ie AckCC)
Cancel pending delayed ACK (ACK this segment immediately)
# this increases the ACK rate to a maximum of 1.5 data segments
# per ACK, with delACK=2,
# and CE mark rate exceeds 75% for a number
# of at least 18 segments.
# 5 codepoints would allow delack=2 indefinitely btw
When preparing an ACK to be sent:
If (CI.g > 0) or
((E1.i != 0) and (CI.i != 0)) # E1.g = 0 is to skip this
# if only the 2nd CI.c ACK
# has to be sent - effectively alternating CI.c and E1.c on ACKs
# should give slightly better resiliency against ack losses
If CI.i == 0 # updates to CI.c allowed
and CI.g > 0 # update is meaningful
CI.i = 1 # may be larger
#if more resiliency is reqd
CI.c += min(CI.cp-1,CI.g) # CI.cp-1 is 3 for 4 codepoints,
# 4 for 5 etc
CI.c = CI.c modulo CI.cp*CI.cp # using modulo the square of
# available codepoints,
# for convinience (debugging)
CI.g -= min(CI.cp-1,CI.g) #
Else
CI.i-- # just in case CI.f was set to
# more than 1 for resiliency
Send next ACK with ECI = CI.cm[CI.c modulo CI.cp]
Else
If (E1.g > 0) or (E1.i != 0)
If (E1.i == 0) and (E1.g > 0)
E1.i = 1
E1.c += min(E1.cp-1,E1.g)
E1.c = E1.c modulo E1.cp*E1.cp
E1.g -= min(E1.cp-1,E1.g)
Else
E1.i--
Send next ACK with ECI = E1.cm[E1.c modulo E1.cp]
Else
Send next ACK with ECI = CI.cm[CI.c modulo CI.cp] # default action
]]></artwork></figure>
<t>Sender:</t>
<t>Counters:</t>
<!--<texttable>
<ttcol align="center">Name</ttcol>
<ttcol align="center">Description</ttcol>
<c>CI.r</c><c>current value of CEs seen by receiver</c>
<c>E1.s</c><c>sum of all sent ECT(1) marked packets (up to snd.nxt)</c>
<c>E1.s(t)</c><c>value of E1.s at time (in sequence space) t</c>
<c>E1.r</c><c>value signaled by receiver about received ECT(1) segments</c>
<c>E1.r(t)</c><c>value of E1.r at time (in sequence space) t</c>
<c>CI.r(t)</c><c>ditto</c>
</texttable>-->
<t>
CI.r - current value of CEs seen by receiver<vspace blankLines="0" />
E1.s - sum of all sent ECT(1) marked packets (up to snd.nxt)<vspace blankLines="0" />
E1.s(t) - value of E1.s at time (in sequence space) t<vspace blankLines="0" />
E1.r - value signaled by receiver about received ECT(1) segments<vspace blankLines="0" />
E1.r(t) - value of E1.r at time (in sequence space) t<vspace blankLines="0" />
CI.r(t) - ditto</t>
<figure><artwork><![CDATA[
# Note: With a codepoint-implementation,
# a reverse table ECI[n] -> CI.r / E1.r is needed.
# This example is simplified with 4/4 codepoints
# instead of 5/3
If ACK with NS=0
CI.r += (ECI + 4 - (CI.r mod CI.cp)) mod CI.cp
# The wire protocol transports the absolute value
# of the receiver-side counter.
# Thus the (positive only) delta needs to be calculated,
# and added to the sender-side counter.
If ACK with NS=1
E1.r += (ECI + 4 - (E1.r mod E1.cp)) mod E1.c
# Before CI.r or E1.r reach a (binary) rollover,
# they need to roll over some multiple of CI.cp
# and E1.cp respectively.
CI.r = CI.r modulo CI.cp * n_CI
E1.r = E1.r modulo E1.cp * n_E1
# (an implementation may choose to use a single constant,
# ie 3^4*5^4 for 16-bit integers,
# or 3^8*5^8 for 32-bit integers)
# The following test can (probabilistically) reveal,
# if the receiver or path is not properly
# handling ECN (CE, E1) marks
If not E1.r(t) <= E1.s(t) <= E1.r(t) + CI.r(t)
# -> receiver lies (or too many ACKs got lost,
# which can be checked too by the sender).
]]></artwork></figure>
</section>
</back>
</rfc>
| PAFTECH AB 2003-2026 | 2026-04-23 11:00:29 |