One document matched: draft-ietf-tcpm-accurate-ecn-02.xml
<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
There has to be one entity for each item to be referenced.
An alternate method (rfc include) is described in the references. -->
<!-- <!ENTITY RFC0793 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.0793.xml"> -->
<!-- <!ENTITY RFC2018 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.2018.xml"> -->
<!ENTITY RFC2119 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC3168 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.3168.xml">
<!ENTITY RFC3540 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.3540.xml">
<!ENTITY RFC4987 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.4987.xml">
<!-- <!ENTITY RFC5226 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.5226.xml"> -->
<!ENTITY RFC5562 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.5562.xml">
<!ENTITY RFC5681 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.5681.xml">
<!ENTITY RFC5925 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.5925.xml">
<!ENTITY RFC6824 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.6824.xml">
<!ENTITY RFC6994 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.6994.xml">
<!ENTITY RFC7413 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.7413.xml">
<!ENTITY RFC7560 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.7560.xml">
<!ENTITY I-D.ietf-conex-abstract-mech SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml3/reference.I-D.ietf-conex-abstract-mech.xml">
<!ENTITY I-D.kuehlewind-tcpm-ecn-fallback SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml3/reference.I-D.kuehlewind-tcpm-ecn-fallback.xml">
<!ENTITY I-D.moncaster-tcpm-rcv-cheat SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml3/reference.I-D.moncaster-tcpm-rcv-cheat.xml">
<!ENTITY I-D.bensley-tcpm-dctcp SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml3/reference.I-D.bensley-tcpm-dctcp.xml">
]>
<?xml-stylesheet type='text/xsl' href='http://xml.resource.org/authoring/rfc2629.xslt' ?>
<!-- Alterations to I-D/RFC boilerplate -->
<?rfc private="" ?>
<!-- Default private="" Produce an internal memo 2.5pp shorter than an I-D or RFC -->
<?rfc rfcprocack="yes" ?>
<!-- Default rfcprocack="no" add a short sentence acknowledging xml2rfc -->
<?rfc strict="no" ?>
<!-- Default strict="no" Don't check I-D nits -->
<?rfc rfcedstyle="yes" ?>
<!-- Default rfcedstyle="yes" attempt to closely follow finer details from the latest observable RFC-Editor style -->
<!-- IETF process -->
<?rfc iprnotified="no" ?>
<!-- Default iprnotified="no" I haven't disclosed existence of IPR to IETF -->
<!-- ToC format -->
<?rfc toc="yes" ?>
<!-- Default toc="no" No Table of Contents -->
<!-- Cross referencing, footnotes, comments -->
<?rfc symrefs="yes"?>
<!-- Default symrefs="no" Don't use anchors, but use numbers for refs -->
<?rfc sortrefs="yes"?>
<!-- Default sortrefs="no" Don't sort references into order -->
<?rfc comments="yes" ?>
<!-- Default comments="no" Don't render comments -->
<?rfc inline="no" ?>
<!-- Default inline="no" if comments is "yes", then render comments inline; otherwise render them in an `Editorial Comments' section -->
<!-- Pagination control -->
<?rfc compact="yes"?>
<!-- Default compact="no" Start sections on new pages -->
<?rfc subcompact="no"?>
<!-- Default subcompact="(as compact setting)" yes/no is not quite as compact as yes/yes -->
<!-- HTML formatting control -->
<?rfc emoticonic="yes" ?>
<!-- Default emoticonic="no" Doesn't prettify HTML format -->
<rfc category="exp" docName="draft-ietf-tcpm-accurate-ecn-02"
ipr="trust200902" updates="">
<front>
<title abbrev="Accurate TCP-ECN Feedback">More Accurate ECN Feedback in
TCP</title>
<author fullname="Bob Briscoe" initials="B." surname="Briscoe">
<organization>Simula Research Laboratory</organization>
<address>
<postal>
<street/>
<city/>
<country/>
</postal>
<email>ietf@bobbriscoe.net</email>
<uri>http://bobbriscoe.net/</uri>
</address>
</author>
<author fullname="Mirja Kühlewind" initials="M."
surname="Kühlewind">
<organization>ETH Zurich</organization>
<address>
<postal>
<street/>
<city>Zurich</city>
<country>Switzerland</country>
</postal>
<email>mirja.kuehlewind@tik.ee.ethz.ch</email>
</address>
</author>
<author fullname="Richard Scheffenegger" initials="R."
surname="Scheffenegger">
<address>
<postal>
<street/>
<city>Vienna</city>
<country>Austria</country>
</postal>
<email>rscheff@gmx.at</email>
</address>
</author>
<date year="2016"/>
<area>Transport</area>
<workgroup>TCP Maintenance & Minor Extensions (tcpm)</workgroup>
<keyword>Congestion Control and Management</keyword>
<keyword>Congestion Notification</keyword>
<keyword>Feedback</keyword>
<keyword>Reliable</keyword>
<keyword>Ordered</keyword>
<keyword>Protocol</keyword>
<keyword>ECN</keyword>
<abstract>
<t>Explicit Congestion Notification (ECN) is a mechanism where network
nodes can mark IP packets instead of dropping them to indicate incipient
congestion to the end-points. Receivers with an ECN-capable transport
protocol feed back this information to the sender. ECN is specified for
TCP in such a way that only one feedback signal can be transmitted per
Round-Trip Time (RTT). Recently, new TCP mechanisms like Congestion
Exposure (ConEx) or Data Center TCP (DCTCP) need more accurate ECN
feedback information whenever more than one marking is received in one
RTT. This document specifies an experimental scheme to provide more than
one feedback signal per RTT in the TCP header. Given TCP header space is
scarce, it overloads the three existing ECN-related flags in the TCP
header and provides additional information in a new TCP option.</t>
</abstract>
</front>
<!-- ================================================================ -->
<middle>
<!-- ================================================================ -->
<section anchor="accecn_Introduction" title="Introduction">
<t>Explicit Congestion Notification (ECN) <xref target="RFC3168"/> is a
mechanism where network nodes can mark IP packets instead of dropping
them to indicate incipient congestion to the end-points. Receivers with
an ECN-capable transport protocol feed back this information to the
sender. ECN is specified for TCP in such a way that only one feedback
signal can be transmitted per Round-Trip Time (RTT). Recently, proposed
mechanisms like Congestion Exposure (ConEx <xref
target="I-D.ietf-conex-abstract-mech"/>) or DCTCP <xref
target="I-D.bensley-tcpm-dctcp"/> need more accurate ECN feedback
information whenever more than one marking is received in one RTT. A
fuller treatment of the motivation for this specification is given in
the associated requirements document <xref target="RFC7560"/>.</t>
<t>This documents specifies an experimental scheme for ECN feedback in
the TCP header to provide more than one feedback signal per RTT. It will
be called the more accurate ECN feedback scheme, or AccECN for short. If
AccECN progresses from experimental to the standards track, it is
intended to be a complete replacement for classic ECN feedback, not a
fork in the design of TCP. Thus, the applicability of AccECN is intended
to include all public and private IP networks (and even any non-IP
networks over which TCP is used today). Until the AccECN experiment
succeeds, <xref target="RFC3168"/> will remain as the standards track
specification for adding ECN to TCP. To avoid confusion, in this
document we use the term 'classic ECN' for the pre-existing ECN
specification <xref target="RFC3168"/>.</t>
<t>AccECN is solely an (experimental) change to the TCP wire protocol.
It is completely independent of how TCP might respond to congestion
feedback. This specification overloads flags and fields in the main TCP
header with new definitions, so both ends have to support the new wire
protocol before it can be used. Therefore during the TCP handshake the
two ends use the three ECN-related flags in the TCP header to negotiate
the most advanced feedback protocol that they can both support.</t>
<t>It is likely (but not required) that the AccECN protocol will be
implemented along with the following experimental additions to the
TCP-ECN protocol: ECN-capable SYN/ACK <xref target="RFC5562"/>, ECN
path-probing and fall-back <xref
target="I-D.kuehlewind-tcpm-ecn-fallback"/> and testing receiver
non-compliance <xref target="I-D.moncaster-tcpm-rcv-cheat"/>.</t>
<section title="Document Roadmap">
<t>The following introductory sections outline the goals of AccECN
(<xref target="accecn_Goals"/>) and the goal of experiments with ECN
(<xref target="accecn_Expt_Goals"/>) so that it is clear what success
would look like. Then terminology is defined (<xref
target="accecn_Terminology"/>) and a recap of existing prerequisite
technology is given (<xref target="accecn_Recap"/>).</t>
<t><xref target="accecn_Overview"/> gives an informative overview of
the AccECN protocol. Then <xref target="accecn_Spec"/> gives the
normative protocol specification. <xref
target="accecn_Interact_Variants"/> assesses the interaction of AccECN
with commonly used variants of TCP, whether standardised or not. <xref
target="accecn_Properties"/> summarises the features and properties of
AccECN.</t>
<t><xref target="accecn_IANA_Considerations"/> summarises the protocol
fields and numbers that IANA will need to assign and <xref
target="accecn_Security_Considerations"/> points to the aspects of the
protocol that will be of interest to the security community.</t>
<t><xref target="accecn_Algo_Examples"/> gives pseudocode examples for
the various algorithms that AccECN uses.</t>
<!-- <t>Three further appendices are included for use during document development {Delete this list before publication}:<list style="symbols">
<t><xref target="accecn_Alt_Designs"/>: Protocol design
alternatives that could be considered for inclusion in the main
specification;</t>
<t><xref target="accecn_Open_Issues"/>: a 'To Do' list of open
protocol design issues;</t>
<t><xref target="accecn_Doc_Changes"/>: Document change log.</t>
</list></t>-->
</section>
<section anchor="accecn_Goals" title="Goals">
<t><xref target="RFC7560"/> enumerates requirements that a candidate
feedback scheme will need to satisfy, under the headings: resilience,
timeliness, integrity, accuracy (including ordering and lack of bias),
complexity, overhead and compatibility (both backward and forward). It
recognises that a perfect scheme that fully satisfies all the
requirements is unlikely and trade-offs between requirements are
likely. <xref target="accecn_Properties"/> presents the properties of
AccECN against these requirements and discusses the trade-offs
made.</t>
<t>The requirements document recognises that a protocol as ubiquitous
as TCP needs to be able to serve as-yet-unspecified requirements.
Therefore an AccECN receiver aims to act as a generic (dumb) reflector
of congestion information so that in future new sender behaviours can
be deployed unilaterally.</t>
</section>
<section anchor="accecn_Expt_Goals" title="Experiment Goals">
<t>TCP is critical to the robust functioning of the Internet,
therefore any proposed modifications to TCP need to be thoroughly
tested. The present specification describes an experimental protocol
that adds more accurate ECN feedback to the TCP protocol. The
intention is to specify the protocol sufficiently so that more than
one implementation can be built in order to test its function,
robustness and interoperability (with itself and with previous version
of ECN and TCP).</t>
<t><!-- <list style="hanging">
<t hangText="Success criteria: "> -->The experimental protocol
will be considered successful if it satisfies the requirements of
<xref target="RFC7560"/> in the consensus opinion of the IETF tcpm
working group. In short, this requires that it improves the accuracy
and timeliness of TCP's ECN feedback, as claimed in <xref
target="accecn_Properties"/>, while striking a balance between the
conflicting requirements of resilience, integrity and minimisation of
overhead. It also requires that it is not unduly complex, and that it
is compatible with prevalent equipment behaviours in the current
Internet, whether or not they comply with standards.</t>
<!-- TODO: add paragraph to say that it will be moved to standard track
if deployed, e.g. in the Linux kernel or used by DCTCP in data centers...? -->
<!-- <t hangText="Duration: ">To be credible, the experiment will need
to last at least 12 months from publication of the present
specification. At that time, a report on the experiment will be
written up. If successful, it would then be appropriate to work on
a standards track specification that adds more accurate ECN
feedback to TCP.</t>
</list></t>
ToDo: Why is this timescale point commented out? It means that, if successful after 12 months, the chairs cannot delay moving from expt to stds track. But we are still allowed to take longer to be successful.
-->
</section>
<section anchor="accecn_Terminology" title="Terminology">
<t><list style="hanging">
<t hangText="AccECN:">The more accurate ECN feedback scheme will
be called AccECN for short.</t>
<t hangText="Classic ECN:">the ECN protocol specified in <xref
target="RFC3168"/>.</t>
<t hangText="Classic ECN feedback:">the feedback aspect of the ECN
protocol specified in <xref target="RFC3168"/>, including
generation, encoding, transmission and decoding of feedback, but
not the Data Sender's subsequent response to that feedback.</t>
<t hangText="ACK:">A TCP acknowledgement, with or without a data
payload.</t>
<t hangText="Pure ACK:">A TCP acknowledgement without a data
payload.</t>
<!-- <t hangText="SupAccECN:">The Supplementary Accurate ECN field that
provides additional resilience as well as information about the
ordering of ECN markings covered by a delayed ACK.</t> -->
<t hangText="TCP client:">The TCP stack that originates a
connection.</t>
<t hangText="TCP server:">The TCP stack that responds to a
connection request.</t>
<t hangText="Data Receiver:">The endpoint of a TCP half-connection
that receives data and sends AccECN feedback.</t>
<t hangText="Data Sender:">The endpoint of a TCP half-connection
that sends data and receives AccECN feedback.</t>
<!-- <t
hangText="Outgoing AccECN Protocol Handler (or, Outgoing Protocol Handler):">The
protocol handler at the Data Receiver that marshals the AccECN
fields when sending an ACK.</t>
<t
hangText="Incoming AccECN Protocol Handler (or, Incoming Protocol Handler):">The
protocol handler at the Data Sender that reads the AccECN fields
when receiving an ACK.</t> -->
</list></t>
<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in <xref
target="RFC2119">RFC 2119</xref>.</t>
</section>
<section anchor="accecn_Recap"
title="Recap of Existing ECN feedback in IP/TCP">
<t>ECN <xref target="RFC3168"/> uses two bits in the IP header. Once
ECN has been negotiated with the receiver at the transport layer, an
ECN sender can set two possible codepoints (ECT(0) or ECT(1)) in the
IP header to indicate an ECN-capable transport (ECT). <!-- It is
prohibited from doing so unless it has checked that the receiver will
understand ECN and be able to feed it back.--> If both ECN bits are
zero, the packet is considered to have been sent by a Not-ECN-capable
Transport (Not-ECT). When a network node experiences congestion, it
will occasionally either drop or mark a packet, with the choice
depending on the packet's ECN codepoint. If the codepoint is Not-ECT,
only drop is appropriate. If the codepoint is ECT(0) or ECT(1), the
node can mark the packet by setting both ECN bits, which is termed
'Congestion Experienced' (CE), or loosely a 'congestion mark'. <xref
target="accecn_Tab_ECN"/> summarises these codepoints.</t>
<texttable anchor="accecn_Tab_ECN"
title="The ECN Field in the IP Header">
<ttcol>IP-ECN codepoint (binary)</ttcol>
<ttcol>Codepoint name</ttcol>
<ttcol>Description</ttcol>
<c>00</c>
<c>Not-ECT</c>
<c>Not ECN-Capable Transport</c>
<c>01</c>
<c>ECT(1)</c>
<c>ECN-Capable Transport (1)</c>
<c>10</c>
<c>ECT(0)</c>
<c>ECN-Capable Transport (0)</c>
<c>11</c>
<c>CE</c>
<c>Congestion Experienced</c>
</texttable>
<t>In the TCP header the first two bits in byte 14 are defined as
flags for the use of ECN (CWR and ECE in <xref
target="accecn_Fig_TCPHdr"/> <xref target="RFC3168"/>). A TCP client
indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an
ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in
the SYN/ACK. On reception of a CE-marked packet at the IP layer, the
Data Receiver starts to set the Echo Congestion Experienced (ECE) flag
continuously in the TCP header of ACKs, which ensures the signal is
received reliably even if ACKs are lost. The TCP sender confirms that
it has received at least one ECE signal by responding with the
congestion window reduced (CWR) flag, which allows the TCP receiver to
stop repeating the ECN-Echo flag. This always leads to a full RTT of
ACKs with ECE set. Thus any additional CE markings arriving within
this RTT cannot be fed back.</t>
<t>The ECN Nonce <xref target="RFC3540"/> is an optional experimental
addition to ECN that the TCP sender can use to protect against
accidental or malicious concealment of marked or dropped packets. The
sender can send an ECN nonce, which is a continuous pseudo-random
pattern of ECT(0) and ECT(1) codepoints in the ECN field. The receiver
is required to feed back a 1-bit nonce sum that counts the occurrence
of ECT(1) packets using the last bit of byte 13 in the TCP header,
which is defined as the Nonce Sum (NS) flag.</t>
<?rfc needLines="8" ?>
<figure align="center" anchor="accecn_Fig_TCPHdr"
title="The (post-ECN Nonce) definition of the TCP header flags">
<artwork align="center"><![CDATA[
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| | | N | C | E | U | A | P | R | S | F |
| Header Length | Reserved | S | W | C | R | C | S | S | Y | I |
| | | | R | E | G | K | H | T | N | N |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
]]></artwork>
</figure>
</section>
</section>
<!-- ================================================================ -->
<section anchor="accecn_Overview"
title="AccECN Protocol Overview and Rationale">
<t>This section provides an informative overview of the AccECN protocol
that will be normatively specified in <xref target="accecn_Spec"/></t>
<t>Like the original TCP approach, the Data Receiver of each TCP
half-connection sends AccECN feedback to the Data Sender on TCP
acknowledgements, reusing data packets of the other half-connection
whenever possible.</t>
<!--<section title="Essential and Supplementary Parts">-->
<t>The AccECN protocol has had to be designed in two parts:<list
style="symbols">
<t>an essential part that re-uses ECN TCP header bits to feed back
the number of arriving CE marked packets. This provides more
accuracy than classic ECN feedback, but limited resilience against
ACK loss;</t>
<t>a supplementary part using a new AccECN TCP Option that provides
additional feedback on the number of bytes that arrive marked with
each of the three ECN codepoints (not just CE marks). This provides
greater resilience against ACK loss than the essential feedback, but
it is more likely to suffer from middlebox interference. <!-- <t>a supplementary part that serves three functions:<list
style="symbols">
<t>it greatly improves the resilience of AccECN feedback
information against loss of ACKs;</t>
<t>it provides information about the order in which ECN
markings in the IP header arrived at the Data Receiver;</t>
<t>it improves the timeliness of AccECN feedback when a
delayed ACK covers multiple congestion signals.</t>
</list> --></t>
</list>The two part design was necessary, given limitations on the
space available for TCP options and given the possibility that certain
incorrectly designed middleboxes prevent TCP using any new options.</t>
<t>The essential part overloads the previous definition of the three
flags in the TCP header that had been assigned for use by ECN. This
design choice deliberately replaces the classic ECN feedback protocol,
rather than leaving classic ECN feedback intact and adding more accurate
feedback separately because:<list style="symbols">
<t>this efficiently reuses scarce TCP header space, given TCP option
space is approaching saturation;</t>
<t>a single upgrade path for the TCP protocol is preferable to a
fork in the design;</t>
<t>otherwise classic and accurate ECN feedback could give
conflicting feedback on the same segment, which could open up new
security concerns and make implementations unnecessarily
complex;</t>
<t>middleboxes are more likely to faithfully forward the TCP ECN
flags than newly defined areas of the TCP header.</t>
</list></t>
<t>AccECN is designed to work even if the supplementary part is removed
or zeroed out, as long as the essential part gets through.</t>
<section title="Capability Negotiation">
<t>AccECN is a change to the wire protocol of the main TCP header,
therefore it can only be used if both endpoints have been upgraded to
understand it. The TCP client signals support for AccECN on the
initial SYN of a connection and the TCP server signals whether it
supports AccECN on the SYN/ACK. The TCP flags on the SYN that the
client uses to signal AccECN support have been carefully chosen so
that a TCP server will interpret them as a request to support the most
recent variant of ECN feedback that it supports. Then the client falls
back to the same variant of ECN feedback.</t>
<t>An AccECN TCP client does not send the new AccECN Option on the SYN
as SYN option space is limited and successful negotiation using the
flags in the main header is taken as sufficient evidence that both
ends also support the AccECN Option. The TCP server sends the AccECN
Option on the SYN/ACK and the client sends it on the first ACK to test
whether the network path forwards the option correctly.</t>
</section>
<section title="Feedback Mechanism">
<t>A Data Receiver maintains four counters initialised at the start of
the half-connection. Three count the number of arriving payload bytes
marked CE, ECT(1) and ECT(0) respectively. The fourth counts the
number of packets arriving marked with a CE codepoint (including
control packets without payload if they are CE-marked).</t>
<t>The Data Sender maintains four equivalent counters for the half
connection, and the AccECN protocol is designed to ensure they will
match the values in the Data Receiver's counters, albeit after a
little delay.</t>
<t>Each ACK carries the three least significant bits (LSBs) of the
packet-based CE counter using the ECN bits in the TCP header, now
renamed the Accurate ECN (ACE) field. The LSBs of each of the three
byte counters are carried in the AccECN Option.</t>
</section>
<section title="Delayed ACKs and Resilience Against ACK Loss">
<t>With both the ACE and the AccECN Option mechanisms, the Data
Receiver continually repeats the current LSBs of each of its
respective counters. Then, even if some ACKs are lost, the Data Sender
should be able to infer how much to increment its own counters, even
if the protocol field has wrapped.</t>
<t>The 3-bit ACE field can wrap fairly frequently. Therefore, even if
it appears to have incremented by one (say), the field might have
actually cycled completely then incremented by one. The Data Receiver
is required not to delay sending an ACK to such an extent that the ACE
field would cycle. However cyling is still a possibility at the Data
Sender because a whole sequence of ACKs carrying intervening values of
the field might all be lost or delayed in transit.</t>
<!-- "Further, if the lost ACKs included no payload they would never be retransmitted." Commented out, because even data ACks would be retransmitted with a different ACE field anyway.-->
<t>The fields in the AccECN Option are larger, but they will increment
in larger steps because they count bytes not packets. Nonetheless,
their size has been chosen such that a whole cycle of the field would
never occur between ACKs unless there had been an infeasibly long
sequence of ACK losses. Therefore, as long as the AccECN Option is
available, it can be treated as a dependable feedback channel.</t>
<t>If the AccECN Option is not available, e.g. it is being stripped by
a middlebox, the AccECN protocol will only feed back information on CE
markings (using the ACE field). Although not ideal, this will be
sufficient, because it is envisaged that neither ECT(0) nor ECT(1)
will ever indicate more severe congestion than CE, even though future
uses for ECT(0) or ECT(1) are still unclear. Because the 3-bit ACE
field is so small, when it is the only field available the Data Sender
has to interpret it conservatively assuming the worst possible
wrap.</t>
<t>Certain specified events trigger the Data Receiver to include an
AccECN Option on an ACK. The rules are designed to ensure that the
order in which different markings arrive at the receiver is
communicated to the sender (as long as there is no ACK loss).
Implementations are encouraged to send an AccECN Option more
frequently, but this is left up to the implementer.</t>
<!--As one ACK might acknowledge multiple data segments at the same time the
proposed scheme providing accumulated information does not preserve the
order at which the marking were received.This decision was taken
deliberately to reduce complexity.-->
</section>
<section title="Feedback Metrics">
<t>The CE packet counter in the ACE field and the CE byte counter in
the AccECN Option both provide feedback on received CE-marks. The CE
packet counter includes control packets that do not have payload data,
while the CE byte counter solely includes marked payload bytes. If
both are present, the byte counter in the option will provide the more
accurate information needed for modern congestion control and policing
schemes, such as DCTCP or ConEx. If the option is stripped, a simple
algorithm to estimate the number of marked bytes from the ACE field is
given in <xref target="accecn_Algo_ACE_Bytes"/>.</t>
<t>Feedback in bytes is recommended in order to protect against the
receiver using attacks similar to 'ACK-Division' to artificially
inflate the congestion window, which is why <xref target="RFC5681"/>
now recommends that TCP counts acknowledged bytes not packets.</t>
</section>
<section anchor="accecn_demb_reflector" title="Generic (Dumb) Reflector">
<t>The ACE field provides information about CE markings on both data
and control packets. According to <xref target="RFC3168"/> the Data
Sender is meant to set control packets to Not-ECT. However, mechanisms
in certain private networks (e.g. data centres) set control packets to
be ECN capable because they are precisely the packets that performance
depends on most.</t>
<t>For this reason, AccECN is designed to be a generic reflector of
whatever ECN markings it sees, whether or not they are compliant with
a current standard. Then as standards evolve, Data Senders can upgrade
unilaterally without any need for receivers to upgrade too. It is also
useful to be able to rely on generic reflection behaviour when senders
need to test for unexpected interference with markings (for instance
<xref target="I-D.kuehlewind-tcpm-ecn-fallback"/> and <xref
target="I-D.moncaster-tcpm-rcv-cheat"/>).</t>
<t>The initial SYN is the most critical control packet, so AccECN
provides feedback on whether it is CE marked, even though it is not
allowed to be ECN-capable according to RFC 3168. However, middleboxes
have been known to overwrite the ECN IP field as if it is still part
of the old Type of Service (ToS) field. If a TCP client has set the
SYN to Not-ECT, but receives CE feedback, it can detect such middlebox
interference and send Not-ECT for the rest of the connection (see
<xref target="I-D.kuehlewind-tcpm-ecn-fallback"/> for the detailed
fall-back behaviour).</t>
<t>Today, if a TCP server receives CE on a SYN, it cannot know whether
it is invalid (or valid) because only the TCP client knows whether it
originally marked the SYN as Not-ECT (or ECT). Therefore, the server's
only safe course of action is to disable ECN for the connection.
Instead, the AccECN protocol allows the server to feed back the CE
marking to the client, which then has all the information to decide
whether the connection has to fall-back from supporting ECN (or
not).</t>
<t>Providing feedback of CE marking on the SYN also supports future
scenarios in which SYNs might be ECN-enabled (without prejudging
whether they ought to be). For instance, in certain environments such
as data centres, it might be appropriate to allow ECN-capable SYNs.
Then, if feedback showed the SYN had been CE marked, the TCP client
could reduce its initial window (IW). It could also reduce IW
conservatively if feedback showed the receiver did not support ECN
(because if there had been a CE marking, the receiver would not have
understood it). Note that this text merely motivates dumb reflection
of CE on a SYN, it does not judge whether a SYN ought to be
ECN-capable.</t>
</section>
</section>
<!-- ================================================================ -->
<section anchor="accecn_Spec" title="AccECN Protocol Specification">
<section anchor="accecn_Negotiation"
title="Negotiation during the TCP handshake">
<t>During the TCP handshake at the start of a connection, to request
more accurate ECN feedback the TCP client (host A) MUST set the TCP
flags NS=1, CWR=1 and ECE=1 in the initial SYN segment.</t>
<t>If a TCP server (B) that is AccECN enabled receives a SYN with the
above three flags set, it MUST set both its half connections into
AccECN mode. Then it MUST set the flags CWR=1 and ECE=0 on its
response in the SYN/ACK segment to confirm that it supports AccECN.
The TCP server MUST NOT set this combination of flags unless the
preceding SYN requested support for AccECN as above.</t>
<t>A TCP server in AccECN mode MUST additionally set the flag NS=1 on
the SYN/ACK if the SYN was CE-marked (see <xref
target="accecn_demb_reflector"/>). If the received SYN was Not-ECT,
ECT(0) or ECT(1), it MUST clear NS (NS=0) on the SYN/ACK.</t>
<!--Bob: Out of scope: move to fall-back draft.-->
<!--If the sending host (A) indicated AccECN support, the receiving host (B) may set the IP ECN field of the SYN/ACK to ECT.
<t>If the SYN was ECT and the SYN/ACK indicates that a CE mark was received
(NS=1), the originating host (A) MUST react to this congestion
indication e.g. by selecting a lower initial sending window.</t>
<t>If the SYN was ECT marked, but the receiving host is not AccECN enabled
(ECE=0 and CWR=0 in SYN/ACK), the originating host (A) SHOULD conservatively
reduce its initial window as if the SYN had been CE-marked.</t> -->
<t>Once a TCP client (A) has sent the above SYN to declare that it
supports AccECN, and once it has received the above SYN/ACK segment
that confirms that the TCP server supports AccECN, the TCP client MUST
set both its half connections into AccECN mode.</t>
<!--Bob: Out of scope: move to fall-back draft.
<t> In AccECN mode the originating host (A) MAY set the IP ECN field to
ECT in the first ACK that finalizes the three way handshake (3WSH).
E.g. to test ECN support of the path, setting the SYN/ACK as well as
the first ACK to ECT allows each end to determine as soon as possible
whether the path passes ECT or a middlebox bleaches or overwrites the
IP ECN field.</t>
-->
<t>If after the normal TCP timeout the TCP client has not received a
SYN/ACK to acknowledge its SYN, the SYN might just have been lost,
e.g. due to congestion, or a middlebox might be blocking segments with
the AccECN flags. To expedite connection setup, the host SHOULD fall
back to NS=CWR=ECE=0 on the retransmission of the SYN. It would make
sense to also remove any other experimental fields or options on the
SYN in case a middlebox might be blocking them, although the required
behaviour will depend on the specification of the other option(s) and
any attempt to co-ordinate fall-back between different modules of the
stack. Implementers MAY use other fall-back strategies if they are
found to be more effective (e.g. attempting to retransmit a second
AccECN segment before fall-back, falling back to classic ECN feedback
rather than non-ECN, and/or caching the result of a previous attempt
to access the same host while negotiating AccECN).</t>
<t>The fall-back procedure if the TCP server receives no ACK to
acknowledge a SYN/ACK that tried to negotiate AccECN is specified in
<xref target="accecn_Mbox_Interference"/>.</t>
<t>The three flags set to 1 to indicate AccECN support on the SYN have
been carefully chosen to enable natural fall-back to prior stages in
the evolution of ECN. <xref target="accecn_Tab_Negotiation"/>
tabulates all the negotiation possibilities for ECN-related
capabilities that involve at least one AccECN-capable host. To
compress the width of the table, the headings of the first four
columns have been severely abbreviated, as follows: <list
hangIndent="4" style="hanging">
<t hangText="Ac:">More *Ac*curate ECN Feedback</t>
<t hangText="N:">ECN-*N*once <xref target="RFC3540"/></t>
<t hangText="E:">*E*CN <xref target="RFC3168"/></t>
<t hangText="I:">Not-ECN (*I*mplicit congestion notification using
packet drop).</t>
</list></t>
<!-- <?rfc needLines="22" ?> -->
<texttable align="center" anchor="accecn_Tab_Negotiation"
title="ECN capability negotiation between Originator (A) and Responder (B)">
<ttcol align="left">Ac</ttcol>
<ttcol align="center">N</ttcol>
<ttcol align="center">E</ttcol>
<ttcol align="center">I</ttcol>
<ttcol align="center">SYN A->B</ttcol>
<ttcol align="center">SYN/ACK B->A</ttcol>
<ttcol align="left">Feedback Mode</ttcol>
<c/>
<c/>
<c/>
<c/>
<c>NS CWR ECE</c>
<c>NS CWR ECE</c>
<c/>
<c>AB</c>
<c/>
<c/>
<c/>
<c>1 1 1</c>
<c>0 1 0</c>
<c>AccECN</c>
<!-- new -->
<c>AB</c>
<c/>
<c/>
<c/>
<c>1 1 1</c>
<c>1 1 0</c>
<c>AccECN (CE on SYN)</c>
<!-- new end -->
<c/>
<c/>
<c/>
<c/>
<c/>
<c/>
<c/>
<c>A</c>
<c>B</c>
<c/>
<c/>
<c>1 1 1</c>
<c>1 0 1</c>
<c>classic ECN</c>
<c>A</c>
<c/>
<c>B</c>
<c/>
<c>1 1 1</c>
<c>0 0 1</c>
<c>classic ECN</c>
<c>A</c>
<c/>
<c/>
<c>B</c>
<c>1 1 1</c>
<c>0 0 0</c>
<c>Not ECN</c>
<c/>
<c/>
<c/>
<c/>
<c/>
<c/>
<c/>
<c>B</c>
<c>A</c>
<c/>
<c/>
<c>0 1 1</c>
<c>0 0 1</c>
<c>classic ECN</c>
<c>B</c>
<c/>
<c>A</c>
<c/>
<c>0 1 1</c>
<c>0 0 1</c>
<c>classic ECN</c>
<c>B</c>
<c/>
<c/>
<c>A</c>
<c>0 0 0</c>
<c>0 0 0</c>
<c>Not ECN</c>
<c/>
<c/>
<c/>
<c/>
<c/>
<c/>
<c/>
<!-- moved -->
<c>A</c>
<c/>
<c/>
<c>B</c>
<c>1 1 1</c>
<c>1 1 1</c>
<c>Not ECN (broken)</c>
<!-- moved end -->
<c>A</c>
<c/>
<c/>
<c/>
<c>1 1 1</c>
<c>0 1 1</c>
<c>Not ECN (see Appx <xref format="counter"
target="accecn_Alt_Designs"/>)</c>
<!-- changed to Not ECN instead of RSVD -->
<c>A</c>
<c/>
<c/>
<c/>
<c>1 1 1</c>
<c>1 0 0</c>
<!-- changed to Not ECN instead of RSVD -->
<c>Not ECN (see Appx <xref format="counter"
target="accecn_Alt_Designs"/>)</c>
</texttable>
<t><xref target="accecn_Tab_Negotiation"/> is divided into blocks each
separated by an empty row.<list style="numbers">
<t>The top block shows the case already described where both
endpoints support AccECN and how the TCP server (B) indicates
congestion feedback.</t>
<t>The second block shows the cases where the TCP client (A)
supports AccECN but the TCP server (B) supports some earlier
variant of TCP feedback, indicated in its SYN/ACK. Therefore, as
soon as an AccECN-capable TCP client (A) receives the SYN/ACK
shown it MUST set both its half connections into the feedback mode
shown in the rightmost column.</t>
<t>The third block shows the cases where the TCP server (B)
supports AccECN but the TCP client (A) supports some earlier
variant of TCP feedback, indicated in its SYN. Therefore, as soon
as an AccECN-enabled TCP server (B) receives the SYN shown, it
MUST set both its half connections into the feedback mode shown in
the rightmost column.</t>
<t>The fourth block displays combinations that are not valid or
currently unused and therefore both ends MUST fall-back to Not ECN
for both half connections. Especially the first case (marked
`broken') where all bits set in the SYN are reflected by the
receiver in the SYN/ACK, which happens quite often if the TCP
connection is proxied.{ToDo: Consider using the last two cases for
AccECN f/b of ECT(0) and ECT(1) on the SYN (<xref
target="accecn_Alt_Designs"/>)}</t>
<!-- <t>Forward Compatibility: The fourth block enumerates the
remaining combinations of AccECN-related flags that are Reserved
for future use by AccECN ('Rsvd').<list style="symbols">
<t>If an originating AccECN host (A) sends NS=1, CWR=1 and
ECE=1 in the initial SYN segment and if it receives any of
these Reserved values in a SYN/ACK response, it MUST set both
its half connections into AccECN mode.</t>
<t>To comply with the present AccECN protocol, middleboxes
MUST forward these Rsvd combinations of flags unaltered (see
also <xref target="accecn_Mbox_Operation"/>).</t>
</list></t>-->
</list></t>
<t><!--The table is self-explanatory in most respects, but --> The
following exceptional cases need some explanation:<list
style="hanging">
<!-- <t hangText="Not ECN (broken):"> Broken TCP server implementations exist that reflect the
flags that were originally reserved in <xref target="RFC0793"/> back to the originator <xref target="RFC3168"/>.
Therefore, if the SYN/ACK reflects the same flag settings as the preceding
SYN, an AccECN client implementation MUST revert to Not-ECT.</t>
-->
<t hangText="ECN Nonce:">An AccECN implementation, whether client
or server, sender or receiver, does not need to implement the ECN
Nonce behaviour <xref target="RFC3540"/>. AccECN is compatible
with an alternative ECN feedback integrity approach that does not
use up the ECT(1) codepoint and can be implemented solely at the
sender (see <xref target="accecn_Integrity"/>).</t>
<t hangText="Simultaneous Open:">An originating AccECN Host (A),
having sent a SYN with NS=1, CWR=1 and ECE=1, might receive
another SYN from host B. Host A MUST then enter the same feedback
mode as it would have entered had it been a responding host and
received the same SYN. Then host A MUST send the same SYN/ACK as
it would have sent had it been a responding host (see the third
block above).</t>
</list></t>
</section>
<section anchor="accecn_feedback" title="AccECN Feedback">
<t>Each Data Receiver maintains four counters, r.cep, r.ceb, r.e0b and
r.e1b. The CE packet counter (r.cep), counts the number of packets the
host receives with the CE code point in the IP ECN field, including CE
marks on control packets without data. r.ceb, r.e0b and r.e1b count
the number of TCP payload bytes in packets marked respectively with
the CE, ECT(0) and ECT(1) codepoint in their IP-ECN field. When a host
first enters AccECN mode, it initialises its counters to r.cep = 6,
r.e0b = 1 and r.ceb = r.e1b.= 0 (see <xref
target="accecn_Algo_Not-ECT"/>). Non-zero initial values are used to
be distinct from cases where the fields are incorrectly zeroed (e.g.
by middleboxes).</t>
<t>A host feeds back the CE packet counter using the Accurate ECN
(ACE) field, as explained in the next section. And it feeds back all
the byte counters using the AccECN TCP Option, as specified in <xref
target="accecn_option"/>. Whenever a host feeds back the value of any
counter, it MUST report the most recent value, no matter whether it is
in a pure ACK, an ACK with new payload data or a retransmission.</t>
<section anchor="accecn_ACE" title="The ACE Field">
<t>After AccECN has been negotiated on the SYN and SYN/ACK, both
hosts overload the three TCP flags ECE, CWR and NS in the main TCP
header as one 3-bit field. Then the field is given a new name, ACE,
as shown in <xref target="accecn_Fig_ACE_ACK"/>.</t>
<!-- <?rfc needLines="9" ?> -->
<figure align="center" anchor="accecn_Fig_ACE_ACK"
title="Definition of the ACE field within bytes 13 and 14 of the TCP Header (when AccECN has been negotiated and SYN=0).">
<artwork align="center"><![CDATA[
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| | | | U | A | P | R | S | F |
| Header Length | Reserved | ACE | R | C | S | S | Y | I |
| | | | G | K | H | T | N | N |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
]]></artwork>
</figure>
<t>The original definition of these three flags in the TCP header,
including the addition of support for the ECN Nonce, is shown for
comparison in <xref target="accecn_Fig_TCPHdr"/>. This specification
does not rename these three TCP flags, it merely overloads them with
another name and definition once an AccECN connection has been
established.</t>
<t>A host MUST interpret the ECE, CWR and NS flags as the 3-bit ACE
counter on a segment with SYN=0 that it sends or receives if both of
its half-connections are set into AccECN mode having successfully
negotiated AccECN (see <xref target="accecn_Negotiation"/>). A host
MUST NOT interpret the 3 flags as a 3-bit ACE field on any segment
with SYN=1 (whether ACK is 0 or 1), or if AccECN negotiation is
incomplete or has not succeeded.</t>
<t>Both parts of each of these conditions are equally important. For
instance, even if AccECN negotiation has been successful, the ACE
field is not defined on any segments with SYN=1 (e.g. a
retransmission of an unacknowledged SYN/ACK, or when both ends send
SYN/ACKs after AccECN support has been successfully negotiated
during a simultaneous open).</t>
<t>The ACE field encodes the three least significant bits of the
r.cep counter, therefore its initial value will be 0b110 (decimal
6). This non-zero initialization allows a TCP server to use a
stateless handshake (see <xref
target="accecn_Interaction_SYN_Cookies"/>) but still detect from the
TCP client's first ACK that the client considers it has successfully
negotiated AccECN. If the SYN/ACK was CE marked, the client MUST
increase its r.cep counter before it sends its first ACK, therefore
the initial value of the ACE field will be 0b111 (decimal 7). These
values have deliberately been chosen such that they are distinct
from <xref target="RFC5562"/> behaviour, where the TCP client would
set ECE on the first ACK as feedback for a CE mark on the
SYN/ACK.</t>
<t>If the value of the ACE field on the first segment with SYN=0 in
either direction is anything other than 0b110 or 0b111, the Data
Receiver MUST disable ECN for the remainder of the half-connection
by marking all subsequent packets as Not-ECT.</t>
</section>
<section anchor="accecn_ACE_Safety"
title="Safety against Ambiguity of the ACE Field">
<t>If too many CE-marked segments are acknowledged at once, or if a
long run of ACKs is lost, the 3-bit counter in the ACE field might
have cycled between two ACKs arriving at the Data Sender.</t>
<t>Therefore an AccECN Data Receiver SHOULD immediately send an ACK
once 'n' CE marks have arrived since the previous ACK, where 'n'
SHOULD be 2 and MUST be no greater than 6.</t>
<t>If the Data Sender has not received AccECN TCP Options to give it
more dependable information, and it detects that the ACE field could
have cycled under the prevailing conditions, it SHOULD
conservatively assume that the counter did cycle. It can detect if
the counter could have cycled by using the jump in the
acknowledgement number since the last ACK to calculate or estimate
how many segments could have been acknowledged. An example algorithm
to implement this policy is given in <xref
target="accecn_Algo_ACE_Wrap"/>. An implementer MAY develop an
alternative algorithm as long as it satisfies these
requirements.</t>
<t>If missing acknowledgement numbers arrive later (reordering) and
prove that the counter did not cycle, the Data Sender MAY attempt to
neutralise the effect of any action it took based on a conservative
assumption that it later found to be incorrect.</t>
</section>
<section anchor="accecn_option" title="The AccECN Option">
<t>The AccECN Option is defined as shown below in <xref
target="accecn_Fig_TCPopt"/>. It consists of three 24-bit fields
that provide the 24 least significant bits of the r.e0b, r.ceb and
r.e1b counters, respectively. The initial 'E' of each field name
stands for 'Echo'.</t>
<figure align="center" anchor="accecn_Fig_TCPopt"
title="The AccECN Option">
<artwork><![CDATA[ 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Kind = TBD1 | Length = 11 | EE0B field |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| EE0B (cont'd) | ECEB field |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| EE1B field |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+]]></artwork>
</figure>
<t>The Data Receiver MUST set the Kind field to TBD1, which is
registered in <xref target="accecn_IANA_Considerations"/> as a new
TCP option Kind called AccECN. An experimental TCP option with
Kind=254 MAY be used for initial experiments, with magic number
0xACCE.</t>
<t><xref target="accecn_Algo_Option_Coding"/> gives an example
algorithm for the Data Receiver to encode its byte counters into the
AccECN Option, and for the Data Sender to decode the AccECN Option
fields into its byte counters.</t>
<t>Note that there is no field to feedback Not-ECT bytes.
Nonetheless an algorithm for the Data Sender to calculate the number
of payload bytes received as Not-ECT is given in <xref
target="accecn_Algo_Not-ECT"/>.</t>
<t>Whenever a Data Receiver sends an AccECN Option, the rules in
<xref target="accecn_option_usage"/> expect it to always send a
full-length option. To cope with option space limitations, it can
omit unchanged fields from the tail of the option, as long as it
preserves the order of the remaining fields and includes any field
that has changed. The length field MUST indicate which fields are
present as follows:<list style="hanging">
<t hangText="Length=11:">EE0B, ECEB, EE1B</t>
<t hangText="Length=8:">EE0B, ECEB</t>
<t hangText="Length=5:">EE0B</t>
<t hangText="Length=2:">(empty)</t>
</list></t>
<t>The empty option of Length=2 is provided to allow for a case
where an AccECN Option has to be sent (e.g. on the SYN/ACK to test
the path), but there is very limited space for the option. For
initial experiments, the Length field MUST be 2 greater to
accommodate the 16-bit magic number.</t>
<t>All implementations of a Data Sender MUST be able to read in
AccECN Options of any of the above lengths. They MUST ignore an
AccECN Option of any other length.</t>
</section>
<section anchor="accecn_Mbox_Interference"
title="Path Traversal of the AccECN Option">
<t>An AccECN host MUST NOT include the AccECN TCP Option on the SYN.
Nonetheless, if the AccECN negotiation using the ECN flags in the
main TCP header (<xref target="accecn_Negotiation"/>) is successful,
it implicitly declares that the endpoints also support the AccECN
TCP Option.</t>
<t>If the TCP client indicated AccECN support, a TCP server tha
confirms its support for AccECN (as described in <xref
target="accecn_Negotiation"/>) SHOULD also include an AccECN TCP
Option in the SYN/ACK. A TCP client that has successfully negotiated
AccECN SHOULD include an AccECN Option in the first ACK at the end
of the 3WHS. However, this first ACK is not delivered reliably, so
the TCP client SHOULD also include an AccECN Option on the first
data segment it sends (if it ever sends one). A host MAY NOT
include an AccECN Option in any of these three cases if it has
cached knowledge that the packet would be likely to be blocked on
the path to the other host if it included an AccECN Option.</t>
<t>If the TCP client has successfully negotiated AccECN but does not
receive an AccECN Option on the SYN/ACK, it switches into a mode
that assumes that the AccECN Option is not available for this half
connection. Similarly, if the TCP server has successfully negotiated
AccECN but does not receive an AccECN Option on the first ACK or on
the first data segment, it switches into a mode that assumes that
the AccECN Option is not available for this half connection.</t>
<t>While a host is in the mode that assumes the AccECN Option is not
available, it MUST adopt the conservative interpretation of the ACE
field discussed in <xref target="accecn_ACE_Safety"/>. However, it
cannot make any assumption about support of the AccECN Option on the
other half connection, so it MUST continue to send the AccECN Option
itself.</t>
<t>If after the normal TCP timeout the TCP server has not received
an ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been
lost, e.g. due to congestion, or a middlebox might be blocking the
AccECN Option. To expedite connection setup, the host SHOULD fall
back to NS=CWR=ECE=0 and no AccECN Option on the retransmission of
the SYN/ACK. Implementers MAY use other fall-back strategies if they
are found to be more effective (e.g. retransmitting a SYN/ACK with
AccECN TCP flags but not the AccECN Option; attempting to retransmit
a second AccECN segment before fall-back (most appropriate during
high levels of congestion); or falling back to classic ECN feedback
rather than non-ECN).</t>
<t>Similarly, if the TCP client detects that the first data segment
it sent with the AccECN Option was lost,
it SHOULD fall back to no AccECN Option on the
retransmission. Again, implementers MAY use other fall-back
strategies such as attempting to retransmit a second segment with
the AccECN Option before fall-back, and/or caching the result of
previous attempts.</t>
<t>Either host MAY include the AccECN Option in a subsequent segment
to retest whether the AccECN Option can traverse the path.</t>
<t>Currently the Data Sender is not required to test whether the
arriving byte counters in the AccECN Option have been correctly
initialised. This allows different initial values to be used as an
additional signalling channel in future. If any inappropriate
zeroing of these fields is discovered during testing, this approach
will need to be reviewed.</t>
</section>
<section anchor="accecn_option_usage"
title="Usage of the AccECN TCP Option">
<t>The following rules determine when a Data Receiver in AccECN mode
sends the AccECN TCP Option, and which fields to include:<list
style="hanging">
<t hangText="Change-Triggered ACKs:">If an arriving packet
increments a different byte counter to that incremented by the
previous packet, the Data Receiver SHOULD immediately send an
ACK with an AccECN Option, without waiting for the next delayed
ACK. Certain offload hardware might not be able to support
change-triggered ACKs, but otherwise it is important to keep
exceptions to this rule to a minimum so that Data Senders can
generally rely on this behaviour;</t>
<t hangText="Continual Repetition:">Otherwise, if arriving
packets continue to increment the same byte counter, the Data
Receiver can include an AccECN Option on most or all (delayed)
ACKs, but it does not have to. If option space is limited on a
particular ACK, the Data Receiver MUST give precedence to SACK
information about loss. It SHOULD include an AccECN Option if
the r.ceb counter has incremented and it MAY include an AccECN
Option if r.ec0b or r.ec1b has incremented;</t>
<t hangText="Full-Length Options Preferred:">It SHOULD always
use full-length AccECN Options. It MAY use shorter AccECN
Options if space is limited, but it MUST include the counter(s)
that have incremented since the previous AccECN Option and it
MUST only truncate fields from the right-hand tail of the option
to preserve the order of the remaining fields (see <xref
target="accecn_option"/>);</t>
<t hangText="Beaconing Full-Length Options:">Nonetheless, it
MUST include a full-length AccECN TCP Option on at least three
ACKs per RTT, or on all ACKs if there are less than three per
RTT (see <xref target="accecn_Algo_Beacon"/> for an example
algorithm that satisfies this requirement).</t>
</list></t>
<!--Further an AccECN host MAY send the AccECN TCP Option immediately if a different counter changes than triggered by the previous received segment. 'immediate' in this case does not only mean
that the AccECN Option will be included in the next ACK, but also means that the host might send an ACK immediately after reception of the current segment and does not wait for the next delayed ACK.
Note that the ACK for the next segment could be delayed again if it carries the same ECN mark. -->
<t>The following example series of arriving marks illustrates when a
Data Receiver will emit an ACK if it is using a delayed ACK factor
of 2 segments and change-triggered ACKs: 01 -> ACK, 01, 01 ->
ACK, 10 -> ACK, 10, 01 -> ACK, 01, 11 -> ACK, 01 ->
ACK.</t>
<t>For the avoidance of doubt, the change-triggered ACK mechanism
ignores the arrival of a control packet with no payload, because it
does not alter any byte counters. The change-triggered ACK approach
will lead to some additional ACKs but it feeds back the timing and
the order in which ECN marks are received with minimal additional
complexity.</t>
<t>Implementation note: sending an AccECN Option each time a
different counter changes and including a full-length AccECN Option
on every delayed ACK will satisfy the requirements described above
and might be the easiest implementation, as long as sufficient space
is available in each ACK (in total and in the option space).</t>
<t><xref target="accecn_Algo_ACE_Bytes"/> gives an example algorithm
to estimate the number of marked bytes from the ACE field alone, if
the AccECN Option is not available.</t>
<t>If a host has determined that segments with the AccECN Option
always seem to be discarded somewhere along the path, it is no
longer obliged to follow the above rules.</t>
</section>
</section>
<!-- <section anchor="accecn_Rcvr_Operation"
title="Accurate ECN Receiver Operation">
<t>A TCP receiver MUST only feedback ECN information arriving in a
segment that it deems is part of the flow, by using regular TCP
techniques based on sequence numbers.</t>
<t>{ToDo: It might be useful to describe receiver end of the feedback
process, including special cases, e.g. pure ACKs, retransmissions,
window probes, partial ACKs, etc. Does AccECN feed back each ECN
codepoint when a data packet is duplicated?}</t>
</section>
<section anchor="accecn_Sndr_Operation"
title="Accurate ECN Sender Operation">
<t>A TCP sender MUST only accept ECN feedback on ACKs that it deems is
part of the flow, by using regular TCP techniques based on sequence
numbers.</t>
<t>{ToDo: It might be useful to describe the sender end of the
feedback process, including special cases, e.g. pure ACKs,
retransmissions, window probes, partial ACKs, etc.}</t>
</section> -->
<!-- Comment by Mirja: not sure if the following section is needed. Of
course a proxy should comply to the spec. Just writing this down explicitly
doesn't help the problem; especially as the problem is old boxes that
never get updated...!
Bob adds: Of course it doesn't stop legacy middleboxes being wrong,
but it allows us (or an operator that buys a middlebox) to say a middlebox
does not comply with this RFC, which can be important if the contract
to maintain the box says it has to comply with updated standards -->
<section anchor="accecn_Mbox_Operation"
title="AccECN Compliance by TCP Proxies, Offload Engines and other Middleboxes">
<t>A large class of middleboxes split TCP connections. Such a
middlebox would be compliant with the AccECN protocol if the TCP
implementation on each side complied with the present AccECN
specification and each side negotiated AccECN independently of the
other side.</t>
<t>Another large class of middleboxes intervene to some degree at the
transport layer, but attempts to be transparent (invisible) to the
end-to-end connection. A subset of this class of middleboxes attempts
to `normalise' the TCP wire protocol by checking that all values in
header fields comply with a rather narrow interpretation of the TCP
specifications. To comply with the present AccECN specification, such
a middlebox MUST NOT change the ACE field or the AccECN Option and it
MUST attempt to preserve the timing of each ACK (for example, if it
coalesced ACKs it would not be AccECN-compliant). <!-- This includes the explicitly stated requirements to forward
Reserved (Rsvd) and Currently Unused (CU) values unaltered.
An 'ideal' TCP normaliser would not have to change to accommodate AccECN, because AccECN does not directly contravene any existing TCP specifications,
even though it uses existing TCP fields in unorthodox ways.
--> A middlebox claiming to be transparent at the transport layer MUST forward
the AccECN TCP Option unaltered, whether or not the length value
matches one of those specified in <xref target="accecn_option"/>, and
whether or not the initial values of the byte-counter fields are
correct. This is because blocking apparently invalid values does not
improve security (because AccECN hosts are required to ignore invalid
values anyway), while it prevents the standardised set of values being
extended in future (because outdated normalisers would block updated
hosts from using the extended AccECN standard).</t>
<t>Hardware to offload certain TCP processing represents another large
class of middleboxes, even though it is often a function of a host's
network interface and rarely in its own 'box'. Leeway has been allowed
in the present AccECN specification in the expectation that offload
hardware could comply and still serve its function. Nonetheless, such
hardware MUST attempt to preserve the timing of each ACK (for example,
if it coalesced ACKs it would not be AccECN-compliant).</t>
</section>
</section>
<section anchor="accecn_Interact_Variants"
title="Interaction with Other TCP Variants">
<t>This section is informative, not normative.</t>
<section anchor="accecn_Interaction_SYN_Cookies"
title="Compatibility with SYN Cookies">
<t>A TCP server can use SYN Cookies (see Appendix A of <xref
target="RFC4987"/>) to protect itself from SYN flooding attacks. It
places minimal commonly used connection state in the SYN/ACK, and
deliberately does not hold any state while waiting for the subsequent
ACK (e.g. it closes the thread). Therefore it cannot record the fact
that it entered AccECN mode for both half-connections. Indeed, it
cannot even remember whether it negotiated the use of classic ECN
<xref target="RFC3168"/>.</t>
<t>Nonetheless, such a server can determine that it negotiated AccECN
as follows. If a TCP server using SYN Cookies supports AccECN and if
the first ACK it receives contains an ACE field with the value 0b110
or 0b111, it can assume that:<list style="symbols">
<t>the TCP client must have requested AccECN support on the
SYN</t>
<t>it (the server) must have confirmed that it supported
AccECN</t>
</list>Therefore the server can switch itself into AccECN mode, and
continue as if it had never forgotten that it switched itself into
AccECN mode earlier.</t>
</section>
<section anchor="accecn_Interaction_Other"
title="Compatibility with Other TCP Options and Experiments">
<t>AccECN is compatible (at least on paper) with the most commonly
used TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It
is also compatible with the recent promising experimental TCP options
TCP Fast Open (TFO <xref target="RFC7413"/>) and Multipath TCP (MPTCP
<xref target="RFC6824"/>). AccECN is friendly to all these protocols,
because space for TCP options is particularly scarce on the SYN, where
AccECN consumes zero additional header space.</t>
<t>When option space is under pressure from other options, <xref
target="accecn_option_usage"/> provides guidance on how important it
is to send an AccECN Option and whether it needs to be a full-length
option.</t>
</section>
<section anchor="accecn_Integrity"
title="Compatibility with Feedback Integrity Mechanisms">
<t>The ECN Nonce <xref target="RFC3540"/> is an experimental IETF
specification intended to allow a sender to test whether ECN CE
markings (or losses) introduced in one network are being suppressed by
the receiver or anywhere else in the feedback loop, such as another
network or a middlebox. The ECN nonce has not been deployed as far as
can be ascertained. The nonce would now be nearly impossible to deploy
retrospectively, because to catch a misbehaving receiver it relies on
the receiver volunteering feedback information to incriminate itself.
A receiver that has been modified to misbehave can simply claim that
it does not support nonce feedback, which will seem unremarkable given
so many other hosts do not support it either.</t>
<t>With minor changes AccECN could be optimised for the possibility
that the ECT(1) codepoint might be used as a nonce. However, given the
nonce is now probably undeployable, the AccECN design has been
generalised so that it ought to be able to support other possible uses
of the ECT(1) codepoint, such as a lower severity or a more instant
congestion signal than CE.</t>
<t>Three alternative mechanisms are available to assure the integrity
of ECN and/or loss signals. AccECN is compatible with any of these
approaches:<list style="symbols">
<t>The Data Sender can test the integrity of the receiver's ECN
(or loss) feedback by occasionally setting the IP-ECN field to a
value normally only set by the network (and/or deliberately
leaving a sequence number gap). Then it can test whether the Data
Receiver's feedback faithfully reports what it expects <xref
target="I-D.moncaster-tcpm-rcv-cheat"/>. Unlike the ECN Nonce,
this approach does not waste the ECT(1) codepoint in the IP
header, it does not require standardisation and it does not rely
on misbehaving receivers volunteering to reveal feedback
information that allows them to be detected. However, setting the
CE mark by the sender might conceal actual congestion feedback
from the network and should therefore only be done sparsely.</t>
<t>Networks generate congestion signals when they are becoming
congested, so they are more likely than Data Senders to be
concerned about the integrity of the receiver's feedback of these
signals. A network can enforce a congestion response to its ECN
markings (or packet losses) using congestion exposure (ConEx)
audit <xref target="I-D.ietf-conex-abstract-mech"/>. Whether the
receiver or a downstream network is suppressing congestion
feedback or the sender is unresponsive to the feedback, or both,
ConEx audit can neutralise any advantage that any of these three
parties would otherwise gain. <vspace blankLines="1"/>ConEx is a
change to the Data Sender that is most useful when combined with
AccECN. Without AccECN, the ConEx behaviour of a Data Sender would
have to be more conservative than would be necessary if it had the
accurate feedback of AccECN.</t>
<t>The TCP authentication option (TCP-AO <xref target="RFC5925"/>)
can be used to detect any tampering with AccECN feedback between
the Data Receiver and the Data Sender (whether malicious or
accidental). The AccECN fields are immutable end-to-end, so they
are amenable to TCP-AO protection, which covers TCP options by
default. However, TCP-AO is often too brittle to use on many
end-to-end paths, where middleboxes can make verification fail in
their attempts to improve performance or security, e.g. by
resegmentation or shifting the sequence space.</t>
</list></t>
</section>
</section>
<!-- ================================================================ -->
<section anchor="accecn_Properties" title="Protocol Properties">
<t>This section is informative not normative. It describes how well the
protocol satisfies the agreed requirements for a more accurate ECN
feedback protocol <xref target="RFC7560"/>.<list style="hanging">
<t hangText="Accuracy:">From each ACK, the Data Sender can infer the
number of new CE marked segments since the previous ACK. This
provides better accuracy on CE feedback than classic ECN. In
addition if the AccECN Option is present (not blocked by the network
path) the number of bytes marked with CE, ECT(1) and ECT(0) are
provided.</t>
<!-- <t hangText="Accuracy:">The Data Receiver can feed back to the Data
Sender a list of the order of the IP-ECN markings covered by each
delayed ACK.</t> -->
<t hangText="Overhead:">The AccECN scheme is divided into two parts.
The essential part reuses the 3 flags already assigned to ECN in the
IP header. The supplementary part adds an additional TCP option
consuming up to 11 bytes. However, no TCP option is consumed in the
SYN.</t>
<t hangText="Ordering:">The order in which marks arrive at the Data
Receiver is preserved in AccECN feedback, because the Data Receiver
is expected to send an ACK immediately whenever a different mark
arrives.</t>
<!-- <t hangText="Overhead:">Two alternative locations for the
supplementary protocol field are proposed:<list style="numbers">
<t>In the 16-bit Urgent Pointer when URG=0. This specification
reserves 15 bits of this space, but while the specification is
only experimental it refrains from using this space in the main
TCP header. If AccECN progresses to the standards track and uses
these 15b, it will require zero additional overhead, because it
will overload fields that already takes up space in every TCP
header</t>
<t>In a TCP option. This takes up 4B; the fifteen bits have to
be rounded up to 2B, plus 2B for the TCP option Kind and
Length.</t>
</list></t> -->
<t hangText="Timeliness:">While the same ECN markings are arriving
continually at the Data Receiver, it can defer ACKs as TCP does
normally, but it will immediately send an ACK as soon as a different
ECN marking arrives.</t>
<t hangText="Timeliness vs Overhead:">Change-Triggered ACKs are
intended to enable latency-sensitive uses of ECN feedback by
capturing the timing of transitions but not wasting resources while
the state of the signalling system is stable. The receiver can
control how frequently it sends the AccECN TCP Option and therefore
it can control the overhead induced by AccECN.</t>
<!-- <t hangText="Timeliness:">{ToDo: Add improved timeliness if the
Delayed ACK Control (DAC) feature is included.}</t> -->
<t hangText="Resilience:">All information is provided based on
counters. Therefore if ACKs are lost, the counters on the first ACK
following the losses allows the Data Sender to immediately recover
the number of the ECN markings that it missed.</t>
<t hangText="Resilience against Bias:">Because feedback is based on
repetition of counters, random losses do not remove any information,
they only delay it. Therefore, even though some ACKs are
change-triggered, random losses will not alter the proportions of
the different ECN markings in the feedback.</t>
<t hangText="Resilience vs Overhead:">If space is limited in some
segments (e.g. because more option are need on some segments, such
as the SACK option after loss), the Data Receiver can send AccECN
Options less frequently or truncate fields that have not changed,
usually down to as little as 5 bytes. However, it has to send a
full-sized AccECN Option at least three times per RTT, which the
Data Sender can rely on as a regular beacon or checkpoint.</t>
<t hangText="Resilience vs Timeliness and Ordering:">Ordering
information and the timing of transitions cannot be communicated in
three cases: i) during ACK loss; ii) if something on the path strips
the AccECN Option; or iii) if the Data Receiver is unable to support
Change-Triggered ACKs.</t>
<!-- reworked end -->
<!-- <t hangText="Resilience:">Subsequent ACKs will allow it to recover
the number of other ECN markings that it missed.</t>
<t hangText="Resilience against Bias:">Undetected ACK loss is as
likely to decrease as increase congestion signals detected by the
Data Sender.</t>
<t hangText="Resilience against Bias:">However, if the supplementary
part is unavailable, the required conservative decoding of feedback
during ACK loss is more likely to increase perceived congestion
signals, which would otherwise be more likely to be
under-reported.</t>
<t hangText="Timeliness vs Overhead:">For efficiency, each delayed
ACK only includes one of the counters at a time, therefore recovery
of the count of the other signals might not be immediate if an ACK
is lost that covers more than one signal. The receiver cannot
predict which ACKs might get lost, if any. Therefore it repeats the
count of each signal roughly in proportion to how often each signal
changes.</t>
<t hangText="Ordering:">The order of arriving ECN codepoints is
communicated in a 10-bit field in the supplementary part;</t>
<t hangText="Resilience vs. Ordering:">Following an ACK loss, only a
count of the lost ECN signals is recovered, not their order of
arrival over the sequence covered by the loss.</t>
<t hangText="Ordering vs. Overhead:">The encoding is tailored for
sequences of ECN codepoints expected to be typical. It can encode
sequences of up to 15 segments but, if the pattern of arrivals
becomes too complex, the protocol forces the Data Receiver to emit
an ACK. The protocol can always encode any sequence of 3 segments in
one delayed ACK;</t>
<t hangText="Ordering, Timeliness and Resilience:">If one delayed
ACK covers changes to more than one congestion counter the
supplementary sequence information provides more timely congestion
feedback than waiting for the other congestion counters on future
ACKs, and it provides resilience against the possibility of those
future ACKs going missing;</t> -->
<!-- new -->
<t hangText="Complexity:">An AccECN implementation solely involves
simple counter increments, some modulo arithmetic to communicate the
least significant bits and allow for wrap, and some heuristics for
safety against fields cycling due to prolonged periods of ACK loss.
Each host needs to maintain eight additional counters. The hosts
have to apply some additional tests to detect tampering by
middleboxes, but in general the protocol is simple to understand,
simple to implement and requires few cycles per packet to
execute.</t>
<t hangText="Integrity:">AccECN is compatible with at least three
approaches that can assure the integrity of ECN feedback. If the
AccECN Option is stripped the resolution of the feedback is
degraded, but the integrity of this degraded feedback can still be
assured.</t>
<t hangText="Backward Compatibility:">If only one endpoint supports
the AccECN scheme, it will fall-back to the most advanced ECN
feedback scheme supported by the other end.</t>
<!-- <t hangText="Backward Compatibility:">Each endpoint can detect
normalisation of the Supplementary AccECN field by middleboxes at
any time during a connection. It could then fall-back to the
essential part using only the fewer but safer bits in the TCP
header.</t> -->
<!-- new -->
<t hangText="Backward Compatibility:">If the AccECN Option is
stripped by a middlebox, AccECN still provides basic congestion
feedback in the ACE field. Further, AccECN can be used to detect
mangling of the IP ECN field; mangling of the TCP ECN flags;
blocking of ECT-marked segments; and blocking of segments carrying
the AccECN Option. It can detect these conditions during TCP's 3WHS
so that it can fall back to operation without ECN and/or operation
without the AccECN Option.</t>
<!-- new end -->
<t hangText="Forward Compatibility:">The behaviour of endpoints and
middleboxes is carefully defined for all reserved or currently
unused codepoints in the scheme, to ensure that any blocking of
anomalous values is always at least under reversible policy
control.</t>
</list></t>
</section>
<!-- ================================================================ -->
<section anchor="accecn_IANA_Considerations" title="IANA Considerations">
<t>This document defines a new TCP option for AccECN, assigned a value
of TBD1 (decimal) from the TCP option space. This value is defined
as:</t>
<texttable>
<ttcol>Kind</ttcol>
<ttcol>Length</ttcol>
<ttcol>Meaning</ttcol>
<ttcol>Reference</ttcol>
<c>TBD1</c>
<c>N</c>
<c>Accurate ECN (AccECN)</c>
<c>RFC XXXX</c>
</texttable>
<t>[TO BE REMOVED: This registration should take place at the following
location:
http://www.iana.org/assignments/tcp-parameters/tcp-parameters.xhtml#tcp-parameters-1]</t>
<t>Early implementation before the IANA allocation MUST follow <xref
target="RFC6994"/> and use experimental option 254 and magic number
0xACCE (16 bits) {ToDo register this with IANA}, then migrate to the new
option after the allocation.</t>
</section>
<!-- ================================================================ -->
<section anchor="accecn_Security_Considerations"
title="Security Considerations">
<t>If ever the supplementary part of AccECN based on the new AccECN TCP
Option is unusable (due for example to middlebox interference) the
essential part of AccECN's congestion feedback offers only limited
resilience to long runs of ACK loss (see <xref
target="accecn_ACE_Safety"/>). These problems are unlikely to be due to
malicious intervention (because if an attacker could strip a TCP option
or discard a long run of ACKs it could wreak other arbitrary havoc).
However, it would be of concern if AccECN's resilience could be
indirectly compromised during a flooding attack. AccECN is still
considered safe though, because if the option is not presented, the
AccECN Data Sender is then required to switch to more conservative
assumptions about wrap of congestion indication counters (see <xref
target="accecn_ACE_Safety"/> and <xref
target="accecn_Algo_ACE_Wrap"/>).</t>
<t><xref target="accecn_Interaction_SYN_Cookies"/> describes how a TCP
server can negotiate AccECN and use the SYN cookie method for mitigating
SYN flooding attacks.</t>
<t>There is concern that ECN markings could be altered or suppressed,
particularly because a misbehaving Data Receiver could increase its own
throughput at the expense of others. Given the experimental ECN nonce is
now probably undeployable, AccECN has been generalised for other
possible uses of the ECT(1) codepoint to avoid obsolescence of the
codepoint even if the nonce mechanism is obsoleted. AccECN is compatible
with the three other schemes known to assure the integrity of ECN
feedback (see <xref target="accecn_Integrity"/> for details). If the
AccECN Option is stripped by an incorrectly implemented middlebox, the
resolution of the feedback will be degraded, but the integrity of this
degraded information can still be assured.</t>
<!--Bob adds: I removed the following 3 sentences, which I felt were weak. I think it is better to admit there is a security concern, than try to claim it is not a problem (when it is).
If a receiver has driven a network from marking into loss, it has already probably harmed other flows and gained a large share of resources for itself.
Anyway, a receiver can regulate concealment of ECN marks to give itself more resources without driving a link into loss.-->
<!--The motivation for concealing ECN marks is generally considered to be self-interest. Causing congestion collapse would not be in the interest of a receiver,
and it has not been identified as a realistic motivation for attacks that conceal ECN marks.-->
<!--
-->
<!--"However, if congestion is persistent but no congestion notification is provided to the Data Sender, the congestion will lead to packet loss which cannot easily be concealed by a reliable TCP connection.
Therefore the absence of ECN-based packet feedback will not lead to congestion collapse. Further note that classic ECN also do not have an integrity check.
ECN Nonce was specified separately therefore a end point that wants to conceal ECN feedback can simply present to not support ECN Nonce."-->
<t>The AccECN protocol is not believed to introduce any new privacy
concerns, because it merely counts and feeds back signals at the
transport layer that had already been visible at the IP layer.</t>
</section>
<!-- ================================================================ -->
<section anchor="accecn_Acknowledgements" title="Acknowledgements">
<t>We want to thank Koen De Schepper, Praveen Balasubramanian and
Michael Welzl for their input and discussion. The idea of using the
three ECN-related TCP flags as one field for more accurate TCP-ECN
feedback was first introduced in the re-ECN protocol that was the
ancestor of ConEx.</t>
<t>Bob Briscoe was part-funded by the European Community under its
Seventh Framework Programme through the Reducing Internet Transport
Latency (RITE) project (ICT-317700) and through the Trilogy 2 project
(ICT-317756). The views expressed here are solely those of the
authors.</t>
<t>This work is partly supported by the European Commission under Horizon 2020
grant agreement no. 688421 Measurement and Architecture for a
Middleboxed Internet (MAMI), and by the Swiss State Secretariat for
Education, Research, and Innovation under contract no. 15.0268. This
support does not imply endorsement.</t>
</section>
<!-- ================================================================ -->
<section anchor="accecn_Comments_Solicited" title="Comments Solicited">
<t>Comments and questions are encouraged and very welcome. They can be
addressed to the IETF TCP maintenance and minor modifications working
group mailing list <tcpm@ietf.org>, and/or to the authors.</t>
</section>
</middle>
<back>
<!-- ================================================================ -->
<references title="Normative References">
<!-- &RFC0793;
-->
&RFC2119;
&RFC3168;
&RFC5681;
&RFC6994;
</references>
<references title="Informative References">
<!-- &RFC2018;
-->
&RFC3540;
&RFC4987;
<!-- &RFC5226;
-->
&RFC5562;
&RFC5925;
&RFC6824;
&RFC7560;
&RFC7413;
&I-D.ietf-conex-abstract-mech;
&I-D.kuehlewind-tcpm-ecn-fallback;
&I-D.moncaster-tcpm-rcv-cheat;
&I-D.bensley-tcpm-dctcp;
</references>
<!-- <section anchor="accecn_Algo_Examples" title="Example Algorithms">
<t>This appendix is informative, not normative. It gives examples in
pseudocode for the various algorithms used by AccECN.</t> -->
<section anchor="accecn_Algo_Examples" title="Example Algorithms">
<t>This appendix is informative, not normative. It gives example
algorithms that would satisfy the normative requirements of the AccECN
protocol. However, implementers are free to choose other ways to
implement the requirements.</t>
<section anchor="accecn_Algo_Option_Coding"
title="Example Algorithm to Encode/Decode the AccECN Option">
<t>The example algorithms below show how a Data Receiver in AccECN
mode could encode its CE byte counter r.ceb into the ECEB field within
the AccECN TCP Option, and how a Data Sender in AccECN mode could
decode the ECEB field into its byte counter s.ceb. The other counters
for bytes marked ECT(0) and ECT(1) in the AccECN Option would be
similarly encoded and decoded.</t>
<t>It is assumed that each local byte counter is an unsigned integer
greater than 24b (probably 32b), and that the following constant has
been assigned:<list style="empty">
<t>DIVOPT = 2^24</t>
</list></t>
<t>Every time a CE marked data segment arrives, the Data Receiver
increments its local value of r.ceb by the size of the TCP Data.
Whenever it sends an ACK with the AccECN Option, the value it writes
into the ECEB field is <list style="empty">
<t>ECEB = r.ceb % DIVOPT</t>
</list></t>
<t>where '%' is the modulo operator.</t>
<t>On the arrival of an AccECN Option, the Data Sender uses the TCP
acknowledgement number and any SACK options to calculate newlyAckedB,
the amount of new data that the ACK acknowledges in bytes. If
newlyAckedB is negative it means that a more up to date ACK has
already been processed, so this ACK has been superseded and the Data
Sender has to ignore the AccECN Option. Then the Data Sender
calculates the minimum difference d.ceb between the ECEB field and its
local s.ceb counter, using modulo arithmetic as follows:</t>
<figure>
<artwork><![CDATA[ if (newlyAckedB >= 0) {
d.ceb = (ECEB + DIVOPT - (s.ceb % DIVOPT)) % DIVOPT
s.ceb += d.ceb
}
]]></artwork>
</figure>
<t>For example, if s.ceb is 33,554,433 and ECEB is 1461 (both
decimal), then</t>
<figure>
<artwork><![CDATA[ s.ceb % DIVOPT = 1
d.ceb = (1461 + 2^24 - 1) % 2^24
= 1460
s.ceb = 33,554,433 + 1460
= 33,555,893
]]></artwork>
</figure>
</section>
<section anchor="accecn_Algo_ACE_Wrap"
title="Example Algorithm for Safety Against Long Sequences of ACK Loss">
<t>The example algorithms below show how a Data Receiver in AccECN
mode could encode its CE packet counter r.cep into the ACE field, and
how the Data Sender in AccECN mode could decode the ACE field into its
s.cep counter. The Data Sender's algorithm includes code to
heuristically detect a long enough unbroken string of ACK losses that
could have concealed a cycle of the congestion counter in the ACE
field of the next ACK to arrive.</t>
<t>Two variants of the algorithm are given: i) a more conservative
variant for a Data Sender to use if it detects that the AccECN Option
is not available (see <xref target="accecn_ACE_Safety"/> and <xref
target="accecn_Mbox_Interference"/>); and ii) a less conservative
variant that is feasible when complementary information is available
from the AccECN Option.</t>
<section title="Safety Algorithm without the AccECN Option">
<t>It is assumed that each local packet counter is a sufficiently
sized unsigned integer (probably 32b) and that the following
constant has been assigned:<list style="empty">
<t>DIVACE = 2^3</t>
</list></t>
<t>Every time a CE marked packet arrives, the Data Receiver
increments its local value of r.cep by 1. It repeats the same value
of ACE in every subsequent ACK until the next CE marking arrives,
where<list>
<t>ACE = r.cep % DIVACE.</t>
</list></t>
<t>If the Data Sender received an earlier value of the counter that
had been delayed due to ACK reordering, it might incorrectly
calculate that the ACE field had wrapped. Therefore, on the arrival
of every ACK, the Data Sender uses the TCP acknowledgement number
and any SACK options to calculate newlyAckedB, the amount of new
data that the ACK acknowledges. If newlyAckedB is negative it means
that a more up to date ACK has already been processed, so this ACK
has been superseded and the Data Sender has to ignore the AccECN
Option. If newlyAckedB is zero, to break the tie the Data Sender
could use timestamps (if present) to work out newlyAckedT, the
amount of new time that the ACK acknowledges. Then the Data Sender
calculates the minimum difference d.cep between the ACE field and
its local s.cep counter, using modulo arithmetic as follows:</t>
<figure>
<artwork><![CDATA[ if ((newlyAckedB > 0) || (newlyAckedB == 0 && newlyAckedT > 0))
d.cep = (ACE + DIVACE - (s.cep % DIVACE)) % DIVACE
]]></artwork>
</figure>
<t><xref target="accecn_ACE_Safety"/> requires the Data Sender to
assume that the ACE field did cycle if it could have cycled under
prevailing conditions. The 3-bit ACE field in an arriving ACK could
have cycled and become ambiguous to the Data Sender if a row of ACKs
goes missing that covers a stream of data long enough to contain 8
or more CE marks. We use the word `missing' rather than `lost',
because some or all the missing ACKs might arrive eventually, but
out of order. Even if some of the lost ACKs are piggy-backed on data
(i.e. not pure ACKs) retransmissions will not repair the lost AccECN
information, because AccECN requires retransmissions to carry the
latest AccECN counters, not the original ones.</t>
<t>The phrase `under prevailing conditions' allows the Data Sender
to take account of the prevailing size of data segments and the
prevailing CE marking rate just before the sequence of ACK losses.
However, we shall start with the simplest algorithm, which assumes
segments are all full-sized and ultra-conservatively it assumes that
ECN marking was 100% on the forward path when ACKs on the reverse
path started to all be dropped. Specifically, if newlyAckedB is the
amount of data that an ACK acknowledges since the previous ACK, then
the Data Sender could assume that this acknowledges newlyAckedPkt
full-sized segments, where newlyAckedPkt = newlyAckedB/MSS. Then it
could assume that the ACE field incremented by</t>
<figure>
<artwork><![CDATA[ dSafer.cep = newlyAckedPkt - ((newlyAckedPkt - d.cep) % DIVACE),]]></artwork>
</figure>
<t>For example, imagine an ACK acknowledges newlyAckedPkt=9 more
full-size segments than any previous ACK, and that ACE increments by
a minimum of 2 CE marks (d.cep=2). The above formula works out that
it would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8)
= 2). However, if ACE increases by a minimum of 2 but acknowledges
10 full-sized segments, then it would be necessary to assume that
there could have been 10 CE marks (because 10 - ((10-2) % 8) =
10).</t>
<t>Implementers could build in more heuristics to estimate
prevailing average segment size and prevailing ECN marking. For
instance, newlyAckedPkt in the above formula could be replaced with
newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing
segment size and p is the prevailing ECN marking probability.
However, ultimately, if TCP's ECN feedback becomes inaccurate it
still has loss detection to fall back on. Therefore, it would seem
safe to implement a simple algorithm, rather than a perfect one.</t>
<t>The simple algorithm for dSafer.cep above requires no monitoring
of prevailing conditions and it would still be safe if, for example,
segments were on average at least 5% of full-sized as long as ECN
marking was 5% or less. Assuming it was used, the Data Sender would
increment its packet counter as follows:<list style="empty">
<t>s.cep += dSafer.cep</t>
</list></t>
<t>If missing acknowledgement numbers arrive later (due to
reordering), <xref target="accecn_ACE_Safety"/> says "the Data
Sender MAY attempt to neutralise the effect of any action it took
based on a conservative assumption that it later found to be
incorrect". To do this, the Data Sender would have to store the
values of all the relevant variables whenever it made assumptions,
so that it could re-evaluate them later. Given this could become
complex and it is not required, we do not attempt to provide an
example of how to do this.</t>
</section>
<section title="Safety Algorithm with the AccECN Option">
<t>When the AccECN Option is available on the ACKs before and after
the possible sequence of ACK losses, if the Data Sender only needs
CE-marked bytes, it will have sufficient information in the AccECN
Option without needing to process the ACE field. However, if for
some reason it needs CE-marked packets, if dSafer.cep is different
from d.cep, it can calculate the average marked segment size that
each implies to determine whether d.cep is likely to be a safe
enough estimate. Specifically, it could use the following algorithm,
where d.ceb is the amount of newly CE-marked bytes (see <xref
target="accecn_Algo_Option_Coding"/>):</t>
<figure>
<artwork><![CDATA[ SAFETY_FACTOR = 2
if (dSafer.cep > d.cep) {
s = d.ceb/d.cep
if (s <= MSS) {
sSafer = d.ceb/dSafer.cep
if (sSafer < MSS/SAFETY_FACTOR)
dSafer.cep = d.cep % d.cep is a safe enough estimate
} % else
% No need for else; dSafer.cep is already correct,
% because d.cep must have been too small
}
]]></artwork>
</figure>
<t>The chart below shows when the above algorithm will consider
d.cep can replace dSafer.cep as a safe enough estimate of the number
of CE-marked packets:</t>
<figure>
<artwork><![CDATA[ ^
sSafer|
|
MSS+
|
| dSafer.cep
| is
MSS/2+--------------+ safest
| |
| d.cep is safe|
| enough |
+-------------------->
MSS s
]]></artwork>
</figure>
<t>The following examples give the reasoning behind the algorithm,
assuming MSS=1,460 [B]:<list style="symbols">
<t>if d.cep=0, dSafer.cep=8 and d.ceb=1,460, then s=infinity and
sSafer=182.5.<vspace blankLines="0"/>Therefore even though the
average size of 8 data segments is unlikely to have been as
small as MSS/8, d.cep cannot have been correct, because it would
imply an average segment size greater than the MSS.</t>
<t>if d.cep=2, dSafer.cep=10 and d.ceb=1,460, then s=730 and
sSafer=146.<vspace blankLines="0"/>Therefore d.cep is safe
enough, because the average size of 10 data segments is unlikely
to have been as small as MSS/10.</t>
<t>if d.cep=7, dSafer.cep=15 and d.ceb=10,200, then s=1,457 and
sSafer=680.<vspace blankLines="0"/>Therefore d.cep is safe
enough, because the average data segment size is more likely to
have been just less than one MSS, rather than below MSS/2.</t>
</list></t>
<t>If pure ACKs were allowed to be ECN-capable, missing ACKs would
be far less likely. However, because <xref target="RFC3168"/>
currently precludes this, the above algorithm assumes that pure ACKs
are not ECN-capable.</t>
</section>
</section>
<section anchor="accecn_Algo_ACE_Bytes"
title="Example Algorithm to Estimate Marked Bytes from Marked Packets">
<t>If the AccECN Option is not available, the Data Sender can only
decode CE-marking from the ACE field in packets. Every time an ACK
arrives, to convert this into an estimate of CE-marked bytes, it needs
an average of the segment size, s_ave. Then it can add or subtract
s_ave from the value of d.ceb as the value of d.cep increments or
decrements.</t>
<t>To calculate s_ave, it could keep a record of the byte numbers of
all the boundaries between packets in flight (including control
packets), and recalculate s_ave on every ACK. However it would be
simpler to merely maintain a counter packets_in_flight for the number
of packets in flight (including control packets), which it could
update once per RTT. Either way, it would estimate s_ave as:<list
style="empty">
<t>s_ave ~= flightsize / packets_in_flight,</t>
</list>where flightsize is the variable that TCP already maintains
for the number of bytes in flight. To avoid floating point arithmetic,
it could right-bit-shift by lg(packets_in_flight), where lg() means
log base 2.</t>
<t>An alternative would be to maintain an exponentially weighted
moving average (EWMA) of the segment size:<list style="empty">
<t>s_ave = a * s + (1-a) * s_ave,</t>
</list>where a is the decay constant for the EWMA. However, then it
is necessary to choose a good value for this constant, which ought to
depend on the number of packets in flight. Also the decay constant
needs to be power of two to avoid floating point arithmetic.</t>
</section>
<section anchor="accecn_Algo_Beacon"
title="Example Algorithm to Beacon AccECN Options">
<t><xref target="accecn_option_usage"/> requires a Data Receiver to
beacon a full-length AccECN Option at least 3 times per RTT. This
could be implemented by maintaining a variable to store the number of
ACKs (pure and data ACKs) since a full AccECN Option was last sent and
another for the approximate number of ACKs sent in the last round trip
time:</t>
<figure>
<artwork><![CDATA[ if (acks_since_full_last_sent > acks_in_round / BEACON_FREQ)
send_full_AccECN_Option()]]></artwork>
</figure>
<t>For optimised integer arithmetic, BEACON_FREQ = 4 could be used,
rather than 3, so that the division could be implemented as an integer
right bit-shift by lg(BEACON_FREQ).</t>
<t>In certain operating systems, it might be too complex to maintain
acks_in_round. In others it might be possible by tagging each data
segment in the retransmit buffer with the number of ACKs sent at the
point that segment was sent. This would not work well if the Data
Receiver was not sending data itself, in which case it might be
necessary to beacon based on time instead, as follows:</t>
<figure>
<artwork><![CDATA[ if ( time_now > time_last_option_sent + (RTT / BEACON_FREQ) )
send_full_AccECN_Option()]]></artwork>
</figure>
<t>This time-based approach does not work well when all the
ACKs are sent early in each round trip, as is the case during
slow-start. In this case few options will be sent (evtl. even less than 3 per RTT).
However, when continuously sending data, data packets as well as ACKs will spread out
equally over the RTT and sufficient ACKs with the AccECN option will be sent.</t>
</section>
<section anchor="accecn_Algo_Not-ECT"
title="Example Algorithm to Count Not-ECT Bytes">
<t>A Data Sender in AccECN mode can infer the amount of TCP payload
data arriving at the receiver marked Not-ECT from the difference
between the amount of newly ACKed data and the sum of the bytes with
the other three markings, d.ceb, d.e0b and d.e1b. Note that, because
r.e0b is initialised to 1 and the other two counters are initialised
to 0, the initial sum will be 1, which matches the initial offset of
the TCP sequence number on completion of the 3WHS.</t>
<!--ToDo: write-up pseudocode, rather than just describe it.-->
<t>For this approach to be precise, it has to be assumed that spurious
(unnecessary) retransmissions do not lead to double counting. This
assumption is currently correct, given that RFC 3168 requires that the
Data Sender marks retransmitted segments as Not-ECT. However, the
converse is not true; necessary transmissions will result in
under-counting.</t>
<t>However, such precision is unlikely to be necessary. The only known
use of a count of Not-ECT marked bytes is to test whether equipment on
the path is clearing the ECN field (perhaps due to an out-dated
attempt to clear, or bleach, what used to be the ToS field). To detect
bleaching it will be sufficient to detect whether nearly all bytes
arrive marked as Not-ECT. Therefore there should be no need to keep
track of the details of retransmissions.</t>
</section>
</section>
<section anchor="accecn_Alt_Designs"
title="Alternative Design Choices (To Be Removed Before Publication)">
<t>This appendix is informative, not normative. It records alternative
designs that the authors chose not to include in the normative
specification, but which the IETF might wish to consider for
inclusion:<list style="hanging">
<t hangText="Feedback all four ECN codepoints on the SYN/ACK:">The
last two negotiation combinations in <xref
target="accecn_Tab_Negotiation"/> could also be used to indicate
AccECN support and to feedback that the arriving SYN was ECT(0) or
ECT(1). This could be used to probe the client to server path for
incorrect forwarding of the ECN field <xref
target="I-D.kuehlewind-tcpm-ecn-fallback"/>. Note, however, that it
would be unremarkable if ECN on the SYN was zeroed by security
devices, given RFC 3168 prohibited ECT on SYN because it enables DoS
attacks.</t>
<t hangText="Feedback all four ECN codepoints on the First ACK:">To
probe the server to client path for incorrect ECN forwarding, it
could be useful to have four feedback states on the first ACK from
the TCP client. This could be achieved by assigning four
combinations of the ECN flags in the main TCP header, and only
initialising the ACE field on subsequent segments.</t>
<t hangText="Empty AccECN Option:">It might be useful to allow an
empty (Length=2) AccECN Option on the SYN/ACK and first ACK. Then if
a host had to omit the option because there was insufficient space
for a larger option, it would not give the impression to the other
end that a middlebox had stripped the option.</t>
<!--
<t hangText="Single ECT mode:">If a Data Receiver has never seen one
of the ECT codepoints arrive, e.g. ECT(1), it would be more
efficient not to have to include byte counters for that ECT
codepoint in the full-length AccECN Option at all. One way to
achieve this would be to redefine the meaning of an AccECN Option
with only one ECT byte counter (Length=8B) so that it would report
the only ECT counter to have changed during the connection. As soon
as the other ECT counter incremented (if it ever did), for the
remainder of the connection the Data Receiver would have to stop
using the AccECN Option with the single ECT field. Superficially,
this would seem to be a simple enhancement, because the Data Sender
would know if it had only ever used one ECT codepoint, However, the
protocol design would be complicated by the need to check whether a
middlebox on the path might be switching over one ECT codepoint to
the other.</t>
<t hangText="Multi-Codepoints ACE field:">If the AccECN Option is
not available, the ability to check AccECN feedback integrity is
limited. If this is seen as a show-stopper, it might be necessary to
retrun to the more complex scheme with three sets of codepoints for
three very small counters in the ACE field, as described in
draft-kuehlewind-tcpm-accurate-ecn-03.</t>
-->
</list></t>
</section>
<section anchor="accecn_Open_Issues"
title="Open Protocol Design Issues (To Be Removed Before Publication)">
<t><list style="numbers">
<t>Currently it is specified that the receiver `SHOULD' use
Change-Triggered ACKs. It is controversial whether this ought to be
a `MUST' instead. A `SHOULD' would leave the Data Sender uncertain
whether it can rely on the timing and ordering information in ACKs.
If the sender guesses wrongly, it will probably introduce at least
1 RTT of delay before it can use this timing information. Ironically
it will most likely be wanting this information to reduce ramp-up
delay. A `MUST' could make it hard to implement AccECN in offload
hardware. However, it is not known whether AccECN would be hard to
implement in such hardware even with a `SHOULD' here. For instance,
was it hard to offload DCTCP to hardware because of change-triggered
ACKs, or was this just one of many reasons? The choice between MUST
and SHOULD here is critical. Before that choice is made, a clear
use-case for certainty of timing and ordering information is needed,
plus well-informed discussion about hardware offload
constraints.</t>
<t>There is possibly a concern that a receiver could deliberately
omit the AccECN Option pretending that it had been stripped by a
middlebox. No known way can yet be contrived to take advantage of
this downgrade attack, but it is mentioned here in case someone else
can contrive one.</t>
<t>The s.cep counter might increase even if the s.ceb counter does
not (e.g. due to a CE-marked control packet). The sender's response
to such a situation is considered out of scope, because this ought
to be dealt with in whatever future specification allows ECN-capable
control packets. However, it is possible that the situation might
arise even if the sender has not sent ECN-capable control packets,
in which case, this draft might need to give some advice on how the
sender should respond.</t>
</list></t>
</section>
<section anchor="accecn_Doc_Changes"
title="Changes in This Version (To Be Removed Before Publication)">
<t>The difference between any pair of versions can be displayed at
<http://datatracker.ietf.org/doc/draft-kuehlewind-tcpm-accurate-ecn/history/><list
style="hanging">
<t hangText="From kuehlewind-05 to ietf-00:">Filename change to
reflect WG adoption.</t>
</list></t>
</section>
</back>
</rfc>
| PAFTECH AB 2003-2026 | 2026-04-22 16:33:47 |