One document matched: draft-kuehlewind-conex-accurate-ecn-00.xml


<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
    which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
    There has to be one entity for each item to be referenced. 
    An alternate method (rfc include) is described in the references. -->


<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC3168 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3168.xml">
<!ENTITY RFC3540 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3540.xml">
<!ENTITY RFC5562 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5562.xml">
<!ENTITY RFC5681 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5681.xml">
<!ENTITY RFC5690 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5690.xml">

]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
    please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
    (Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
    (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="exp" docName="draft-kuehlewind-conex-accurate-ecn-00" ipr="trust200902">
	<!-- updates="3186" -->
 <!-- category values: std, bcp, info, exp, and historic
    ipr values: trust200902, noModificationTrust200902, noDerivativesTrust200902,
       or pre5378Trust200902
    you can add the attributes updates="NNNN" and obsoletes="NNNN" 
    they will automatically be output with "(if approved)" -->

 <!-- ***** FRONT MATTER ***** -->

 <front>
   <!-- The abbreviated title is used in the page header - it is only necessary if the 
        full title is longer than 39 characters -->

   <title>Accurate ECN Feedback in TCP</title>

   <!-- add 'role="editor"' below for the editors if appropriate -->

   <!-- Another author who claims to be an editor -->

   <author fullname="Mirja Kühlewind" initials="M." role="editor"
	   surname="Kühlewind">
    	<organization>University of Stuttgart</organization>
    	<address>
    		<postal>
    			<street>Pfaffenwaldring 47</street>
    			<code>70569</code>
    			<city>Stuttgart</city>
    			<country>Germany</country>
    		</postal>
    		<email>mirja.kuehlewind@ikr.uni-stuttgart.de</email>
    	</address>
    </author>
    
    <author fullname="Richard Scheffenegger" initials="R."
           surname="Scheffenegger">
     <organization>NetApp, Inc.</organization>
     <address>
       <postal>
         <street>Am Euro Platz 2</street>
         <code>1120</code>
         <city>Vienna</city>
         <region></region>
         <country>Austria</country>
       </postal>
       <phone>+43 1 3676811 3146</phone>
       <email>rs@netapp.com</email>
     </address>
    </author>

   <date year="2011" />


   <area>Transport</area>

   <workgroup>Congestion Exposure (ConEx)</workgroup>

   <keyword>Internet-Draft</keyword>
   <keyword>I-D</keyword>

   <abstract>
     <t>Explicit Congestion Notification (ECN) is an IP/TCP mechanism where network nodes 
     can mark IP packets instead of dropping them to indicate congestion to the end-points. 
     An ECN-capable receiver will feedback this information to the sender. ECN is specified for
     TCP in such a way that only one feedback signal can be transmitted per Round-Trip Time (RTT).
     Recently new TCP mechanisms like ConEx or DCTCP need more accurate feedback information in the case 
     where more than one marking is received in one RTT. 
     </t>
    
   </abstract>
 </front>

 <middle>
   <section title="Introduction">
	   <t>Explicit Congestion Notification (ECN) <xref target="RFC3168"/> is 
		   an IP/TCP mechanism where
		   network nodes can mark IP packets instead of dropping them to indicate congestion to
		   the end-points. An ECN-capable receiver will feedback this information to the sender.
		   ECN is specified for TCP in such a way that only one feedback signal can be
		   transmitted per Round-Trip Time (RTT).
		   Recently proposed mechanisms like Congestion Exposure (ConEx) or DCTCP 
		   <xref target="Ali10"/> need more accurate feedback information 
		   in case when more than one marking is received in one RTT. 
     </t>
     
     <t>This documents discusses and (will in a further version specify) a different scheme for the ECN feedback in the TCP header 
	     to provide more than one feedback signal per RTT. This modification does not obsolete 
	     <xref target="RFC3168"/>. It provides an extension that requires 
	     additional negotiation in the TCP handshake by using the TCP nonce sum (NS) bit 
	     <!--, as specified in <xref target="RFC3540"/>,--> which is 
	     currently not used when SYN is set.
     </t>
     

      
     <t> In the current version of this document there are different coding schemes proposed for
	     discussion. All proposed codings aim to scope with the given bit space. All schemes require
	     the use of the NS bit at least in the TCP handshake. Depending of the coding scheme the
	     accurate ECN feedback extension will or will not include the ECN-Nonce integrity mechanism. 
	     A later version of this document will choose between the coding options, and remove the rationale 
	     for the choice and the specs of those schemes not chosen.
	     If a scheme will be chosen that does not include ECN Nonce, a mechanism that is requiring a
	     more accurate ECN feedback needs to provide an own method to ensure the integrity of the
	     congestion feedback information or has to scope with the uncertainty of this information.
     </t>
     
     <t>
	     The following scenarios should briefly show where the accurate feedback is needed or provides additional value:
	     <list style="letters">
     <t>A Standard TCP sender with <xref target="RFC5681"/> congestion control algorithm that supports ConEx:
	     <vspace blankLines="0" />
	     In this case the congestion control algorithm still ignores multiple marks per RTT, 
	     while the ConEx mechanism uses the extra information per RTT to re-echo more precise congestion
	     information. </t>
     <t>A sender using DCTCP without ConEx:<vspace blankLines="0" />
	     The congestion control algorithm uses the extra info per RTT to perform its decrease depending on the
	     number of congestion marks.</t>
     <t>A sender using DCTCP congestion control and supports ConEx:<vspace blankLines="0" />
	     Both the congestion control algorithm and ConEx use the accurate ECN feedback mechanism.</t>
     <t>A standard TCP sender using RFC5681 congestion control algorithm without ConEx:<vspace blankLines="0" />
	     No accurate feedback is necessary here. The congestion control algorithm still react only on one signal
	     per RTT. But its best to have one generic feedback mechanism, whether you use it or not.</t>
</list>
     </t>
     
     <section title="Overview ECN and ECN Nonce in TCP">
	     <t>ECN requires two bits in the IP header. The ECN capability of a packet is indicated, 
	     	when either one of the two bits is set. An ECN sender can set one or the other bit to indicate 
		an ECN-capable transport (ETC)
		     which results in two signals --- ECT(0) and respectively ECT(1). 
		     A network node can set both bits simultaneously 
	     	when it experiences congestion. When both bits are
		     set the packets is regarded as "Congestion Experienced" (CE).
	     </t>
		     
	     <t>In the TCP header two bits in byte 14 are defined for the use of ECN. The TCP mechanism
		     for signaling the reception of a congestion mark
		     uses the ECN-Echo (ECE) flag in the TCP header. To enable the TCP receiver to
		     determine when to stop setting the ECN-Echo flag, the CWR flag is set by the sender
		     upon reception of the feedback signal.
	     </t>
	     <t>
		     ECN-Nonce <xref target="RFC3540"/> is
		     an optional addition to ECN that is used to protects the TCP sender against
		     accidental or malicious concealment of marked or dropped packets. This addition defines the
		     last bit of the 13 byte in the TCP header as the Nonce Sum (NS) bit. With ECN-Nonce
		     a nonce sum is maintain that counts the occurrence of ECT(1) packets.
	     </t>
		     
	     <figure anchor="TCPHdr" align="center" title="The (post-ECN Nonce) definition of the TCP header flags">
<artwork align="center"><![CDATA[						 
  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|               |           | N | C | E | U | A | P | R | S | F |
| Header Length | Reserved  | S | W | C | R | C | S | S | Y | I |
|               |           |   | R | E | G | K | H | T | N | N |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
]]></artwork>
  	   </figure>
   </section>
   <section title="Design choices">
	<t>
		The idea of this document is to use the ECE, CWR and NS bits for additional capability negotiation during the SYN/SYN-ACK exchange, and then for the more accurate feedback itself on subsequent packets in the flow (with SYN=0).
	</t>
	<t>Alternatively, a new TCP option could be introduced, to help maintain the accuracy, 
		and integrity of the ECN feedback between receiver and sender. Such an option could provide more information. E.g. ECN for RTP/UDP provides explicit the number of ECT(0), ECT(1), CE, non-ECT marked and lost packets. However, deploying new TCP options has it's own challenges.
	</t>
	<!--<t>Combining the idea of <xref target="eci_mode"/> and <xref target="cp_mode"/>,
		further extending it to a one-octet option, would allow the signaling of two 
		values, each with 4 bit. The gains in worst case ACK loss, delayed ACK ratios
		and maintaining ECN Nonce would scale accordingly.
	</t>
	<t>Alternatively, if timestamp capability negotiation is supported, a few
		bits could be extracted from the timestamp value, to provide extended
		signaling. However, processing TCP options (or overloaded TCP options) is
		more complex than processing of header flags.
	</t>-->
	<t>As seen in <xref target="TCPHdr"/>, there are currently three unused flag bits
		in the TCP header. Any of the below described schemes could be extended by one or
		more bits, to add higher resiliency against ACK loss. The relative gains 
		would be proportional to each of the described schemes, while the respective 
		drawbacks would remain identical. Thus the approach in this document is to scope with the 
		given number of bits as they seem to be already sufficient and the accurate ECN feedback scheme 
		will only be used instead of the classic ECN and never in parallel.
	</t>
     </section>
    
     

     <section title="Requirements Language">
       <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
       "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
       document are to be interpreted as described in <xref
       target="RFC2119">RFC 2119</xref>.</t>
	<t>We use the following terminology from <xref target="RFC3168"/> and <xref target="RFC3540"/>:</t>
	<t>The ECN field in the IP header:
		<list hangIndent="10" style="empty">
			<t>CE: the Congestion Experienced codepoint; and</t>
		
			<t>ECT(0)/ECT(1): either one of the two ECN-Capable Transport codepoints.</t>
		</list></t>
	<t>The ECN flags in the TCP header:
		<list hangIndent="10" style="empty">
			<t>CWR: the Congestion Window Reduced flag;</t>
		
			<t>ECE: the ECN-Echo flag; and</t>
			
			<t>NS: ECN Nonce Sum.</t>
		</list></t>

	<t> In this document, we will call the ECN feedback scheme as specified 
		in <xref target="RFC3168"/> the 'classic ECN' 
		and our new proposal the 'accurate ECN feedback' scheme. 
		A 'congestion mark' is defined as an IP packet where the CE codepoint is set.
     </t>
	
	
     </section>
   </section>
   <section title="Negotiation in TCP handshake" anchor="TCPNeg">
	   <t>   During the TCP hand-shake at the start of a connection, an originator
		   of the connection (host A) MUST
		   indicate a request to get more accurate ECN feedback by setting the TCP flags 
		   NS=1, CWR=1 and ECE=1 in the initial SYN.
	   </t>
	   <t>
		   A responding host (host B) MUST return a SYN ACK with flags
		   CWR=1 and ECE=0.  The responding host MUST NOT set this combination
		   of flags unless the preceding SYN has already requested 
		   support for accurate ECN feedback as above.  Normally a server (B) will 
		   reply to a client with NS=0, but if the initial SYN from client A is
		   marked CE, the sever B can set the NS flag to 1 to indicate the congestion 
		   immediately instead of delaying the signal to the first acknowledgment when 
		   the actually data transmission already started.
		   <!--a server B MUST 
			 increment its local value of
			 ECC.  But B cannot reflect the value of ECC in the SYN ACK, because
			 it is still using the 3 bits to negotiate connection capabilities. -->
		   <!-- [RS] ECC not yet defined. Suggest to remove this discussion for later 
			(sec. 312 or 313). Also, need to add a paragraph stating that we encourage 
			the use of ECT during non-data segments (SYN, pure ACK)... -->
		   So, server B MAY set the alternative TCP header flags in its SYN
		   ACK: NS=1, CWR=1 and ECE=0. 
	   </t>
	   <t> The Addition of ECN to TCP SYN/ACK packets is discussed 
		   and specified as experimental in <xref target="RFC5562"/>. The addition
		   of ECN to the SYN packet is optional. The security implication
		   when using this option are not further discussed here.
	   </t>
	   <t>
		   These handshakes are summarized in Table 1 below, with X indicating
		   NS can be either 0 or 1 depending on whether congestion had been
		   experienced.  The handshakes used for the other flavors of ECN are
		   also shown for comparison.  To compress the width of the table, the
		   headings of the first four columns have been severely abbreviated, as
		   follows:
	   </t>
	   <t>
		   Ac: *Ac*curate ECN Feedback
	   </t>
	   <t>
		   N: ECN-*N*once (RFC3540)
	   </t>
	   <t>
		   E: *E*CN (RFC3168)
	   </t>
	   <t>
		   I: Not-ECN (*I*mplicit congestion notification).
	   </t>
     

     <texttable anchor="Tab1" align="center" 
     	title="ECN capability negotiation between Sender (A) and Receiver (B)">
     	  <ttcol align="left">Ac</ttcol>
     		<ttcol align="center">N</ttcol>
     		<ttcol align="center">E</ttcol>
     		<ttcol align="center">I</ttcol>
     		<ttcol align="center">[SYN] A->B</ttcol>
     		<ttcol align="center">[SYN,ACK] B->A</ttcol>
     		<ttcol align="left">Mode</ttcol>
     		<c/> <c/> <c/> <c/> <c>NS CWR ECE</c> <c>NS CWR ECE</c> <c/>
     		<c>AB</c>  <c/>  <c/>  <c/> <c>1   1   1</c> <c>X   1   0</c> <c>accurate ECN</c>
     		<c>A</c> <c>B</c> <c/> <c/> <c>1   1   1</c> <c>1   0   1</c> <c>ECN Nonce</c>
     		<c>A</c> <c/> <c>B</c> <c/> <c>1   1   1</c> <c>0   0   1</c> <c>classic ECN</c>
     		<c>A</c> <c/> <c/> <c>B</c> <c>1   1   1</c> <c>0   0   0</c> <c>Not ECN</c>
     		<c>A</c> <c/> <c/> <c>B</c> <c>1   1   1</c> <c>1   1   1</c> <c>Not ECN (broken)</c>
     	</texttable>	
			
	<t>
     Recall that, if the SYN ACK reflects the same flag settings as the
     preceding SYN (because there is a broken RFC3168 compliant
     implementation that behaves this way), RFC3168 specifies that the
     whole connection MUST revert to Not-ECT.
	</t>
   </section>

   <section title="Accurate Feedback">
     <t>In this section we refer the sender to be the on sending data and the receiver as the one that
	     will acknowledge this data. Of course such a scenario is describing only one half connection 
	     of a TCP connection. The proposed scheme, if negotiated, will be used for both half
	     connection as both, sender and receiver, need to be capable to echo and understand the
	     accurate ECN feedback scheme.
     </t>
     
     <section title="Coding" anchor="TCPSig">
	     <t>
		     This section proposes three different coding schemes for discussion. First, requirements are 
		     listed that will allow to evaluate the proposed schemes against each other. A later version 
		     of this document will choose between the coding options, and remove the rationale 
		     for the choice and the specs of those schemes not chosen.
		     The next section provides basically a fourth alternative to allow a compatibility mode when a
		     sender needs accurate feedback but has to operate with a legacy <xref target="RFC3168"/> receiver.
	     </t>
	     <section title="Requirements">
		     <t>The requirements of the accurate ECN feedback protocol for the use of e.g. Conex or DCTCP
			     are to have a fairly accurate (not necessarily perfect), timely 
			     and protected signaling. This leads to the following requirements:</t>
		     <t><list hangIndent="8" style="hanging">
		     <t hangText="Resilience"><vspace/>The ECN feedback signal is implicit carried within the 
			     TCP acknowledgment. TCP ACKs can get lost. Moreover, delayed ACK are usually used 
			     with TCP. That means in most cases only every second data packets gets acknowledged. 
			     In a high congestion situation where most of the packet are marked with CE, an 
			     accurate feedback mechanism must still be able to signal sufficient congestion 
			     information. Thus the accurate ECN feedback extension has to take delayed ACK and 
			     ACK loss into account.</t>
		     
		     <t hangText="Timely"><vspace/>The CE marking is induced by a network node on the 
			     transmission path and echoed by the receiver in the TCP acknowledgment. Thus when 
			     this information arrives at the sender, its naturally already about one RTT old. 
			     With a sufficient ACK rate a further delay of a small number of ACK can be 
			     tolerated but with large delays this information will be out dated due to high 
			     dynamic in the network. TCP congestion control which introduces parts of this 
			     dynamic operates on an time scale of one RTT. Thus the congestion feedback 
			     information should be delivered timely (within one RTT).</t>
		     
		     <t hangText="Integrity"><vspace/>With ECN Nonce, a misbehaving receiver can be detected 
			     with a certain probability. As this accurate ECN feedback might reuse the NS bit 
			     it is encouraged to ensure integrity as least as good as ECN Nonce. If this is 
			     not possible, alternative approaches should be provided how a mechanism using the accurate ECN 
			     feedback extension can re-ensure integrity or give strong incentives for the 
			     receiver and network node to cooperate honestly. <!--If and what kind of 
				enforcements a sender should do, when detecting wrong feedback information, is 
				out-of-scope.--></t>
		     
		     <t hangText="Accuracy"><vspace/><!--In TCP usually delayed ACKs are used. Thats means in 
			    most cases only for every second data packets an acknowledgment is sent. Moreover, 
			    an ACK can get lost.-->Classic ECN feeds back one congestion notification per RTT, as 
			    this is supposed to be used for TCP congestion control which reduces the sending 
			     rate at most once per RTT. The accurate ECN feedback scheme has to ensure that 
			     if a congestion events occurs at least one congestion notification is echoed and 
			     received per RRT as classic ECN would do. Of course, the goal of this extension is to 
			     reconstruct the number of CE marking more accurately. However, a sender should 
			     not assume to get the exact number of congestion marking in a high congestion 
			     situation.</t>
		     
		     <t hangText="Complexity"><vspace/>Of course, the more accurate ECN feedback can also be
			     used, even if only one ECN feedback signal per RTT is need. 
			     To enable this proposal for a more
			     accurate ECN feedback as the standard ECN feedback mechanism, the implementation should
			     be as simple as possible and a minimum of addition state information should be needed.</t>
		     </list></t>
     		</section>
	     	<section title="One bit feedback flag" anchor="sm_mode"> <!-- [RS] "ACK state machine mode?" -->
			
			<t>This option is using a one bit flag, namely the ECE bit, to signal more accurate ECN
			feedback. Other than classic ECN feedback, a accurate ECN feedback receiver MUST set
			the ECE bit in N subsequent ACK packets (only). A accurate ECN feedback receiver MUST
			NOT wait for a CWR bit from the sender to reset the ECE bit.
			N is not defined yet but is intended to be 2.
			</t>
			<t>Moreover, when a congestion situation occurs or stops, the receiver MUST immediately 
			acknowledge the data packet and MUST NOT delay the acknowledgment until a further data
			packet is arrived. A congestion situation occurs when the previous data packet was CE=0 
			but the current one is CE=1. And a congestion situation stops when the previous data
			packet was CE=1 and the current one is CE=0. 
				
			</t>			
			<t>The following figure shows a simple state machine to describe 
				the receiver behavior for N=1.
		</t>
     <figure align="center" anchor="DCTCP_ACK" title="Two state ACK generation state machine">
<artwork align="center"><![CDATA[
                       Send immediate
                       ACK with ECE=0
             .---.     .------------.      .---.
Send 1 ACK  /     v    v             |    |     \  
 for every |     .------.           .------.     | Send 1 ACK
 m packets |     | CE=0 |           | CE=1 |     | for every 
with ECE=0 |     '------'           '------'     | m packets 
            \     |    |             ^    ^     /  with ECE=1
             '---'      '------------'     '---'
                        Send immediate
                        ACK with ECE=1
]]></artwork>
</figure>
		<section title="Discussion">
		<t>ACK loss</t>
		<t>The simplest way to get a more accurate ECN feedback, which allows more than one 
			signal per RTT, is to set the ECE flag only once when a congestion marks occurs 
			instead of setting the ECE flag in every packets until a CWR flag is received. This 
			solution still only allows one signal per acknowledgment which might not be sufficient
			when more than one packet is acknowledged at once (delayed ACKs). And even more
			important, this information can get lost with the loss only one ACK packet carrying
			this information. One solution would be to carry the same information in a defined
			number of subsequent ACK packets. This would reduce again the number of feedback
			signals that can be transmitted in one RTT but improve the integrity. 
			More sophisticated solutions based on ACK loss detection might be possible as well.
		</t>
		
		<t>
			<!--This scheme was first proposed in <xref target="Ali10"/> for the use with DCTCP.-->
			Note that the semantics of classic ECN are changed, and the CWR flag is no longer
			interpreted by the receiver to reset the ECE flag. 
			A simple extension of this scheme could make use of the CWR flag. E.g. the receiver could 
			always repeat the value of the ECE flag of the predecessor ACK in the CWR flag.
			However, only a single 
			lost ACK can be addressed that way. Two consecutive ACKs becoming lost may still
			result in a loss of ECN information to the sender.
			<!--This could allows an extension 
			of this scheme, to accomodate some ACK loss. However, in it's basic form, this
			signaling scheme is still very vulnerable to ACK loss.-->
		</t>
		<t>In low congestion situations (less than one CE mark per RTT on average),
			the loss of m subsequent ACKs would result in complete
			loss of the congestion information. The opposite would 
			be true during high congestion, where the sender can incorrectly assume that all segments 
			were received with the CE codepoint. </t>
		
		<t>With DCTCP <xref target="Ali10"/> it was proposed to acknowledge a data packet directly 
			without delay when a congestion situation occurs, as already described above.
			This scheme allows a more accurate feedback
			signal in a high congestion/marking situation. 
			However, using Delayed ACKs is important for a variety
			of reasons, including reducing the load on the data sender.
			<!--To use delayed ACKs (one cumulative ACK for every m
			consecutively received packets), the DCTCP receiver uses
			the trivial two state state-machine shown in Figure <xref target="DCTCP_ACK"/> to 
			determine whether to immediately send an ACK, and wether to 
			set the ECN-Echo bit. The states correspond
			to whether the last received packet was marked with the CE
			codepoint or not. Since the sender knows how many packets 
			each ACK covers, it can exactly reconstruct the runs of
			marks seen by the receiver.-->
			</t>
			
			<!--+ reaction/ack loss recognition (sender action needed? with/without NS bit?): 
			      delay congestion signal + redundant feedback signal in two subsequent ack or 
			      ack loss detection (for conex)?-->
		
			
				
			<t>As this heuristic is triggering immediate ACKs whenever the received CE
				bit toggles, arbitrarily large ACK ratios are supported. However, the effective
				ACK ratio is depending on the congestion state of the network. Thus it may collapse
				to 1 (one ACK for each data segment)More sophisticated solutions based on ACK loss
				detection might be possible as well, when every	other segment is received with CE
				set.
					
				<!--An additional shortcoming
				of this scheme is the possible deactivation of delayed ACKs, when every
				other segment is received with CE set. The state machine will then trigger
				one ACK for each received segment.-->
			</t>
			<!--<t>In the context of
				DCTCP, with very low RTTs, and special active queue managment (AQM) rules,
				any glitch will get corrected fast without much impact. However, for general
				deployment, the basic form of this signaling scheme does not appear viable.
			</t>-->
			<t>ECN Nonce</t>
			<t>As the ECN Nonce bit is not used otherwise, ECN Nonce <xref target="RFC3540"/> can
				be used complementary. Network paths not supporting ECN, misbehaving, or 
				malicious receivers withholding ECN information can therefore be detected.
			</t>
			
			
		</section>
     		</section>
		<section title="Three bit field with counter feedback" anchor="eci_mode"> <!-- [RS] shorten to "Echo congestion incremenet" or "Echo congestion value"? -->
			<t>
				The receiver maintains an unsigned integer counter which we call ECC
				(echo congestion counter).  This counter maintains a count of how
				many times a CE marked packet has arrived during the half-connection.
				Once a TCP connection is established, the three TCP option flags
				(ECE, CWR and NS) <!--used for ECN-related functions in other versions of
				ECN--> are used as a 3-bit field for the receiver to permanently signal the
				sender the current value of ECC, modulo 8, whenever it sends a TCP
				ACK.  We will call these three bits the echo congestion increment (ECI) field.
			</t>
			<t>This overloaded use of these 3 option flags as one 3-bit ECI field is
				shown in <xref target="ECI_ACK"/>.  The actual definition of the TCP header,
				including the addition of support for the ECN Nonce, is shown for
				comparison in <xref target="TCPHdr"/>.  This specification does not redefine the
				names of these three TCP option flags, it merely overloads them with
				another definition once a flow with accurate ECN feedback is established.
			</t>
				
			<figure 
			title="Definition of the ECI field within bytes 13 and 14 of the TCP Header (when SYN=0)."
			align="center" anchor="ECI_ACK">
			<artwork align="center"><![CDATA[	
  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|               |           |           | U | A | P | R | S | F |
| Header Length | Reserved  |    ECI    | R | C | S | S | Y | I |
|               |           |           | G | K | H | T | N | N |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
			]]></artwork></figure>
			<t>Also note that, whenever the SYN flag of a TCP segment is set
				(including when the ACK flag is also set), the NS, CWR and ECE flags
				(i.e. the ECI field of the SYNACK) MUST NOT be interpreted as the
				3-bit ECI value, which is only set as a copy of the local ECC value
				in non-SYN packets.
			</t>
			
			<t>This scheme was first proposed in <xref target="I-D.briscoe-tsvwg-re-ecn-tcp"/> 
				for the use with re-ECN. <!--However, without the external framework to
				detect and address misbehaving receivers, a sender alone can not detect
				if a receiver is concealing ECN information.-->
			</t>
			
			<section title="Discussion">
			<t>ACK loss</t>
			<t><!--The ECI method was chosen for echoing congestion marking because a
				   re-ECN sender needs to know about every CE mark arriving at the
				   receiver, not just whether at least one arrives within a round trip
				   time (which is all the ECE/CWR mechanism supported).  And, -->
				As pure ACKs are not protected by TCP reliable delivery, we repeat the same
				ECI value in every ACK until it changes.  Even if many ACKs in a row
				are lost, as soon as one gets through, the ECI field it repeats from
				previous ACKs that didn't get through will update the sender on how
				many CE marks arrived since the last ACK got through.</t>
			
			<t>The sender will only lose a record of the arrival of a CE mark if all
				the ACKS are lost (and all of them were pure ACKs) for a stream of
				data long enough to contain 8 or more CE marks.  So, if the marking
				fraction was p, at least 8/p pure ACKs would have to be lost.  For
				example, if p was 5%, a sequence of 160 pure ACKs (without delayed ACKs) 
				would all have to be lost. When ACK are delay this number has to be reduced 
				by 1/m. This would still require a sequence of 80 pure lost ACKs with the usual
				delay rate of m=2. </t>
				
			<t>Additionally, to protect against such extremely unlikely events, if a re-
				ECN sender detects a sequence of pure ACKs has been lost it can
				assume the ECI field wrapped as many times as possible within the
				sequence. E.g., if a re-ECN sender receives an ACK with an
				acknowledgement number that acknowledges L (>m) segments since the
				previous ACK but with a sequence number unchanged from the previously
				received ACK, it can conservatively assume that the ECI field
				incremented by D' = L - ((L-D) mod 8), where D is the apparent
				increase in the ECI field.  For example if the ACK arriving after 9
				pure ACK losses apparently increased ECI by 2, the assumed increment
				of ECI would still be 2.  But if ECI apparently increased by 2 after
				11 pure ACK losses, ECI should be assumed to have increased by 10.</t>
			
			<!--<t>A re-ECN sender MAY implement a heuristic algorithm to predict beyond
				reasonable doubt that the ECI field probably did not wrap within a
				sequence of lost pure ACKs.  But such an algorithm is OPTIONAL.  Such
				an algorithm MUST NOT be used unless it is proven to work even in the
				presence of correlation between high ACK loss rate on the back
				channel and high CE marking rate on the forward channel.</t>
			
			<t>Whatever assumption a re-ECN sender makes about potentially lost CE
				marks, both its congestion control and its re-echoing behaviour
				SHOULD be consistent with the assumption it makes.</t>-->
			<!--<t>As the current value of the ECC is repeatedly signaled, in situations with a low
				congestion rate this scheme has 
				no issue with ACK loss . 
				In the worst, when all ACKs in one RTT are lost but one, sender and receiver stay in sync as long as fewer than 7 segments were CE marked.  As in this scenario, the ACK clock is basically disabled, TCP performance 
				will be impacted regardless of the accuracy of the ECN feedback signal.
			</t>
			<t>During phases of high congestion, where every data segment is marked 
				with CE, no more than two or three consecutive, delayed (m=2) ACKs can get 
				lost to keep in sync. The number of ACK that can get lost is thus depending on 
				the rate of the delayed ACKs and the actually length of the period of CE marks. 
				If the receiver is sending immediate ACKs, e.g. when 
				some reordering or loss has been detected, up to 6 consecutive ACKs
				can be lost without loosing the synchronization between sender and 
				receiver. The latter scenario may happen if a single segment was 
				dropped, and subsequent segments get CE marked as the congestion on the link
				persists. 
			</t>
			<t>The highest ACK ratio, where at least a single lost ACK will never
				cause the counters between sender and receiver to become unsynchronized, 
				is four. This again assumes a very high congestion scenario, where each
				data segment is marked with CE.
			</t>-->
			<t>ECN Nonce</t>
			<t>ECN Nonce cannot be used in parallel to this scheme. But mechanism  
				that make use of this new scheme might provide stronger incentives to declare 
				congestion honestly when needed. 
				E.g. with ConEx each congestion notification suppressed by the 
				receiver should lead the ConEx audit function to 
				discard an equivalent number of bytes such that the receiver does not gain from
				suppressing feedback. This mechanism would even provide a stronger integrity mechanism
				than ECN-Nonce does.
				Without an external framework to discourage 
				the withholding of ECN information, this scheme is vulnerable to the problems
				described in <xref target="RFC3540"/>.
			</t>
			
			
			    
			
			</section>
			
				 
				
				<!--Receiver Action in RECN Mode
				
				Every time a CE marked packet arrives at a receiver in RECN mode,
				the receiver transport increments its local value of ECC and MUST
				echo its value, modulo 8, to the sender in the ECI field of the
				next ACK.  It MUST repeat the same value of ECI in every
				subsequent ACK until the next CE event, when it increments ECI
				again.
				
				The increment of the local ECC values is modulo 8 so the field
				value simply wraps round back to zero when it overflows.  The
				least significant bit is to the right (labelled bit 9).
				
				A receiver in RECN mode MAY delay the echo of a CE to the next
				delayed-ACK, which would be necessary if ACK-withholding were
				implemented.

				Sender Action in RECN Mode
				
				On the arrival of every ACK, the sender compares the ECI field
				with its own ECC value, then replaces its local value with that
				from the ACK.  The difference D (D = (ECI + 8 - ECC mod 8) mod 8)
				is assumed to be the number of CE marked packets that arrived at
				the receiver since it sent the previously received ACK (but see
				below for the sender's safety strategy).
								
				As we have already emphasised, the re-ECN protocol makes no
				changes and has no effect on the TCP congestion control algorithm.
				So, the first increment of ECI (or detection of a drop) in a RTT
				triggers the standard TCP congestion response, no more than one
				congestion response per round trip, as usual.  However, the sender
				re-echoes every increment of ECI irrespective of RTTs.
				
				A TCP sender also acts as the receiver for the other half-
				connection.  The host will maintain two ECC values S.ECC and R.ECC
				as sender and receiver respectively.  Every TCP header sent by a
				host in RECN mode will also repeat the prevailing value of R.ECC
				in its ECI field.  If a sender in RECN mode has to retransmit a
				packet due to a suspected loss, the re-transmitted packet MUST
				carry the latest prevailing value of R.ECC when it is re-
				transmitted, which will not necessarily be the one it carried
				originally.-->
				
				
			
     		</section>
		<section title="Codepoints with dual counter feedback" anchor="cp_mode"> <!-- [RS] "Codepoint Nonce Feedback"? -->
			<t> In-line with the definition of the previous section in Figure 3, the 
				ECE, CWR and NS bits are used as one field but instead they are encoding 8 codepoints. 
				These 8 codepoints, as shown below, encode either a "congestion 
				indication" (CI) counter or an ECT(1) counter (E1). These counters maintain 
				the number of CE marks or the number of ECT(1) signals observed at the 
				receiver respectively.</t>
     	<texttable anchor="Tab2" align="center" title="Codepoint assignment for accurate ECN feedback">
			  <ttcol align="center">ECI</ttcol>
     		<ttcol align="center">NS</ttcol>
     		<ttcol align="center">CWR</ttcol>
     		<ttcol align="center">ECE</ttcol>
     		<ttcol align="center">CI (base5)</ttcol>
     		<ttcol align="center">E1 (base3)</ttcol>
     		<c>0</c> <c>0</c> <c>0</c> <c>0</c> <c>0</c> <c>-</c>
     		<c>1</c> <c>0</c> <c>0</c> <c>1</c> <c>1</c> <c>-</c>
     		<c>2</c> <c>0</c> <c>1</c> <c>0</c> <c>2</c> <c>-</c>
     		<c>3</c> <c>0</c> <c>1</c> <c>1</c> <c>3</c> <c>-</c>
     		<c>4</c> <c>1</c> <c>0</c> <c>0</c> <c>4</c> <c>-</c>
     		<c>5</c> <c>1</c> <c>0</c> <c>1</c> <c>-</c> <c>0</c>
     		<c>6</c> <c>1</c> <c>1</c> <c>0</c> <c>-</c> <c>1</c>
     		<c>7</c> <c>1</c> <c>1</c> <c>1</c> <c>-</c> <c>2</c>	
     	</texttable>	
      <!--<t>TBD: Stipulate the alternate sending of the two counters, unless either 
      	counter changes, and then to send a new value twice? This would decrease
      	ACK ratio under high congestion, but increase the number of consecutive 
      	lost ACKs to 2, which can be always tolerated.</t> -->
		<t>By default an accurate ECN receiver MUST echo the CI counter 
			(modulo 5) with the respective codepoints. Whenever an CE occurs and thus the value of the
			CI has changed, the receiver MUST echo the CI in the next ACK.
			Moreover, the receiver MUST repeat the codepoint, that provides 
			the CI counter, directly on the subsequent ACK. Thus every value of CI 
			will be transmitted at least twice.</t>
				
				
		<t>If an ECT(1) mark is receipt and thus E1 increases, the receiver has to convey that
			updated information to the sender as soon as possible. Thus on the reception 
			of a ECT(1) marked packet, the receiver MUST
			signal the current value of the E1 counter (modulo 3) in the next
			ACK, unless a CE mark was receipt which is not echoed yet twice. The receiver MUST
			also repeat very E1 value. But this repetition does not need to be in the
			subsequent ACK as the E1 value will only be transmitted when no changes in the CI
			have occured. Each E1 value will be send excatly twice. The repetition of every
			signal will provide further resilience against lost ACKs. </t>
			
		<t>As only a limited number of E1 codepoints exist and the receiver might not
			acknowledge every single data packet immediately (delayed ACKs), a sender SHOULD NOT
			mark more than 1/m of the packets with ECT(1), where m is the ACK ratio (e.g. 50% when
			every second data packet triggers an ACK). This constraint will avoid a
			permanent feedback of E1 only. <!--, and never more than two 
			consecutive packets.--></t>
			
		 
			<!--<t>For resilience against lost ACKs, a second ACK has to
        transmit the previous codepoint again, whether another 
	congestion indication (CE) or ECT(1) mark arrives or not.</t>--> 
     	 
      		<t>This requirement may conflict with delayed ACK ratios 
		larger than two, using the available number of codepoints. A receiver 
		MUST change the ACK'ing rate such 
		that a sufficient rate of feedback signals can be sent. Details on how the change in the ACK'ing rate should be implemented are given in the next subsection.
		<!--Under certain
		circumstances, i.e. the sender using excessive ECT(1) marks, every packet
		may be immediately get ACK'ed. The available codepoints for CI allow
		the indefinite use of delayed ACKs with a ratio of two, even during 
		heavy network congestion. --></t>
		<section title="Implementation">
		<t> The basic idea is for the receiver to count how many packets carry a 
			congestion notification. This could, in principle, be achieved by 
			increasing a "congestion indication" counter (CI.c) for every incoming CE marked
			segment. Since the space for communicating the information back to the 
			sender in ACKs is limited, instead of directly increasing this counter, 
			a "gauge" (CI.g) is increased instead.
		</t>
		<t>When sending an ACK, the content of this gauge (capped by the maximum 
			number that can be encoded in the ACK, e.g. 4 for CI, and 2 for E1) is 
			copied to the actual counter, and CI.g is reduced by the value 
			that was copied over and transmitted, unless CI.g was zero before. To 
			avoid losing information, it is ensured that an ACK is sent at least 
			after 5 incoming congestion marks (i.e. when CI.g exceeds 5).
		</t>
		<t>For resilience against lost ACKs, an indicator flag (CI.i) ensures that, 
			whether another congestion indication arrives or not, a second ACK 
			transmits the previous counter value again.
		</t>
		
		<t>The same counter / gauge method is used to count and feed back (using 
			a different mapping) the number of incoming packets marked ECT(1) 
			(called E1 in the algorithm). As fewer codepoints are available for
			conveying the E1 counter value, an immediate ACK MUST be triggered
			whenever the gauge E1.g exceeds a threshold of 3. The sender receives 
			the receiver's counter values and compares them with the locally 
			maintained counter. Any increase of these counters is added to the 
			sender's internal counters, yielding a precise number of CE-marked 
			and ECT(1) marked packets. Architecturally the counters never decrease 
			during a TCP session. However, any overflow must be modulo 5 for CI,
			and modulo 3 for E1.</t>
		<t>The following table provides an example showing an half-connection with an TCP sender A and 
			receiver B. The sender maintains a counter CI.r to reconstruct the number of CE mark 
			receipt at receiver-side.</t>
		

<texttable anchor="Tab4" align="center" title="Codepoint signal example">
<ttcol align="center"> </ttcol>
<ttcol align="center">Data</ttcol>
<ttcol align="right">TCP A</ttcol>
<ttcol align="right">IP</ttcol>
<ttcol align="right">TCP B</ttcol>
<ttcol align="center">Data</ttcol>
<c>  </c> <c/> <c>SEQ   ACK CTL</c>   <c/>   <c>SEQ   ACK CTL</c> <c/> 
<c>--</c> <c/> <c>-------------</c> <c>----------</c>  <c>-------------</c> <c/> 
<c>1</c>  <c/> <c>0100      SYN</c> <c> ---->     </c>  <c>              </c> <c/> 
<c> </c>  <c/> <c>   CWR,ECE,NS</c> <c>      </c>  <c>              </c> <c/> 
<c>2</c>  <c/> <c>             </c> <c> <---- <!--ECT0--></c>  <c>0300 0101 SYN </c> <c/> 
<c> </c>  <c/> <c>             </c> <c>      </c>  <c>      ACK,CWR </c> <c/> 
<c>3</c>  <c/> <c>0101 0301 ACK</c> <c> ECT0 -CE-></c>  <c>   </c> <c/>
<!--<c> </c>  <c/> <c>ECI=CI.0     </c> <c/>                   <c/>               <c/>-->
<c/>      <c/> <c> </c>        <c/>                    <c>CI.c=0 CI.g=1</c>                  <c/>
<c>4</c>  <c>100</c> <c>0101 0301 ACK</c> <c>ECT0 ----></c> <c/>  <c/>
<!--<c> </c>  <c/> <c>ECI=CI.0     </c> <c/>                   <c/>               <c/>-->
<c/>      <c/> <c> </c>        <c/>                    <c>CI.c=1 CI.g=0</c>                  <c/>
<c>5</c>  <c/> <c></c> <c><----     </c> <c>0301 0201 ACK</c> <c/>
<c> </c>  <c/> <c/>          <c/>                 <c>ECI=CI.1</c> <c/>
<c/>      <c/> <c>CI.r=1</c>        <c/>                    <c/>                  <c/>
<c>6</c>  <c>100</c> <c>0201 0301 ACK</c> <c>ECT0 -CE-></c> <c/>  <c/>
<!--<c> </c>  <c/> <c>ECI=CI.0     </c> <c/>                   <c/>               <c/>-->
<c/>      <c/> <c> </c>        <c/>                    <c>CI.c=1 CI.g=1</c>                  <c/>
<c>7</c>  <c>100</c> <c>0301 0301 ACK</c> <c>ECT0 -CE-></c> <c/>  <c/>
<!--<c> </c>  <c/> <c>ECI=CI.0     </c> <c/>                   <c/>               <c/>-->
<c/>      <c/> <c> </c>        <c/>                    <c>CI.c=1 CI.g=2</c>                  <c/>
<c>8</c>  <c/> <c></c> <c>XX--     </c> <c>0301 0401 ACK</c> <c/>
<c> </c>  <c/> <c/>          <c/>                 <c>ECI=CI.1</c> <c/>
<c/>      <c/> <c>CI.r=1</c>        <c/>                    <c/>                  <c/>
<c>9</c>  <c>100</c> <c>0401 0301 ACK</c> <c>ECT0 -CE-></c> <c/>  <c/>
<!--<c> </c>  <c/> <c>ECI=CI.0     </c> <c/>                   <c/>               <c/>-->
<c/>      <c/> <c> </c>        <c/>                    <c>CI.c=1 CI.g=3</c>                  <c/>
<c>10</c> <c>100</c> <c>0501 0301 ACK</c> <c>ECT0 -CE-></c> <c/>  <c/>
<!--<c> </c>  <c/> <c>ECI=CI.0     </c> <c/>                   <c/>               <c/>-->
<c/>      <c/> <c> </c>        <c/>                    <c>CI.c=5 CI.g=0</c>                  <c/>
<c>11</c> <c/> <c></c> <c><----     </c> <c>0301 0601 ACK</c> <c/>
<c> </c>  <c/> <c/>          <c/>                 <c>ECI=CI.0</c> <c/>
<c/>      <c/> <c>CI.r=5</c>        <c/>                    <c/>                  <c/>
<c>12</c> <c>100</c> <c>0601 0301 ACK</c> <c>ECT0 -CE-></c> <c/>  <c/>
<!--<c> </c>  <c/> <c>ECI=CI.0     </c> <c/>                   <c/>               <c/>-->
<c/>      <c/> <c> </c>        <c/>                    <c>CI.c=5 CI.g=1</c>                  <c/>
<c>13</c> <c>100</c> <c>0701 0301 ACK</c> <c>ECT0 -CE-></c> <c/>  <c/>
<!--<c> </c>  <c/> <c>ECI=CI.0     </c> <c/>                   <c/>               <c/>-->
<c/>      <c/> <c> </c>        <c/>                    <c>CI.c=5 CI.g=2</c>                  <c/>
<c>14</c> <c/> <c></c> <c><----     </c> <c>0301 0801 ACK</c> <c/>
<c> </c>  <c/> <c/>          <c/>                 <c>ECI=CI.0</c> <c/>
<c/>      <c/> <c>CI.r=5</c>        <c/>                    <c/>                  <c/>


<!--
   |  1 |      | 0100      SYN  | FNE   | -   |      R.ECC=0  |      |
   |    |      |    CWR,ECE,NS  |       |       |               |      |
   |  2 |      |      R.ECC=0   | <-   | FNE   | 0300 0101     |      |
   |    |      |                |       |       |   SYN,ACK,CWR |      |
   |  3 |      | 0101 0301 ACK  | RECT  | -   |      R.ECC=0  |      |
   |  4 | 1000 | 0101 0301 ACK  | FNE   | -  |      R.ECC=0  |      |
   |  5 |      |      R.ECC=0   | <-   | FNE   | 0301 1102 ACK | 1460 |
   |  6 |      |      R.ECC=0   | <-   | RECT  | 1762 1102 ACK | 1460 |
   |  7 |      |      R.ECC=0   | <-   | FNE   | 3222 1102 ACK | 1460 |
   |  8 |      | 1102 1762 ACK  | RECT  | -   |      R.ECC=0  |      |
   |  9 |      |      R.ECC=0   | <-   | RECT  | 4682 1102 ACK | 1460 |
   | 10 |      |      R.ECC=0   | <-   | RECT  | 6142 1102 ACK | 1460 |
   | 11 |      | 1102 3222 ACK  | RECT  | -   |      R.ECC=0  |      |
   | 12 |      |      R.ECC=0   | <-   | RECT  | 7602 1102 ACK | 1460 |
   | 13 |      |      R.ECC=1   | <*-   | RECT  | 9062 1102 ACK | 1460 |
   |    |      | ...            |       |       |               |      |
   -->
   </texttable>
			</section>
			<section title="Discussion">
			<t>ACK loss</t>
			<t>As this scheme sends each codepoint (of the two subsets) at least two
				times, at least one, and up to two consecutive ACKs can be lost. Further
				refinements, such as interleaving ACKs when sending codepoints belonging
				to the two subsets (e.g. CI, E1),	can allow the loss of any two 
				consecutive ACKs, without the sender losing congestion information, at 
				the cost of also reducing the ACK	ratio.
			</t>
			<t>At low congestion rates, the sending of the current value of the CI 
				counter by default allows higher numbers of consecutive ACKs to be
        			lost, without impacting the accuracy of the ECN signal.
      			</t>
			<t>ECN Nonce</t>
			<t>By comparing the number of incoming ECT(1) notifications with 
				the actual number of packets that were transmitted with an ECT(1) mark 
				as well as the sum of the sender's two internal counters, the sender 
				can probabilistic detect a receiver that would send false marks or supress
				accurate ECN feedback, or a path that doesn't properly support ECN.
			</t>
			<t>This approach maintains a balanced selection of properties found in
				ECN Nonce, <xref target="eci_mode"/>, and <xref target="sm_mode"/>.
				A delayed ACK ratio of two can be sustained indefinitely even during 
				heavy congestion, but not during excessive ECT(1) marking, which is 
				under the control of the sender. An higher ACK ratios can be sustained
				even when congestion is low but its need for the E1 feedback. 
				<!-- as a high ACK ratios will not cause a loss in timeliness or accuracy. 
				     -> MK: dont understand that...?-->
			</t>
			</section>
			
      
      
				<!--
				
				This approach aims to keep one integrity mechanism similar to ECN-None.
				    
				If the codepints are taken from 3 bits, and
				assigned properly, the wire protocol does not impose limits on
				regular delayed ACKs (1 ack per 2 data seg, typical), even under 
				severe congestion where 100% CE marks are received...
				
				The idea similar to the idea of the previous section is to signal the
				"absolute value" on the wire protocol, not some deltas (or
				bit-flipping).
				Next idea is, to signal two counters independently from each
				other (one for the CE, one for ECT1) so that the sender can check
				a equation, and	probabilistically determine, if the
				received counters are trustable.
				
				Compared to the very simple feedback of one counter, the limited
				number of codepoints requires two major changes, to address
				resiliency and accuracy:
				The value on the wire should be repeated at least twice even 
				under worst case conditions (so that a single ACK loss is always
				acceptable), and the chance between successive counter values 
				must not overflow.
				
				Finally, as a fall-back mechanism to maintain these properties
				under certain conditions (ie AckCC with DelACK > 2, or extreme
				high ECT marking probabilities), the receive may be required to
				interrupt a pending delack, and instead send out an immediate ACK.
				Note that for normal implementations with delack=2 and low ECT1
				marking probabilities, this will not be triggered.
				
				In the original ECI scheme, a binary counter is maintained in the
				receiver, and the lowest 3 bits mapped directly to ECI; Overflows of a
				binary counter are by definition at some power of 2...
				
				The new scheme requires (much?) more state in the receiver - a counter, a
				gauge and a Boolean flag, two times (for CE and ECT(1)).
				The CE counter is mapped to 5 codepoints in ECI, and the ECT1 to 3
				codepoints. As there is no natural base-5 or base-3 counter, the overflows
				have to be handled explicitly (ie. Modulo 5^n / 3^n) in the receiver too.
				
				This scheme has all the required properties:
			<list hangIndent="10" style="empty">
			     <t>Resiliency - In worst case, any one ACK can be dropped, two consecutive
			     dropped ACKs only impact accuracy in 50% - when CE/ECT marking rates are
			     very high (>>50%). At normal marking rates, there is a high redundancy in
			     the signal - many ACKs may be lost without the counters getting
			     unsynchronized.</t>
		     
			     <t>Timely - CE signals can be fed back at the same rate they are received,
			     using delayed ACKs only, even when keeping the value constant for every 2
			     ACKs; (The sender has control over ECT(1), and should not send at a ECT(1)
			     marking rate exceeding 50%; if it does, the scheme below will disable
			     delayed ACKs to keep up. Alternatively, the Gauge could be allowed to fill
			     up and the signal returned with delay. The sender should be
			     able to implicitly disable delayed ACKs on purpose sometimes (ie.
			     End-of-stream / accurate timing information etc).</t>
		     
			     <t>Integrity - The feedback of two independent signals allows the sender to
			     verify the plausibility of the counters reported by the receiver.
			     Lost (or never sent) ACKs can also be detected by the sender. As with
			     ECN-Nonce, a misbehaving receiver can only be detected with a certain
			     probability though. If and what kind of enforcements a sender should do,
			     would be out-of-scope (ie cwnd=RW, IW or 1; RST; logging...)</t>
		     
			     <t>Accuracy - Multiple signals can be conveyed from the receiver back to
			     the sender. For CE signals, these signals may be delayed by 3 (data)
			     segments. With the algoritm below, ECT(1) signals may lag  4-5 segments
			     behind (until delacks are disabled; then this offset is kept).</t>
			</list>-->
     		</section>

		
     		<section title="Short Summary of the Discussions">
     			<t>With the exception of the signaling scheme described in <xref target="sm_mode"/>, all 
     				signaling may fail to work, if middleboxes intervene and check on the semantic of
     				<xref target="RFC3168"/> signals.</t>
     			<t>The scheme described in <xref target="cp_mode"/> is the most complex to implement
     				especially on a receiver, with much additional state to be kept there, compared to the
     				other signaling schemes. With the advances in compute power, many more cycles are 
     				available	to process TCP than ever before. </t>
     			<t><xref target="Tab3"/> gives an overview of the relative implications of the different
     				proposed signaling schemes. Further discussion should be included here in the next version of this document.</t>
     			<texttable anchor="Tab3" title="Overview of accurate feedback schemes">
     				<ttcol align="center">Section</ttcol>
     				<ttcol align="center">Resi- liency</ttcol>
     				<ttcol align="center">Timely</ttcol>
     				<ttcol align="center">Integrity</ttcol>
     				<ttcol align="center">Accuracy</ttcol>
     				<ttcol align="center">Complexity</ttcol>
				<c>1-bit-flag</c>
				<c> -</c> <c> +</c> <c> + <!--*)-->  </c> <c> -</c> <c> +</c> 
				  <c>3-bit-field</c>
     				  <c>     ++</c> <c>     ++</c> <c>--   </c> <c>++</c> <c> -</c> 
				  <c>Codepoints</c>
     				  <c> +</c> <c> +</c> <c>+   </c> <c>++</c> <c>--</c> 
     				<!--<c><xref target="comp_mode" format="counter"/></c>
				<c>     minusminus</c> <c> -</c> <c>- *)</c> <c>minusminus</c> <c>++</c> -->
     				<!--<postamble>*) could be combined with ECN-Nonce</postamble>-->
     			</texttable>
     				
     			
     		</section>
     </section>
     
     
     <section title="TCP Sender">
       <t> This section will specify the sender-side action describing how to exclude the accurate number of congestion markings from the given receiver feedback signal.
       </t>
     </section>
     <section title="TCP Receiver">
       <t> This section will describe the receiver-side action to signal the accurate ECN feedback back to the sender. In any case the receiver will need to maintain a counter of how many CE marking has been seen during a connection. Depending on the chosen coding scheme there will be different action to set the corresponding bits in the TCP header. For all case it might be helpful if the receiver is able to switch form a delayed ACK behavior to send ACKs immediately after the data packet reception in a hight congestion situation.
       </t>
     </section>
     
     <section title="Advanced Compatibility Mode" anchor="comp_mode">
	     <t>
		     This section describes a possiblity to achieve more accurate feedback even when
		     the receiver is not capable of the new accurate ECN feedback scheme with the drawback of
		     less reliability.
	     </t>
	     <t>During initial deployment, a large number of receivers will only support
		     <xref target="RFC3168"/> classic ECN feedback. Such a receiver will set the
		     ECE bit whenever it receives a segment with the CE codepoint set, and clear
		     the ECE bit only when it receives a segment with the CWR bit set. As the CE 
		     codepoint has priority over the CWR bit (Note: the wording in this regard
		     is ambiguous in <xref target="RFC3168"/>, but the reference implementation of 
		     ECN in ns2 is clear), a <xref target="RFC3168"/> compliant
		     receiver will not clear the ECE bit on the reception of a segment, where both
		     CE and CWR are set simultaneously. This property allows the use of a compatibility
		     mode, to extract more accurate feedback from legacy <xref target="RFC3168"/>
		     receivers by setting the CWR permanently.
	     </t> 
	     <t>Assuming an delayed ACK ratio of one, a sender can permanently set the CWR 
		     bit in the TCP header, to receive a more accurate feedback of the CE codepoints 
		     as seen at the receiver. This feedback signal is however very brittle and any
		     ACK loss may cause congestion information to become lost.
		     Delayed ACKs and ACK loss can both not be accounted for in a reliable
		     way, however. Therefore, a sender would need to use heuristics to determine the
		     current delay ACK ratio m used by the receiver (e.g. most receivers will
		     use m=2), and also the recent ACK loss ratio (l). Acknowledge Congestion Control
		     (AckCC) as defined in <xref target="RFC5690"/> can not be used, as deployment
		     of this feature is only experimental.
	     </t>
	     <t>Using a phase locked loop algorithm, the CWR bit can then be set only on
		     those data segments, that will trigger a (delayed) ACK. Thereby, no congestion
		     information is lost, as long as the ACK carrying the ECE bit is seen by the
		     sender.
	     </t>
	     <t>Whenever the sender sees an ACK with 
		     ECE set, this indicates that at least one, and at most m  / (m - l) data 
		     segments with	the CE codepoint set where seen by the receiver. The sender	
		     SHOULD react, as if m CE indications where reflected back to the sender by 
		     the receiver, unless additional heuristics (e.g. dead time correction)
		     can determine a more accurate value of the "true" number of received CE marks.
	     </t>    				
      </section>

   </section>

   <section title="Acknowledgements">
	   <t> We want to thank Michael Welzl and Bob Briscoe for their input and discussion.
	   </t>
   </section>

   <section anchor="IANA" title="IANA Considerations">
     <t>This memo includes no request to IANA.</t>
     <!--<t> If this memo was to progress to standards track, it would update RFC3168 
     	and RFC3540, to add new combinations of flags in the TCP header for capability
      negotiation (see <xref target="TCPNeg"/>) and a change in TCP ECN semantics 
      (see <xref target="TCPSig"/>).</t>-->
   </section>

   <section anchor="Security" title="Security Considerations">
	   <t>For coding schemes that increase robustness for the ECN feedback, similar
	   	considerations as in RFC3540 apply for the selection of when to sent a ECT(1)
	   	codepoint.</t>
   </section>
   
 </middle>

 <!--  *****BACK MATTER ***** -->

 <back>

   <references title="Normative References">
     <!--?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"?-->
     &RFC2119;
     &RFC3168;
     &RFC3540;
     
   </references>
   
   <references title="Informative References">
	   <?rfc include="reference.I-D.briscoe-tsvwg-re-ecn-tcp.xml"?>

     &RFC5562;
     &RFC5681;
     &RFC5690;
	   
		<reference anchor="Ali10">
			<front>
				<title>DCTCP: Efficient Packet Transport for the Commoditized Data Center</title>
				
				<author initials="M" surname="Alizadeh">
					<organization></organization></author>
				<author initials="A" surname="Greenberg">
					<organization></organization></author>
				<author initials="D" surname="Maltz">
					<organization></organization></author>
				<author initials="J" surname="Padhye">
					<organization></organization></author>
				<author initials="P" surname="Patel">
					<organization></organization></author>
				<author initials="B" surname="Prabhakar">
					<organization></organization></author>
				<author initials="S" surname="Sengupta">
					<organization></organization></author>
				<author initials="M" surname="Sridharan">
					<organization></organization></author>
				<date month="Jan" year="2010"/>
			</front>
		</reference>
		
		
    </references>

    <section anchor="app-codepoints" title="Pseudo Code for the Codepoint Coding">
	    	    <t>Receiver:</t>
		    
		    <t>Input signals: CE , ECT(1)<vspace blankLines="0" />
		    TCP Fields:  ECI (3-bit field from CWR and ECE). CI.cm and E1.cm map into these 8 codepoints (ie. 5 and 3 codepoints)</t>
		    
		    <t>These counters get tracked by the following variables:</t>
		    
		    <t>CI.c (congestion indication - counter, modulo a multiple of the available codepoints to represent CI.c in the ECI field. Range[0..n*CI.cp-1])<vspace blankLines="0" />
		    CI.g (congestion indication - gauge, [0.."inf"])<vspace blankLines="0" />
		    CI.i (congestion indication - iteration, [0,1])<vspace blankLines="0" />
		    These are to track CE indications.</t>
		    
		    <t>E1.c, E1.g and E1.r (doing the same, but for ECT(1) signals).</t>
		    
		    <t>Constants:<vspace blankLines="0" />
		    CI.cp (number of codepoints available to signal)<vspace blankLines="0" />
		    CI.cm[] (codepoint mapping for CI)<vspace blankLines="0" />
		    E1.cp (number of codepoints available for E1 signal)<vspace blankLines="0" />
		    E1.cm[0..(E1.cp-1)] (codepoint mappings for E1)</t>
		    

		    <figure><artwork><![CDATA[
At session initialization, all these counters are set to 0;

When a Segement (Data, ACK) is received, 
perform the following steps:

If a CE codepoint is received,
Increase CI.g by 1
If a ECT(1) codepoint is received,
Increase E1.g by 1
If (CI.g > 5)        # When ACK rate is not sufficient to keep
or (E1.g > 3)        # gauge close to zero, increase ACK rate
# works independent of delACK number (ie AckCC)
Cancel pending delayed ACK (ACK this segment immediately)
# this increases the ACK rate to a maximum of 1.5 data segments 
# per ACK, with delACK=2,
# and CE mark rate exceeds 75% for a number 
# of at least 18 segments. 
# 5 codepoints would allow delack=2 indefinitely btw

When preparing an ACK to be sent:

If (CI.g > 0) or
((E1.i != 0) and (CI.i != 0))  # E1.g = 0 is to skip this 
			       # if only the 2nd CI.c ACK
# has to be sent - effectively alternating CI.c and E1.c on ACKs
# should give slightly better resiliency against ack losses
If CI.i == 0                   # updates to CI.c allowed
and CI.g > 0                   # update is meaningful
CI.i = 1                       # may be larger 
			       #if more resiliency is reqd
CI.c += min(CI.cp-1,CI.g)      # CI.cp-1 is 3 for 4 codepoints, 
			       # 4 for 5 etc
CI.c = CI.c modulo CI.cp*CI.cp # using modulo the square of 
			       # available codepoints, 
			       # for convinience (debugging)
CI.g -= min(CI.cp-1,CI.g)      #
Else 
CI.i--                         # just in case CI.f was set to 
    			       # more than 1 for resiliency
Send next ACK with ECI = CI.cm[CI.c modulo CI.cp]
Else
If (E1.g > 0) or (E1.i != 0)

If (E1.i == 0) and (E1.g > 0)
E1.i = 1
E1.c += min(E1.cp-1,E1.g)
E1.c = E1.c modulo E1.cp*E1.cp
E1.g -= min(E1.cp-1,E1.g)
Else
E1.i--
Send next ACK with ECI = E1.cm[E1.c modulo E1.cp]
Else
Send next ACK with ECI = CI.cm[CI.c modulo CI.cp] # default action
		]]></artwork></figure>
    
		    <t>Sender:</t>
		    
		    <t>Counters:</t>
		    
		    <!--<texttable>
			    <ttcol align="center">Name</ttcol>
			    
			    <ttcol align="center">Description</ttcol>
			    
			    <c>CI.r</c><c>current value of CEs seen by receiver</c>
			    
			    <c>E1.s</c><c>sum of all sent ECT(1) marked packets (up to snd.nxt)</c>
			    
			    <c>E1.s(t)</c><c>value of E1.s at time (in sequence space) t</c>
			    
			    <c>E1.r</c><c>value signaled by receiver about received ECT(1) segments</c>
			    
			    <c>E1.r(t)</c><c>value of E1.r at time (in sequence space) t</c>
			    
			    <c>CI.r(t)</c><c>ditto</c>
			    
		    </texttable>-->
		    
		    <t>
		    CI.r - current value of CEs seen by receiver<vspace blankLines="0" />
		    E1.s - sum of all sent ECT(1) marked packets (up to snd.nxt)<vspace blankLines="0" />
		    E1.s(t) - value of E1.s at time (in sequence space) t<vspace blankLines="0" />
		    E1.r - value signaled by receiver about received ECT(1) segments<vspace blankLines="0" />
		    E1.r(t) - value of E1.r at time (in sequence space) t<vspace blankLines="0" />
		    CI.r(t) - ditto</t>
		    
		    <figure><artwork><![CDATA[
# Note: With a codepoint-implementation, 
# a reverse table ECI[n] -> CI.r / E1.r is needed.
# This example is simplified with 4/4 codepoints 
# instead of 5/3
    
If ACK with NS=0
CI.r +=  (ECI + 4 - (CI.r mod CI.cp)) mod CI.cp 
# The wire protocol transports the absolute value 
# of the receiver-side counter.
# Thus the (positive only) delta needs to be calculated, 
# and added to the sender-side counter.
If ACK with NS=1
E1.r += (ECI + 4 - (E1.r mod E1.cp)) mod E1.c

# Before CI.r or E1.r reach a (binary) rollover, 
# they need to roll over some multiple of CI.cp 
# and E1.cp respectively.

CI.r = CI.r modulo CI.cp * n_CI
E1.r = E1.r modulo E1.cp * n_E1

# (an implementation may choose to use a single constant, 
# ie 3^4*5^4 for 16-bit integers, 
# or 3^8*5^8 for 32-bit integers)

# The following test can (probabilistically) reveal, 
# if the receiver or path is not properly 
# handling ECN (CE, E1) marks

If not E1.r(t) <= E1.s(t) <= E1.r(t) + CI.r(t) 
# -> receiver lies (or too many ACKs got lost, 
# which can be checked too by the sender).
		    ]]></artwork></figure>
   </section>

 </back>
</rfc>

PAFTECH AB 2003-20262026-04-23 11:00:29