One document matched: draft-ietf-conex-tcp-modifications-02.xml


<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
    which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
    There has to be one entity for each item to be referenced. 
    An alternate method (rfc include) is described in the references. -->

<!ENTITY RFC3168 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3168.xml">
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2018 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2018.xml">
<!ENTITY RFC3522 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3522.xml">
<!ENTITY RFC3708 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3708.xml">
<!ENTITY RFC4015 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4015.xml">
<!ENTITY RFC5562 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5562.xml">
<!ENTITY RFC5681 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5681.xml">
<!ENTITY RFC5682 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5682.xml">

]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
    please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
    (Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
    (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="exp" 
     docName="draft-ietf-conex-tcp-modifications-02" 
	ipr="trust200902">
 <!-- category values: std, bcp, info, exp, and historic
    ipr values: trust200902, noModificationTrust200902, noDerivativesTrust200902,
       or pre5378Trust200902
    you can add the attributes updates="NNNN" and obsoletes="NNNN" 
    they will automatically be output with "(if approved)" -->

 <!-- ***** FRONT MATTER ***** -->

 <front>
   <!-- The abbreviated title is used in the page header - it is only necessary if the 
        full title is longer than 39 characters -->

   <title>TCP modifications for Congestion Exposure</title>

   <!-- add 'role="editor"' below for the editors if appropriate -->

   <!-- Another author who claims to be an editor -->

    <author fullname="Mirja Kuehlewind" initials="M." role="editor"
    	surname="Kuehlewind">
    	<organization>University of Stuttgart</organization>
    	<address>
    		<postal>
    			<street>Pfaffenwaldring 47</street>
    			<code>70569</code>
    			<city>Stuttgart</city>
    			<country>Germany</country>
    		</postal>
    		<email>mirja.kuehlewind@ikr.uni-stuttgart.de</email>
    	</address>
    </author>
    
    <author fullname="Richard Scheffenegger" initials="R."
           surname="Scheffenegger">
     <organization>NetApp, Inc.</organization>
     <address>
       <postal>
         <street>Am Euro Platz 2</street>
         <code>1120</code>
         <city>Vienna</city>
         <region></region>
         <country>Austria</country>
       </postal>
       <phone>+43 1 3676811 3146</phone>
       <email>rs@netapp.com</email>
     </address>
    </author>

   <date year="2012" />


   <area>Transport</area>

   <workgroup>Congestion Exposure (ConEx)</workgroup>

   <keyword>Internet-Draft</keyword>
   <keyword>I-D</keyword>

   <abstract>
	   <t>Congestion Exposure (ConEx) is a mechanism by which senders inform the network 
		   about the congestion encountered by previous packets on the same flow.
		   This document describes the necessary modifications to use ConEx with the 
		   Transmission Control Protocol (TCP). 
	   </t>
   </abstract>
 </front>

 <middle>
   <section title="Introduction">
	   <t>Congestion Exposure (ConEx) is a mechanism by which senders inform the network 
		   about the congestion encountered by previous packets on the same flow.
		   This document describes the necessary modifications to use ConEx with the 
		   Transmission Control Protocol (TCP). The ConEx signal is based on loss or 
		   ECN marks <xref target="RFC3168"/> as a congestion indication. This congestion information is retrieved by the sender based on existing feedback mechanisms from the receiver to the sender in TCP.
	    </t>
	    <t>With standard TCP without Selective Acknowledgments (SACK) 
		    <xref target="RFC2018"/> the actual number of losses is hard to detect, 
		    thus we recommend to enable SACK when using ConEx. 
		    However, we discuss both cases, with and without SACK support, later on. 
		    </t>
	    <t>Explicit Congestion Notification (ECN) is defined in such a way that only a 
	    	single congestion signal is guaranteed to be delivered per Round-trip Time 
	    	(RTT) from the receiver to the sender. For ConEx a more accurate <!-- timely? --> feedback signal 
		    would be beneficial. Such an extension to ECN is defined in a seperate document 
		    <xref target="draft-kuehlewind-conex-accurate-ecn"/>, as it can also be useful 
		    for other mechanisms, as e.g. <xref target="DCTCP"/> or whenever the congestion 
		    control reaction should be proportional to the experienced congestion. ConEx also works with classic ECN but it is less accurate when multiple congestion markings occur within on RTT.
	    </t>
	    <!--<t>The current version of this draft is only a first collection of ConEx-based TCP
		    modification and should not be regared as feature-complete as 
		    the specification for the abstract ConEx mechanism is still under discussion.
		    The next version will also go more precisely into implementation details.
	    </t>-->
	    <t>
		    ConEx is currently/will be defined as an destination option for IPv6. 
		    The use of four bits have been defined, namely the X (ConEx-capable), 
		    the L (loss experienced), the E (ECN experienced) and C (credit) bit. 
	     </t>

     <section title="Requirements Language">
       <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
       "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
       document are to be interpreted as described in <xref
       target="RFC2119"/>.</t>
     </section>
   </section>

   <section title="Sender-side Modifications">
	   <t>A ConEx sender MUST negotiate for both SACK and ECN or the more accurate ECN feedback 
	   	in the TCP handshake if these TCP extension are available at the sender. 
		Depending on the capability of the receiver, the following operation modes exist:
		  <list style="symbols">
			 <t> Full-ConEx (SACK and accurate ECN feedback) </t>
			 <t> accECN-ConEx (no SACK but accurate ECN feedback)</t>
			 <t> ECN-ConEx (no SACK and no accurate ECN feedback but 'classic' ECN)</t>
			 <t> SACK-ECN-ConEx (SACK and 'classic' instead of accurate ECN)</t>
			 <t> SACK-ConEx (SACK but no ECN at all)</t>
			 <t> Basic-ConEx (neither SACK nor ECN) </t>
		  </list>
     </t>
     <t>A ConEx sender MUST expose congestion to the network according to the congestion information received 
	     by ECN or based on loss information provided by the TCP feedback loop. A TCP sender SHOULD account congestion
	     byte-wise (and not packet-wise). A sender MUST mark <!--the respective number of payload bytes in -->
	     subsequent packets (after the congestion notification) with the respective ConEx bit in the IP header.
	   </t> 
	   <t>With SACK only the number of lost bytes is known, but not the number of packets carrying these bytes. 
	   	 With classic ECN only an indication is given that a marking occurred which is not giving an exact number 
	   	 of bytes nor packets. As network congestion is usually byte-congestion, the exact number of bytes should 
	   	 be taken into account if available to make the ConEx signal as exact as possible.
	   </t>
	   <t>The congestion accounting based on different operation modes is described in the next section and 
	     the handling of the IPv6 bits itself in the subsequent section afterwards.
     </t>
   </section>

   <section title="Accounting congestion">
	   <t>
		A TCP sender SHOULD account congestion byte-wise (and not packet-wise) based 
		the congestion information received by ECN or loss detection provided by TCP.
		For this purpose a TCP sender will maintain two different counters for number outstanding bytes
		that need to be ConEx marked either with the E bit or the L Bit. 
      </t>
      <t>
	      The outstanding bytes accounted based on ECN feedback information are maintained
	      in the congestion exposure gauge (CEG). The accounting of these bytes from the ECN feedback 
	      is explained in more detail next. 
      </t>
      <t>
	      The outstanding bytes for congestion indications based on loss are maintained
	      in the loss exposure gauge (LEG) and the accounting is explained in subsequent to the CEG 
	      accounting. 
      </t>
      <t> 
	      The subtraction of bytes which have been ConEx marked from both counters is explained in the
	      next section.
      </t>
      <t>
	      Usually all bytes of an IP packet must be accounted. Therefore the sender
	      SHOULD take the headers into account.  If equal sized packets or at  
	      least equally distributed packet sizes can be assumed the
	      sender MAY only account the TCP payload bytes, as the ConEx marked
	      packets as well as the original packets causing the congestion will
   		both contain about the same number of headers. 
	</t>
	<t>
		If a sender sends different sized packets with unequally distributed packet
		sizes, the sender might be able to reconstruct 
	      the exact number of headers based on information which packet sizes has been sent in the last RTT.
	      Otherwise if no additional information is available the worst case number of headers SHOULD be estimated 
	      in a conservative way based on a minimum packet size (of all packets sent in the last RTT).
      </t>
      
     <section title="ECN">
	     <t>ECN is an IP/TCP mechanism that allows network nodes to mark packets
		   with the Congestion Experienced (CE) mark instead of (early) 
		   dropping them when congestion occurs. <!--As soon as a CE marks is received at the receiver, it will feed this information back to the sender. --> As soon as a CE mark is seen at the receiver, with classic ECN it will feed this information back to the sender by setting the Echo Congestion Experienced (ECE) bit in the TCP header until a packet with Congestion Window Reduced (CWR) bit in the TCP header is received to acknowledge the reception of the congestion notification. The sender sets the CWR bit in the TCP header once when the first ECE of a congestion notification is received.
	     </t>
	     <t>A receiver can support the accurate ECN feedback scheme, the 
		     'classic' ECN or neither. In the case ECN is not supported at 
		     all, the transport is not ECN-capable and no ECN marks will occur, 
		     thus the E bit will never be set. In the other cases a ConEx sender 
		     MUST maintain a gauge for the number of outstanding bytes that have to
		     be ConEx marked with the E bit, 
		     the congestion exposure gauge (CEG).
	     </t>
	     <t>The CEG is increased when ECN information is received from an ECN-capable 
	     	receiver supporting the 'classic' ECN scheme or the accurate ECN 
	     	feedback scheme. When the ConEx sender receives an ACK indicating one or more
	     	segments were received with a CE mark, CEG is increased by the appropriate 
	     	number of bytes.
		<!--sent by the IP layer (e.g. by MTU bytes for each SMSS segment).
	     	Whenever a packet is sent with the E bit set, this gauge is decreased by the
	     	IP size of that packet. -->
		</t>
		
		<t>In case of duplicate acknowledgements the number of acknowledged bytes will be zero even though (CE marked) data has been received. Therefore, we calculated a variable DeliveredData. DeliveredData covers the total number of bytes that the current ACK indicates have been delivered to the receiver, relative to all past ACKs. With SACK, DeliveredData is increased by the number of bytes given by changes in the SACK information. 
		Note the change in in the SACK information can also be negative if the number of acknowledged bytes increases. Without SACK, DeliveredData
      		is estimated to be 1 SMSS on duplicate acknowledgements, and on a
      		subsequent partial or full ACK, DeliveredData is estimated to be the
   		change in acknowledged bytes, minus one SMSS for each preceding duplicate ACK.</t>
		
		<t>DeliveredData = acked_bytes + SACK_diff + (is_dup)*1SMSS - (is_after_dup)*num_dup*1SMSS
		</t>
		
		<t>The two cases, with and without more accurate ECN depending on the 
		receiver capability, are discussed in the following sections.
	     </t>
       <!--<t>TBD: Discussion to set ECN in which packets. Initially apply RFC5562 rules
	       ([SYN,ACK] and data segments only), as security implications of ECN on
	       control packets ([SYN], pure [ACK], window probe, window update, ...) is an open 
	       research question. However, running bidirectional ECN on all TCP segments 
	       including TCP control packets, may allow for more timely and accurate ConEx 
	       signals. Also, ConEx provides a framework to possibly address some of these 
	       security risks.</t>-->
	     <section title="Accurate ECN feedback">
		     <t>With a more accurate ECN feedback scheme either the number of marked packets/received CE marks or directly the number of marked bytes is
		     known. In the later case the CEG can 
		     directly be increased by the number of marked bytes. 
		     Otherwise if D is assumed to be the number of marks,
		     <!-- Otherwise when the accurate ECN feedback scheme is supported by the 
	      	receiver, the receiver will maintain an echo congestion counter 
	      	(ECC). The ECC will hold the number of CE marks received. A 
	      	sender that is understanding the accurate ECN feedback will be 
	      	able to reconstruct this ECC value on the sender side by 
	      	maintaining a counter ECC.r.
	      </t>
	      <t>On the arrival of every ACK, the sender calculates
	      	the difference D between the local ECC.r counter, and the signaled
	      	value of the receiver side ECC counter. The value
	      	of ECC.r is increased by D, and D is assumed to be the number 
	      	of CE marked packets that arrived at the receiver since it 
	      	sent the previously received ACK.
	      </t>
	      <t>Whenever the counter ECC.r is increased, -->
		the gauge CEG has to 
	      	be increased by the amount of bytes sent which 
	      	were marked:</t>
		<t>CEG += min( SMSS*D, DeliveredData )</t>
		<!--<t>CEG += min( (SMSS+IPhl+TCPhl)*D, acked_bytes + (IPhl+TCPhl)*D )</t>
		<t>where IPhl is the IP heder length which is when using IPv4 20 Byte and with IPv6 40 Byte and where TCPhl is the TCP header length which is 20 byte. Those values give the respective header length withour any options. If an option is negociated which has to be carried in every packet, the size of the option MUST be added for each header. Otherwise the best case with no option header is assumed.</t>-->
	      	
	     </section>
	     <section title="Classic ECN support">
		     
	     	<t>A ConEx sender that communicates with a classic ECN receiver (conforming
	     		to <xref target="RFC3168"/> or <xref target="RFC5562"/>) MAY
	     		run in one of these modes:
	     <list style="symbols">
		     <t>Full compliance mode:<vspace blankLines="1"/>
			     The ConEx sender fully conforms to
     				all the semantics of the ECN signaling as defined by 
     				<xref target="RFC5562"/>. In this mode, only a single
     				congestion indication can be signaled by the receiver
     				per RTT. Whenever the ECE flag toggles from "0" to "1", 
				the gauge CEG is increased at maximum by the SMSS:<!-- plus headers.-->
				<vspace blankLines="1"/>
				CEG += min(SMSS, DeliveredData)
				<vspace blankLines="1"/>
     				Note that under severe congestion, a session adhering
     				to these semantics may not provide enough ConEx marks. 
     				This may cause appropriate sanctions by an audit device in 
				a ConEx enabled network.</t>
			<t>Simple compatibility mode:<vspace blankLines="1"/>
				<!--Alternatively, a ConEx sender MAY
    				set the CWR flag opportunistically, to extract more than
     				one ECE indication per RTT. In 
     				the most simple form, CWR can be set on a permanent basis.-->
				The sender will set the CWR permanently to force the receiver to signal
				only one ECE per CE mark. Unfortunately, the use of delayed
				ACKs <xref target="RFC5681"/>, as it is usually done today, will prevent a feedback of every CE mark. An CWR confirmation will be received before the ECE can be sent out with the next ACK. 
				With an ACK rate of M, about M-1/M CE indications will not be signaled back by
				the receiver (e.g. 50% with M=2 for delayed ACKs). Thus, in this mode the ConEx sender MUST
				increase CEG <!--by a count ofM*SMSS-->
				<!--M*(SMSS+IP.header+TCP.header)<vspace blankLines="1"/>-->
				as if M congestion notification were received
				for each received ECE signal:
				<vspace blankLines="1"/>
				CEG += min(M*SMSS, DeliveredData + (M-1)*SMSS)
				<vspace blankLines="1"/>
				In case of a congestion event with low congestion (that means when only a very
				smaller number of packets get marked), the sender might miss the whole
				congestion event. 
				On average the sender will send sufficient ConEx marks due to the scheme proposed
				above but these ConEx marks might be shifted in time. 
				Regarding congestion control it is not a general problem to miss a congestion event
				as, by chance, a marking scheme in the network node might also
				miss a certain flow. 
				In the case where no other flow is reacting, the congestion level will increase
				and it will get more likely that the congestion feedback is delivered. 
				To provide a fair share over time, a TCP sender implementing this simple ECN compatibility mode could react more strongly
				when receiving an ECN feedback signal. This of course depends on the congestion control
				used. <!--A TCP sender using this scheme MUST take the impact on congestion control
				into account.--></t>
			<t>Advanced compatibility mode:<vspace blankLines="1"/>
				To avoid the loss of ECN feedback information in the proposed simple compatibility mode, a sender could 
     				set CWR only on those data segments, that
     				will actually trigger a (delayed) ACK. The sender would need an additional control loop to estimated which data segment will trigger an ACK.
				Such a more sophisticated heuristics could extract
     				congestion notifications more timely. <!--A ConEx sender MAY
     				choose to implement such an heuristic.--> In addition, if this advanced compatibility mode is used, further
     				heuristics SHOULD be implemented, to determine the value
     				of each ECE notification. E.g. for each consecutive ACK received
     				with the ECE flag set, 
				CEG should be increased by 
				min( M*SSMS, DeliveredData). 
				Else if the predecessor ACK was received with the ECE flag cleared,
				CEG need only be increase at maximum by one SMSS:
				<vspace blankLines="1"/>
				if previous_marked: CEG += min( M*SSMS, DeliveredData)
				<vspace blankLines="0"/>
				else: CEG += min(SMSS, DeliveredData)
				<vspace blankLines="1"/>
				<!--D should be increased by one, and
				CEG increased by<vspace blankLines="1"/>
				CEG += min((SMSS+IP.header+TCP.header)*D, acked_bytes+(IP+TCP Header)*D)
				<vspace blankLines="1"/>
     				If an ACK is received with the ECE flag cleared, D must be
     				set to zero. -->
				This heuristic is conservative during more
     				serious congestion, and more relaxed at low congestion
     				levels.</t>
	     	</list>
	     	</t>
	     	
       </section>
     </section>
     <section title="Loss Detection with/without SACK">
	     <t>For all the data segments that are determined by a ConEx sender 
	     	as lost, (at least) the same number of TCP payload bytes MUST be be sent 
	     	with the ConEx L bit set.  Loss detection typically happens by use 
	     	of duplicate ACKs, or the firing of the retransmission timer.  A 
	     	ConEx sender MUST maintain a loss exposure gauge (LEG), indicating 
	     	the number of outstanding bytes that must be sent with the ConEx
	     	L bit.  When a data segment is retransmitted, LEG will be 
	     	increased by the size of the TCP payload packet containing the retransmission,
		assuming equal sized segments such that the retransmitted packet will have 
		the same number of header as the original ones.
	     	When sending subsequent segments<!-- (including TCP control segments)-->, 
	     	the ConEx L bit is set as long as LEG is positive, and LEG is 
	     	decreased by the size of the sent TCP payload with the ConEx L bit set.
        </t>
	<t>Any retransmission may be spurious.  To accommodate that, a ConEx
         sender SHOULD make use of heuristics to detect such spurious
         retransmissions (e.g. F-RTO <xref target="RFC5682"/>, DSACK 
         <xref target="RFC3708"/>, and Eifel <xref target="RFC3522"/>, 
         <xref target="RFC4015"/>).  When such a heuristic has determined, 
         that a certain number of packets were retransmitted
         erroneously, the ConEx sender should subtract the payload size of these
         TCP packets from LEG. 
        </t>
        <t>Note that the above heuristics delays the ConEx signal by one 
         segment, and also decouples them from the retransmissions themselves, as
         some control packets (e.g. pure ACKs, window probes, or window updates) 
         may be sent in between data segment retransmissions.
         A simpler approach would be to set the ConEx signal for each
         retransmitted data segment.  However, it is important to remember, that 
         a ConEx signal and TCP segments do not natively belong together. 
        </t> 
	<t>If SACK is not available or SACK information has been reset for any reason, 
		spurious retransmission are more likely. In this case it might be valuable to 
		slightly delay the ConEx loss feedback until a spurious retransmission might 
		be detected. But the ConEx signal MUST NOT be delayed more than one RTT.
	</t>

     </section>
   </section>

   <section title="Setting the ConEx IPv6 Bits"> 
	     
		<!-- RS: remove IPv6 - Conex signals (bits) should be defined agnostic of IP version, right?) -->
		<t>
		     ConEx is currently/will be defined as an destination option for IPv6. 
		     The use of four bits have been defined, namely the X (ConEx-capable), 
		     the L (loss experienced), the E (ECN experienced) and C (credit) bit. 
	     </t>
	     <t>By setting the X bit a packet is marked as ConEx-capable. 
		     All packets carrying payload MUST be marked with the X bit set including retransmissions. No congestion feedback information are available about control packets as pure ACKs which are not carrying any payload. Thus these packet should not be taken into account when determining ConEx information. These packet MUST carry a ConEx Destination Option with
		     the X bit unset.
	     </t>
	     <!--<t>By setting the X bit a packet is marked as ConEx-capable. It is not 
		     decided yet which or if any packets should not be ConEx capable. 
		     (e.g. control packets as pure ACKs or retransmits). It is not defined yet which bits 
		     (E, L, C) can be set at the same time in one (data) packet. It is assumed 
		     that ConEx marked packets are accounted by their respective IP size, as 
		     all the signals (Loss, ECN) are attributes of an IP packet, not a TCP segment 
		     or merely the TCP payload. Further discussion is needed here.
	   </t>-->
    <section title="Setting the E and the L Bit">
		   <t>As long as the CEG or LEG counter is positive, ConEx-capable packets SHOULD be marked with
			   E or L respectively, and the CEG or LEG counter is decreased by the TCP payload bytes carried in this
			   packet. If the CEG or LEG counter is negative, the respective counter SHOULD be reset to zero within one RTT 
			   after it was decreased the last time or one RTT after recovery if no further 
			   congestion occurred.
			   <!-- This can be done by remembering the seq# of the next packet send after the LEG went negative and reset the LEG if the respective ACk is received.-->
			   <!--is drained by one byte with every 
			   packet sent out, as ConEX information are only meaningful for a certain time:
			   <vspace blankLines="1" />
			   if CEG > 0: CEG -= TCPpayload.length else: CEG -= 1<vspace blankLines="0" />
			   if LEG > 0: LEG -= TCPpayload.length else: LEG -= 1
			   -->
			   <!--As ConEx credits have only a limited lifetime, 
			   whenever the gauge becomes negative, it should be drained at a low 
			   rate (e.g. 1 count per sent packet).-->
			   
		   </t>
   </section>
    
     <section title="Credit Bits">
	     <t>The ConEx abstract mechanism requires that the transport SHOULD signal sufficient 
		     credit in advance to cover any reasonably expected congestion during its
		     feedback delay. To be very conservative the number of credits would need to equal
		     the number of packets in flight, as every packet could get lost or congestion marked.
		     With a more moderate view, only an increase in the sending rate should cause
		     congestion.
	     </t>
	     <t>
		     TODO: More general description to maintain always at least flight_size - flight_size_prev credits. -> Can the number of creidts in the audit decrease?
		     
	     </t>
	     <!--<t>TBD: When increasing the congestion window while in CA, will the
		     one additional segment need further credits?
	     </t>-->
	     <t>For TCP sender using the <xref target="RFC5681"/> congestion control algorithm, we recommend to 
		     only send credit in Slow Start, as in Congestion Avoidance an increase of one segment per RTT
		     should only cause a minor amount of congestion marks (usually at max one). If a more aggressive
		     congestion control is used, a sufficient amount of credits need to be set.
	     </t>
		    
	     <t>In TCP Slow Start the sending rate will increase exponentially and that means double every RTT. 
		     Thus the number of credits should equal half the number of packets in flight in
		     every RTT. Under the assumption that all marks will not get invalid for the whole
		     Slow Start phase, marks of a previous RTT have to be summed up. Thus the marking of
		     every fourth packet will allow sufficient credits in Slow Start as it can be seen in Figure 1.
	     </t>
	     <t>
	     <figure 
		     title="Credits in Slow Start (with an initial window of 3)"
		     align="center" anchor="SS_credit">
			<artwork align="center"><![CDATA[
RTT1	|------XC------>|
	|------X------->|
	|------X------->|   credit=1  in_flight=3
	|		|
RTT2	|------X------->|
	|------XC------>|
	|------X------->|
	|------X------->|
	|------X------->|
	|------XC------>|   credit=3  in_flight=6
	|		|
RTT3	|------X------->|
	|------X------->|
	|------X------->|
	|------XC------>|
	|------X------->|
	|------X------->|
	|------X------->|
	|------XC------>|
	|------X------->|
	|------X------->|
	|------X------->|
	|------XC------>|   credit=6  in_flight=12
	|	.	|
	|	:	|
			]]></artwork></figure></t>
	     <!--<t>TBD: Detailed discussion around 1/4th ConEx C marking during slow start; Any slow
		     start (session start, idle restart, RTO) or specific slow starts. Spurious RTO 
		     interaction?
	     </t>-->
	     
	     <!--<t>If a ConEx sender detects an increasing number of losses even though 
		     the sender reduced the sending rate, the sender SHOULD assume that 
		     those losses are incorporated by an audit device and thus should send 
		     further credits. Up to now its not clear if the credits say valid as long as the connection 
		     is established or if an expiration of the credits need to be assumed by the sender.
	     </t>-->
	     <!--<t>TBD: additional loss would reduce the sending rate, 
		     until enough credits are available to sustain a sending rate. Adding
		     one C bit for each loss recovery episode may be simpler?
	     </t>-->
	     <!--
		 3.1.  On Beginning of a TCP session
		 
		 A conex sender should build conex credits, by sending the 1st and 3rd
		 data segment with the conex C bit set.  For a detailed discussion as
		 to the background for choosing these two data segments to build conex
		 credits with the network, see [I-D.briscoe-tsvwg-re-ecn-tcp].
		 
		 In addition, a conex sender SHOULD maintain a gauge to account for
		 the number of (data) bytes [packets], which need to be sent with the
		 conex L bit set, and a similar gauge to account for the segments that
		 still have to carry the conex E bit.  As multiple indications of lost
		 segments or ECN marks can arrive simultaneously at the sender, a
		 second counter should be maintained for both the conex L and E bits,
		 to space the sending of these conex bits more evenly.  For
		 implementation details, see Appendix A.
		 
		 3.5.  On restarting idle TCP Connections
		 
		 Conex signals are valid only for a limited amount of time.
		 Furthermore, TCP does not currently account for lost ACKs or allow
		 the use of ECN marking on control segments (e.g. pure ACKs, window
		 probes, window updates, FIN or RST segments without data).  In a
		 common TCP connection, data will flow only unidirectional for a
		 certain period of time, while only TCP control packets are traversing
		 the return half-connection.  Subsequently, the data direction may
		 change.  If a TCP determines, that the duration between two sent data
		 segments becomes too large, it will reduce it's congestion window
		 (see Section 4.1 in [RFC5681]).  This MUST be accompanied in a conex
		 sender by the building of new conex credits, by setting conex C bit
		 in the 1st and 3rd data segment sent after the restart.  Furthermore,
		 the gauges, counters and flags maintained by the sender should be
		 reinitialized to zero, as any previous value will be invalid at that
		 point in time.
		 -->
     </section>
     
     <!--<section title="Credit Bits during Congestion Avoidance">
     </section>-->
     <section title="Loss of ConEx information">
	     <t>The audit can have wrong infos if e.g. ConEx got lost on the channel (or a wrong number of ConEx marking has been estimated by the sender due to a lack of feedback information). In this case audit might penalize a sender wrongly.
	     TODO: Further action needed by the sender?</t>
     </section>
     

     
   </section>

   <section title="Timeliness of the ConEx Signals">
	   <t>ConEx signals will anyway be evaluated with a slight time delay of about one RTT by a network
		   node.  Therefore, it is not absolutely necessary to immediately signal ConEx
		   bits when they become known (e.g.  L and E bits), but a sender SHOULD sent the 
		   ConEx signaling with the next available packet. In cases where it 
		   is preferable to slightly delay the ConEx signal, the sender MUST NOT delay the 
		   ConEx signal more than one RTT.
	   </t>
	   <t>Multiple ConEx bits may become available for signaling at the same
		   time, for example when an ACK is received by the sender, that
		   indicates that at least one segment has been lost, and that one or
		   more ECN marks were received at the same time.  This may happen
		   during excessive congestion, where buffer queues overflow and some
		   packets are marked, while others have to be dropped nevertheless.
		   Another possibility when this may happen are lost ACKs, so that a 
		   subsequent ACK carries summary information not previously available 
		   to the sender.
	   </t>
	   <!--<t>It may be preferrable to signal only one ConEx bit per segment, and to
		   space out the signaling of multiple bits across a (short) period of
		   time - or number of segments.  However, that delay should not be
		   excessive, and ideally also shorter than the RTT of the affected TCP
		   session.  The heuristic sketched in Appendix A uses a maximum delay
		   of 10 packets or 1/4 of the congestion window, whatever is smaller
		   to minimize delay.
	   </t>-->
	   <!--<t>It is important to remember, that ConEx bits and TCP retransmissions
		   do not interact with each other.  However, a retransmission should be
		   accompanied by one ConEx L bit in close proximity nevertheless. This does not mean, 
		   that TCP retransmissions may never contain ConEx marks. In 
		   a typical scenario using SACK, the first retransmission would not carry
		   a ConEx L bit, while subsequent retransmissions in the same recovery
		   episode, would be marked with the ConEx L bit.
		   Spreading the ConEx bits over a small number of segments increases
		   the likelihood that most devices along the path will see some
		   ConEx marks even during heavy congestion.
	   </t>-->
     </section>
   
   <section anchor="Acknowledgements" title="Acknowledgements">
	   <t>The authors would like to thank Bob Briscoe who contributed with this initial ideas and valuable feedback.</t>
   </section>

   <!-- Possibly a 'Contributors' section ... -->

   <section anchor="IANA" title="IANA Considerations">
	   <t>This document does not have any requests to IANA.</t>
   </section>

   <section anchor="Security" title="Security Considerations">
	   <t>With some of the advanced ECN compatibility modes it is possible to miss congestion notifications. Thus a sender will not decrease its sending rate. If the congestion is persistent, the likelihood to receive a congestion notification increases. In the worst case the sender will still react correctly to loss. This will prevent a congestion collapse.</t>
   </section>
 </middle>

 <!--  *****BACK MATTER ***** -->

 <back>

   <references title="Normative References">
     <!--?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"?-->
     &RFC2119;
     &RFC3168;
     &RFC2018;
     &RFC5562;
     &RFC5681;

   </references>

<references title="Informative References">

	   &RFC3522;
	
	   &RFC3708;
	
	   &RFC4015;

	   &RFC5682;
	

	   <?rfc include="reference.I-D.briscoe-tsvwg-re-ecn-tcp.xml"?>

    <reference anchor="draft-kuehlewind-conex-accurate-ecn" >
    	<front>
    		<title>Accurate ECN Feedback in TCP</title>
    		
    		<author initials="M" surname="Kuehlewind">
    			<organization></organization></author>
    		<author initials="R" surname="Scheffenegger">
    			<organization></organization></author>
    		<date  month="Jun" year="2011"/>
    	</front>
    	<seriesInfo name="Internet-Draft" value="draft-kuehlewind-conex-accurate-ecn-00"/>
    </reference>
    			
	   
		<reference anchor="DCTCP">
			<front>
				<title>DCTCP: Efficient Packet Transport for the Commoditized Data Center</title>
				
				<author initials="M" surname="Alizadeh">
					<organization></organization></author>
				<author initials="A" surname="Greenberg">
					<organization></organization></author>
				<author initials="D" surname="Maltz">
					<organization></organization></author>
				<author initials="J" surname="Padhye">
					<organization></organization></author>
				<author initials="P" surname="Patel">
					<organization></organization></author>
				<author initials="B" surname="Prabhakar">
					<organization></organization></author>
				<author initials="S" surname="Sengupta">
					<organization></organization></author>
				<author initials="M" surname="Sridharan">
					<organization></organization></author>
				<date month="Jan" year="2010"/>
			</front>
		</reference>
		
		
    </references>

<!--    <section anchor="ApxA" title="Spacing conex marks evenly">
    	<t>Under certain circumstances, very high marking conex marking rates may 
    		need to be signaled. However, as conex maintains running averages, it
    		may be beneficial to send these marks more evenly spaced, than in bursts
    		of consecutive segments, all with the conex bits set.</t>
    	<t>If only a single conex mark needs to be sent, it should be sent immediately
    		to maintain optimal timeliness. Any subsequent conex marks may be delayed 
    		slightly, to disentangle retransmissions of the transport protocol from
    		packets carrying conex marks.</t>
    	<t>The following algorithm will provide such a method. When very high marking
    		rates are required, it will automatically set a conex mark with every sent
    		packet.</t>
    	<t>When an ACK is received, the sender determines the number of bytes which
    		need to be sent with a conex mark, and the next segment to be sent should 
    		carry at least part of the conex signal:</t>
	<figure><preamble>CEF .. congestion.expirienced.flag<vspace blankLines="0" />
		CEG .. congestion.expirienced.gauge<vspace blankLines="0" />
		CEC .. congestion.expirienced.counter<vspace blankLines="0" />
		LEF .. loss.expirienced.flag<vspace blankLines="0" />
		LEG .. loss.expirienced.gauge<vspace blankLines="0" />
LEC .. loss.expirienced.counter</preamble><artwork><![CDATA[
if (marks.received > 0) {
   CEF = 1;                # for the immediate mark
   CEG += bytes.received.with.CE; # for the delayed mark
}
if (lost segment retransmitted) {
   LEF = 1;
   LEG += IP.size.of.retransmitted.segment;              
}
]]></artwork></figure>
			<t>When sending a segment, the following algorithm is run to determine
				if the segment should carry a conex mark. Note that the counter is
				initialized to at most 10*(MTU of the path). This spaces two consecutive received
				marks at most 10 full sized data segments apart.</t>
			<t>For each of the L (loss) and E (ECN) conex bits, a similar algorithm 
				needs to run.</t>
<figure><artwork><![CDATA[
if ((LEF == 0) and (LEG > 0)) {
  if ((LEC <= 0) or (LEG >= LEC)) {
    LEG -= IP.size.of.segment.to.be.sent;
    LEC = min(10*(SMSS+IP.header+TCP.header), Flightsize/4); 
    LEF = 1; 
  } else {
    LEC -= LEG;
  }
} 
if ((CEF == 0) and (CEG > 0)) {
  if ((CEC <= 0) or (CEG >= CEC)) {
    CEG -= IP.size.of.segment.to.be.sent;
    CEC = min(10*(SMSS+IP.header+TCP.header), Flightsize/4); 
    CEF = 1; 
  } else {
    CEC -= CEG;
  }
} 

if (LEF == 1) {
  LEF = 0;
  SendSegment(conex.L.bit);
} else {
  if (CEF == 1) {
    CEF == 0;
    SendSegment(conex.E.bit);
  } else 
    SendSegment;
}
if (CEG < 0) {
  CEG += 1; # slowly reduce credit 
}
if (LEG < 0) {
  LEG += 1;
}
}
]]></artwork></figure>

    </section>-->

  <section title="Revision history">
  	  <t>RFC Editior: This section is to be removed before RFC publication.</t>
      <t>00 ... initial draft, early submission to meet deadline.
      </t>
	    <t>01 ... refined draft, updated LEG "drain" from per-packet to RTT-based.
	  	</t>
	    
	    <t><vspace blankLines='100' /></t>
		
	  </section> 
  

 </back>
</rfc>

PAFTECH AB 2003-20262026-04-23 05:27:19