One document matched: draft-ietf-conex-tcp-modifications-05.xml


<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
    which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
    There has to be one entity for each item to be referenced. 
    An alternate method (rfc include) is described in the references. -->

<!ENTITY RFC3168 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3168.xml">
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2018 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2018.xml">
<!ENTITY RFC3522 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3522.xml">
<!ENTITY RFC3708 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3708.xml">
<!ENTITY RFC4015 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4015.xml">
<!ENTITY RFC5562 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5562.xml">
<!ENTITY RFC5681 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5681.xml">
<!ENTITY RFC5682 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5682.xml">
<!ENTITY RFC6789 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6789.xml">

]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
    please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
    (Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
    (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="exp" docName="draft-ietf-conex-tcp-modifications-05" ipr="trust200902">
 <!-- category values: std, bcp, info, exp, and historic
    ipr values: trust200902, noModificationTrust200902, noDerivativesTrust200902,
       or pre5378Trust200902
    you can add the attributes updates="NNNN" and obsoletes="NNNN" 
    they will automatically be output with "(if approved)" -->

 <!-- ***** FRONT MATTER ***** -->

 <front>
   <!-- The abbreviated title is used in the page header - it is only necessary if the 
        full title is longer than 39 characters -->

   <title>TCP modifications for Congestion Exposure</title>

   <!-- add 'role="editor"' below for the editors if appropriate -->

   <!-- Another author who claims to be an editor -->

    <author fullname="Mirja Kuehlewind" initials="M." role="editor"
      surname="Kuehlewind">
      <organization>University of Stuttgart</organization>
      <address>
        <postal>
          <street>Pfaffenwaldring 47</street>
          <code>70569</code>
          <city>Stuttgart</city>
          <country>Germany</country>
        </postal>
        <email>mirja.kuehlewind@ikr.uni-stuttgart.de</email>
      </address>
    </author>
    
    <author fullname="Richard Scheffenegger" initials="R."
           surname="Scheffenegger">
     <organization>NetApp, Inc.</organization>
     <address>
       <postal>
         <street>Am Euro Platz 2</street>
         <code>1120</code>
         <city>Vienna</city>
         <region></region>
         <country>Austria</country>
       </postal>
       <phone>+43 1 3676811 3146</phone>
       <email>rs@netapp.com</email>
     </address>
    </author>

   <date year="2014" />


   <area>Transport</area>

   <workgroup>Congestion Exposure (ConEx)</workgroup>

   <keyword>Internet-Draft</keyword>
   <keyword>I-D</keyword>

   <abstract>
     <t>Congestion Exposure (ConEx) is a mechanism by which senders inform the network 
       about the congestion encountered by previous packets on the same flow.
       This document describes the necessary modifications to use ConEx with the 
       Transmission Control Protocol (TCP). 
     </t>
   </abstract>
 </front>

 <middle>
   <section title="Introduction">
     <t>Congestion Exposure (ConEx) is a mechanism by which senders inform the network 
       about the congestion encountered by previous packets on the same flow. ConEx concepts 
       and use cases are further explained in <xref target="RFC6789 "/>. The abstract ConEx
       mechanism is explained in <xref target="draft-ietf-conex-abstract-mech"/>.
       This document describes the necessary modifications to use ConEx with the 
       Transmission Control Protocol (TCP). 
     </t>
     <!--<t>
       ConEx is defined as a destination option for IPv6 <xref target="draft-ietf-conex-destopt"/>. 
       The use of four bits have been defined, namely the X (ConEx-capable), 
       the L (loss experienced), the E (ECN experienced) and C (credit) bit. 
     </t>-->
     <t>
       The ConEx signal is based on loss or Explicit Congestion Notification (ECN) 
       marks <xref target="RFC3168"/> as a congestion indication. This congestion 
       information is retrieved by the sender based on existing feedback mechanisms 
       from the receiver to the sender in TCP.
     </t>
     <t>
       This document describes mechanisms for both TCP with and without the
       Selective Acknowledgment (SACK) extension <xref target="RFC2018"/>. However, ConEx benefits
       from more accurate information about the number of packets dropped in the
       network. We therefore recommend using the SACK extension when using TCP
       with ConEx.
     </t>
     <t>
       While loss-based congestion feedback should be minimized, ECN could actually provide more
       fine-grained feedback information. ConEx-based traffic measurement or management mechanism
       would benefit from this. Unfortunately the current ECN does not reflect multiple congestion markings
       which occur within the same Round-Trip Time (RTT). A more accurate feedback extension to ECN is
       defined in a separate document <xref target="draft-kuehlewind-tcpm-accurate-ecn"/>, 
       as this is also useful for other mechanisms.
       <!-- as e.g. <xref target="DCTCP"/> or whenever the congestion 
       control reaction should be proportional to the experienced congestion. ConEx also works with classic ECN but it is less accurate when multiple congestion markings occur within on RTT.-->
      </t>
      <!--<t>The current version of this draft is only a first collection of ConEx-based TCP
        modification and should not be regared as feature-complete as 
        the specification for the abstract ConEx mechanism is still under discussion.
        The next version will also go more precisely into implementation details.
      </t>-->

     <section title="Requirements Language">
       <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
         "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
         document are to be interpreted as described in <xref target="RFC2119"/>.
       </t>
     </section>
   </section>

   <section title="Sender-side Modifications">
     <t>A ConEx sender MUST negotiate for both SACK and ECN or the more accurate ECN feedback 
       in the TCP handshake if these TCP extension are available at the sender. 
       Thus a ConEx SHOULD also implement SACK and ECN.
       Depending on the capability of the receiver, the following operation modes exist:
       <list style="symbols">
         <t> SACK-accECN-ConEx (SACK and accurate ECN feedback) </t>
         <t> accECN-ConEx (no SACK but accurate ECN feedback)</t>
         <t> ECN-ConEx (no SACK and no accurate ECN feedback but 'classic' ECN)</t>
         <t> SACK-ECN-ConEx (SACK and 'classic' instead of accurate ECN)</t>
         <t> SACK-ConEx (SACK but no ECN at all)</t>
         <t> Basic-ConEx (neither SACK nor ECN) </t>
       </list>
     </t>
     <t>A ConEx sender MUST expose all congestion information to the network according to the congestion information received 
       by ECN or based on loss information provided by the TCP feedback loop. A TCP sender SHOULD account congestion
       byte-wise (and not packet-wise). A sender MUST mark <!--the respective number of payload bytes in -->
       subsequent packets (after the congestion notification) with the respective ConEx bit in the IP header.
       Furthermore, a ConEx sender must send enough credit to cover all experienced congestion for the connection so far, as well as the risk of congestion for the current transmission (see <xref target="credits"/>.
     </t> 
     <t>With SACK only the number of lost payload bytes is known, but not the number of packets carrying these bytes. 
       With classic ECN only an indication is given that a marking occurred but not the exact number 
       of payload bytes nor packets. As network congestion is usually byte-congestion <xref target="draft-briscoe-tsvwg-byte-pkt-mark"/>, the exact number of bytes should 
       be taken into account, if available, to make the ConEx signal as exact as possible.
     </t>
     <t>The congestion accounting for each operation mode is described in the next section and 
       the handling of the IPv6 bits itself in the subsequent section afterwards.
     </t>
   </section>

   <section title="Accounting congestion">
     <t>A ConEx sender, thats accounts congestion byte-wise based on
       the congestion information received by ECN or loss detection provided by TCP, will 
       maintain two different counters. These counters hold the number of outstanding bytes
       that should be ConEx marked either with the E bit or the L bit in subsequent packets. 
     </t>
     <t>The outstanding bytes accounted based on ECN feedback information are maintained
       in the congestion exposure gauge (CEG). The accounting of these bytes from the ECN feedback 
       is explained in more detail next in <xref target="ECN"/>. 
     </t>
     <t>The outstanding bytes for congestion indications based on loss are maintained
       in the loss exposure gauge (LEG) and the accounting is explained in <xref target="loss"/>. 
     </t>
     <t>Furthermore, those counters will be reduced every time a ConEx capable packet with the E or L bit
	     set is sent. This is explained for both counters in <xref target="settingBits"/>.
     </t>
     <t>Usually all bytes of an IP packet must be accounted. Therefore 
       the sender SHOULD take the headers into account, too. If equal sized packets, or
       at least equally distributed packet sizes can be assumed,
       the sender MAY only account the TCP payload bytes. In this case there should be about the same
       number of ConEx marked packets as the original packets that were causing the congestion. Thus 
       both contain about the same number of header bytes. This case is assumed for simplification in the following sections.
     </t>
     <t>Otherwise if this is not the case and a sender sends different sized packets (with unequally
       distributed packet sizes), the sender needs to memorize or estimate the number of ECN-marked or 
       lost packets. A sender might be able to reconstruct the number of packets and thus the header 
       bytes if the packet sizes of all packets that were sent during the last RTT are known.
       Otherwise if no additional information is available the worst case number of packets and thus header bytes should 
       be estimated in a conservative way based on a minimum packet size (of all packets sent in the 
       last RTT). If the number of ConEx marked packets is smaller (or larger) than the estimated number 
       of ECN-marked or lost packets, the additional header bytes should the added to (or can be subtracted
       from) the respective counter.
     </t>
      
     <section title="ECN" anchor="ECN">
      <t>ECN <xref target="RFC3168"/> is an IP/TCP mechanism that allows network nodes to mark packets
         with the Congestion Experienced (CE) mark instead of (early) dropping 
         them when congestion occurs. <!--As soon as a CE marks is received at 
         the receiver, it will feed this information back to the sender. --> 
         As soon as a CE mark is seen at the receiver, with classic ECN it will 
         feed this information back to the sender by setting the Echo Congestion 
         Experienced (ECE) bit in the TCP header of all subsequent ACKs until a packet with Congestion 
         Window Reduced (CWR) bit in the TCP header is received to acknowledge 
         the reception of the congestion notification. The sender sets the CWR 
         bit in the TCP header once when the first ECE of a congestion 
         notification is received.
       </t>
       <t>A receiver can support 'classic' ECN, a more accurate ECN feedback scheme, or neither. 
         In the case ECN is not supported at all, of course, no ECN marks will occur, 
         thus the E bit will never be set. Otherwise, a ConEx sender 
	 must maintain a counter, the congestion exposure gauge (CEG), for the number of outstanding 
	 bytes that have to  be ConEx marked with the E bit.
       </t>
       <t>The CEG is increased when ECN information is received from an ECN-capable 
         receiver supporting the 'classic' ECN scheme or the accurate ECN 
         feedback scheme. When the ConEx sender receives an ACK indicating one or more
         segments were received with a CE mark, CEG is increased by the appropriate 
	 number of bytes as described further below.
    <!--sent by the IP layer (e.g. by MTU bytes for each SMSS segment).
         Whenever a packet is sent with the E bit set, this gauge is decreased by the
         IP size of that packet. -->
    </t>
    
    <t>Unfortunately in case of duplicate acknowledgements the number of newly acknowledged bytes will be zero even though (CE marked) data has been received. Therefore, we increase the CEG by DeliveredData, as defined below: 
    </t>
    <t>DeliveredData covers the number of bytes which has been newly delivered to the receiver. Therefore on each arrival of an ACK, DeliveredData will be calculated by the newly acknowledged bytes (acked_bytes) as indicated by the current ACK, relative to all past ACKs. Moreover with SACK, DeliveredData is increased by the number of bytes provided by (new) SACK information (SACK_diff). Note, if less unacknowledged bytes are announced in the new SACK information than in the previous ACK, SACK_diff can be negative. In this case, data is newly acknowledged (in acked_byte), that has previously already been accounted to DeliveredData based on SACK information. 
	Without SACK, DeliveredData
          is estimated to be 1 SMSS on duplicate acknowledgements. For the subsequent partial or full ACK,
	  DeliveredData is estimated to be the newly acknowledged bytes, minus one SMSS for each preceding duplicate ACK.</t>
    
    <t>DeliveredData = acked_bytes + SACK_diff + (is_dup)*1SMSS - (is_after_dup)*num_dup*1SMSS
    </t>
    <t>Thus is_dup is one if the current ACK is a duplicated ACK without SACK, and zero otherwise. 
      is_after_dup is only one for the next full or partial ACK after a number of duplicated ACKs 
      without SACK and num_dup counts the number of duplicated ACKs in a row.
    </t>
    
    <t>The two cases, with and without more accurate ECN depending on the 
    receiver capability, are discussed in the following sections.
       </t>
       <!--<t>TBD: Discussion to set ECN in which packets. Initially apply RFC5562 rules
         ([SYN,ACK] and data segments only), as security implications of ECN on
         control packets ([SYN], pure [ACK], window probe, window update, ...) is an open 
         research question. However, running bidirectional ECN on all TCP segments 
         including TCP control packets, may allow for more timely and accurate ConEx 
         signals. Also, ConEx provides a framework to possibly address some of these 
         security risks.</t>-->
       <section title="Accurate ECN feedback">
         <t>With a more accurate ECN feedback scheme either the number of marked packets/received CE marks or directly the number of marked bytes is
         known. In the later case the CEG can 
         directly be increased by the number of marked bytes. 
         Otherwise if D is assumed to be the number of marks,
         <!-- Otherwise when the accurate ECN feedback scheme is supported by the 
          receiver, the receiver will maintain an echo congestion counter 
          (ECC). The ECC will hold the number of CE marks received. A 
          sender that is understanding the accurate ECN feedback will be 
          able to reconstruct this ECC value on the sender side by 
          maintaining a counter ECC.r.
        </t>
        <t>On the arrival of every ACK, the sender calculates
          the difference D between the local ECC.r counter, and the signaled
          value of the receiver side ECC counter. The value
          of ECC.r is increased by D, and D is assumed to be the number 
          of CE marked packets that arrived at the receiver since it 
          sent the previously received ACK.
        </t>
        <t>Whenever the counter ECC.r is increased, -->
    the gauge CEG will be conservatively increased by one SMSS for each marking or at max the number of
    newly acknowledged bytes:</t>
    <!--has to be increased by the amount of bytes sent which were marked:</t>-->
    <t>CEG += min(SMSS*D, DeliveredData)</t>
    <!--<t>CEG += min( (SMSS+IPhl+TCPhl)*D, acked_bytes + (IPhl+TCPhl)*D )</t>
    <t>where IPhl is the IP heder length which is when using IPv4 20 Byte and with IPv6 40 Byte and where TCPhl is the TCP header length which is 20 byte. Those values give the respective header length withour any options. If an option is negociated which has to be carried in every packet, the size of the option MUST be added for each header. Otherwise the best case with no option header is assumed.</t>-->
          
       </section>
       <section title="Classic ECN support">
	 <t>If the ConEx sender fully conforms to the semantics of the ECN signaling as defined by 
           <xref target="RFC5562"/>, it will receive one full RTT of delayed ACKs with the ECE flag set 
	   whenever at least one CE mark was received by the receiver.
	   As the sender cannot estimate how much packets have actually been CE marked during this RTT, 
	   the most conservative assumption should be taken, namely assuming that all packets were marked.
	   This can be achieved by increasing the CEG by DeliveredData for each ACK with the ECE flag:
	   <vspace blankLines="1"/>
	   CEG += DeliveredData
         </t>
	 
	 <t>Optionally a ConEx sender could implement an Advanced Compatibility Mode:<vspace blankLines="1"/>
	 To extract more than one ECE indication per RTT, a ConEx sender could
        set the CWR flag opportunistically to force the receiver to signal
	only one ECE per CE mark. Unfortunately, the use of delayed
        ACKs <xref target="RFC5681"/>, as it is usually done today, will prevent a feedback of every CE mark. 
	If an CWR confirmation will be received before the ECE can be sent out with the next ACK,
	ECN feedback information information could get lost.
	Thus a sender should set CWR only on those data segments, that will actually trigger a (delayed) ACK. 
	The sender would need an additional control loop to estimated which data segment will trigger an ACK.
	But such a more sophisticated heuristics could extract congestion notifications more timely.
	Still the CEG need to be increased by DeliveredData, as one or more CE marked packets could be
	acknowledged by one delayed ACK.
       </t>
         
       <!--<t>A ConEx sender that communicates with a classic ECN receiver (conforming
           to <xref target="RFC3168"/> or <xref target="RFC5562"/>) may
           run in one of these modes:
       <list style="symbols">
         <t>Full compliance mode:<vspace blankLines="1"/>
           The ConEx sender fully conforms to
             all the semantics of the ECN signaling as defined by 
             <xref target="RFC5562"/>. In this mode, only a single
             congestion indication can be signaled by the receiver
             per RTT. Whenever the ECE flag toggles from "0" to "1", 
        the gauge CEG is increased at maximum by the SMSS:--><!-- plus headers.-->
        <!--<vspace blankLines="1"/>
        CEG += min(SMSS, DeliveredData)
        <vspace blankLines="1"/>
             Note that most often, a session adhering
             to these semantics may not provide enough ConEx marks as usually more than one CE mark 
	     will occur during one congestion event (within one RTT). We assume that 
	     the credits build up during the Slow Start phase will cover the mismatch
	     for short connections with only light congestion. Otherwise this will cause appropriate
	     sanctions by an audit device in a ConEx enabled network. To avoid this in any case, on whole RTT
	     of packets need to be regarded as congestion marked. Thus increasing the CEG by the number
	     of DeliveredData for each ACK with the ECE bit set, would cover the worst case estimation.-->
	<!-- ToDo: Or should we increase for each ECE instead because overestimation is better than underestimation-->
	<!--</t>
      <t>Simple compatibility mode:<vspace blankLines="1"/>-->
        <!--Alternatively, a ConEx sender MAY
            set the CWR flag opportunistically, to extract more than
             one ECE indication per RTT. In 
             the most simple form, CWR can be set on a permanent basis.-->
        <!--The sender will set the CWR permanently to force the receiver to signal
        only one ECE per CE mark. Unfortunately, the use of delayed
        ACKs <xref target="RFC5681"/>, as it is usually done today, will prevent a feedback of every CE mark. An CWR confirmation will be received before the ECE can be sent out with the next ACK. 
        With an ACK rate of M, about M-1/M CE indications will not be signaled back by
        the receiver (e.g. 50% with M=2 for delayed ACKs). Thus, in this mode the ConEx sender MUST
        increase CEG --><!--by a count ofM*SMSS-->
        <!--M*(SMSS+IP.header+TCP.header)<vspace blankLines="1"/>-->
        <!--as if M congestion notification were received
        for each received ECE signal:
        <vspace blankLines="1"/>
        CEG += min(M*SMSS, DeliveredData + (M-1)*SMSS)
        <vspace blankLines="1"/>
        In case of a congestion event with low congestion (that means when only a very
        smaller number of packets get marked), the sender might miss the whole
	congestion event. Even though the sender will send sufficient ConEx marks on 
	average due to the scheme proposed
	above, these ConEx marks might be shifted in time and an audit might penalize this behavior. 
        Regarding congestion control, it is not a general problem to miss a congestion event
        as, by chance, a marking scheme in the network node might also
        miss a certain flow. 
        In the case where no other flow is reacting, the congestion level will increase
        and it will get more likely that the congestion feedback is delivered. 
        To provide a fair share over time, a TCP sender implementing this simple ECN compatibility mode could react more strongly
        when receiving an ECN feedback signal. This of course depends on the congestion control
        used. --><!--A TCP sender using this scheme MUST take the impact on congestion control
        into account.--><!--</t>
      <t>Advanced compatibility mode:<vspace blankLines="1"/>
        To avoid the loss of ECN feedback information in the proposed simple compatibility mode, a sender could 
             set CWR only on those data segments, that
             will actually trigger a (delayed) ACK. The sender would need an additional control loop to estimated which data segment will trigger an ACK.
        Such a more sophisticated heuristics could extract
             congestion notifications more timely. --><!--A ConEx sender MAY
             choose to implement such an heuristic.--> 
	     <!--In addition, if this advanced compatibility mode is used, further
             heuristics SHOULD be implemented, to determine the value
             of each ECE notification. E.g. for each consecutive ACK received
             with the ECE flag set, 
        CEG should be increased by 
        min( M*SSMS, DeliveredData). 
        Else if the predecessor ACK was received with the ECE flag cleared,
        CEG need only be increase at maximum by one SMSS:
        <vspace blankLines="1"/>
        if previous_marked: CEG += min( M*SSMS, DeliveredData)
        <vspace blankLines="0"/>
        else: CEG += min(SMSS, DeliveredData)
        <vspace blankLines="1"/>-->
        <!--D should be increased by one, and
        CEG increased by<vspace blankLines="1"/>
        CEG += min((SMSS+IP.header+TCP.header)*D, acked_bytes+(IP+TCP Header)*D)
        <vspace blankLines="1"/>
             If an ACK is received with the ECE flag cleared, D must be
             set to zero. -->
        <!--This heuristic is conservative during more
             serious congestion, and more relaxed at low congestion
             levels.</t>
         </list>
        </t>-->
         
       </section>
     </section>
     <section title="Loss Detection" anchor="loss">
       <t><!--For all the data segments that are determined by a ConEx sender 
         as lost, (at least) the same number of TCP payload bytes MUST be be sent 
         with the ConEx L bit set.-->
	 <!--Loss detection in TCP typically happens by the use 
         of duplicate ACKs, or the firing of the retransmission timer.-->  
	 A ConEx sender MUST maintain a loss exposure gauge (LEG), indicating 
         the number of outstanding bytes that must be sent with the ConEx
         L bit.  When a data segment is retransmitted, LEG will be 
         increased by the size of the TCP payload bytes contained by the retransmission,
         assuming equal sized segments such that the retransmitted packet will have 
         the same number of header bytes as the original ones.
         <!--When sending subsequent segments, 
         the ConEx L bit is set as long as LEG is positive, and LEG is 
         decreased by the size of the sent TCP payload bytes with the ConEx L bit set.-->
        </t>
         <t>Any retransmission may be spurious.  To accommodate that, a ConEx
         sender SHOULD make use of heuristics to detect such spurious
         retransmissions (e.g. F-RTO <xref target="RFC5682"/>, DSACK 
         <xref target="RFC3708"/>, and Eifel <xref target="RFC3522"/>, 
         <xref target="RFC4015"/>).  When such a heuristic has determined, 
         that a certain number of packets were retransmitted
         erroneously, the ConEx sender should subtract the payload size of these
         TCP packets from LEG. 
        </t>
        <!--<t>Note that the above heuristics delays the ConEx signal by one 
         segment, and also decouples them from the retransmissions themselves, as
         some control packets (e.g. pure ACKs, window probes, or window updates) 
         may be sent in between data segment retransmissions.
         A simpler approach would be to set the ConEx signal for each
         retransmitted data segment.  However, it is important to remember, that 
         a ConEx signal and TCP segments do not natively belong together. 
        </t>--> 
	<section title="Without SACK Support">
	<t>If multiple losses occur within one RTT and SACK is not used, it may take several RTTs
	until all lost data is retransmitted. With the scheme described above, the ConEx information will be 
	delayed strongly but timeliness is important for ConEx.</t>

	<t>For ConEx it is not important to know which data got lost but only how much.
	During the first RTT after the initial loss detection, the amount of received data and thus also the 
	amount of lost data can be estimated based on the number of received ACKs.
	Thus without SACK, the needed information for the ConEx
	feedback can be available with an additionally delay of one RTT by using the 
	following estimation algorithm:</t>
	
        <t>If SACK information is not available, a ConEx sender should maintain
	an additional Loss Estimation Counter (LEC). With the first retransmission 
	of a congestion event LEC is set to:
	<vspace blankLines="1"/>
	LEC = f - 3*SMSS 
	<vspace blankLines="1"/>
	where f the is current flight size in bytes. At this point of time in the 
	transmission, in the worst case, all packets in flight minus three that trigged 
	the dupACks could have been lost. 
	For each retransmission that is sent, the LEG will still be increased 
	but the LEC will also be decreased by the payload size of the retransmission.
	During the following RTT, LEC should be reduced by SMSS for each ACK that is 
	received. Thus after one RTT the LEC estimates the number of outstanding bytes that 
	should be ConEx L marked.
	To not further delay this information, now LEG should be increased by LEC. 
	From then on every following retransmission should only reduce the LEC and not 
	increase the LEG until the LEC is zero, as those bytes were already accounted.</t>
	</section>

     </section>
   </section>

   <section title="Setting the ConEx Bits"> 
       
    <!-- RS: remove IPv6 - Conex signals (bits) should be defined agnostic of IP version, right?) -->
      <t>
         ConEx is  defined as a destination option for IPv6 <xref target="draft-ietf-conex-destopt"/>. 
         The use of four bits have been defined, namely the X (ConEx-capable), 
         the L (loss experienced), the E (ECN experienced) and C (credit) bit. 
       </t>
       <t>By setting the X bit a packet is marked as ConEx-capable. 
         All packets carrying payload MUST be marked with the X bit set including retransmissions. No congestion feedback information are available about control packets such as pure ACKs which are not carrying any payload. Thus these packets should not be taken into account when determining ConEx information. These packet MUST carry a ConEx Destination Option with the X bit unset.
       </t>
       <!--<t>By setting the X bit a packet is marked as ConEx-capable. It is not 
         decided yet which or if any packets should not be ConEx capable. 
         (e.g. control packets as pure ACKs or retransmits). It is not defined yet which bits 
         (E, L, C) can be set at the same time in one (data) packet. It is assumed 
         that ConEx marked packets are accounted by their respective IP size, as 
         all the signals (Loss, ECN) are attributes of an IP packet, not a TCP segment 
         or merely the TCP payload. Further discussion is needed here.
     </t>-->
    <section title="Setting the E and the L Bit" anchor="settingBits">
       <t>As long as the CEG or LEG counter is positive, ConEx-capable packets SHOULD be marked
         with E or L respectively, and the CEG or LEG counter is decreased by the TCP
         payload bytes carried in this packet. If the CEG or LEG counter is negative, 
         the respective counter SHOULD be reset to zero within one RTT after it was
         decreased the last time or one RTT after recovery if no further congestion
         occurred.
         <!-- This can be done by remembering the seq# of the next packet send after the
         LEG went negative and reset the LEG if the respective ACk is received.-->
         <!--is drained by one byte with every 
         packet sent out, as ConEX information are only meaningful for a certain time:
         <vspace blankLines="1" />
         if CEG > 0: CEG -= TCPpayload.length else: CEG -= 1<vspace blankLines="0" />
         if LEG > 0: LEG -= TCPpayload.length else: LEG -= 1
         -->
         <!--As ConEx credits have only a limited lifetime, 
         whenever the gauge becomes negative, it should be drained at a low 
         rate (e.g. 1 count per sent packet).-->
         
       </t>
       <t>If SACK information is not available 
        <!--or SACK information has been reset for any reason-->
	       spurious retransmission are more likely. In this case it might be valuable to 
	       slightly delay the ConEx loss feedback until a spurious retransmission might 
	       be detected. But the ConEx signal MUST NOT be delayed more than one RTT if as long as 
	       data packets are sent out.
       </t>
   </section>
    
   <section anchor="credits" title="Credit Bits">
       <t>The ConEx abstract mechanism requires that sufficient 
         credit must be signaled in advance to cover the expected congestion during the
	 feedback delay of one RTT. A ConEx sender should maintain a counter of the sent 
    	 credits c in bytes. If congestion occurs, credits will be consumed and the c counter should 
	 be reduced by the number of bytes that where lost or estimated to be ECN-marked. 
	 If the risk of congestion was estimated wrongly and thus too few credits were sent, 
	 the c counter becomes zero but can not get negative.
       </t>
       
       <t> The number of credits sent should always equal 
	  the number of bytes in flight, as all packets could potentially get lost 
	  or congestion marked. Thus a ConEx sender should monitor the 
	  number of bytes in flight f. If f ever becomes larger than c, the 
	  ConEx sender SHOULD send new credits. Remember that c will be decreased if 
	  congestion occurs. 
       </t>
	 
       <!--<t> With a more moderate view, only an increase in the sending
       rate should cause loss. In contrast the number of ECN markings
       within one RTT not only depends actions taken by the sender.
       In general if Active Queue Management (AQM) is used, packets will be
       dropped or marked depending on the parameterization of the
       respective AQM scheme and influenced by the cross traffic.
       Thus the number of losses and marks could also be larger than
       Expected momentarily. The maximum number per congestion event could
       potentially be estimated over time. This case is not further
       expanded here. </t>-->
       
       <t>In TCP Slow Start, the congestion window might grow much larger than 
       during the rest of the transmission. Thus a sender could consider to sent
       fewer than f credits but risking potential penalization by an audit.
       In any case the credits should at least cover the increase in sending rate.
       As the sending rate increases exponentially in Slow Start, thus double every RTT,
       a ConEx sender should at least cover half the number of packets in flight by credits. 
       Note, that the number of losses or markings within one RTT does not only 
       depend actions taken by the sender. In general, the behavior of the cross traffic, and 
       if Active Queue Management (AQM) is used, the respective parameterization influence
       how many packets get dropped or marked. 
       But if the used AQM is not overly aggressive with ECN marking, sending halve the
       flight size as credits
       should be sufficient for both, congestion signaled by loss or 
       ECN. <!--Thus the number of credits SHOULD equal at least half 
       the number of packets in flight in Slow Start.--> <!--Under the 
       assumption that the previously sent ConEx marks will stay valid 
       for the whole Slow Start phase,--> Marking every fourth packet will 
       allow the respective number of credits in Slow Start as it can be seen in 
       Figure <xref target="SS_credit" />. </t>
       
       <t>
       <figure 
         title="Credits in Slow Start (with an initial window of 3)"
         align="center" anchor="SS_credit">
      <artwork align="center"><![CDATA[
RTT1  |------XC------>|
      |------X------->|
      |------X------->|   credit=1  in_flight=3
      |               |
RTT2  |------X------->|
      |------XC------>|
      |------X------->|
      |------X------->|
      |------X------->|
      |------XC------>|   credit=3  in_flight=6
      |               |
RTT3  |------X------->|
      |------X------->|
      |------X------->|
      |------XC------>|
      |------X------->|
      |------X------->|
      |------X------->|
      |------XC------>|
      |------X------->|
      |------X------->|
      |------X------->|
      |------XC------>|   credit=6  in_flight=12
      |      .        |
      |      :        |
      ]]></artwork></figure></t>
  
   <!--TODO: More general description to maintain 
	  always at least flight_size - flight_size_prev credits. -> Can the 
	  number of credts in the audit decrease?-->
  
  <!--<t>
    A ComEx sender needs to monitor the increase in the sending rate by calculating d as the number of packets in flight in the last RTT minus the number of packets in flight in the previous RTT.
    Instead of remembering the number of packets in flight, d can also most often be 
    derived from the congestion control algorithm. When using the <xref target="RFC5681"/>
    d should also be 1 (expect in Slow Start) as the congestion window is at maximum increased by one packet per RTT.
  </t>-->
  <t>
    It is possible that the audit looses state due to e.g. rerouting or memory limitations. Therefore, 
    the sender needs to detect this case and resend credits. Thus a ConEx sender should 
    reset the credit count c to zero if losses occur in two subsequent RTTs (assuming that the 
    sending rate was correctly reduced based on the received congestion signal).
    
    <!--<list style="letter">
      <t> if the number of losses is much larger than the increase in sending rate d.
      The increase in the sending rate d can be calculated as the number of packets in flight in the last RTT minus the number of packets in flight in the previous RTT.
      Instead of remembering the number of packets in flight, d can also most often be 
      derived from the congestion control algorithm. When using the <xref target="RFC5681"/>
      d should also be 1 (expect in Slow Start) as the congestion window is at maximum increased by one packet per RTT.</t>
      <t> if looses occur eventhough the level of ECN marking did not increase. Therefore,
      a ConEx sender need to monitor the number of ECN markings if a more accurate ECN feedback is used. Otherwise, a ConEx sender needs to reset c everytime a loss occurs in the same RTT than ECN markings.</t>
      </list> -->
  </t>
  <!--<t>TBD: When increasing the congestion window while in CA, will the
    one additional segment need further credits?
  </t>-->
  <!--<t>For TCP sender using the <xref target="RFC5681"/> congestion control algorithm, we recommend to 
    only send credit in Slow Start, as in Congestion Avoidance an increase of one segment per RTT
    should only cause a minor amount of congestion marks (usually at max one). If a more aggressive
    congestion control is used, a sufficient amount of credits need to be set.
       </t>-->
       <!--<t>TBD: Detailed discussion around 1/4th ConEx C marking during slow start; Any slow
         start (session start, idle restart, RTO) or specific slow starts. Spurious RTO 
         interaction?
       </t>-->
       
       <!--<t>If a ConEx sender detects an increasing number of losses even though 
         the sender reduced the sending rate, the sender SHOULD assume that 
         those losses are incorporated by an audit device and thus should send 
         further credits. Up to now its not clear if the credits stay valid as long as the connection 
         is established or if an expiration of the credits need to be assumed by the sender.
       </t>-->
       <!--<t>TBD: additional loss would reduce the sending rate, 
         until enough credits are available to sustain a sending rate. Adding
         one C bit for each loss recovery episode may be simpler?
       </t>-->
       <!--
     3.1.  On Beginning of a TCP session
     
     A conex sender should build conex credits, by sending the 1st and 3rd
     data segment with the conex C bit set.  For a detailed discussion as
     to the background for choosing these two data segments to build conex
     credits with the network, see [I-D.briscoe-tsvwg-re-ecn-tcp].
     
     In addition, a conex sender SHOULD maintain a gauge to account for
     the number of (data) bytes [packets], which need to be sent with the
     conex L bit set, and a similar gauge to account for the segments that
     still have to carry the conex E bit.  As multiple indications of lost
     segments or ECN marks can arrive simultaneously at the sender, a
     second counter should be maintained for both the conex L and E bits,
     to space the sending of these conex bits more evenly.  For
     implementation details, see Appendix A.
     
     3.5.  On restarting idle TCP Connections
     
     Conex signals are valid only for a limited amount of time.
     Furthermore, TCP does not currently account for lost ACKs or allow
     the use of ECN marking on control segments (e.g. pure ACKs, window
     probes, window updates, FIN or RST segments without data).  In a
     common TCP connection, data will flow only unidirectional for a
     certain period of time, while only TCP control packets are traversing
     the return half-connection.  Subsequently, the data direction may
     change.  If a TCP determines, that the duration between two sent data
     segments becomes too large, it will reduce it's congestion window
     (see Section 4.1 in [RFC5681]).  This MUST be accompanied in a conex
     sender by the building of new conex credits, by setting conex C bit
     in the 1st and 3rd data segment sent after the restart.  Furthermore,
     the gauges, counters and flags maintained by the sender should be
     reinitialized to zero, as any previous value will be invalid at that
     point in time.
     -->
     </section>
     <!--<section title="Credit Bits during Congestion Avoidance">
     </section>-->
   </section>
     
     
     <section anchor="sec43" title="Loss of ConEx information">
       <!--<t>The audit can have wrong information if e.g. ConEx marks got lost (or a wrong number
         of ConEx marking has been estimated by the sender due to a lack of feedback
         information). In this case the audit might penalize a sender wrongly. 
         The ConEx sender should detect this case and send further credits 
         which should solve the situation (see <xref target="credits" />).
         </t>-->
       <t>
       Of course also packets that carry a ConEx marking can get lost. 
       A ConEx sender must remember which packet was marked with either the L, the E or the C bit.
       If one of these packets is detected to be lost, the should increase the respective
       gauge, LEG or CEG, by the number of lost payload bytes.
       </t>
     </section>


   <section title="Timeliness of the ConEx Signals">
       <t>ConEx signals can only be evaluated by a network
       nodewith a time delay of about one RTT after the congestion occured.
       <!--Therefore, it is not absolutely necessary to immediately signal ConEx
       bits when they become known (e.g.  L and E bits), but-->
       To avoiad further delays, a ConEx sender SHOULD sent the 
       ConEx signaling with the next available packet. In cases where it 
       is preferable to slightly delay the ConEx signal, 
       the sender MUST NOT delay the ConEx signal more than one RTT.
     </t>
     <t>Multiple ConEx bits may become available for signaling at the same
       time, for example when an ACK is received by the sender, that
       indicates  at the same time that at least one segment has been lost, and that one or
       more ECN marks were received.  This may happen
       during excessive congestion, where the queues overflow even though ECN was used and currently all
       packets are marked, while others have to be dropped nevertheless.
       Another possibility when this may happen are lost ACKs, so that a 
       subsequent ACK carries summary information not previously available 
       to the sender. As ConEx-capable packet can carry different ConEx marks at the same time,
       these information do not need to be distributed over several packets and thus
       can be sent without further delay.
     </t>
     <!--<t>It may be preferrable to signal only one ConEx bit per segment, and to
       space out the signaling of multiple bits across a (short) period of
       time - or number of segments.  However, that delay should not be
       excessive, and ideally also shorter than the RTT of the affected TCP
       session.  The heuristic sketched in Appendix A uses a maximum delay
       of 10 packets or 1/4 of the congestion window, whatever is smaller
       to minimize delay.
     </t>-->
     <!--<t>It is important to remember, that ConEx bits and TCP retransmissions
       do not interact with each other.  However, a retransmission should be
       accompanied by one ConEx L bit in close proximity nevertheless. This does not mean, 
       that TCP retransmissions may never contain ConEx marks. In 
       a typical scenario using SACK, the first retransmission would not carry
       a ConEx L bit, while subsequent retransmissions in the same recovery
       episode, would be marked with the ConEx L bit.
       Spreading the ConEx bits over a small number of segments increases
       the likelihood that most devices along the path will see some
       ConEx marks even during heavy congestion.
     </t>-->
     </section>
   
   <section anchor="Acknowledgements" title="Acknowledgements">
   <t>The authors would like to thank Bob Briscoe who contributed with this initial ideas and
     valuable feedback. Moreover, thanks to Jana Iyengar who provided valuable feedback.</t>
   </section>

   <!-- Possibly a 'Contributors' section ... -->

   <section anchor="IANA" title="IANA Considerations">
     <t>This document does not have any requests to IANA.</t>
   </section>

   <section anchor="Security" title="Security Considerations">
     <t>With some of the advanced ECN compatibility modes it is possible to miss congestion notifications. Thus a sender will not decrease its sending rate. If the congestion is persistent, the likelihood to receive a congestion notification increases. In the worst case the sender will still react correctly to loss. This will prevent a congestion collapse.</t>
   </section>
 </middle>

 <!--  *****BACK MATTER ***** -->

 <back>

   <references title="Normative References">
     <!--?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"?-->
     &RFC2119;
     &RFC3168;
     &RFC2018;
     &RFC5681;
     
     <reference anchor="draft-ietf-conex-destopt" >
	     <front>
		     <title>IPv6 Destination Option for ConEx</title>
		     
		     <author initials="S" surname="Krishnan">
			     <organization></organization></author>
		     <author initials="M" surname="Kuehlewind">
			     <organization></organization></author>
		     <author initials="C" surname="Ucendo">
			     <organization></organization></author>
		     <date  month="March" year="2013"/>
	     </front>
	     <seriesInfo name="Internet-Draft" value="draft-ietf-conex-destopt-04"/>
    </reference>
    
    <reference anchor="draft-ietf-conex-abstract-mech" >
	    <front>
		    <title>Congestion Exposure (ConEx) Concepts and Abstract Mechanism</title>
		    
		    <author initials="M" surname="Mathis">
			    <organization></organization></author>
		    <author initials="B" surname="Briscoe">
			    <organization></organization></author>
		    <date  month="October" year="2012"/>
	    </front>
	    <seriesInfo name="Internet-Draft" value="draft-ietf-conex-abstract-mech-06"/>
    </reference>
   </references>

<references title="Informative References">
	
    &RFC5562;

     &RFC3522;
  
     &RFC3708;
  
     &RFC4015;

     &RFC5682;
     
     &RFC6789;  

     <?rfc include="reference.I-D.briscoe-tsvwg-re-ecn-tcp.xml"?>

     <reference anchor="draft-kuehlewind-tcpm-accurate-ecn" >
      <front>
        <title>More Accurate ECN Feedback in TCP</title>
        
        <author initials="M" surname="Kuehlewind">
          <organization></organization></author>
        <author initials="R" surname="Scheffenegger">
          <organization></organization></author>
        <date  month="Jun" year="2013"/>
      </front>
      <seriesInfo name="Internet-Draft" value="draft-kuehlewind-tcpm-accurate-ecn-02"/>
    </reference>
    

    <reference anchor="	draft-briscoe-tsvwg-byte-pkt-mark" >
	    <front>
		    <title>Byte and Packet Congestion Notification</title>
		    
		    <author initials="B" surname="Briscoe">
			    <organization></organization></author>
		    <author initials="J" surname="Manner">
			    <organization></organization></author>
		    <date  month="May" year="2013"/>
	    </front>
	    <seriesInfo name="Internet-Draft" value="	draft-briscoe-tsvwg-byte-pkt-mark-010"/>
    </reference>
     
    <reference anchor="DCTCP">
      <front>
        <title>DCTCP: Efficient Packet Transport for the Commoditized Data Center</title>
        
        <author initials="M" surname="Alizadeh">
          <organization></organization></author>
        <author initials="A" surname="Greenberg">
          <organization></organization></author>
        <author initials="D" surname="Maltz">
          <organization></organization></author>
        <author initials="J" surname="Padhye">
          <organization></organization></author>
        <author initials="P" surname="Patel">
          <organization></organization></author>
        <author initials="B" surname="Prabhakar">
          <organization></organization></author>
        <author initials="S" surname="Sengupta">
          <organization></organization></author>
        <author initials="M" surname="Sridharan">
          <organization></organization></author>
        <date month="Jan" year="2010"/>
      </front>
    </reference>
    
    
    </references>

<!--    <section anchor="ApxA" title="Spacing conex marks evenly">
      <t>Under certain circumstances, very high marking conex marking rates may 
        need to be signaled. However, as conex maintains running averages, it
        may be beneficial to send these marks more evenly spaced, than in bursts
        of consecutive segments, all with the conex bits set.</t>
      <t>If only a single conex mark needs to be sent, it should be sent immediately
        to maintain optimal timeliness. Any subsequent conex marks may be delayed 
        slightly, to disentangle retransmissions of the transport protocol from
        packets carrying conex marks.</t>
      <t>The following algorithm will provide such a method. When very high marking
        rates are required, it will automatically set a conex mark with every sent
        packet.</t>
      <t>When an ACK is received, the sender determines the number of bytes which
        need to be sent with a conex mark, and the next segment to be sent should 
        carry at least part of the conex signal:</t>
  <figure><preamble>CEF .. congestion.expirienced.flag<vspace blankLines="0" />
    CEG .. congestion.expirienced.gauge<vspace blankLines="0" />
    CEC .. congestion.expirienced.counter<vspace blankLines="0" />
    LEF .. loss.expirienced.flag<vspace blankLines="0" />
    LEG .. loss.expirienced.gauge<vspace blankLines="0" />
LEC .. loss.expirienced.counter</preamble><artwork><![CDATA[
if (marks.received > 0) {
   CEF = 1;                # for the immediate mark
   CEG += bytes.received.with.CE; # for the delayed mark
}
if (lost segment retransmitted) {
   LEF = 1;
   LEG += IP.size.of.retransmitted.segment;              
}
]]></artwork></figure>
      <t>When sending a segment, the following algorithm is run to determine
        if the segment should carry a conex mark. Note that the counter is
        initialized to at most 10*(MTU of the path). This spaces two consecutive received
        marks at most 10 full sized data segments apart.</t>
      <t>For each of the L (loss) and E (ECN) conex bits, a similar algorithm 
        needs to run.</t>
<figure><artwork><![CDATA[
if ((LEF == 0) and (LEG > 0)) {
  if ((LEC <= 0) or (LEG >= LEC)) {
    LEG -= IP.size.of.segment.to.be.sent;
    LEC = min(10*(SMSS+IP.header+TCP.header), Flightsize/4); 
    LEF = 1; 
  } else {
    LEC -= LEG;
  }
} 
if ((CEF == 0) and (CEG > 0)) {
  if ((CEC <= 0) or (CEG >= CEC)) {
    CEG -= IP.size.of.segment.to.be.sent;
    CEC = min(10*(SMSS+IP.header+TCP.header), Flightsize/4); 
    CEF = 1; 
  } else {
    CEC -= CEG;
  }
} 

if (LEF == 1) {
  LEF = 0;
  SendSegment(conex.L.bit);
} else {
  if (CEF == 1) {
    CEF == 0;
    SendSegment(conex.E.bit);
  } else 
    SendSegment;
}
if (CEG < 0) {
  CEG += 1; # slowly reduce credit 
}
if (LEG < 0) {
  LEG += 1;
}
}
]]></artwork></figure>

    </section>-->

  <section title="Revision history">
      <t>RFC Editior: This section is to be removed before RFC publication.</t>
      <t>00 ... initial draft, early submission to meet deadline.
      </t>
      <t>01 ... refined draft, updated LEG "drain" from per-packet to RTT-based.
      </t>
      <t>02 ... added <xref target="sec43"/> and expanded discussion about ECN interaction.
      </t>
      <t>03 ... expanded the discussion around credit bits.
      </t>
      <t>04 ... review comments of Jana addressed. (Change in full compliance mode.)
      </t>
      <t>05 ... changes on Loss Detection without SACK, support of classic ECN and credit handling.
      </t>
      
      <t><vspace blankLines='100' /></t>
    
    </section> 
  

 </back>
</rfc>

PAFTECH AB 2003-20262026-04-22 23:52:34