One document matched: draft-ietf-conex-tcp-modifications-10.xml


<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
    which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
    There has to be one entity for each item to be referenced. 
    An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC3168 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3168.xml">
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2018 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2018.xml">
<!ENTITY RFC3522 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3522.xml">
<!ENTITY RFC3708 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3708.xml">
<!ENTITY RFC4015 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4015.xml">
<!ENTITY RFC5681 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5681.xml">
<!ENTITY RFC5682 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5682.xml">
<!ENTITY RFC6789 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6789.xml">
<!ENTITY RFC7141 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7141.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc strict="yes" ?>
<?rfc toc="yes"?>
<?rfc tocdepth="4"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>
<?rfc comments="no"?>
<rfc category="exp" docName="draft-ietf-conex-tcp-modifications-10"
     ipr="trust200902">

  <front>

    <title abbrev="TCP Modifications for ConEx">TCP modifications for
    Congestion Exposure</title>

    <author fullname="Mirja Kuehlewind" initials="M." role="editor"
            surname="Kuehlewind">
      <organization>ETH Zurich</organization>

      <address>
        <postal>
          <street/>

          <code/>

          <city/>

          <country>Switzerland</country>
        </postal>

        <email>mirja.kuehlewind@tik.ee.ethz.ch</email>
      </address>
    </author>

    <author fullname="Richard Scheffenegger" initials="R."
            surname="Scheffenegger">
      <organization>NetApp, Inc.</organization>

      <address>
        <postal>
          <street>Am Euro Platz 2</street>

          <code>1120</code>

          <city>Vienna</city>

          <region/>

          <country>Austria</country>
        </postal>

        <phone></phone>

        <email>rs.ietf@gmx.at</email>
      </address>
    </author>

    <date year="2015"/>

    <area>Transport</area>

    <workgroup>Congestion Exposure (ConEx)</workgroup>

    <keyword>Internet-Draft</keyword>

    <keyword>I-D</keyword>

    <abstract>
      <t>Congestion Exposure (ConEx) is a mechanism by which senders inform
      the network about expected congestion based on congestion feedback from
      previous packets in the same flow. This document describes the necessary
      modifications to use ConEx with the Transmission Control Protocol
      (TCP).</t>
    </abstract>
  </front>

  <middle>
    <section title="Introduction">
      <t>Congestion Exposure (ConEx) is a mechanism by which senders inform
      the network about expected congestion based on congestion feedback from
      previous packets in the same flow. ConEx concepts and use cases are
      further explained in <xref target="RFC6789"/>. The abstract ConEx
      mechanism is explained in <xref
      target="draft-ietf-conex-abstract-mech"/>. This document describes the
      necessary modifications to use ConEx with the Transmission Control
      Protocol (TCP).</t>

      <t>The markings for ConEx signaling are defined in the ConEx Destination
      Option (CDO) for IPv6 <xref target="draft-ietf-conex-destopt"/>.
      Specifically, the use of four flags is defined: X (ConEx-capable), L
      (loss experienced), E (ECN experienced) and C (credit).</t>

      <t>ConEx signaling is based on loss or Explicit Congestion Notification
      (ECN) marks <xref target="RFC3168"/> as congestion indications. The
      sender collects this congestion information based on existing TCP
      feedback mechanisms from the receiver to the sender. No changes are
      needed at the receiver to implement ConEx signaling. Therefore no
      additional negotiation is needed to implement and use ConEx at the
      sender. This document specifies the sender's actions that are needed to
      provide meaningful ConEx information to the network.</t>

      <t>Section <xref format="counter" target="mods"/> provides an overview
      of the modifications needed for TCP senders to implement ConEx. First
      congestion information has to be extracted from TCP's loss or ECN
      feedback as described in section <xref format="counter"
      target="account"/>. Section <xref format="counter" target="bits"/>
      details how to set the CDO marking based on this congestion information.
      <xref target="sec43"/> discusses loss of packets carrying ConEx
      information. Section <xref format="counter" target="timeliness"/> 
      discusses timeliness of the ConEx feedback signal, given
      congestion is a temporary state.</t>

      <t>This document describes congestion accounting for TCP with and
      without the Selective Acknowledgment (SACK) extension <xref
      target="RFC2018"/> (in section <xref format="counter" target="loss"/>).
      However, ConEx benefits from the more accurate information that SACK
      provides about the number of bytes dropped in the network. It is
      therefore preferable to use the SACK extension when
      using TCP with ConEx. The detailed mechanism to set the L flag in
      response to loss-based congestion feedback signal is given in section
      <xref format="counter" target="settingBits"/>.</t>

      <t>While loss has to be minimized, ECN can provide more fine-grained
      feedback information. ConEx-based traffic measurement or management
      mechanisms could benefit from this. Unfortunately, the current ECN
      feedback mechanism does not reflect multiple congestion markings if they
      occur within the same Round-Trip Time (RTT). A more accurate feedback
      extension to ECN (AccECN) is proposed in a separate document <xref
      target="draft-kuehlewind-tcpm-accurate-ecn"/>, as this is also useful
      for other mechanisms. <!-- as e.g. <xref target="DCTCP"/> or whenever the congestion 
      control reaction should be proportional to the experienced congestion. 
      ConEx also works with classic ECN but it is less accurate when multiple 
      congestion markings occur within on RTT.--></t>

      <t>Congestion accounting for both classic ECN feedback and AccECN
      feedback is explained in detail in section <xref format="counter"
      target="ECN"/>. Setting the E flag in response to ECN-based congestion
      feedback is again detailed in section <xref format="counter"
      target="settingBits"/>.</t>

      <section title="Requirements Language">
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
        "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
        document are to be interpreted as described in <xref
        target="RFC2119"/>.</t>
      </section>
    </section>

    <section anchor="mods" title="Sender-side Modifications">
      <t>This section gives an overview of actions that need to be taken by a
      TCP sender modified to use ConEx signaling.</t>

      <t>In the TCP handshake, a ConEx sender MUST negotiate for SACK and ECN
      preferably with AccECN feedback. Therefore a ConEx sender MUST also
      implement SACK and ECN. Depending on the capability of the receiver, the
      following operation modes exist:
      <list style="symbols">
          <t>SACK-accECN-ConEx (SACK and accurate ECN feedback)</t>

	  <t>SACK-ECN-ConEx (SACK and 'classic' instead of accurate ECN)</t>

          <t>accECN-ConEx (no SACK but accurate ECN feedback)</t>

          <t>ECN-ConEx (no SACK and no accurate ECN feedback but 'classic'
          ECN)</t>

          <t>SACK-ConEx (SACK but no ECN at all)</t>

          <t>Basic-ConEx (neither SACK nor ECN)</t>
        </list>
      </t>

      <!--<texttable anchor="conextcpmods_tab_modes" title="ConEx modes.">
        <ttcol>SACK</ttcol>

        <ttcol>ECN</ttcol>

        <c>S</c>

        <c>A</c>

        <c>S</c>

        <c>C</c>

        <c>S</c>

        <c>-</c>

        <c>-</c>

        <c>A</c>

        <c>-</c>

        <c>C</c>

        <c>-</c>

        <c>-</c>

        <postamble>S: SACK enabled; A: AccECN enabled; C: Classic ECN <xref
        target="RFC3168"/> enabled</postamble>
      </texttable>-->

      <t>A ConEx sender MUST expose all congestion information to the network
      according to the congestion information received by ECN or based on loss
      information provided by the TCP feedback loop. A TCP sender SHOULD count
      congestion byte-wise (rather than packet-wise; see next paragraph).
      After any congestion notification, a sender MUST mark <!--the respective number of payload bytes in -->
      subsequent packets with the appropriate ConEx flag in the IP header.
      Furthermore, a ConEx sender must send enough credit to cover all
      experienced congestion for the connection so far, as well as the risk of
      congestion for the current transmission (see <xref
      target="credits"/>).</t>

      <t>With SACK the number of lost payload bytes is known, but not the
      number of packets carrying these bytes. With classic ECN only an
      indication is given that a marking occurred but not the exact number of
      payload bytes nor packets. As network congestion is usually
      byte-congestion <xref target="RFC7141"/>, the byte-size of a packet
      marked with a CDO flag is defined to represent that number of bytes of
      congestion signaling <xref target="draft-ietf-conex-destopt"/>.
      Therefore the exact number of bytes should be taken into account, if
      available, to make the ConEx signal as exact as possible.</t>

      <t>Detailed mechanisms for congestion counting in each operation mode
      are described in the next section.</t>
    </section>

    <section anchor="account" title="Counting Congestion">
      <t>A ConEx TCP sender maintains two counters: one that counts congestion
      based on the information retrieved by loss detection, and a second that
      accounts for ECN based congestion feedback. These counters hold the
      number of outstanding bytes that should be ConEx marked with
      respectively the E flag or the L flag in subsequent packets.</t>

      <t>The outstanding bytes for congestion indications based on loss are
      maintained in the loss exposure gauge (LEG), as explained in <xref
      target="loss"/>.</t>

      <t>The outstanding bytes counted based on ECN feedback information are
      maintained in the congestion exposure gauge (CEG), as explained in <xref
      target="ECN"/>.</t>

      <t>When the sender sends a ConEx capable packet with the E or L flag set,
      it reduces the respective counter by the byte-size of the packet. This
      is explained for both counters in <xref target="settingBits"/>. </t>
      
      <t>Note that all bytes of an IP packet must be counted in the LEG or CEG to capture the
          right number of bytes that should be marked. Therefore the
      sender SHOULD take the payload and headers into account, up to and
      including the IP header. However, in TCP the information regarding how large the headers
      of a lost or marked packet were is usually not available, as only payload data will be acknowledged.
      <!--Therefore, as well as the TCP payload bytes, an
      appropriate number of header bytes SHOULD be added to the gauge for each
      packet of congestion feedback. And the sender SHOULD subtract header
      bytes from the gauge for each marked packet sent.--></t>

      <t>If equal-sized packets, or at least equally distributed packet sizes,
      can be assumed, the sender MAY only add and subtract TCP payload bytes.
      In this case there should be about the same number of ConEx marked
      packets as the original packets that were causing the congestion. Thus
      both contain about the same number of header bytes so they will cancel
      out. This case is assumed for simplicity in the following sections.</t>

      <t>Otherwise, if a sender sends different sized packets (with unequally
      distributed packet sizes), the sender needs to memorize or estimate the
      number of lost or ECN-marked packets. If the sender has sufficient memory available,
      the most accurate way to reconstruct the number of lost or marked packets is
      to remember the sequence number of all sent but not acknowledged packets.
      In this case a sender is able to reconstruct the number of packets and
      thus the header bytes that were sent during the last RTT.
      Otherwise, if e.g. not enough memory is available, the sender should estimate the packet size,
      e.g. if the packet size distribution follows a certain known pattern, or by using
      the minimum packet size seen in the last RTT.</t>
      
      <t>If the number of newly sent-out packets with the ConEx L or E flag set is
      smaller (or larger) than this estimated number of lost/ECN-marked
      packets, the additional header bytes should be added to (or can be
      subtracted from) the respective gauge.</t>

      <section anchor="loss" title="Loss Detection">
        
          <t>This section applies whether or not SACK support is available.
          The following subsection (<xref target="withoutSACK"/>) handles the case when SACK is not
          available.</t>

          <t>A TCP sender detects losses and subsequently retransmits the lost data. 
          Therefore, ConEx sender can simply set the ConEx L flag on all
          retransmissions in order to at least cover the amount of bytes lost.
          If this approach is taken, no LEG is needed.</t>	            

          <t>However, any retransmission may be spurious. In this case more bytes 
          have been marked than necessary. To compensate for this effect a ConEx sender
          can maintain a local signed counter, the (LEG), that indicates the number of
          outstanding bytes to be sent with the ConEx L flag and also can become negative.</t>
          
          <t>Using the LEG, when a TCP
          sender decides that a data segment needs to be retransmitted, it
          will increase LEG by the size of the TCP payload bytes in the
          retransmission (assuming equal sized segments such that the
          retransmitted packet will have the same number of header bytes as
              the original ones):</t>
          
          <t>For each retransmission:<vspace blankLines="1"/>
              LEG += payload</t>
              
          <t>Note, how the LEG is reduced when the ConEx L marking are set is described
              in section <xref target="bits"/>.</t>
          
          <t>Further to accommodate spurious retransmissions, a ConEx
          sender SHOULD make use of heuristics to detect such spurious
          retransmissions (e.g. F-RTO <xref target="RFC5682"/>, DSACK <xref
          target="RFC3708"/>, and Eifel <xref target="RFC3522"/>, <xref
          target="RFC4015"/>) if already available in a given implementation.
          If no mechanism for detecting spurious retransmissions is available,
          the ConEx sender MAY chose to implement one of the mechanism stated above. However,
          given the inaccuracy that ConEx may have anyway and the
          timeliness of ConEx information, a ConEx MAY also chose to not compensate for
          spurious retransmission. In this case if spurious retransmissions occur, the
          ConEx sender simple has sent too many ConEx signals which e.g. would decrease the congestion allowance
          in a ConEx policer unnecessarily.</t>
          
          <t>If a heuristic method is used to detect spurious retransmission and has determined that a
          certain number of packets were retransmitted erroneously, the ConEx
          sender subtracts the payload size of these TCP packets from
          LEG.</t>
          
          <t>If a spurious retransmission is detected:<vspace blankLines="1"/>
              LEG -= payload</t>

          <t>Note that LEG can become negative, if too many L marking have already been sent.
              This case is further discussed in section <xref target="timeliness"/>.</t>


        <section title="Without SACK Support" anchor="withoutSACK">
          <t>If multiple losses occur within one RTT and SACK is not used, it
          may take several RTTs until all lost data is retransmitted. With the
          scheme described above, the ConEx information will be delayed
          considerably, but timeliness is important for ConEx. For ConEx, 
	  it is important to know how much data was lot; it is
	  not important to know what data is lost. During the first RTT after the initial loss
          detection, the amount of received data and thus also the amount of
          lost data can be estimated based on the number of received ACKs.</t>
          
          <t>Therefore a ConEx sender can use the following algorithm
          to estimated the number of lost bytes with an additional delay of 
          one RTT using an additional Loss Estimation Counter (LEC):</t>

          <figure>
            <artwork><![CDATA[   flight_bytes:      current flight size in bytes
   retransmit_bytes:  payload size of the retransmission

   At the first retransmission in a congestion event LEC is set:

      LEC = flight_bytes - 3*SMSS 

      (At this point of time in the transmission, in the worst case, 
      all packets in flight minus three that trigged the dupACks 
      could have been lost.)

]]></artwork>
          </figure>
          <figure>
<artwork><![CDATA[   Then during the first RTT of the congestion event:

      For each retransmission:
         LEG += retransmit_bytes
         LEC -= retransmit_bytes
   
      For each ACK:
         LEC -= SMSS


   After one RTT:

      LEG += LEC

      (The LEC now estimates the number of outstanding bytes  
      that should be ConEx L marked.)


   After the first RTT for each following retransmissions:

      if (LEC > 0): LEC -= retransmit_bytes
      else if (LEC==0): LEG += retransmit_bytes

      if (LEC < 0): LEG += -LEC

      (The LEG is not increased for those bytes that were 
      already counted.)
]]></artwork>
          </figure>
        </section>
      </section>

      <section anchor="ECN" title="ECN">
        <t>ECN <xref target="RFC3168"/> is an IP/TCP mechanism that allows
        network nodes to mark packets with the Congestion Experienced (CE)
        mark instead of dropping them when congestion occurs.</t>

        <t>A receiver might support 'classic' ECN, the more accurate ECN
        feedback scheme (AccECN), or neither. In the case that ECN is not
        supported for a connection, of course, no ECN marks will occur; thus
        the sender will never set the E flag. Otherwise, a ConEx sender needs to
        maintain a signed counter, the congestion exposure gauge (CEG), for
        the number of outstanding bytes that have to be ConEx marked with the
        E flag.</t>

        <t>The CEG is increased when ECN information is received from an
        ECN-capable receiver supporting the 'classic' ECN scheme or the
        accurate ECN feedback scheme. When the ConEx sender receives an ACK
        indicating one or more segments were received with a CE mark, CEG is
        increased by the appropriate number of bytes as described further
        below.</t>

        <t>Unfortunately in case of duplicate acknowledgements the number of
        newly acknowledged bytes will be zero even though (CE marked) data has
        been received. Therefore, we increase the CEG by DeliveredData, as
        defined below:</t>

        <t>DeliveredData = acked_bytes + SACK_diff + (is_dup)*1SMSS -
        (is_after_dup)*num_dup*1SMSS + </t>

        <t>DeliveredData covers the number of bytes that has been newly
        delivered to the receiver. Therefore on each arrival of an ACK,
        DeliveredData will be increased by the newly acknowledged bytes
        (acked_bytes) as indicated by the current ACK, relative to all past
        ACKs. The formula depends on whether SACK is available: if SACK is not 
        available SACK_diff is always zero, whereas is ACK information is
        available is_dup and is_after_dup are always zero.</t>

        <t>With SACK, DeliveredData is increased by the number of bytes provided by 
	(new) SACK information (SACK_diff). Note, if less unacknowledged bytes are 
        announced in the new SACK information than in the previous ACK, 
        SACK_diff can be negative. In this case, data is newly acknowledged 
        (in acked_bytes), that has previously already been accumulated into 
        DeliveredData based on SACK information.</t>

        <t>Otherwise without SACK, DeliveredData is increased by 1 SMSS 
        on duplicate acknowledgements because duplicate acknowledgements do not 
        acknowledge any new data (and acked_bytes will be zero). For the 
        subsequent partial or full ACK, acked_bytes cover all newly acknowledged 
        bytes including those already accounted for with the receipt of any duplicate
	acknowledgement. Therefore DeliveredData 
        is reduced by one SMSS for each preceding duplicate ACK. Consequently,
        is_dup is one if the current ACK is a duplicated ACK without SACK,
        and zero otherwise. is_after_dup is only one for the next full or
        partial ACK after a number of duplicated ACKs without SACK and
        num_dup counts the number of duplicated ACKs in a row 
        (which usually is 3 or more). </t>

        <t>With classic ECN, one congestion marked packet causes
        continuous congestion feedback for a whole round trip, thus hiding the
        arrival of any further congestion marked packets during that round
        trip. A more accurate ECN feedback scheme (AccECN) is needed
        to ensure that feedback properly reflects the extent of congestion
        marking. The two cases, with and without a receiver capable of AccECN,
        are discussed in the following sections.</t>

        <!--<t>TBD: Discussion to set ECN in which packets. Initially apply RFC5562 rules
         ([SYN,ACK] and data segments only), as security implications of ECN on
         control packets ([SYN], pure [ACK], window probe, window update, ...) is an open 
         research question. However, running bidirectional ECN on all TCP segments 
         including TCP control packets, may allow for more timely and accurate ConEx 
         signals. Also, ConEx provides a framework to possibly address some of these 
         security risks.</t>-->

        <section title="Accurate ECN Feedback">
          <t>With a more accurate ECN feedback scheme (AccECN) that is supported by the receiver,
          either the number of marked packets or the number of marked bytes will be fed back from the
          receiver to the sender and is therefore know at sender-side. In the latter case, the CEG can directly be increased by the
          number of marked bytes. Otherwise if D is assumed to be the number
          of marks, the gauge (CEG) will be conservatively increased by one 
          SMSS for each marking or at max the number of newly acknowledged bytes:</t>

          <t>CEG += min(SMSS*D, DeliveredData)</t>

        </section>

        <section title="Classic ECN Support">

          <t>With classic ECN, as soon as a CE mark is seen at the receiver, it
          will feed this information back to the sender by setting the Echo
          Congestion Experienced (ECE) flag in the TCP header of subsequent
          ACKs. Once the sender receives the first ECE of a congestion
          notification, it sets the CWR flag in the TCP header once. When this
          packet with Congestion Window Reduced (CWR) flag in the TCP header
          arrives at the receiver, acknowledging its first ECE feedback, the
          receiver stops setting ECE.</t>

          <t>If the ConEx sender fully conforms to the semantics of ECN
          signaling as defined by <xref target="RFC3168"/>, it will receive 
          one full RTT of ACKs with the ECE
          flag set whenever at least one CE mark was received by the receiver.
          As the sender cannot estimate how many packets have actually been CE
          marked during this RTT, the most conservative assumption MAY be
          taken, namely assuming that all packets were marked. This can be
          achieved by increasing the CEG by DeliveredData for each ACK with
          the ECE flag:</t>
          <!--<vspace blankLines="1"/>-->
          <t>CEG += DeliveredData</t>

          <t>Optionally a ConEx sender could implement the following
          technique (that not conforms to <xref target="RFC3168"/>), 
          called advanced compatibility mode, to considerably
          improve its estimate of the number of ECN-marked packets:</t>
	  
          <t>To extract more than one ECE indication per RTT, a
          ConEx sender could set the CWR flag continuously to force the
          receiver to signal only one ECE per CE mark. Unfortunately, the use
          of delayed ACKs <xref target="RFC5681"/> (which is common) will
          prevent feedback of every CE mark; if a CWR confirmation is received
          before the ECE can be sent out on the next ACK, ECN feedback
          information could get lost (depending on the actual receiver 
          implementation). Thus a sender SHOULD set CWR only on
          those data segments that will presumably trigger a (delayed) ACK. The
          sender would need an additional control loop to estimate which data
          segments will trigger an ACK in order to extract more timely
          congestion notifications. Still, the CEG SHOULD be increased by
          DeliveredData, as one or more CE marked packets could be
          acknowledged by one delayed ACK.</t>

        </section>
      </section>
    </section>

    <section anchor="bits" title="Setting the ConEx Flags">

      <t>By setting the X flag, a packet is marked as ConEx-capable. All
      packets carrying payload MUST be marked with the X flag set, including
      retransmissions. Only if no congestion feedback information is 
      (currently) available, the X flag SHOULD be zero (e.g. for control packets on a
      connection that not sent any user data for some time and therefore is sending
      only pure ACKs that are not carrying any payload).</t>

      <section anchor="settingBits" title="Setting the E or the L Flag">

	<t>As described in section <xref target="loss"/>, the sender needs to maintain
        a CEG counter and might maintain a LEG counter. If no LEG is used, all
        retransmission will be marked with the L flag.</t> 

        <t>Further, as long as the LEG or CEG counter is positive, the sender marks
        each ConEx-capable packet with L or E respectively, and decreases the
        LEG or CEG counter by the TCP payload bytes carried in the marked
        packet (assuming headers are not being counted because packet sizes
        are regular). No matter how small the value of LEG or CEG, if the value is
	positive the sender MUST NOT defer packet marking; this ensure ConEx
	signals are timely. Therefore the value of LEG and CEG will commonly
        be negative.</t>

        <t>If both LEG and CEG are positive, the sender MUST mark each
        ConEx-capable packet with both L and E. If a credit signal is also
        pending (see next section), the C flag can be set as
        well.</t>
        
      </section>

      <section anchor="credits" title="Setting the Credit Flag">
        <t>The ConEx abstract mechanism <xref
        target="draft-ietf-conex-abstract-mech"/> requires that sufficient
        credit MUST be signaled in advance to cover the expected congestion
        during the feedback delay of one RTT.</t>

        <t>To monitor the credit state at the audit, a ConEx sender needs to maintain a
        Credit State Counter (CSC) in bytes. If congestion occurs, credits
        will be consumed and the CSC is reduced by the number of
        bytes that where lost or estimated to be ECN-marked. If the risk of
        congestion was estimated wrongly and thus too few credits were sent,
        the CSC becomes zero but cannot go negative.</t>

        <t>To be sure that the credit state in the audit never reaches zero, the number
        of credits should always equal the number of bytes in flight as all 
        packets could potentially get lost or congestion marked. In this case a ConEx 
        sender also monitors the number of bytes in flight F.  If F ever becomes larger 
        than CSC, the ConEx sender sets the C flag on each ConEx-capable packet and
        increase CSC by the payload size of each marked packet until CSC is no less than
        F again. However, a ConEx sender might also be less conservative and send fewer
        credits, if it e.g. assumes based on previous experience that the congestion will
        be low on a certain path.</t>

        <t>Recall that CSC will be decreased whenever congestion occurs;
        therefore CSC will need to be replenished as soon as CSC drops below F.
        Also recall that the sender can set the C flag on a ConEx-capable
        packet whether or not the E or L flags are also set.</t>

        <t>In TCP Slow Start, the congestion window might grow much larger
        than during the rest of the transmission. Likely, a sender could consider
        sending fewer than F credits but risking being penalized by an audit
        function. However, the credits should at least cover the increase
        in sending rate. Given the exponential increase as implemented in the TCP
        Slow Start algorithm which means that the sending rate doubles every RTT,
        a ConEx sender should at least cover half the number of packets
        in flight by credits.</t>

        <t> Note that the number of losses or markings
        within one RTT does not solely depend on the sender's actions. In
        general, the behavior of the cross traffic, whether Active Queue
        Management (AQM) is used and how it is parameterized influence how
        many packets might be dropped or marked. As long as any AQM
        encountered is not overly aggressive with ECN marking, sending half
        the flight size as credits should be sufficient whether congestion is
        signaled by loss or ECN.</t>
       
        <t> To maintain half of the packets in flight as credits, also half of the 
        packet of the initial window must be C marked. In Slow
        Start marking every fourth packet introduces the correct amount of 
        credit as can be seen in <xref target="SS_credit"/>.</t>

        <figure align="center" anchor="SS_credit"
                title="Credits in Slow Start (with an initial window of 3)">
          <artwork align="center"><![CDATA[                        in_flight  credits
RTT1  |------XC------>|     1         1
      |------X------->|     2         1
      |------XC------>|     3         2
      |               |
RTT2  |------X------->|     3         2
      |------X------->|     4         2
      |------X------->|     4         2
      |------XC------>|     5         3
      |------X------->|     5         3
      |------X------->|     6         3
      |               |
RTT3  |------X------->|     6         3
      |------XC------>|     7         4
      |------X------->|     7         4
      |------X------->|     8         4
      |------X------->|     8         4
      |------XC------>|     9         5
      |------X------->|     9         5
      |------X------->|    10         5
      |------X------->|    10         5
      |------XC------>|    11         6
      |------X------->|    11         6
      |------X------->|    12         6
      |      .        |
      |      :        |
      ]]></artwork>
        </figure>

        <t>It is possible that a TCP flow will encounter an audit function
        without relevant flow state, due to e.g. rerouting or memory
        limitations. Therefore, the sender needs to detect this case and
        resend credits. A ConEx sender might reset the credit counter CSC to
        zero if losses occur in subsequent RTTs (assuming that the sending
        rate was correctly reduced based on the received congestion signal and 
	using a conservatively large RTT estimation).</t>

       <t><!-- This section proposes concrete algorithms for determining how much
        credit to signal during congestion avoidance and slow start. However,
        experimentation in credit setting algorithms is expected and
        encouraged. -->
        This section proposes a concrete algorithm for determining how much
        credit to signal (with a separate approach used for Slow Start). However,
        experimentation in credit setting algorithms is expected and
        encouraged.
        The wider goal of ConEx is to reflect the 'cost' of the risk of
        causing congestion on those that contribute most to it. Thus,
        experimentation is encouraged to improve or maintain
        performance while reducing the risk of causing congestion, and
        therefore potentially reducing the need to signal so much credit.</t>

      </section>


    </section>

    <section anchor="sec43" title="Loss of ConEx Information">

      <t>Packets carrying ConEx signals could be discarded themselves. This
      will be a second order problem (e.g. if the loss probability is 0.1%,
      the probability of losing a ConEx L signal will be 0.1% of 0.1% = 0.01%).
      Further, the penalty an audit induces should be proportional to the mismatch
      of expected ConEx marks and observed congestion, therefore the audit might only 
      slightly increase the loss level of this flow. Therefore, an implementer MAY 
      choose to ignore this problem, accepting instead the risk that an audit function 
      might wrongly penalize a flow. 
      </t>

      <t>Nonetheless, a ConEx sender is responsible for always signalling sufficient 
      congestion feedback and therefore SHOULD remember which packet was marked
      with either the L, the E or the C flag. If one of these packets is
      detected as lost, the sender SHOULD increase the respective gauge(s),
      LEG or CEG, by the number of lost payload bytes in addition to
      increasing LEG for the loss.</t>
    </section>

    <section anchor="timeliness" title="Timeliness of the ConEx Signals">
      <t>ConEx signals will only be useful to a network node within a time
      delay of about one RTT after the congestion occurred. To avoid
      further delays, a ConEx sender SHOULD send the ConEx signaling on the
      next available packet.</t>

      <t>Any or all of the ConEx flags can be used in the same packet, which
      allows delay to be minimized when multiple signals are pending.
      The need to set multiple ConEx flags at the same time can occur if e.g 
      an ACK is received by the sender that simultaneously indicates that at
      least one ECN mark was received, and that one or more segments were
      lost. This may happen during excessive congestion, if the
      queues overflow even though ECN was used and currently all forwarded packets are
      marked, while others have to be dropped. Another case when this 
      might happen is when ACKs are lost, so that a subsequent ACK
      carries summary information not previously available to the sender.</t>

      <t>If a flow becomes application-limited, there could be insufficient bytes to
      send to reduce the gauges to zero or below. In such cases, the sender
      cannot help but delay ConEx signals. Nonetheless, as long as the sender
      is marking all outgoing packets, an audit function is unlikely to
      penalize ConEx-marked packets. Therefore, no matter how long a gauge has
      been positive, a sender MUST NOT reduce the gauge by more than the ConEx
      marked bytes it has sent.</t>

      <t>If the CEG or LEG counter is negative, the respective counter MAY
      be reset to zero within one RTT after it was decreased the last time or
      one RTT after recovery if no further congestion occurred.</t>

      <!--<t>If SACK information is not available or SACK information has been reset for any reason
      spurious retransmission are more likely. In this case it might be
      valuable to slightly delay the ConEx loss feedback until a spurious
      retransmission might be detected. But the ConEx signal MUST NOT be
      delayed more than one RTT if as long as data packets are sent out.</t>-->


    </section>
    
    <section title="Open Areas for Experimentation">
        
        <t>All proposed mechanisms in this document are
            experimental, and therefore further large-scale experimentation in the
            Internet is required to evaluate if the signaling provided by
            these mechanisms is accurate and timely enough to produce value
            for ConEx-based (traffic management or other) mechanisms.</t>
        
        <t>The current ConEx specifications assume that congestion is
            counted in number of bytes (including the IP header that
            directly encapsulates the CDO and everything that IP header
            encapsulates) <xref target="draft-ietf-conex-destopt"/>. This
            decision was taken because most network devices today experience
            byte-congestion where the memory is filled exactly with the number
            of bytes a packet carries  <xref target="RFC7141"/>. However, there are also devices that
            may allocate a certain amount of memory per packet, no matter how
            large a packet is. These devices get congested based on the
            number of packets in their memory and therefore in this case congestion
            is determined by the number of packets that have been
            lost or marked. Furthermore, a transport layer endpoint, such as a
            TCP sender or receiver, might not know the exact number of bytes
            that a lower layer was carrying. Therefore a TCP endpoint
            may only be able to estimate the exact number of congested bytes
            (assuming that all lower layer header have the same length). If
            this estimation is sufficient to work with, the ConEx signal needs
            to be further evaluated in tests in the Internet together
            with different auditor implementations.</t>
        
        <t>Further, the proposed marking schemes in this document are
            designed under the assumption that all TCP packets of a
            ConEx-capable flow are of equal size or that flows have a constant
            mean packet size over a rather small time frame, like one RTT or less. In
            most implementations this assumption might be taken as well and
            probably is true for most of the traffic flows. 
            If this proposed scheme is used, it is necessary to evaluate how
	    much accuracy degrades if this precondition is not met. Evaluating
	    with real traffic from different applications is especially important
	    in making the decision regarding whether the proposed schemes are
	    sufficient or whether a more complex scheme is needed.</t>
        
        <t>In this context the proposed scheme to set credit
            markings in Slow Start runs a risk to provide an insufficient
            number of markings which can cause an audit function to penalize
            this flow. Both the proposed credit scheme for Slow Start as well
            as the scheme in Congestion Avoidance must be evaluated together
            with one or more specific implementations of an ConEx auditor to
            ensure that both algorithms, in the sender and in the auditor,
            work properly together with a low risk of false positives (which
            would lead to penalization of an honest sender). However, if a
            sender is wrongly assumed to cheat, the penalization of the audit
            should be adequate and should allow an honest sender using a
            congestion control scheme that is commonly used today to recover
            quickly.</t>
        
        <t> Another open issue is the accuracy of the ECN feedback signal.
            At time of publication of this document there is no AccECN mechanism 
            specified yet, and further AccECN will also take some time
            to be widely deployed. This document proposes an advanced
            compatibility mode for Classic ECN. The proposed mechanism can
            provide more accurate feedback by utilizing the way Classic ECN is
            specified but has a higher risk of losing information. To figure
            out how high this risk is in a real deployment scenario, further
            experimental evaluation is needed. The following argument is
            intended to prove that suppressing repetitions of ECE, however, is
            still safe against possible congestion collapse due to lost
            congestion feedback and should be further proven in
            experimentation:</t>
        
        <t>Repetition of ECE in classic ECN is intended to ensure reliable
            delivery of congestion feedback. However, with advanced
            compatibility mode, it is possible to miss congestion
            notifications. This can happen in some implementations if delayed
            acknowledgements are used. Further, an ACK
            containing ECE can simply get lost. If only a few CE marks are
            received within one congestion event (e.g., only one), the loss of
            one acknowledgements due to (heavy) congestion on the reverse path
            can prevent that any congestion notification is received by the
            sender.</t>
        
        <t>However, if loss of feedback exacerbates congestion on the
            forward path, more forward packets will be CE marked, increasing
            the likelihood that feedback from at least one CE will get through
            per RTT. As long as one ECE reaches the sender per RTT, the
            sender's congestion response will be the same as if CWR were not
            continuous. The only way that heavy congestion on the forward path
            could be completely hidden would be if all ACKs on the reverse
            path were lost. If total ACK loss persisted, the sender would time
            out and do a congestion response anyway. Therefore, the problem
            seems confined to potential suppression of a congestion response
            during light congestion.</t>
        
        <t>Furthermore, even if loss of all ECN feedback leads to no congestion 
            response, the worst that could happen would be loss instead of 
            ECN-signaled congestion on the forward path. Given compatibility 
            mode does not affect loss feedback, there would be no risk of 
            congestion collapse.</t>
        
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>The authors would like to thank Bob Briscoe who contributed with this
      initial ideas <xref target="I-D.briscoe-conex-re-ecn-tcp"/> and valuable
      feedback. Moreover, thanks to Jana Iyengar who also provided valuable
      feedback.</t>
    </section>

    <!-- Possibly a 'Contributors' section ... -->

    <section anchor="IANA" title="IANA Considerations">
      <t>This document does not have any requests to IANA.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <!-- BB: The following para has been dealt with under Classic ECN support, 
           because it concerns protocol safety not security 
           (it is not to do with protecting against deliberate malice)
      <t>With some of the advanced ECN compatibility modes it is possible to
      miss congestion notifications. Thus a sender will not decrease its
      sending rate. If the congestion is persistent, the likelihood to receive
      a congestion notification increases. In the worst case the sender will
      still react correctly to loss. This will prevent a congestion
      collapse.</t>
      -->

      <t>General ConEx security considerations are covered extensively in the
      ConEx abstract mechanism <xref
      target="draft-ietf-conex-abstract-mech"/>. This section covers
      TCP-specific concerns that may occur with the addition of ConEx to TCP
      (while not discussing general well-known attacks against TCP).
      It is assumed that any altering of ConEx information can be detected by 
      protection mechanisms in the IP layer and is therefore not discussed here
      but in <xref target="draft-ietf-conex-destopt"/>. Further, 
      <xref target="draft-ietf-conex-destopt"/> 
      describes how to use ConEx to mitigate flooding attacks by using 
      preferential drop where the use of ConEx can even increase security.</t>

      <t>The ConEx modifications to TCP provide no mechanism for a receiver to
      force a sender not to use ConEx. A receiver can degrade the accuracy of
      ConEx by claiming that it does not support SACK, AccECN or ECN, but the
      sender will never have to turn ConEx off. Further, the receiver cannot force the
      sender to have to mark ConEx more conservatively, in order to cover the
      risk of any inaccuracy. Instead it is always the sender's choice to either mark
      very conservatively which ensures that the audits always sees enough markings
      to not penalize the flow, or estimate the needed number of markings more 
      tightly. This second case  lead to inaccurate marking and therefore increases
      the likelihood of loss at an audit function which will only harm the receiver itself.</t>

      <t>Assuming the sender is limited in some way by a congestion allowance
      or quota, a receiver could spoof more loss or ECN congestion feedback
      than it actually experiences, in an attempt to make the sender draw down
      its allowance faster than necessary. However, over-declaring congestion
      simply makes the sender slow down. If the receiver is interested in the
      content it will not want to harm its own performance.</t>

      <t>However, if the receiver is solely interested in making the sender
      draw down its allowance, the net effect will depend on the sender's
      congestion control algorithm as permanently adding more and more additional congestion would cause
      the sender to more and more reduce its sending rate. Therefore a receiver 
      can only maintain a certain congestion level that is corresponding to a certain
      sending rate. With New Reno <xref target="RFC5681"/>,
      doubling congestion feedback causes the sender to reduce its sending rate such
      that it would only to consume sqrt(2) = 1.4
      times more congestion allowance. However, to improve scaling, congestion
      control algorithms are tending towards less responsive algorithms like
      Cubic or Compound TCP, and ultimately to linear algorithms like DCTCP
      <xref target="DCTCP"/> that aim to maintain the same congestion level independent
      of the current sending rate and always reduce its sending window if the signaled 
      congestion feedback is higher. In each case, if the receiver doubles congestion
      feedback, it causes the sender to respectively consume more allowance by
      a factor of 1.2, 1.15 or 1, where 1 implies the attack has become
      completely ineffective as no further congestion allowance is consumed but the flow will
      decrease its sending rate to a minimum instead.</t>
    </section>
  </middle>

  <!--  *****BACK MATTER ***** -->

  <back>
    <references title="Normative References">
      <!--?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"?-->

      &RFC2119;

      &RFC3168;

      &RFC2018;

      &RFC5681;

      <reference anchor="draft-ietf-conex-destopt">
        <front>
          <title>IPv6 Destination Option for ConEx</title>

          <author initials="S" surname="Krishnan">
            <organization/>
          </author>

          <author initials="M" surname="Kuehlewind">
            <organization/>
          </author>

          <author initials="C" surname="Ucendo">
            <organization/>
          </author>

          <date month="March" year="2013"/>
        </front>

        <seriesInfo name="Internet-Draft" value="draft-ietf-conex-destopt-04"/>
      </reference>

      <reference anchor="draft-ietf-conex-abstract-mech">
        <front>
          <title>Congestion Exposure (ConEx) Concepts and Abstract
          Mechanism</title>

          <author initials="M" surname="Mathis">
            <organization/>
          </author>

          <author initials="B" surname="Briscoe">
            <organization/>
          </author>

          <date month="October" year="2012"/>
        </front>

        <seriesInfo name="Internet-Draft"
                    value="draft-ietf-conex-abstract-mech-06"/>
      </reference>
    </references>

    <references title="Informative References">

      &RFC3522;

      &RFC3708;

      &RFC4015;

      &RFC5682;

      &RFC6789;

      &RFC7141;

      <?rfc include="reference.I-D.briscoe-conex-re-ecn-tcp.xml"?>

      <reference anchor="draft-kuehlewind-tcpm-accurate-ecn">
        <front>
          <title>More Accurate ECN Feedback in TCP</title>

          <author initials="M" surname="Kuehlewind">
            <organization/>
          </author>

          <author initials="R" surname="Scheffenegger">
            <organization/>
          </author>

          <date month="Jun" year="2013"/>
        </front>

        <seriesInfo name="Internet-Draft"
                    value="draft-kuehlewind-tcpm-accurate-ecn-02"/>
      </reference>

      <reference anchor="DCTCP">
        <front>
          <title>DCTCP: Efficient Packet Transport for the Commoditized Data
          Center</title>

          <author initials="M" surname="Alizadeh">
            <organization/>
          </author>

          <author initials="A" surname="Greenberg">
            <organization/>
          </author>

          <author initials="D" surname="Maltz">
            <organization/>
          </author>

          <author initials="J" surname="Padhye">
            <organization/>
          </author>

          <author initials="P" surname="Patel">
            <organization/>
          </author>

          <author initials="B" surname="Prabhakar">
            <organization/>
          </author>

          <author initials="S" surname="Sengupta">
            <organization/>
          </author>

          <author initials="M" surname="Sridharan">
            <organization/>
          </author>

          <date month="Jan" year="2010"/>
        </front>
      </reference>
    </references>

    <section title="Revision history">
      <t>RFC Editor: This section is to be removed before RFC publication.</t>

      <t>00 ... initial draft, early submission to meet deadline.</t>

      <t>01 ... refined draft, updated LEG "drain" from per-packet to
      RTT-based.</t>

      <t>02 ... added <xref target="sec43"/> and expanded discussion about ECN
      interaction.</t>

      <t>03 ... expanded the discussion around credit bits.</t>

      <t>04 ... review comments of Jana addressed. (Change in full compliance
      mode.)</t>

      <t>05 ... changes on Loss Detection without SACK, support of classic ECN
      and credit handling.</t>

      <t>07 ... review feedback provided by Nandita</t>

      <t>08 ... based on Bob's feedback: Wording edits and structuring of a few 
      paragraphs; change of SHOULD to MAY for resetting negative LEG/CEG;
      additional security considerations provided by Bob (thanks!).</t>

      <t>09 ... experimentation section added</t>

      <t>10 ... final review comments based on IETF last call</t>

      <!--<t><vspace blankLines="100"/></t>-->
    </section>
  </back>
</rfc>

PAFTECH AB 2003-20262026-04-23 05:27:25