One document matched: draft-briscoe-tsvwg-ecn-encap-guidelines-01.xml


<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?xml-stylesheet type='text/xsl' href='http://xml.resource.org/authoring/rfc2629.xslt' ?>
<!-- Alterations to I-D/RFC boilerplate -->
<?rfc private="" ?>
<!-- Default private="" Produce an internal memo 2.5pp shorter than an I-D or RFC -->
<?rfc rfcprocack="yes" ?>
<!-- Default rfcprocack="no" add a short sentence acknowledging xml2rfc -->
<?rfc strict="no" ?>
<!-- Default strict="no" Don't check I-D nits -->
<?rfc rfcedstyle="yes" ?>
<!-- Default rfcedstyle="yes" attempt to closely follow finer details from the latest observable RFC-Editor style -->
<!-- IETF process -->
<?rfc iprnotified="no" ?>
<!-- Default iprnotified="no" I haven't disclosed existence of IPR to IETF -->
<!-- ToC format -->
<?rfc toc="yes" ?>
<!-- Default toc="no" No Table of Contents -->
<!-- Cross referencing, footnotes, comments -->
<?rfc symrefs="yes"?>
<!-- Default symrefs="no" Don't use anchors, but use numbers for refs -->
<?rfc sortrefs="yes"?>
<!-- Default sortrefs="no" Don't sort references into order -->
<?rfc comments="yes" ?>
<!-- Default comments="no" Don't render comments -->
<?rfc inline="no" ?>
<!-- Default inline="no" if comments is "yes", then render comments inline; otherwise render them in an `Editorial Comments' section -->
<!-- Pagination control -->
<?rfc compact="yes"?>
<!-- Default compact="no" Start sections on new pages -->
<?rfc subcompact="no"?>
<!-- Default subcompact="(as compact setting)" yes/no is not quite as compact as yes/yes -->
<!-- HTML formatting control -->
<?rfc emoticonic="yes" ?>
<!-- Default emoticonic="no" Doesn't prettify HTML format -->
<rfc category="bcp" docName="draft-briscoe-tsvwg-ecn-encap-guidelines-01"
     ipr="trust200902" updates="3819">
  <front>
    <title abbrev="ECN Encapsulation Guidelines">Guidelines for Adding
    Congestion Notification to Protocols that Encapsulate IP</title>

    <author fullname="Bob Briscoe" initials="B." surname="Briscoe">
      <organization>BT</organization>

      <address>
        <postal>
          <street>B54/77, Adastral Park</street>

          <street>Martlesham Heath</street>

          <city>Ipswich</city>

          <code>IP5 3RE</code>

          <country>UK</country>
        </postal>

        <phone>+44 1473 645196</phone>

        <email>bob.briscoe@bt.com</email>

        <uri>http://bobbriscoe.net/</uri>
      </address>
    </author>

    <date day="22" month="October" year="2012" />

    <area>Transport</area>

    <workgroup>Transport Area Working Group</workgroup>

    <keyword>Congestion Control and Management</keyword>

    <keyword>Congestion Notification</keyword>

    <keyword>Information Security</keyword>

    <keyword>Tunnelling</keyword>

    <keyword>Encapsulation & Decapsulation</keyword>

    <keyword>Protocol</keyword>

    <keyword>ECN</keyword>

    <keyword>Layering</keyword>

    <abstract>
      <t>The purpose of this document is to guide the design of congestion
      notification in any lower layer or tunnelling protocol that encapsulates
      IP. The aim is for explicit congestion signals to propagate consistently
      from lower layer protocols into IP. Then the IP internetwork layer can
      act as a portability layer to carry congestion notification from
      non-IP-aware congested nodes up to the transport layer (L4). Following
      these guidelines should assure interworking between new lower layer
      congestion notification mechanisms, whether specified by the IETF or
      other standards bodies.</t>
    </abstract>
  </front>

  <!-- ================================================================ -->

  <middle>
    <!-- ================================================================ -->

    <section anchor="ecnencap_Introduction" title="Introduction">
      <t>Explicit Congestion Notification (ECN <xref target="RFC3168"></xref>)
      is defined in the IP header (v4 & v6) to allow a resource to notify
      the onset of queue build-up without having to drop packets, by
      explicitly marking a proportion of packets with the congestion
      experienced (CE) codepoint.<!--In the layered model of communication, each layer accepts requests to forward PDUs and eventually returns 
a status code to the higher layer. Without ECN, each layer returns either a 'delivered' status code or an 
implicit 'not delivered'. Explicit notification of congestion adds a useful 'delivered but congestion 
experienced' status code to each layer interface.-->ECN removes nearly all
      congestion loss and it cuts delays for two main reasons: i) it avoids
      the delays recovering from congestion losses, which particularly
      benefits small flows, making their completion time predictably short
      <xref target="RFC2884"></xref>; and ii) as ECN is used more widely by
      end-systems, it will gradually remove the need to configure a degree of
      delay into buffers before they start to notify congestion (the cause of
      bufferbloat). The latter delay is because drop involves a trade-off
      between sending a timely signal and trying to avoid impairment, whereas
      ECN is solely a signal so there is no harm triggering it earlier.</t>

      <t>Some lower layer technologies (e.g. MPLS, Ethernet) are used to form
      large subnetworks with IP-aware nodes only at the edges. Particularly
      now that end-system protocols are finally being deployed without their
      earlier deficiencies, even the buffers of well-provisioned interior
      switches will often need to signal episodes of queuing. However, the
      above benefits of ECN can only be fully realised if the relevant
      subnetwork technology supports it. Propagation of ECN is defined for
      MPLS <xref target="RFC5129"></xref>, and is being defined for TRILL
      <xref target="trill-rbridge-options"></xref>, but it remains to be
      defined for a number of other subnetwork technologies.</t>

      <t>Similarly, ECN propagation is yet to be defined for many tunnelling
      protocols. <xref target="RFC6040"></xref> defines how ECN should be
      propagated for IP-in-IP <xref target="RFC2003"></xref> and IPsec <xref
      target="RFC4301"></xref> tunnels. However, as Section 9.3 of RFC3168
      pointed out, ECN support will need to be defined for other tunnelling
      protocols, e.g. L2TP <xref target="RFC2661"></xref>, GRE [<xref
      format="counter" target="RFC1701"></xref>, <xref format="counter"
      target="RFC2784"></xref>], PPTP <xref target="RFC2637"></xref> and GTP
      [<xref format="counter" target="GTPv1"></xref>, <xref format="counter"
      target="GTPv1-U"></xref>, <xref format="counter"
      target="GTPv2-C"></xref>].<!--Add PMIP (Proxy Mobile IPv6) [RFC5213]?--></t>

      <t>The purpose of this document is to guide the addition of congestion
      notification to any subnet technology or tunnelling protocol, so that
      lower layer equipment can signal congestion explicitly and it will
      propagate consistently into encapsulated (higher layer) headers,
      otherwise the signals will not reach their ultimate destination.</t>

      <t>Incremental deployment is the most tricky aspect when adding support
      for ECN. The original ECN protocol in IP <xref target="RFC3168"></xref>
      was carefully designed so that a congested buffer would not mark a
      packet (rather than drop it) unless both source and destination hosts
      were ECN-capable. Otherwise its congestion markings would never be
      detected and congestion would just deteriorate further. However, to
      support congestion marking below the IP layer, it is not sufficient to
      only check that the two end-points support ECN; correct operation also
      depends on the decapsulator propagating congestion notifications
      faithfully. Otherwise, a legacy decapsulator might silently fail to
      propagate any ECN signals from the outer to the forwarded header. Then
      the lost signals would never be detected and again congestion would
      deteriorate further. The guidelines given later require protocol
      designers to carefully consider incremental deployment, and suggest
      various safe approaches for different circumstances.</t>

      <t>Of course, the IETF does not have standards authority over every link
      layer protocol. So this document gives guidelines for designing
      propagation of congestion notification across the interface between IP
      and protocols that may encapsulate IP (i.e. that can be layered beneath
      IP). Each lower layer technology will exhibit different issues and
      compromises, so the IETF or the relevant standards body must be free to
      define the specifics of each lower layer congestion notification scheme.
      Nonetheless, if the guidelines are followed, congestion notification
      should interwork between different technologies, using IP in its role as
      a 'portability layer'.</t>

      <t>It has not been possible to give common guidelines for all lower
      layer technologies, because they do not all fit a common pattern.
      Instead they have been divided into a few distinct modes of operation:
      Feed-Forward-and-Upward, Feed-Upward-and-Forward, Feed-Backward and
      Null. These are described in <xref target="ecnencap_Modes"></xref>, then
      in the following sections separate guidelines are given for each
      mode.</t>

      <t>This document updates the advice to subnetwork designers about ECN in
      Section 13 of <xref target="RFC3819"></xref>.</t>

      <section anchor="ecnencap_Scope" title="Scope">
        <t>This document only concerns wire protocol processing of explicit
        notification of congestion and makes no changes or recommendations
        concerning algorithms for congestion marking or congestion
        response.</t>

        <t>This document focuses on the congestion notification interface
        between IP (v4 or v6) and lower layer protocols that can encapsulate
        IP. However, it is likely that the guidelines will also be useful when
        a lower layer protocol or tunnel encapsulates itself (e.g. Ethernet
        MAC in MAC <xref target="IEEE802.1Qah"></xref>) or when it
        encapsulates other protocols.</t>
      </section>
    </section>

    <!-- ================================================================ -->

    <section anchor="ecnencap_Reqs_Language" title="Terminology">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in RFC 2119 <xref
      target="RFC2119"></xref>.</t>

      <t>Further terminology used within this document:<list style="hanging">
          <t hangText="Protocol data unit (PDU):">Information that is
          delivered as a unit among peer entities of a layered network
          consisting of protocol control information (typically a header) and
          possibly user data (payload) of that layer. The scope of this
          document includes layer 2 and layer 3 networks, where the PDU is
          respectively termed a frame or a packet (or a cell in ATM). PDU is a
          general term for any of these. This definition also includes a
          payload with a shim header lying somewhere between layer 2 &
          3.</t>

          <t hangText="Transport:">The end-to-end transmission control
          function, conventionally considered at layer-4 in the OSI reference
          model. Given the audience for this document will often use the word
          transport to mean low level bit carriage, whenever the term is used
          it will be qualified, e.g. 'L4 transport'.</t>

          <t hangText="Encapsulator:">The link or tunnel endpoint function
          that adds an outer header to a PDU (also termed the 'link ingress',
          the 'subnet ingress', the 'ingress tunnel endpoint' or just the
          'ingress' where the context is clear).</t>

          <t hangText="Decapsulator:">The link or tunnel endpoint function
          that removes an outer header from a PDU (also termed the 'link
          egress', the 'subnet egress', the 'egress tunnel endpoint' or just
          the 'egress' where the context is clear).</t>

          <t hangText="Incoming header:">The header of an arriving PDU before
          encapsulation.</t>

          <t hangText="Outer header:">The header added to encapsulate a
          PDU.</t>

          <t hangText="Inner header:">The header encapsulated by the outer
          header.</t>

          <t hangText="Outgoing header:">The header forwarded by the
          decapsulator.</t>

          <t hangText="CE:">Congestion Experienced <xref
          target="RFC3168"></xref></t>

          <t hangText="ECT:">ECN-Capable Transport <xref
          target="RFC3168"></xref></t>

          <t hangText="Not-ECT:">Not ECN-Capable Transport <xref
          target="RFC3168"></xref></t>

          <t hangText="ECN-PDU:">A PDU that is part of a feedback loop within
          which the nodes necessary to propagate explicit congestion
          notifications back to the load regulator are ECN-capable. This is
          intended to be a general term for a PDU at any layer, not just an IP
          PDU. An IP packet with a non-zero ECN field would be an ECN-PDU, but
          the term is intended to also be used to describe PDUs of protocols
          that encapsulate IP packets, where it has been checked that the
          necessary egress nodes and endpoints in the feedback loop for that
          PDU will propagate congestion notification.</t>

          <t hangText="Not-ECN-PDU:">A PDU that is part of a feedback-loop
          within which some nodes necessary to propagate explicit congestion
          notifications back to the load regulator are not ECN-capable.</t>

          <t hangText="Load Regulator:">For each flow of PDUs, the transport
          function that is capable of controlling the data rate. Typically
          located at the data source, but in-path nodes can regulate load in
          some congestion control arrangements (e.g. admission control or
          policing nodes). Note the term "a function capable of controlling
          the load" deliberately includes a transport that doesn't actually
          control the load but ought to (e.g. an application without
          congestion control that uses UDP).</t>

          <t hangText="Congestion Baseline:">The location of the function on
          the path that initialised the values of all congestion notification
          fields in a sequence of packets, before any are set to the
          congestion experienced (CE) codepoint if they experience congestion
          further downstream. Typically the original data source at
          layer-4.</t>
        </list></t>
    </section>

    <section anchor="ecnencap_Modes" title="Modes of Operation">
      <t>This section sets down the different modes by which information is
      passed between the lower layer and the higher one. It acts as a
      reference framework for the following sections, which give normative
      guidelines for designers of explicit congestion notification protocols,
      taking each mode separately in turn:<list style="hanging">
          <t hangText="Feed-Forward-and-Up:">Nodes feed forward congestion
          notification towards the destination within the lower layer then up
          the layers (like IP does). The following local optimisation is
          possible:<list style="hanging">
              <t hangText="Feed-Up-and-Forward:">A lower layer switch feeds-up
              congestion notification directly into the ECN field in the
              higher layer (IP) header, irrespective of whether it is at the
              egress of a subnet.</t>
            </list></t>

          <t hangText="Feed-Backward:">Nodes feed back congestion signals
          towards the ingress of the lower layer and (optionally) attempt to
          control congestion within their own layer.</t>

          <t hangText="Null:">Nodes cannot experience congestion at the lower
          layer except at ingress nodes that are also IP-aware (or
          equivalently higher-layer-aware).</t>
        </list></t>

      <section anchor="ecnencap_Forward" title="Feed-Forward-and-Up Mode">
        <t>Many subnet technologies are based on self-contained protocol data
        units (PDUs) or frames sent unreliably. They provide no feedback
        channel at the subnetwork layer, instead relying on higher layers
        (e.g. TCP) to feed back loss signals.</t>

        <t>In these cases, ECN may best be supported by standardising explicit
        notification of congestion into the specific link layer protocol. It
        will then also be necessary to define how the egress of the lower
        layer subnet propagates this explicit signal into the forwarded upper
        layer (IP) header. It can then continue forwards until it finally
        reaches the destination transport (at L4). Then typically the
        destination will feed this congestion notification back to the source
        transport using an end-to-end protocol (e.g. TCP).</t>

        <t>This mode is illustrated in <xref
        target="ecnencap_Fig_Feed-Forward-and-Up"></xref>. Along the middle of
        the figure, layers 2, 3 & 4 of the protocol stack are shown, and
        one packet is shown along the bottom as it progresses across the
        network from source to destination, crossing two subnets connected by
        a router, and crossing two switches on the path across each subnet.
        Congestion at the output of the first switch (shown as *) leads to a
        congestion marking in the L2 header (shown as C in the illustration of
        the packet). The chevrons show the progress of the resulting
        congestion indication. It is propagated from link to link across the
        subnet in the L2 header, then when the router removes the marked L2
        header, it propagates the marking up into the L3 (IP) header. The
        router forwards the marked L3 header into subnet 2, and when it adds a
        new L2 header it copies the L3 marking into the L2 header as well, as
        shown by the 'C's in both layers (assuming the technology of subnet 2
        also supports explicit congestion marking).</t>

        <t>Note that there is no implication that each 'C' marking is encoded
        the same; a different encoding might be used for the 'C' marking in
        each protocol.</t>

        <t>Finally, for completeness, we show the L3 marking arriving at the
        destination, where the host transport protocol (e.g. TCP) feeds it
        back to the source in the L4 acknowledgement (the 'C' at L4 in the
        packet at the top of the diagram).</t>

        <figure align="center" anchor="ecnencap_Fig_Feed-Forward-and-Up"
                title="Feed-Forward-and-Up Mode">
          <artwork><![CDATA[                     _ _ _ 
          /_______  | | |C|  ACK Packet (V)
          \         |_|_|_|
 +---+        layer: 2 3 4 header                            +---+
 |  <|<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Packet V <<<<<<<<<<<<<|<< |L4
 |   |                         +---+                         | ^ |
 |   | . . . . . . Packet U. . | >>|>>> Packet U >>>>>>>>>>>>|>^ |L3
 |   |     +---+     +---+     | ^ |     +---+     +---+     |   |
 |   |     |  *|>>>>>|>>>|>>>>>|>^ |     |   |     |   |     |   |L2
 |___|_____|___|_____|___|_____|___|_____|___|_____|___|_____|___|
 source          subnet A      router       subnet B         dest
     __ _ _ _    __ _ _ _    __ _ _        __ _ _ _
    |  | | | |  |  | | |C|  |  | |C|      |  | |C|C|  Data________\
    |__|_|_|_|  |__|_|_|_|  |__|_|_|      |__|_|_|_|  Packet (U)  /
 layer: 4 3 2A      4 3 2A      4 3           4 3 2B
 header]]></artwork>
        </figure>

        <t>Of course, modern networks are rarely as simple as this text-book
        example, often involving multiple nested layers. Nonetheless, the
        example illustrates the general idea of feeding congestion
        notification forward then upward whenever a header is removed at the
        egress of a subnet.</t>

        <t>Note that the FECN (forward ECN) bit in Frame Relay and the
        explicit forward congestion indication (EFCI <xref
        target="ITU-T.I.371"></xref>) bit in ATM user data cells follow a
        feed-forward pattern. However, in ATM, this is only as part of a
        feed-forward-and-backward pattern at the lower layer, not
        feed-forward-and-up out of the lower layer—the intention was
        never to interface to IP ECN at the subnet egress. To our knowledge,
        Frame Relay FECN is solely used to detect where more capacity should
        be provisioned <xref target="Buck00"></xref>.</t>
      </section>

      <section anchor="ecnencap_Up" title="Feed-Up-and-Forward Mode">
        <t>Ethernet is particularly difficult to extend incrementally to
        support explicit congestion notification. One way to support ECN in
        such cases has been to use so called 'layer-3 switches'. These are
        Ethernet switches that bury into the Ethernet payload to find an IP
        header and manipulate or act on certain IP fields (specifically
        Diffserv & ECN). For instance, in Data Center TCP <xref
        target="DCTCP"></xref>, layer-3 switches are configured to mark the
        ECN field of the IP header within the Ethernet payload when their
        output buffer becomes congested. With respect to switching, a layer-3
        switch acts solely on the addresses in the Ethernet header; it doesn't
        use IP addresses, and it doesn't decrement the TTL field in the IP
        header.</t>

        <figure align="center" anchor="ecnencap_Fig_Feed-Up"
                title="Feed-Up-and-Forward Mode">
          <artwork><![CDATA[                     _ _ _ 
          /_______  | | |C|  ACK packet (V)
          \         |_|_|_|
 +---+        layer: 2 3 4 header                            +---+
 |  <|<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Packet V <<<<<<<<<<<<<|<< |L4
 |   |                         +---+                         | ^ |
 |   | . . .  >>>> Packet U >>>|>>>|>>> Packet U >>>>>>>>>>>>|>^ |L3
 |   |     +--^+     +---+     |   |     +---+     +---+     |   |
 |   |     |  *|     |   |     |   |     |   |     |   |     |   |L2
 |___|_____|___|_____|___|_____|___|_____|___|_____|___|_____|___|
 source          subnet E      router       subnet F         dest
     __ _ _ _    __ _ _ _    __ _ _        __ _ _ _
    |  | | | |  |  | |C| |  |  | |C|      |  | |C|C|  data________\
    |__|_|_|_|  |__|_|_|_|  |__|_|_|      |__|_|_|_|  packet (U)  /
 layer: 4 3 2       4 3 2       4 3           4 3 2
 header]]></artwork>
        </figure>

        <t>By comparing <xref target="ecnencap_Fig_Feed-Up"></xref> with <xref
        target="ecnencap_Fig_Feed-Forward-and-Up"></xref>, it can be seen that
        subnet E (perhaps a subnet of layer-3 Ethernet switches) works in
        feed-up-and-forward mode by notifying congestion directly into L3 at
        the point of congestion, even though the congested switch does not
        otherwise act at L3. In this example, the technology in subnet F (e.g.
        MPLS) does support ECN natively, so when the router adds the layer-2
        header it copies the ECN marking from L3 to L2 as well.</t>
      </section>

      <section anchor="ecnencap_Backward" title="Feed-Backward Mode">
        <t>In some layer 2 technologies, explicit congestion notification has
        been defined for use internally within the subnet with its own
        feedback and load regulation, but typically the interface with IP for
        ECN has not been defined.</t>

        <t>For instance, for the available bit-rate (ABR) service in ATM, the
        relative rate mechanism was one of the more popular mechanisms for
        managing traffic, tending to supersede earlier designs. In this
        approach ATM switches send special resource management (RM) cells in
        both the forward and backward directions to control the ingress rate
        of user data into a virtual circuit. If a switch buffer is approaching
        congestion or congested it sends an RM cell back towards the ingress
        with respectively the No Increase (NI) or Congestion Indication (CI)
        bit set in its message type field <xref target="ATM-TM-ABR"></xref>.
        The ingress then holds or decreases its sending bit-rate
        accordingly.</t>

        <figure align="center" anchor="ecnencap_Fig_Feed-Backward"
                title="Feed-Backward Mode">
          <artwork><![CDATA[                     _ _ _ 
          /_______  | | |C|  ACK packet (X)
          \         |_|_|_|
 +---+        layer: 2 3 4 header                            +---+
 |  <|<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Packet X <<<<<<<<<<<<<|<< |L4
 |   |                         +---+                         | ^ |
 |   |                         |  *|>>> Packet W >>>>>>>>>>>>|>^ |L3
 |   |     +---+     +---+     |   |     +---+     +---+     |   |
 |   |     |   |     |   |     |  <|<<<<<|<<<|<(V)<|<<<|     |   |L2
 |   | . . | . |Packet U | . . | . | . . | . | . . | .*| . . |   |L2
 |___|_____|___|_____|___|_____|___|_____|___|_____|___|_____|___|
 source          subnet G      router       subnet H         dest
     __ _ _ _    __ _ _ _    __ _ _        __ _ _ _   later
    |  | | | |  |  | | | |  |  | | |      |  | |C| |  data________\
    |__|_|_|_|  |__|_|_|_|  |__|_|_|      |__|_|_|_|  packet (W)  /
        4 3 2       4 3 2       4 3           4 3 2
                                        _        
                                  /__  |C|  Feedback control
                                  \    |_|  cell/frame (V)
                                        2    
     __ _ _ _    __ _ _ _    __ _ _        __ _ _ _   earlier
    |  | | | |  |  | | | |  |  | | |      |  | | | |  data________\
    |__|_|_|_|  |__|_|_|_|  |__|_|_|      |__|_|_|_|  packet (U)  /
layer:  4 3 2       4 3 2       4 3           4 3 2
header
]]></artwork>
        </figure>

        <t>ATM's feed-backward approach doesn't fit well when layered beneath
        IP's feed-forward approach—unless the initial data source is the
        same node as the ATM ingress. <!--(which would be the case if ATM had achieved its aspiration of becoming the global internetwork standard, 
rather than just a subnetwork technology)--><xref
        target="ecnencap_Fig_Feed-Backward"></xref> shows the feed-backward
        approach being used in subnet H. If the final switch on the path is
        congested (*), it doesn't feed-forward any congestion indications on
        packet (U). Instead it sends a control cell (V) back to the router at
        the ATM ingress.</t>

        <t>However, the backward feedback doesn't reach the original data
        source directly because IP doesn't support backward feedback (and
        subnet G is independent of subnet H). Instead, the router in the
        middle throttles down its sending rate but the original data source
        doesn't reduce its rate. The resulting rate mismatch causes the middle
        router's buffer at layer 3 to back up until it becomes congested,
        which it signals forwards on later data packets at layer 3 (e.g.
        packet W). Note that the forward signal from the middle router is not
        triggered directly by the backward signal. Rather, it is triggered by
        congestion resulting from the middle router's mismatched rate response
        to the backward signal.</t>

        <t>In response to this later forward signalling, end-to-end feedback
        at layer-4 finally completes the tortuous path of congestion
        indications back to the origin data source, as before.</t>

        <!--To summarise so far, feeding congestion notification backwards can reach the source faster, but only 
if the congested subnet is directly connected to the original data source. In a more general case, 
feedback takes a tortuous path part-way backwards, which can lead to queuing at the higher layer in 
the middle of the network, which can in turn trigger a much-delayed feed-forward signal, which then 
has to be fed back from destination to source.-->
      </section>

      <section title="Null Mode">
        <t>Often link and physical layer resources are 'non-blocking' by
        design. In these cases congestion notification may be implemented but
        it does not need to be deployed at the lower layer; ECN in IP would be
        sufficient.</t>

        <t>A degenerate example is a point-to-point Ethernet link. Excess
        loading of the link merely causes the queue from the higher layer to
        back up, while the lower layer remains immune to congestion. Even a
        whole meshed subnetwork can be made immune to interior congestion by
        limiting ingress capacity and careful sizing of links, particularly if
        multi-path routing is used to ensure even worst-case patterns of load
        cannot congest any link.</t>
      </section>
    </section>

    <section anchor="ecnencap_Guidelines_Forward"
             title="Feed-Forward-and-Up Mode: Guidelines for Adding Congestion Notification">
      <t>These guidelines are consistent with the guidelines on the design of
      alternate schemes for IP tunnelling of the ECN field <xref
      target="RFC6040"></xref> and the more general best current practice for
      the design of alternate ECN schemes given in <xref
      target="RFC4774"></xref>.</t>

      <t>The capitalised term 'SHOULD (NOT)' has often been used in preference
      to 'MUST (NOT)' because it is difficult to know the compromises that
      will be necessary in each protocol design. If a particular protocol
      design chooses to contradict a 'SHOULD (NOT)' given in the advice below,
      it MUST include a sound justification.</t>

      <section anchor="ecnencap_WireProtocolECNSupport"
               title="Wire Protocol Design: Indication of ECN Support">
        <t>A lower layer (or subnet) congestion notification protocol<list
            style="numbers">
            <t>SHOULD NOT apply explicit congestion notifications to PDUs that
            are destined for legacy layer-4 transport implementations that
            will not understand ECN, and</t>

            <t>SHOULD NOT apply explicit congestion notifications to PDUs that
            are destined for a legacy subnet egress that will fail to
            propagate them onward into the higher layer.<vspace
            blankLines="1" />We use the term ECN-PDUs for a PDU on a feedback
            loop that will propagate congestion notification properly because
            it meets both these criteria. And a Not-ECN-PDU is a PDU on a
            feedback loop that does not meet both criteria, and will therefore
            not propagate congestion notification properly. A corollary of the
            above is that a lower layer congestion notification protocol:</t>

            <t>SHOULD be able to distinguish ECN-PDUs from Not-ECN-PDUs.</t>
          </list></t>

        <t>In IP, if the ECN field in each PDU is cleared to the Not-ECT (not
        ECN-capable transport) codepoint, it indicates that the L4 transport
        will not understand congestion markings. A congested buffer must not
        mark these Not-ECT PDUs, and therefore has to drop some. The mechanism
        a lower layer uses to distinguish the ECN-capability of PDUs need not
        mimic that of IP, but it should achieve the same outcome. For
        instance, ECN-capable feedback loops might use PDUs that are
        identified by a particular set of labels or tags. Alternatively,
        logical link protocols that use flow state might determine whether a
        PDU can be congestion marked by checking for ECN-support in the flow
        state.</t>

        <t>The per-domain checking of ECN support in MPLS <xref
        target="RFC5129"></xref> is a good example of a way to avoid sending
        congestion markings to transports that will not understand
        them—without using any header space in the subnet protocol.</t>

        <t>In MPLS, header space is extremely limited, therefore RFC5129 does
        not provide a field in the MPLS header to indicate whether the PDU is
        an ECN-PDU or a Not-ECN-PDU. Instead, interior nodes in a domain are
        allowed to set explicit congestion indications without checking
        whether the PDU is destined for a transport that will understand them.
        Nonetheless, this is made safe by requiring that the network operator
        upgrades all decapsulating edges of a whole domain at once, if any
        switch within the domain is configured to mark rather than drop during
        congestion. Therefore, there will be an implementation of an
        ECN-capable decapsulator on any edge node that might decapsulate a
        packet, which will check whether the higher layer transport is
        ECN-capable. When decapsulating a CE-marked packet, if the
        decapsulator discovers that the higher layer (inner header) indicates
        the transport is not ECN-capable, it drops the packet on behalf of the
        earlier congested node (see Decapsulation Guideline <xref
        target="ecnencap_dropNot-ECTinnerCEouter"></xref> in <xref
        target="ecnencap_DecapGuidelines"></xref>).</t>

        <t>Note that it was only appropriate to define such an incremental
        deployment strategy because MPLS is targeted solely at professional
        operators, who can be expected to ensure that a whole subnetwork is
        consistently configured. This strategy might not be appropriate for
        other link technologies targeted at zero-configuration deployment or
        deployment by the general public (e.g. Ethernet). For such
        'plug-and-play' environments it will be necessary to invent a failsafe
        approach that ensures congestion markings will never fall into black
        holes, no matter how inconsistently a system is put together.
        Alternatively, congestion notification relying on correct system
        configuration could be confined to flavours of Ethernet intended only
        for professional network operators, such as IEEE 802.1ah Provider
        Backbone Bridges (PBB).</t>

        <t>Note that these guidelines do not require the subnet wire protocol
        to be changed at all to accommodate congestion notification. Another
        way to add congestion notification without consuming header space in
        the subnet protocol might be to use a control plane protocol in
        parallel.</t>
      </section>

      <section anchor="ecnencap_EncapGuidelines"
               title="Encapsulation Guidelines">
        <t><list style="numbers">
            <t>Egress Capability Check: A subnet ingress needs to be sure that
            the corresponding egress of a subnet will propagate any congestion
            notification added to the outer header across the subnet. This is
            necessary in addition to checking that an incoming PDU indicates
            an ECN-capable (L4) transport. Examples of how this guarantee
            might be provided include:<list style="symbols">
                <t>by configuration (e.g. if any label switches in a domain
                support ECN marking, <xref target="RFC5129"></xref> requires
                all egress nodes to have been configured to propagate ECN)</t>

                <t>by the ingress explicitly checking that the egress
                propagates ECN (e.g. TRILL uses IS-IS to check path
                capabilities before using critical options <xref
                target="trill-rbridge-options"></xref>)</t>

                <t>by inherent design of the protocol (e.g. by encoding ECN
                marking on the outer header in such a way that a legacy egress
                that does not understand ECN will consider the PDU corrupt and
                discard it, thus at least propagating a form of congestion
                signal).</t>
              </list>If the ingress cannot guarantee that the egress will
            propagate congestion notification, the ingress SHOULD disable ECN
            when it forwards the PDU at the lower layer. An example of how the
            ingress might disable ECN at the lower layer would be by setting
            the outer header of the PDU to identify it as a Not-ECN-PDU.</t>

            <t anchor="ecnencap_Encap_Copy">Standard Congestion Monitoring
            Baseline: Once the ingress to a subnet has established that the
            egress will correctly propagate ECN, on encapsulation it SHOULD
            encode the same level of congestion in outer headers as is
            arriving in incoming headers. For example it could copy any
            incoming congestion notification into the outer header of the
            lower layer protocol.<vspace blankLines="1" />This ensures that
            all outer headers reflect congestion accumulated along the whole
            upstream path, not just since the ingress of the subnet. More
            precisely, congestion notifications in outer headers SHOULD
            reflect congestion experienced along the whole path since the node
            that regulates the load for that path (the Load Regulator,
            typically the data source) and no other node should re-initialise
            the amount of CE markings to zero along the way. <vspace
            blankLines="1" />This guideline is intended to ensure that any
            bulk congestion monitoring of outer headers (e.g. by a network
            management node monitoring ECN in passing frames) is most
            meaningful. For instance, if an operator measures CE in 0.4% of
            passing packets, this information is only useful if the operator
            knows where the proportion of CE markings was last initialised to
            0% (the Congestion Baseline). Such monitoring information will not
            be useful if some subnet ingress nodes reset all outer CE markings
            while others copy incoming CE markings into the outer.<vspace
            blankLines="1" />Most information can be extracted if the
            Congestion Baseline is standardised at the node that is regulating
            the load (the Load Regulator—typically the data source).
            Then the operator can measure both congestion since the Load
            Regulator, and congestion since the subnet ingress. The latter can
            be measured by subtracting the level of CE markings on inner
            headers from that on outer headers.<!--{ToDo: It may be safe to assume a subnetwork technology will not span a trust boundary. 
Especially if copy on encap is not desirable, e.g. if using Floyd's 1-bit MPLS scheme.}

{ToDo - either make this a separate case, move it to modes, or delete it} 
In some circumstances (e.g. pseudowire emulations with link-local flow control), the whole 
path is divided into segments, each with its own congestion notification and feedback loop. 
In these cases, the function that regulates load at the start of each segment will need to 
reset congestion notification (i.e. clear any accumulated congestion notifications) at the 
start of its segment.
--></t>
          </list></t>
      </section>

      <section anchor="ecnencap_DecapGuidelines"
               title="Decapsulation Guidelines">
        <t>A subnet egress SHOULD NOT simply copy congestion notification from
        outer headers to the forwarded header. It SHOULD calculate the
        outgoing congestion notification field from the inner and outer
        headers, using the following rules. If there is any conflict, rules
        earlier in the list take precedence over rules later in the list:<list
            style="numbers">
            <t anchor="ecnencap_dropNot-ECTinnerCEouter">If the arriving inner
            header is a Not-ECN-PDU it implies the L4 transport will not
            understand explicit congestion markings. Then:<list
                style="symbols">
                <t>If the outer header carries an explicit congestion marking,
                the packet SHOULD be dropped—the only indication of
                congestion that the L4 transport will understand.</t>

                <!--{ToDo: RFC6040 allows forwarding if it is not the most severe marking.}-->

                <t>If the outer is an ECN-PDU that carries no indication of
                congestion or a Not-ECN-PDU the PDU SHOULD be forwarded, but
                still as a Not-ECN-PDU.</t>
              </list></t>

            <t>If the outer header does not support explicit congestion
            notification (a Not-ECN-PDU), but the inner header does (an
            ECN-PDU), the inner header SHOULD be forwarded unchanged.</t>

            <t>In some lower layer protocols congestion may be signalled as a
            numerical level, such as in the control frames of quantised
            congestion notification <xref target="IEEE802.1Qau"></xref>. If
            such an encoding encapsulates an ECN-capable IP packet, a function
            will be needed to convert the quantised congestion level into the
            frequency of congestion markings in outgoing IP packets.</t>

            <t>Congestion indications may be encoded by a severity level. For
            instance increasing levels of congestion might be encoded by
            numerically increasing indications, e.g. pre-congestion
            notification (PCN) can be encoded in each PDU at three severity
            levels in IP or MPLS <xref target="RFC6660"></xref>.<vspace
            blankLines="1" />If the arriving inner header is an ECN-PDU, where
            the inner and outer headers carry indications of congestion of
            different severity, the more severe indication SHOULD be forwarded
            in preference to the less severe. Obviously, if the severities in
            both inner and outer are the same, the same severity should be
            forwarded.</t>

            <t>The inner and outer headers might carry a combination of
            congestion notification fields that should not be possible given
            any currently used protocol transitions. For instance, if
            Encapsulation Guideline <xref target="ecnencap_Encap_Copy"></xref>
            in <xref target="ecnencap_EncapGuidelines"></xref> had been
            followed, it should not be possible to have a less severe
            indication of congestion in the outer than in the inner. It MAY be
            appropriate to log unexpected combinations of headers and possibly
            raise an alarm. If a safe outgoing codepoint can be defined for
            such a PDU, the PDU SHOULD be forwarded rather than dropped.
            <vspace blankLines="1" />Some implementers discard PDUs with
            currently unused combinations of headers just in case they
            represent an attack. However, an approach using alarms and
            policy-mediated drop is preferable to hard-coded drop, so that
            operators can keep track of possible attacks but currently unused
            combinations are not precluded from future use through new
            standards actions.</t>
          </list></t>
      </section>

      <section title="Reframing and Congestion Markings">
        <t>Where framing boundaries are different between two layers,
        congestion indications SHOULD be propagated on the basis that a
        congestion indication on a PDU applies to all the octets in the PDU.
        On average, an encapsulator or decapsulator SHOULD approximately
        preserve the number of marked octets arriving and leaving (counting
        the size of inner headers, but not added encapsulating headers).</t>

        <t>The next departing frame SHOULD be immediately marked even if only
        enough incoming marked octets have arrived for part of the departing
        frame. This ensures that any outstanding congestion marked octets are
        propagated immediately, rather than held back waiting for a frame no
        bigger than the outstanding marked octets—which might involve a
        long wait.</t>

        <t>For instance, an algorithm for marking departing frames could
        maintain a counter representing the balance of arriving marked octets
        minus departing marked octets. It adds the size of every marked frame
        that arrives and if the counter is positive it marks the next frame to
        depart and subtracts its size from the counter. This will often leave
        a negative remainder in the counter, which is deliberate.</t>
      </section>
    </section>

    <section anchor="ecnencap_Guidelines_Up"
             title="Feed-Up-and-Forward Mode: Guidelines for Adding Congestion Notification">
      <t>Marking the IP header while switching at layer-2 (by using a layer-3
      switch) seems to represent a layering violation. However, it can be
      considered as a benign optimisation if the guidelines below are
      followed. Feed-up-and-forward is certainly not a general alternative to
      implementing feed-forward congestion notification in the lower layer,
      because:<list style="symbols">
          <t>IPv4 and IPv6 are not the only layer-3 protocols that might be
          encapsulated by lower layer protocols</t>

          <t>Link-layer encryption might be in use, making the layer-2 payload
          inaccessible</t>

          <t>Many Ethernet switches do not have 'layer-3 switch' capabilities
          so they cannot read and modify an IP payload</t>

          <t>It might be costly to find an IP header (v4 or v6) when it may be
          encapsulated by more than one Ethernet header (e.g. when using
          multiple encapsulations of MAC in MAC <xref
          target="IEEE802.1Qah"></xref>).</t>
        </list></t>

      <t>Nonetheless, configuring a layer-3 switch to look for an ECN field in
      an encapsulated IP header is a useful optimisation. If the
      implementation follows the guidelines below, this optimisation does not
      have to be confined to a controlled environment such as within a data
      centre; it could usefully be applied on any network—even if the
      operator is not sure whether the above issues will never apply:<list
          style="numbers">
          <t>If a native lower-layer congestion notification mechanism exists
          for a subnet technology, it is safe to mix feed-up-and-forward with
          feed-forward-and-up on other switches in the same subnet. However,
          it will generally be more efficient to use the native mechanism.</t>

          <t>The depth of search for an IP header SHOULD be limited. If an IP
          header is not found soon enough, or an unrecognised or unreadable
          header is encountered, the switch SHOULD resort to an alternative
          means of signalling congestion (e.g. drop, or the native lower layer
          mechanism if available).</t>

          <t>It is sufficient to use the first IP header found in the stack;
          the egress of the relevant tunnel can propagate congestion
          notification upwards to any more deeply encapsulated IP headers
          later.</t>
        </list></t>
    </section>

    <section anchor="ecnencap_Guidelines_Backward"
             title="Feed-Backward Mode: Guidelines for Adding Congestion Notification">
      <t>It can be seen from <xref target="ecnencap_Backward"></xref> that
      congestion notification in a subnet using feed-backward mode has
      generally not been designed to directly coupled with IP layer congestion
      notification. The subnet attempts to minimise congestion internally, and
      if the incoming load at the ingress exceeds capacity through the subnet,
      the layer 3 buffer into the ingress backs up. Thus, a feed-backward mode
      subnet is in some sense similar to a null mode subnet, in that there is
      no need for any direct interaction between the subnet and higher layer
      congestion notification. Therefore no detailed protocol design
      guidelines are appropriate. Nonetheless, a more general guideline is
      appropriate: <list style="numbers">
          <t>A subnetwork technology intended to eventually interface to IP
          SHOULD NOT be designed using only the feed-backward mode, which is
          certainly best for a stand-alone subnet, but would need to be
          modified to work efficiently as part of the wider Internet, because
          IP uses feed-forward-and-up mode.</t>
        </list></t>

      <t>The feed-backward approach does at least work beneath IP, but it can
      result in very inefficient and sluggish congestion control—except
      if it is confined to the subnet directly connected to the original data
      source, when it is faster than feed-forward. It would be possible to
      design a protocol that could work in feed-backward mode for paths that
      only cross one subnet, and in feed-forward-and-up mode for paths that
      cross subnets.</t>

      <t>In the early days of TCP/IP, a similar feed-backward approach was
      tried for explicit congestion signalling, using source-quench (SQ) ICMP
      control packets. However, SQ fell out of favour and is now formally
      deprecated <xref target="RFC6633"></xref>. The main problem was that it
      is hard for a data source to tell the difference between a spoofed SQ
      message and a quench request from a genuine buffer on the path. It is
      also hard for a lower layer buffer to address an SQ message to the
      original source, which may be buried within many layers of headers, and
      possibly encrypted.</t>

      <t>Quantised congestion notification (QCN—also known as backward
      congestion notification or BCN) <xref target="IEEE802.1Qau"></xref> uses
      a feed-backward mode very similar to ATM. However, QCN confines its
      applicability to scenarios where all endpoints are directly attached by
      the same Ethernet technology, and is used for example in server area
      networks (SANs). If a QCN subnet were connected into a wider IP-based
      internetwork (e.g. when attempting to interconnect SANs within multiple
      data centres) it would suffer the same inefficiency as shown in <xref
      target="ecnencap_Fig_Feed-Backward"></xref>.</t>
    </section>

    <!-- ================================================================ -->

    <!-- ================================================================ -->

    <section anchor="ecnencap_IANA_Considerations" title="IANA Considerations">
      <t>This memo includes no request to IANA.</t>
    </section>

    <!-- ================================================================ -->

    <section anchor="ecnencap_Security_Considerations"
             title="Security Considerations">
      <t>{TBA}`</t>
    </section>

    <!-- ================================================================ -->

    <section anchor="ecnencap_Conclusions" title="Conclusions">
      <t>{TBA}</t>
    </section>

    <!-- ================================================================ -->

    <section anchor="ecnencap_Acknowledgements" title="Acknowledgements">
      <t>Thanks to Gorry Fairhurst for extensive initial review.</t>

      <t>Bob Briscoe produced early drafts while partly funded by Trilogy, a
      research project (ICT-216372) supported by the European Community under
      its Seventh Framework Programme. The views expressed here are those of
      the author only.</t>
    </section>

    <!-- ================================================================ -->

    <section anchor="ecnencap_Comments_Solicited" title="Comments Solicited">
      <t>Comments and questions are encouraged and very welcome. They can be
      addressed to the IETF Transport Area working group mailing list
      <tsvwg@ietf.org>, and/or to the authors.</t>
    </section>
  </middle>

  <back>
    <!-- ================================================================ -->

    <references title="Normative References">
      <?rfc include="reference.RFC.2119" ?>

      <?rfc include='reference.RFC.3168'?>

      <?rfc include='reference.RFC.3819'?>

      <?rfc include='reference.RFC.4774'?>
    </references>

    <references title="Informative References">
      <?rfc include='reference.RFC.2003'?>

      <?rfc include='reference.RFC.2637'?>

      <?rfc include='reference.RFC.2661'?>

      <?rfc include='reference.RFC.2784'?>

      <?rfc include='reference.RFC.2884'?>

      <?rfc include='reference.RFC.1701'?>

      <?rfc include='reference.RFC.4301'?>

      <?rfc include='reference.RFC.5129'?>

      <?rfc include='reference.RFC.6040'?>

      <?rfc include='reference.RFC.6633'?>

      <?rfc include='reference.RFC.6660'?>

      <?rfc include='localref.I-D.ietf-trill-rbridge-options'?>

      <?rfc include='localref.IEEE802.1Qah.MACinMAC'?>

      <?rfc include='localref.IEEE802.1QauCongNotif'?>

      <?rfc include='localref.ITU-T.I.371_ATMTrafficMgmt'?>

      <?rfc include='localref.Alizadeh10.DCTCP'?>

      <?rfc include='localref.Buckwalter00.FrameRelay'?>

      <reference anchor="GTPv1">
        <front>
          <title>GPRS Tunnelling Protocol (GTP) across the Gn and Gp
          interface</title>

          <author>
            <organization>3GPP</organization>
          </author>

          <date />
        </front>

        <seriesInfo name="Technical Specification" value="TS 29.060" />
      </reference>

      <reference anchor="GTPv1-U">
        <front>
          <title>General Packet Radio System (GPRS) Tunnelling Protocol User
          Plane (GTPv1-U)</title>

          <author>
            <organization>3GPP</organization>
          </author>

          <date />
        </front>

        <seriesInfo name="Technical Specification" value="TS 29.281" />
      </reference>

      <reference anchor="GTPv2-C">
        <front>
          <title>Evolved General Packet Radio Service (GPRS) Tunnelling
          Protocol for Control plane (GTPv2-C)</title>

          <author>
            <organization>3GPP</organization>
          </author>

          <date year="" />
        </front>

        <seriesInfo name="Technical Specification" value="TS 29.274" />
      </reference>

      <reference anchor="ATM-TM-ABR">
        <front>
          <title>Understanding the Available Bit Rate (ABR) Service Category
          for ATM VCs</title>

          <author>
            <organization>Cisco</organization>
          </author>

          <date day="5" month="June" year="2005" />
        </front>

        <seriesInfo name="Design Technote" value="10415" />

        <format target="http://www.cisco.com/en/US/tech/tk39/tk51/technologies_tech_note09186a00800fbc76.shtml"
                type="HTML" />
      </reference>
    </references>

    <section title="Outstanding Document Issues">
      <t><list style="numbers">
          <t>[GF] Concern that certain guidelines warrant a MUST (NOT) rather
          than a SHOULD (NOT). Given the guidelines say that if any SHOULD
          (NOT)s are not followed, a strong justification will be needed, they
          have been left as SHOULD (NOT) pending further list discussion. In
          particular:<list style="symbols">
              <t>If inner is a Not-ECN-PDU and Outer is CE (or highest
              severity congestion level), MUST (not SHOULD) drop?</t>
            </list></t>

          <t>[GF] Impact of Diffserv on alternate marking schemes (referring
          to RFC3168, RFC4774 & RFC2983)</t>

          <t>Security Considerations</t>
        </list></t>
    </section>

    <section anchor="ecnencap_Doc_Changes"
             title="Changes in This Version (to be removed by RFC Editor)  ">
      <t><list style="hanging">
          <t hangText="From briscoe-00 to 00:"><list style="symbols">
              <t>Intended status: BCP (was Informational) & updates 3819
              added.</t>

              <t>Briefer Introduction: Introductory para justifying benefits
              of ECN. Moved all but a brief enumeration of modes of operation
              to their own new section (from both Intro & Scope).
              Introduced incr. deployment as most tricky part.</t>

              <t>Tightened & added to terminology section</t>

              <t>Structured with Modes of Operation, then Guidelines section
              for each mode.</t>

              <t>Tightened up guideline text to remove vagueness / passive
              voice / ambiguity and highlight main guidelines as numbered
              items.</t>

              <t>Added Outstanding Document Issues Appendix</t>

              <t>Updated references</t>
            </list></t>
        </list></t>
    </section>
  </back>
</rfc>

PAFTECH AB 2003-20262026-04-22 16:03:05