One document matched: draft-bryant-filsfils-fat-pw-01.xml


<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc symrefs="no"?>
<?rfc sortrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc toc="yes"?>
<rfc category="std" docName="draft-bryant-filsfils-fat-pw-01" ipr="full3978">
  <front>
    <title abbrev="LB-fat-pw">Load Balancing Fat MPLS Pseudowires</title>

    <author fullname="Stewart Bryant" initials="S" surname="Bryant">
      <organization>Cisco Systems</organization>

      <address>
        <postal>
          <street>250 Longwater Ave</street>

          <city>Reading</city>

          <code>RG2 6GB</code>

          <country>United Kingdom</country>
        </postal>

        <phone>+44-208-824-8828</phone>

        <email>stbryant@cisco.com</email>
      </address>
    </author>

    <author fullname="Clarence Filsfils" initials="C" surname="Filsfils">
      <organization>Cisco Systems</organization>

      <address>
        <postal>
          <street></street>

          <city>Brussels</city>

          <code></code>

          <country>Belgium</country>
        </postal>

        <email>cfilsfil@cisco.com</email>
      </address>
    </author>

    <author fullname="Ulrich Drafz" initials="U" surname="Drafz">
      <organization>Deutsche Telekom</organization>

      <address>
        <postal>
          <street></street>

          <city>Muenster</city>

          <region></region>

          <code></code>

          <country>Germany</country>
        </postal>

        <phone></phone>

        <facsimile></facsimile>

        <email>Ulrich.Drafz@t-com.net</email>

        <uri></uri>
      </address>
    </author>

    <date year="2008" />

    <area>Internet</area>

    <workgroup>PWE3</workgroup>

    <keyword></keyword>

    <keyword>pseudowire</keyword>

    <keyword>MPLS</keyword>

    <keyword>Internet-Draft</keyword>

    <abstract>
      <t>Where the payload carried over a pseudowire carries a number of
      identifiable flows it can in some circumstances be desirable to carry
      those flows over the equal cost multiple paths that exist in the packet
      switched network. This draft describes two methods of achieving that,
      the one by including an additional label in the label stack, the other
      by using a block of alternative pseudowire labels.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>A pseudowire is defined as a mechanism that carries the essential
      elements of an emulated service from one provider edge (PE) to one or
      more other PEs over a packet switched network (PSN) <xref
      target="RFC3985"></xref>.</t>

      <t>A pseudowire is normally transported over one single network path,
      even if multiple ECMP paths exit between the ingress and egress PEs<xref
      target="RFC4385"> </xref> <xref target="RFC4928"></xref>. This is
      required to preserve the characteristics of the emulated service (e.g.
      avoid misordering for example for SAToP pseudowire’s <xref
      target="RFC4553"></xref>). Except in the extreme case described in
      Section 6, the new capability proposed in this draft does not change
      this default property of pseudowires.</t>

      <t>Some pseudowire's are used to transport large volumes of IP traffic
      between routers at two locations. One example of this is the use of an
      Ethernet pseudowire to create a virtual direct link between a pair of
      routers. Such pseudowire’s may carry from hundred’s of Mbps
      to Gbps of traffic. Such pseudowire’s do not require ordering to
      be preserved between packets of the pseudowire. They only require
      ordering to be preserved within the context of each individual
      transported IP flow. Some operators have requested the ability to
      explicitly configure such a pseudowire to leverage the availability of
      multiple ECMP paths. This allows for better capacity planning as the
      statistical multiplexing of a larger number of smaller flows is more
      efficient than with a smaller set of larger flows. Although Ethernet is
      used as an example above, the mechanisms decsribed in this draft are
      general mechanisms that may be applied to any pseudowire type in which
      there are identifable flows, and in which the there is no requirement to
      preserve the order between flows.</t>

      <section title="ECMP in Label Switched Routers">
        <t>Label switched routers commonly hash the label stack or some
        elements of the label stack as a method of discrminating between
        flows, in order to distribute those flows over the available equal
        cost multiple paths that exist in the network. Since the label at the
        bottom of stack is usually the label most closely associated with the
        flow, this normally provides the greatest entropy and hence is
        normally included in the hash. The describes two methods of setting
        bottom of stack label in order to facilitate the load balancing of the
        flows with a pseudowire over the available ECMPs.</t>
      </section>

      <t><list style="symbols">
          <t>Use of a load balance label</t>

          <t>Allocation of multiple pseudowire labels</t>
        </list></t>

      <t>The load balance label mechanism is the more general and more
      powerful method. The pseudowire label block is an OPTINAL method that
      may be negotiated by PEs unable to support the additional label needed
      by the load balance label method.</t>

      <section title="Load Balance Label">
        <t>In this approach an additional label is interposed between the
        pseudowire label and the control word, or if the control word is not
        present, between the pseudowire label and the pseudowire payload. This
        additional label is called the pseudowire load balancing label (LB
        label). Indivisible flows within the pseudowire MUST be mapped to the
        same pseudowire LB label by the ingress PE. The pseudowire load
        balancing label stimulates the correct ECMP load balancing behaviour
        in the PSN. On receipt of the pseudowire packet at the egress PE
        (which knows this additional label is present) the label is discarded
        without processing.</t>

        <t>Note that there is no protocol constraint on the value of a LB
        label.</t>
      </section>

      <section title="Pseudowire Label Block">
        <t>In this approach a contiguous block of pseudowire labels are
        allocated to each pseudowire by the egress PE. Flows carried by the
        pseudowire labels are spread over the set of pseudowire labels, such
        that Indivisible flows within the pseudowire MUST be mapped to the
        same pseudowire label by the ingress PE. The use of a multiplicity of
        alternate pseudowire labels stimulates the correct ECMP load balancing
        behaviour in the PSN.</t>

        <t>Note that the pseudowire labels MUST be allocated as a contiguous
        block.</t>

        <t>Support for this method is OPTIONAL.</t>
      </section>
    </section>

    <section title="Native Service Processing Function">
      <t>The Native Service Processing (NSP) function is a component of a PE
      that has knowledge of the structure of the emulated service and is able
      to take action on the service outside the scope of the pseudowire. In
      this case it is required that the NSP in the ingress PE identify flows,
      or groups of flows within the service, and indicate the flow (group)
      identity of each packet as it is passed to the pseudowire forwarder.
      Since this is an NSP function, by definition, the method used to
      identify a flow is outside the scope of the pseudowire design.
      Similarly, since the NSP is internal to the PE, the method of flow
      indication to the pseudowire forwarder is outside the scope of this
      document</t>
    </section>

    <section title="Pseudowire Forwarder">
      <t>The pseudowire forwarder must be provided with a method of mapping
      flows to load balanced paths. Where the method chosen is the label block
      method, the forwarder uses the flow information provided by the NSP to
      allocate a flow to one of the VC labels in the load balancing label
      block. In all other respects forwarder operation is identical to the
      normal single VC label case.</t>

      <t>When the load balance label method is used the forwarder must
      generate a label for the flow or group of flows. How the load balance
      label values are determined is outside the scope of this document,
      however the load balance label allocated to a flow SHOULD remain
      constant. It is recommended that the method chosen to generate the load
      balancing labels introduces a high degree of entropy in their values, to
      maximise the entropy presented to the ECMP path selection mechanism in
      the LSRs in the PSN, and hence distribute the flows as evenly as
      possible over the available PSN ECMP paths. The forwarder at the ingress
      PE prepends the pseudowire control word (if applicable), then prepends
      either the pseudowire load balancing label, followed by the pseudowire
      label. Alternatively it prepends the pseudowire control word (if
      applicable), then selects and appends one of the allocated pseudowire
      labels.</t>

      <t>The forwarder at the egress PE uses the pseudowire label to identify
      the pseudowire. If the label block approach is used operation is
      identical to the current non-load balanced case. Alternatively, from the
      pseudowire context, the egress PE can determine whether a pseudowire
      load balancing label is present, and if one is present, the label is
      discarded.</t>

      <t>All other pseudowire forwarding operations are unmodified by the
      inclusion of the pseudowire load balancing label.</t>

      <section title="Encapsulation when using LB Label">
        <t>The PWE3 Protocol Stack Reference Model modified to include
        pseudowire LB label is shown in <xref target="PStack"></xref>
        below</t>

        <figure anchor="PStack" title="PWE3 Protocol Stack Reference Model">
          <artwork><![CDATA[
   +-------------+                                +-------------+
   |  Emulated   |                                |  Emulated   |
   |  Ethernet   |                                |  Ethernet   |
   | (including  |         Emulated Service       | (including  |
   |  VLAN)      |<==============================>|  VLAN)      |
   |  Services   |                                |  Services   |
   +-------------+                                +-------------+
   | Load balance|                                | Load balance|
   +-------------+            Pseudowire          +-------------+
   |Demultiplexer|<==============================>|Demultiplexer|
   +-------------+                                +-------------+
   |    PSN      |            PSN Tunnel          |    PSN      |
   |   MPLS      |<==============================>|   MPLS      |
   +-------------+                                +-------------+
   |  Physical   |                                |  Physical   |
   +-----+-------+                                +-----+-------+

]]></artwork>

          <postamble></postamble>
        </figure>

        <t></t>

        <t>The encapsulation of a pseudowire with a pseudowire LB label is
        shown in <xref target="Encap"></xref> below</t>

        <figure anchor="Encap"
                title="Encapsulation of a pseudowire with a pseudowire load balancing label">
          <artwork><![CDATA[
    +-------------------------------+
    |      MPLS Tunnel label(s)     | n*4 octets (four octets per label)
    +-------------------------------+
    |      PW label                 |  4 octets
    +-------------------------------+
    |      Load Balance label       |  4 octets 
    +-------------------------------+
    |   Optional Control Word       |  4 octets
    +-------------------------------+
    |            Payload            |
    |                               |
    |                               |  n octets
    |                               |
    +-------------------------------+

]]></artwork>

          <postamble></postamble>
        </figure>

        <t></t>
      </section>
    </section>

    <section title="Load Balance Signaling">
      <t>This section describes the signalling procedures when <xref
      target="RFC4447"></xref> is used.</t>

      <section title="Signaling Load Balance Label">
        <t>When using the signalling procedures in <xref
        target="RFC4447"></xref>, there is a Pseudowire Interface Parameter
        Sub-TLV type used to signal the desire to include the load balance
        label when advertising a VC label.</t>

        <t>The presence of this parameter indicates that the egress PE
        requests that the ingress PE place a load balance label between the
        pseudowire label and the control word (or is the control word is not
        present between the pseudowire label and the pseudowire payload).</t>

        <t>If the ingress PE recognises load balance label indicator parameter
        but does not wish to include the load balance label, it need only
        issue its own label mapping message for the opposite direction without
        including the load balance label Indicator. This will prevent
        inclusion of the load balance label in either direction.</t>

        <t>If PWE3 signalling <xref target="RFC4447"></xref> is not in use for
        a pseudowire, then whether the load balance label is used MUST be
        identically provisioned in both PEs at the pseudowire endpoints. If
        there is no provisioning support for this option, the default
        behaviour is not to include the load balance label.</t>

        <t>Note that what is signaled is the desire to include the load
        balance lable in the label stack. The value of the label is a local
        matter for the ingress PE, and the label value itself is not
        signaled.</t>

        <section title="Structure of Load Balance Label TLV">
          <t>The structure of the load balance label TLV is shown in <xref
          target="MultipleVCTLV"></xref>.</t>

          <figure anchor="MultipleVCTLV" title="Multiple VC TLV">
            <preamble></preamble>

            <artwork><![CDATA[ 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| LBL           |    Length     |    must be zero               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+]]></artwork>

            <postamble></postamble>
          </figure>

          <t></t>

          <t>Where:</t>

          <t><list style="symbols">
              <t>LBL is the load balance label TLV identifier assigned by
              IANA.</t>

              <t>Length is the length of the TLV in octets and is 4.</t>
            </list></t>
        </section>
      </section>

      <section title="Signalling Label Block">
        <t>When using the signalling procedures in <xref
        target="RFC4447"></xref>, there is a Pseudowire Interface load balance
        sub-TLV type used to signal the desire load balance the pseudowire
        over a block of pseudowire VC labels.</t>

        <t>The presence of this TLV indicates that the egress PE requests that
        the ingress PE distribute the ingress flows present in the pseudowire
        over the block of VC labels sent in this TLV.</t>

        <t>This mechanism is fully backwards compatible. If the ingress PE
        does not recognise the load balance TLV or does not wish to use it, it
        simply ignores this TLV.</t>

        <section title="Structure of Multiple VC TLV">
          <t>The field structure is defined in <xref
          target="multTLV"></xref>.</t>
        </section>

        <t></t>

        <figure anchor="multTLV" title="Multiple VC TLV">
          <preamble></preamble>

          <artwork><![CDATA[ 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MultipleVC    |   Length=12   |    must be zero               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                            First Label                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                            Last Label                         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+]]></artwork>

          <postamble></postamble>
        </figure>

        <t></t>

        <t>Where:</t>

        <t><list style="symbols">
            <t>MultipleVC is the TLV identifier assigned by IANA</t>

            <t>Length is 12</t>

            <t>First Label is a 20-bit label value as specified in <xref
            target="RFC3032"></xref> represented as a 20-bit number in a 4
            octet field that indicates the start of the load balance label
            block</t>

            <t>Last Label is a 20-bit label value as specified in <xref
            target="RFC3032"></xref> represented as a 20-bit number in a 4
            octet field that indicates the end of the load balance label
            block</t>
          </list>First Label SHOULD be the normal VC for this pseudowire.</t>
      </section>
    </section>

    <section title="OAM">
      <t>The following OAM considerations apply to both methods of load
      balancing.</t>

      <t>Where the OAM is only to be used to perform a basic test that the
      pseudowires have been configured at the PEs, <xref
      target="I-D.ietf-pwe3-vccv">VCCV</xref> messages may be sent using any
      load balance pseudowire path, i.e. over any of the multiple pseudowire
      labels, or using any pseudowire load balance label.</t>

      <t>Where it is required to verify that a pseudowire is fully functional
      for all flows<xref target="I-D.ietf-pwe3-vccv">, VCCV</xref> connection
      verification message MUST be sent over each ECMP path to the pseudowire
      egress PE. This problem is difficult to solve and scales poorly. . We
      believe that this problem is addressed by the following two methods:</t>

      <t><list counter="" style="numbers">
          <t>If a failure occurs within the PSN, this failure will normally be
          detected by the PSN's IGP (link/node failure, link or BFD or IGP
          hello detection), and the IGP convergence will naturally modify the
          ECMP set of network paths between the Ingress and Egress PE's. Hence
          the PW is only impacted during the normal IGP convergence time.</t>

          <t>If the failure is related to the individual corruption of an LFIB
          entry in a router, then only the network path using that specific
          entry is impacted. If the PW is load balanced over multiple network
          paths, then this failure can only be detected if, by chance, the
          transported OAM flow is mapped onto the impacted network path, or
          all paths are tested. This type of error may be better solved be
          solved by other means such as LSP self test <xref
          target="I-D.ietf-mpls-lsr-self-test"></xref>.</t>
        </list>To troubleshoot the MPLS PSN, including multiple paths, the
      techniques described in <xref target="RFC4378"></xref> and <xref
      target="RFC4379"></xref> can be used.</t>
    </section>

    <section title="Applicability">
      <t>The requirement to load-balance over multiple PSN paths occurs when
      the ratio between the PW access speed and the PSN’s core link
      bandwidth is large (e.g. >= 0.1). ATM and FR are unlikely to meet
      this property. Ethernet does and this is the reason why this document
      focuses on ethernet. Applications for other high-access-bandwidth
      PW’s (fiber-channel) may be defined in the future.</t>

      <t>This design applies to MPLS pseudowires where it is meaningful to
      deconstruct the packets presented to the ingress PE into flows. The
      mechanism described in this document promotes the distribution of flows
      within the pseudowire over different network paths. This in turn means
      that whilst packets within a flow are delivered in order (subject to
      normal IP delivery perturbations due to topology variation), order is
      not maintained amongst packets of different flows. It is not proposed to
      associate a different sequence number with each flow. If sequence number
      support is required this mechanism is not applicable.</t>

      <t>Where it is known that the traffic carried by the ethernet pseudowire
      is IP the method of identifying the flows are well known and can be
      applied. Such methods typically include hashing on the source and
      destination addresses, the protocol ID and higher-layer flow-dependent
      fields such as TCP/UDP ports, L2TPv3 Session ID’s etc.</t>

      <t>Where it is known that the traffic carried by the ethernet pseudowire
      is non-IP, techniques used for link bundling between Ethernet switches
      may be reused. In this case however the latency distribution would be
      larger than is found in the link bundle case. The acceptability of the
      increased latency is for further study. Of particular importance the
      Ethernet control frames SHOULD always be mapped to the same PSN path to
      ensure in-order delivery.</t>

      <t>If the payload of an Ethernet PW is made of a single inner flow (i.e.
      an encrypted connection between two routers), then the functionality
      described in this document does not give any benefits, though neither
      does it give any drawbacks. This is unlikely to be a showstopper for two
      reasons:</t>

      <t><list style="symbols">
          <t>Firstly, the customer of a high-bandwidth PW service has
          incentive to get the best transport service because an inefficient
          use of the PSN leads to jitter and eventually to loss to the
          PW’s payload.</t>

          <t>Secondly, the customer is usually able to tailor their
          applications to generate many flows in the PSN. A well-known example
          is massive data transport between servers which use many parallel
          TCP sessions. This same technique can be used by any transport
          protocol: multiple UDP ports, multiple L2TPv3 Session ID’s,
          multiple GRE keys may be used to decompose a large flow into smaller
          components. This approach may be applied to IPSEC where multiple
          SPI’s may be allocated to the same security association.</t>
        </list>A node within the PSN is not able to perform
      deep-packet-inspection (DPI) of the PW as the PW technology is not
      self-describing: the structure of the PW payload is only known to the
      ingress and egress PE devices. The two methods proposed in this document
      solve this limitation.</t>

      <t>The methods describe in this document are transparent to the PSN and
      as such do not require any new capability from the PSN.</t>
    </section>

    <section title="Comparision of the Approaches">
      <t>There are a number of advantages and disadvantages to each
      approach:</t>

      <t><list style="symbols">
          <t>The LB label method has a better statistical multiplexing
          capability.</t>

          <t>The LB label method has a better semantic than the PW Label block
          approach as this latter merges the pseudowire emultiplexor and the
          load balance semantics.</t>

          <t>The LB label preserves label space and hence the FIB table
          size.</t>

          <t>The PW Label Block preserves the data plane path of the egress
          PE</t>

          <t>The PW Label Block has better hardware backwards
          compatibility.</t>

          <t>The PW Label Block approach increases the use of LFIB space. This
          appoach may particularly problematic in p2mp and mp2mp topologies,
          since the impact on LFIB space is felt by all PE nodes using the PW
          Labels Block approach.</t>

          <t>It is not clear how many labels need to be allocated for a
          particular PW, the more diverse input is provided to load-hashing
          algorithms within core LSR's the more evenly flows are distributed
          over component-links. Some operators have expressed concerns at the
          level of tuning that they will need undertake to establish and
          maintain the desired level of load balance.</t>

          <t>Both approach anyway require data plane forwarding change for the
          ingress PE.</t>

          <t>The LB Label method requires a data plane forwarding change for
          the egress PE.</t>
        </list>It is desirable that based on an analysis of the two load
      balance mechanisms, one will chosen as the method to persue. If
      consensus is reached that only one method is needed this document will
      be simplified by removing the dataplane and signaling test that
      describes the alternative solution.</t>
    </section>

    <section title="Security Considerations">
      <t>The pseudowire generic security considerations described in <xref
      target="RFC3985"></xref> and the security considerations applicable to a
      specific pseudowire type (for example, in the case of an Ethernet
      pseudowire <xref target="RFC4448"></xref> apply.</t>

      <t>There are no additional security risks introduced by this design.</t>
    </section>

    <section title="IANA Considerations">
      <t></t>

      <t>IANA is requested to allocate the next available values from the IETF
      Consensus range in the Pseudowire Interface Parameters Sub-TLV type
      Registry as a Load Balance Label indicator.</t>

      <figure>
        <artwork><![CDATA[
Parameter  Length       Description

TBD         4            Load Balancing Label
TBD        12            Load Balance VC Block]]></artwork>

        <postamble></postamble>
      </figure>
    </section>

    <section title="Congestion Considerations">
      <t>The congestion considerations applicable to pseudowires as described
      in <xref target="RFC3985"></xref> and any additional congestion
      considerations developed at the time of publication apply to this
      design.</t>

      <t>The ability to explicitly configure a PW to leverage the availability
      of multiple ECMP paths is beneficial to capacity planning as, all other
      parameters being constant, the statistical multiplexing of a larger
      number of smaller flows is more efficient than with a smaller number of
      larger flows.</t>

      <t>Note that if the classification into flows is only performed on IP
      packets the behaviour of those flows in the face of congestion will be
      as already defined by the IETF for packets of that type and no
      additional congestion processing is required.</t>

      <t>Where flows that are not IP are classified pseudowire congestion
      avoidance must be applied to each non-IP load balance group.</t>
    </section>

    <section title="Acknowledgements">
      <t>The authors wish to thank Joerg Kuechemann, Wilfried Maas, Luca
      Martini and Mark Townsley for valuable comments and contributions to
      this design.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include='reference.RFC.2119'?>

      <?rfc include='reference.RFC.4447'?>

      <?rfc include='reference.RFC.4448'?>

      <?rfc include='reference.RFC.4928'?>

      <?rfc include='reference.RFC.4553'?>

      <?rfc include='reference.RFC.4385'?>

      <?rfc include='reference.RFC.4378'?>

      <?rfc include='reference.RFC.4379'?>

      <?rfc include='reference.RFC.3032'?>
    </references>

    <references title="Informative References">
      <?rfc include='reference.RFC.3985'?>

      <?rfc include='reference.I-D.ietf-pwe3-vccv'?>

      <?rfc include='reference.I-D.ietf-mpls-lsr-self-test'?>
    </references>
  </back>
</rfc>

PAFTECH AB 2003-20262026-04-21 22:33:46