One document matched: draft-ietf-6man-flow-ecmp-03.xml


<?xml version="1.0"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">

<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com -->

<!-- This can be converted using the Web service at http://xml.resource.org/experimental.html
     (which supports the latest, sometimes undocumented and under-tested, features.) -->

<?rfc toc="yes"?>            <!-- You want a table of contents -->
<?rfc symrefs="yes"?>        <!-- Use symbolic labels for references -->
<?rfc sortrefs="yes"?>       <!-- This sorts the references -->
<?rfc iprnotified="no" ?>    <!-- Change to "yes" if someone has disclosed IPR for the draft -->
<?rfc compact="yes"?>

<!-- You need one entry like the following for each RFC referenced -->


<!ENTITY RFC2119 PUBLIC ''
  'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119'>
<!ENTITY RFC2629 PUBLIC ''
  'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629'>
<!ENTITY RFC2460 PUBLIC ''
  'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2460'>
<!ENTITY RFC3697 PUBLIC ''
  'http://xml.resource.org/public/rfc/bibxml/reference.RFC.3697'>
<!ENTITY RFC2474 PUBLIC ''
  'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2474'>
<!ENTITY RFC2991 PUBLIC ''
  'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2991'>
<!ENTITY RFC2784 PUBLIC ''
  'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2784'>




<!-- You need one entry like the following for each I-D referenced -->

<!ENTITY DRAFT-flowup SYSTEM "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.ietf-6man-flow-3697bis.xml">

<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<rfc ipr="trust200902" docName="draft-ietf-6man-flow-ecmp-03" category="bcp">  


<front>
<title abbrev="Flow Label for tunnel ECMP/LAG">Using the IPv6 flow label for equal cost multipath routing  and link aggregation in tunnels</title>


<author initials="B. E." surname="Carpenter" fullname="Brian Carpenter">
    <organization abbrev="Univ. of Auckland"></organization>
    <address>
      <postal>
        <street>Department of Computer Science</street>
        <street>University of Auckland</street>
        <street>PB 92019</street>
        <city>Auckland</city>
        <region></region>
        <code>1142</code>
        <country>New Zealand</country>
      </postal>
      
      <email>brian.e.carpenter@gmail.com</email>
    </address>
</author>

<author initials="S." surname="Amante" fullname="Shane Amante">
<organization abbrev="Level 3"></organization>
  <address>
      <postal>
        <street>Level 3 Communications, LLC</street>
        <street>1025 Eldorado Blvd</street>
        <city>Broomfield</city>
        <region>CO</region>
        <code>80021</code>
        <country>USA</country>
      </postal>
      <email>shane@level3.net</email>
  </address>
</author>

<date day="20" month="June" year="2011" />

<!-- <area>Internet</area>
<workgroup>6MAN</workgroup> -->

 


<abstract>

<t>The IPv6 flow label has certain restrictions on its use. This document describes how those restrictions
apply when using the flow label for load balancing by equal cost multipath routing, and for
link aggregation, particularly for IP-in-IPv6 tunneled traffic.
</t>
    
</abstract>
</front>

<middle>
<section anchor="intro" title="Introduction">

<t>When several network paths between the same two nodes are known by the routing system to 
be equally good (in terms of capacity and latency), it may be desirable to share traffic among
them. Two such techniques are known as equal cost multipath routing (ECMP) and link aggregation
(LAG) <xref target="IEEE802.1AX"/>. There are of course numerous
possible approaches to this, but certain goals need to be met: </t>
<list style="symbols">
<t>Roughly equal share of traffic on each path.
<vspace blankLines="0" />
(In some cases, the multiple paths might not all have the same capacity and the goal
might be appropriately weighted traffic shares rather than equal shares. This would affect
the load sharing algorithm, but would not otherwise change the argument.)
</t>
<t>Minimize or avoid out-of-order delivery for individual traffic flows.</t>
<t>Minimize idle time on any path when queue is non-empty.</t>
</list>
<t>There is some conflict between these goals: for example, strictly avoiding idle time
could cause a small packet sent on an idle path to overtake a bigger packet from
the same flow, causing out-of-order delivery. </t>

<t>One lightweight approach to ECMP or LAG is this: if there are N
equally good paths to choose from, then form a modulo(N) hash <xref target="RFC2991"/>
from a defined set of fields in each packet header that are certain
to have the same values throughout the duration of a flow, and use the
resulting output hash value to select a particular path.  If the hash function is
chosen so that the output values have a uniform statistical distribution,
this method will share traffic roughly equally between the N paths.
If the header fields included in the hash input are consistent, all packets
from a given flow will generate the same hash output value, so out-of-order
delivery will not occur.  Assuming a large number of unique flows are
involved, it is also probable that the method will avoid idle time, 
since the queue for each link will remain non-empty. </t>

<section anchor="choice" title="Choice of IP Header Fields for Hash Input">

<t>In the remainder of this document, we will use the term "flow" to
represent a sequence of packets that may be identified by either the
source and destination IP addresses alone {2-tuple} or the source
and destination IP addresses, protocol and source and destination
port numbers {5-tuple}.  It should be noted that the latter is more
specifically referred to as a "microflow" in <xref target="RFC2474"/>,
but this term is not used in connection with the flow label in <xref target="RFC3697"/>.
</t>

<t>The question is, then, which header fields are used to identify a flow and to
serve as input keys to a modulo(N) hash algorithm. 
A common choice when routing general traffic is simply to use a hash of the source and destination
IP addresses, i.e., the 2-tuple. This is necessary and sufficient to avoid out-of-order delivery, and 
with a wide variety of sources and destinations, as one finds in
the core of the network, often statistically sufficient to distribute load evenly.
In practice, many implementations use the 5-tuple {dest addr, source addr, protocol, dest port, source port}
as input keys to the hash function, to maximize the probability of evenly sharing traffic over
the equal cost paths. However, including transport layer information	
as input keys to a hash may be a problem for IP fragments <xref target="RFC2991"/> or for encrypted traffic.	
Including the protocol and port numbers, totalling 40 bits, in the hash input
makes the hash slightly more expensive to compute but does
improve the hash distribution, due to the variable nature of ephemeral ports. Ephemeral port numbers
are quite well distributed <xref target="Lee10"/> and will typically contribute 16 variable bits. 
However, in the case of IPv6, transport layer information is inconvenient to extract, due to the variable 
placement of and variable length of next-headers; all implementations must be capable
of skipping over next-headers, even if they are rarely present in actual traffic. 
In fact, <xref target="RFC2460"/> implies that next-headers, except hop-by-hop options, are not	
normally inspected by intermediate nodes in the network. This situation may be challenging for
some hardware implementations, raising the potential that network equipment vendors might 
sacrifice the length of the fields	extracted from an IPv6 header.
</t>

<t>It is worth noting that the possible presence of a GRE header <xref target="RFC2784"/> and the possible presence of
a GRE key within that header creates a similar challenge to the possible presence of IPv6 extension
headers; anything that complicates header analysis is undesirable. </t> 

<t>The situation is different in IP-in-IP tunneled scenarios. 
Identifying a flow inside the tunnel is more complicated,	
particularly because nearly all hardware can only identify	
flows based on information contained in the outermost IP header.
Assume that traffic from
many sources to many destinations is aggregated in a single IP-in-IP tunnel from 
tunnel end point (TEP) A to TEP B (see figure).
Then all the packets forming the tunnel have outer source address A and outer destination address B. 
In all probability they also have the same port and protocol numbers. If there are
multiple paths between routers R1 and R2, and ECMP or LAG is applied to choose a
particular path, the 2-tuple or 5-tuple, and its hash, will be constant and no load sharing will be achieved.
If there is a high proportion of traffic from one or small number of tunnels, traffic will not be distributed as intended across
the paths between R1 and R2. </t>

<figure><artwork>
   _____           _____               _____           _____ 
  | TEP |_________| R1  |-------------| R2  |_________| TEP | 
  |__A__|         |_____|-------------|_____|         |__B__|
          tunnel          ECMP or LAG         tunnel
                              here 

</artwork></figure>

<t>As noted above, for IPv6, the 5-tuple is in any case quite inconvenient to extract
due to the next-header placement. The question therefore arises whether the 20-bit 
flow label in IPv6 packets would be suitable for use as input to an ECMP or LAG 
hash algorithm, especially in the case of tunnels where the inner packet header
is inaccessible. If the flow label could be used in place of the port numbers and protocol
number in the 5-tuple, the implementation would be simplified. </t>

</section> <!-- choice -->

<section anchor="rules" title="Flow label rules">

<t>The flow label was left experimental by <xref target="RFC2460"/> but
was better defined by <xref target="RFC3697"/>. We quote three rules
from that RFC:</t>
<list style="numbers">
<t>"The Flow Label value set by the source MUST be delivered unchanged to
   the destination node(s)."</t>
<t>"IPv6 nodes MUST NOT assume any mathematical or other properties
of the Flow Label values assigned by source nodes." </t>
<t>"Router performance SHOULD NOT be dependent on the distribution of the Flow Label
values. Especially, the Flow Label bits alone make poor material for a hash key." </t>
</list>

<t>These rules, especially the last one, have caused designers to hesitate about using the flow label in support of ECMP or LAG. The fact is today that most nodes set a zero value in the flow label, and the first rule definitely forbids the routing system from changing the flow label once a packet has left the source node. Considering normal IPv6 traffic, the fact that the flow label is typically zero means that it would add no value to
an ECMP or LAG hash. But neither would it do any harm to the distribution of the hash values. </t>

<t>However, in the
case of an IP-in-IPv6 tunnel, the TEP is itself the source node of the outer packets. Therefore, a TEP
may freely set a flow label in the outer IPv6 header of the packets it sends into the tunnel. </t>

<t>The second two rules quoted above need to be seen in the context of <xref target="RFC3697"/>,
which assumes that routers using the flow label in some way will be involved in some sort
of method of establishing flow state:
"To enable flow-specific treatment, flow state needs to be established
on all or a subset of the IPv6 nodes on the path from the source to
the destination(s)."  The RFC should perhaps have made clear that a router that has participated
in flow state establishment can rely on properties of the resulting flow label values without
further signaling.
If a router knows these properties, rule 2 is irrelevant, and it can choose to deviate from rule 3. </t>

<t>In the tunneling situation sketched above, routers R1 and R2 can rely on the flow labels
set by TEP A and TEP B being assigned by a known method. This allows an ECMP or LAG method to
be based on the flow label consistently with <xref target="RFC3697"/>, regardless of whether
the non-tunnel traffic carries non-zero flow label values. </t>

<t>At the time of this writing, the IETF is preparing a revision of RFC 3697
<xref target="I-D.ietf-6man-flow-3697bis"/>.
That revision is fully compatible with the present document and obviates
the concerns resulting from the above three rules. Therefore, the present specification
applies both to RFC 3697 and to its successor. </t>

</section> <!-- rules -->

</section> <!-- intro -->

<section title="Normative Notation">
 <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in <xref target="RFC2119"/>.</t>
 </section>

<section anchor="guide" title="Guidelines">

<t>We assume that the routers supporting ECMP or LAG (R1 and R2 in the above figure) are unaware that 
they are handling tunneled traffic. If it is desired to include the IPv6 flow label in an ECMP 
or LAG hash in the tunneled scenario shown above, the following guidelines apply: </t>

<list style="symbols">
<t>Inner packets MUST be encapsulated in an outer IPv6 packet whose source and destination
addresses are those of the tunnel end points (TEPs). </t>

<t>The flow label in the outer packet SHOULD be set by the sending TEP to a 20-bit 
value in accordance with <xref target="I-D.ietf-6man-flow-3697bis"/>. The same flow label value MUST be used for all packets
in a single user flow, as determined by the IP header fields of the inner packet. </t>
  
<t>To achieve this, the sending TEP MUST classify all packets into flows, once it has determined that they
should enter a given tunnel, and then write the relevant flow label into the outer IPv6 header.
A user flow could be identified by the sending TEP most simply by its {destination, source} address
2-tuple or by its 5-tuple {dest addr, source addr, protocol, dest port, source port}. 
At present, there would be little point 
in using the {dest addr, source addr, flow label} 3-tuple of the inner packet,
but doing so would be a future-proof option. The choice of n-tuple
is an implementation choice in the sending TEP. </t>

  <list style="symbols">
  <t>As specified in <xref target="I-D.ietf-6man-flow-3697bis"/>, the flow label values should be
  chosen from a uniform distribution. Such values will be
  suitable as input to a load balancing hash function and will be hard for a malicious third party to predict. </t> 
  <t>The sending TEP MAY perform stateless flow label assignment, by using a suitable 20 bit hash of
  the inner IP header's 2-tuple or 5-tuple as the flow label value. </t>
  <t>If the inner packet is an IPv6 packet, its flow label value could also be included
  in this hash. </t>
  <t>This stateless method creates a small probability of two different user flows
  hashing to the same flow label. Since <xref target="I-D.ietf-6man-flow-3697bis"/> allows a source (the TEP in this case)
  to define any set of packets that it wishes as a single flow, occasionally labeling two user
  flows as a single flow through the tunnel is acceptable. </t>
  </list>

<t>At intermediate router(s) that perform load distribution, the hash algorithm used
      to determine the outgoing component-link in an ECMP and/or LAG toward the next-hop
      MUST minimally include the 3-tuple {dest addr, source addr, flow label}
      and MAY also include the remaining components of the 5-tuple.
      This applies whether the traffic is tunneled traffic only, or a mixture
      of normal traffic and tunneled traffic.  </t>

  <list style="symbols">
  <t>Intermediate IPv6 router(s) will presumably encounter a mixture of tunneled 
     traffic and normal IPv6 traffic. Because of this, the design should also include
     {protocol, dest port, source port} as input keys to the ECMP and/or LAG hash
     algorithms, to provide additional entropy for flows whose flow label is set to zero,
     including non-tunneled traffic flows. </t>
  </list>
</list>

</section> <!-- guide -->




<section anchor="security" title="Security Considerations">

<t>The flow label is not protected in any way and can be forged by an on-path
attacker. However, it is expected that tunnel end-points and the ECMP or LAG
paths will be part of managed infrastructure that is well protected against
on-path attacks. Off-path attackers are unlikely to guess a valid flow label
if an apparently pseudo-random value is used.
In either case, the worst an attacker could do against ECMP or LAG is to attempt
to selectively overload a particular path. For further discussion, see <xref target="I-D.ietf-6man-flow-3697bis"/>.
</t>

   
</section> <!-- security -->


<section anchor="iana" title="IANA Considerations">
   <t>This document requests no action by IANA. </t>
</section> <!-- iana -->




<section anchor="ack" title="Acknowledgements">

<t>This document was suggested by corridor discussions at IETF76. Joel Halpern made
crucial comments on an early version. We are grateful to Qinwen Hu for general
discussion about the flow label. 
Valuable comments and contributions were made by Jarno Rajahalme, Brian Haberman, Sheng Jiang, Thomas Narten,
and others.</t>

<t>This document was produced using the xml2rfc tool
<xref target="RFC2629"/>.</t>

</section> <!-- ack -->


<section anchor ="changes" title="Change log [RFC Editor: please remove]">

<t>draft-ietf-6man-flow-ecmp-03: minor editorial fixes,  AD comments, 2011-06-20.</t>

<t>draft-ietf-6man-flow-ecmp-02: updated after further comments, 2011-05-02.
Note that RFC3697bis becomes a normative reference. </t>

<t>draft-ietf-6man-flow-ecmp-01: updated after WG Last Call, 2011-02-10</t>

<t>draft-ietf-6man-flow-ecmp-00: after WG adoption at IETF 79, 2010-12-02</t>

<t>draft-carpenter-flow-ecmp-03: clarifications after further comments, 2010-10-07</t>

<t>draft-carpenter-flow-ecmp-02: updated after IETF77 discussion, especially adding LAG, 
changed to BCP language, added second author, 2010-04-14</t>

<t>draft-carpenter-flow-ecmp-01: updated after comments, 2010-02-18</t>

<t>draft-carpenter-flow-ecmp-00: original version, 2010-01-19</t>



</section> <!-- changes -->

</middle>

<back>

<references title="Normative References">

&RFC2119;
&RFC2460;
&RFC3697;
&DRAFT-flowup;

</references>

<references title="Informative References">


&RFC2474;
&RFC2629;
&RFC2991;
&RFC2784;


<reference anchor='Lee10' target='http://www.cs.auckland.ac.nz/~brian/udptcp-paper-cam-submit.pdf'>
<front>
<title>Observations of UDP to TCP Ratio and Port Numbers</title>
<author initials="D. J." surname="Lee" fullname="DongJin Lee"/>
<author initials="B. E." surname="Carpenter" fullname="Brian E. Carpenter"/> 
<author initials="N." surname="Brownlee" fullname="Nevil Brownlee"/> 
<date month='May' year='2010'/>
</front>
<seriesInfo name="Fifth International Conference on Internet Monitoring and Protection" value='ICIMP 2010' />
</reference>

<reference anchor="IEEE802.1AX">
<front>
<title>Link Aggregation</title>
<author>
<organization>Institute of Electrical and Electronics Engineers</organization>
</author>
<date month="" year="2008"/>
</front>
<seriesInfo name="IEEE" value="Standard 802.1AX-2008"/>
</reference>
    
</references>


</back>
</rfc>


PAFTECH AB 2003-20262026-04-23 14:32:37