One document matched: draft-carpenter-flow-ecmp-03.xml
<?xml version="1.0"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<!-- This is built from a template for a generic Internet Draft. Suggestions for
improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com -->
<!-- This can be converted using the Web service at http://xml.resource.org/experimental.html
(which supports the latest, sometimes undocumented and under-tested, features.) -->
<?rfc toc="yes"?> <!-- You want a table of contents -->
<?rfc symrefs="yes"?> <!-- Use symbolic labels for references -->
<?rfc sortrefs="yes"?> <!-- This sorts the references -->
<?rfc iprnotified="no" ?> <!-- Change to "yes" if someone has disclosed IPR for the draft -->
<?rfc compact="yes"?>
<!-- You need one entry like the following for each RFC referenced -->
<!ENTITY RFC2119 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119'>
<!ENTITY RFC2629 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629'>
<!ENTITY RFC2460 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2460'>
<!ENTITY RFC3697 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.3697'>
<!ENTITY RFC2474 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2474'>
<!ENTITY RFC2991 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2991'>
<!-- You need one entry like the following for each I-D referenced -->
<!ENTITY DRAFT-ice SYSTEM "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.ietf-mmusic-ice.xml">
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<rfc ipr="trust200902" docName="draft-carpenter-flow-ecmp-03" category="bcp">
<front>
<title abbrev="Flow Label for tunnel ECMP/LAG">Using the IPv6 flow label for equal cost multipath routing and link aggregation in tunnels</title>
<author initials="B. E." surname="Carpenter" fullname="Brian Carpenter">
<organization abbrev="Univ. of Auckland"></organization>
<address>
<postal>
<street>Department of Computer Science</street>
<street>University of Auckland</street>
<street>PB 92019</street>
<city>Auckland</city>
<region></region>
<code>1142</code>
<country>New Zealand</country>
</postal>
<email>brian.e.carpenter@gmail.com</email>
</address>
</author>
<author initials="S." surname="Amante" fullname="Shane Amante">
<organization abbrev="Level 3"></organization>
<address>
<postal>
<street>Level 3 Communications, LLC</street>
<street>1025 Eldorado Blvd</street>
<city>Broomfield</city>
<region>CO</region>
<code>80021</code>
<country>USA</country>
</postal>
<email>shane@level3.net</email>
</address>
</author>
<date day="7" month="October" year="2010" />
<!-- <area>Internet</area>
<workgroup>6MAN</workgroup> -->
<abstract>
<t>The IPv6 flow label has certain restrictions on its use. This document describes how those restrictions
apply when using the flow label for load balancing by equal cost multipath routing, and for
link aggregation, particularly for tunneled traffic.
</t>
</abstract>
</front>
<middle>
<section anchor="intro" title="Introduction">
<t>When several network paths between the same two nodes are known by the routing system to
be equally good (in terms of capacity and latency), it may be desirable to share traffic among
them. Two such techniques are known as equal cost multipath routing (ECMP) and link aggregation
(LAG) <xref target="IEEE802.1AX"/>. There are of course numerous
possible approaches to this, but certain goals need to be met: </t>
<list style="symbols">
<t>Roughly equal share of traffic on each path.</t>
<t>Work-conserving method (no idle time when queue is non-empty).</t>
<t>Minimize or avoid out-of-order delivery for individual traffic flows.</t>
</list>
<t>There is some conflict between these goals: for example, strictly avoiding idle time
could cause a small packet sent on an idle path to overtake a bigger packet from
the same flow, causing out-of-order delivery. </t>
<t>One lightweight approach to ECMP or LAG is this: if there are N
equally good paths to choose from, then form a modulo(N) hash <xref target="RFC2991"/>
from a consistent set of fields in each packet header, and use the
resulting value to select a particular path. If the hash function is
chosen so that the hash values have a uniform statistical distribution,
this method will share traffic roughly equally between the N paths.
If the header fields included in the hash are consistent, all packets
from a given flow will generate the same hash, so out-of-order
delivery will not occur. Assuming a large number of unique flows are
involved, it is also probable that the method will be work-conserving,
since the queue for each link will remain non-empty. </t>
<t>The question with such a method is which IP header fields are chosen to identify
a flow and, consequently, are used as input keys to a modulo(N) hash
algorithm. </t>
<t>In the remainder of this document, we will use the term "flow" to
represent a sequence of packets that may be identified by either the
source and destination IP addresses alone {2-tuple} or the source
and destination IP addresses, protocol and source and destination
port numbers {5-tuple}. It should be noted that the latter is more
specifically referred to as a "microflow" in <xref target="RFC2474"/>,
but this term is not used in connection with the flow label in <xref target="RFC3697"/>.
</t>
<t>The question with such a method, then, is which IP header fields to include to identify a flow.
A minimal choice in the routing system is simply to use a hash of the source and destination
IP addresses, i.e., the 2-tuple. This is necessary and sufficient to avoid out-of-order delivery, and
with a wide variety of sources and destinations, as one finds in
the core of the network, sometimes sufficient to achieve work-conserving load sharing.
In practice, implementations often use the 5-tuple {dest addr, source addr, protocol, dest port, source port}
as input keys to the hash function, to maximize the probability of evenly sharing traffic over
the equal cost paths.
However, including transport layer information
as input keys to a hash may be a problem for IPv4 fragments <xref target="RFC2991"/>.
In addition, protocol and destination port numbers in the hash will not only make
the hash slightly more expensive to compute, but will not particularly improve
the hash distribution, due to the prevalence of well known port numbers and popular
protocol numbers. Ephemeral ports, on the other hand, are quite well distributed <xref target="Lee10"/>.
In the case of IPv6, protocol numbers are particularly inconvenient due to the variable
placement of and variable length of next-headers. In addition, <xref target="RFC2460"/>
recommends that all next-headers, except hop-by-hop options, should not be
inspected by intermediate nodes in the network, presumably
to make introduction of new next-headers more straightforward.
</t>
<t>The situation is different in tunneled scenarios.
Identifying a flow inside the tunnel is more complicated,
particularly because nearly all hardware can only identify
flows based on information contained in the outermost IP header.
Assume that traffic from
many sources to many destinations is aggregated in a single IP-in-IP tunnel from
tunnel end point (TEP) A to TEP B (see figure).
Then all the packets forming the tunnel have outer source address A and outer destination address B.
In all probability they also have the same port and protocol numbers. If there are
multiple paths between routers R1 and R2, and ECMP or LAG is applied to choose a
particular path, the 5-tuple and its hash will be constant and no load sharing will be achieved.
If there is much tunnel traffic, this will result in a high probability of congestion on one
of the paths between R1 and R2. </t>
<figure><artwork>
_____ _____ _____ _____
| TEP |_________| R1 |-------------| R2 |_________| TEP |
|__A__| |_____|-------------|_____| |__B__|
tunnel ECMP or LAG tunnel
here
</artwork></figure>
<t>Also, for IPv6, the total number of bits in the 5-tuple is quite
large (296), as well as inconvenient to extract due to the next-header placement.
This may be challenging for some hardware implementations, raising the
potential that network equipment vendors might sacrifice the length of the fields
extracted from an IPv6 header. The question therefore arises whether the 20-bit
flow label in IPv6 packets would be suitable for use as input to an ECMP or LAG
hash algorithm. If it could be used in place of the port numbers and protocol
number in the 5-tuple, the hash calculation would be simplified. </t>
<t>The flow label is left experimental by <xref target="RFC2460"/> but
is better defined by <xref target="RFC3697"/>. We quote three rules
from that RFC:</t>
<list style="numbers">
<t>"The Flow Label value set by the source MUST be delivered unchanged to
the destination node(s)."</t>
<t>"IPv6 nodes MUST NOT assume any mathematical or other properties
of the Flow Label values assigned by source nodes." </t>
<t>"Router performance SHOULD NOT be dependent on the distribution of the Flow Label
values. Especially, the Flow Label bits alone make poor material for a hash key." </t>
</list>
<t>These rules, especially the last one, have caused designers to hesitate about using the flow label in support of ECMP or LAG. The fact is today that most nodes set a zero value in the flow label, and the first rule definitely forbids the routing system from changing the flow label once a packet has left the source node. Considering normal IPv6 traffic, the fact that the flow label is typically zero means that it would add no value to
an ECMP or LAG hash. But neither would it do any harm to the distribution of the hash values. If the
community at some stage agrees to set pseudo-random flow labels in the majority of
traffic flows, this would add to the value of the hash. </t>
<t>However, in the
case of an IP-in-IPv6 tunnel, the TEP is itself the source node of the outer packets. Therefore, a TEP
may freely set a flow label in the outer IPv6 header of the packets it sends into the tunnel. In
particular, it may follow the <xref target="RFC3697"/> suggestion to set a pseudo-random value. </t>
<t>The second two rules quoted above need to be seen in the context of <xref target="RFC3697"/>,
which assumes that routers using the flow label in some way will be involved in some sort
of method of establishing flow state:
"To enable flow-specific treatment, flow state needs to be established
on all or a subset of the IPv6 nodes on the path from the source to
the destination(s)." The RFC should perhaps have made clear that a router that has participated
in flow state establishment can rely on properties of the resulting flow label values without
further signaling.
If a router knows these properties, rule 2 is irrelevant, and it can choose to deviate from rule 3. </t>
<t>In the tunneling situation sketched above, routers R1 and R2 can rely on the flow labels
set by TEP A and TEP B being assigned by a known method. This allows a safe ECMP or LAG method to
be based on the flow label without breaching <xref target="RFC3697"/>. </t>
</section> <!-- intro -->
<section title="Normative Notation">
<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in <xref target="RFC2119"/>.</t>
</section>
<section anchor="guide" title="Guidelines">
<t>We assume that the routers supporting ECMP or LAG (R1 and R2 in the above figure) are unaware that
they are handling tunneled traffic. If it is desired to include the IPv6 flow label in an ECMP
or LAG hash in the tunneled scenario shown above, the following guidelines apply: </t>
<list style="symbols">
<t>Inner packets MUST be encapsulated in an outer IPv6 packet whose source and destination
addresses are those of the tunnel end points (TEPs). </t>
<t>The flow label in the outer packet SHOULD be set by the sending TEP to a pseudo-random 20-bit
value in accordance with <xref target="RFC3697"/>. The same flow label value MUST be used for all packets
in a single user flow, as determined by the IP header fields of the inner packet. </t>
<list style="symbols">
<t>Note that this rule is a SHOULD rather than a MUST, to permit individual implementers to take
an alternative approach if they wish to do so. Such an alternative MUST conform to <xref target="RFC3697"/>. </t>
</list>
<t>The sending TEP MUST classify all packets into flows, once it has determined that they
should enter a given tunnel, and then write the relevant flow label into the outer IPv6 header.
A user flow could be identified by the ingress TEP most simply by its {destination, source} address
pair (coarse) or by its 5-tuple {dest addr, source addr, protocol, dest port, source port}
(fine). This is an implementation detail in the sending TEP. </t>
<list style="symbols">
<t>It might be possible to make this classifier stateless, by using a suitable 20 bit hash of
the inner IP header's 2-tuple or 5-tuple as the pseudo-random flow label value. </t>
</list>
<t>At intermediate router(s) that perform load distribution of tunneled packets
whose source address is a TEP, the hash algorithm used to determine
the outgoing component-link in an ECMP and/or LAG toward the next-hop
MUST minimally include the triple {dest addr, source addr, flow label}
to meet the <xref target="RFC3697"/> rules. </t>
<list style="symbols">
<t>Intermediate router(s) MAY also include {protocol, dest
port, source port} as input keys to the ECMP and/or LAG hash
algorithms, to provide sufficient entropy
in cases where the flow-label is currently set to zero. </t>
</list>
</list>
</section> <!-- guide -->
<section anchor="security" title="Security Considerations">
<t>The flow label is not protected in any way and can be forged by an on-path
attacker. Off-path attackers are unlikely to guess a valid flow label
if a pseudo-random value is used.
In either case, the worst an attacker could do against ECMP or LAG is to attempt
to selectively overload a particular path. For further discussion, see <xref target="RFC3697"/>. </t>
</section> <!-- security -->
<section anchor="iana" title="IANA Considerations">
<t>This document requests no action by IANA. </t>
</section> <!-- iana -->
<section anchor="ack" title="Acknowledgements">
<t>This document was suggest by corridor discussions at IETF76. Joel Halpern made
crucial comments on an early version. We are grateful to Qinwen Hu for general
discussion about the flow label.
Valuable comments and contributions were made by Jarno Rajahalme, Brian Haberman, Sheng Jiang,
and others.</t>
<t>This document was produced using the xml2rfc tool
<xref target="RFC2629"/>.</t>
</section> <!-- ack -->
<section anchor ="changes" title="Change log">
<t>draft-carpenter-flow-ecmp-03: clarifications after further comments, 2010-10-07</t>
<t>draft-carpenter-flow-ecmp-02: updated after IETF77 discussion, especially adding LAG,
changed to BCP language, added second author, 2010-04-14</t>
<t>draft-carpenter-flow-ecmp-01: updated after comments, 2010-02-18</t>
<t>draft-carpenter-flow-ecmp-00: original version, 2010-01-19</t>
</section> <!-- changes -->
</middle>
<back>
<references title="Normative References">
&RFC2119;
&RFC2460;
&RFC3697;
</references>
<references title="Informative References">
&RFC2474;
&RFC2629;
&RFC2991;
<!-- &DRAFT-ice; -->
<reference anchor='Lee10' target='http://www.cs.auckland.ac.nz/~brian/udptcp-paper-cam-submit.pdf'>
<front>
<title>Observations of UDP to TCP Ratio and Port Numbers</title>
<author initials="D. J." surname="Lee" fullname="DongJin Lee"/>
<author initials="B. E." surname="Carpenter" fullname="Brian E. Carpenter"/>
<author initials="N." surname="Brownlee" fullname="Nevil Brownlee"/>
<date month='May' year='2010'/>
</front>
<seriesInfo name="Fifth International Conference on Internet Monitoring and Protection" value='ICIMP 2010' />
</reference>
<reference anchor="IEEE802.1AX">
<front>
<title>Link Aggregation</title>
<author>
<organization>Institute of Electrical and Electronics Engineers</organization>
</author>
<date month="" year="2008"/>
</front>
<seriesInfo name="IEEE" value="Standard 802.1AX-2008"/>
</reference>
</references>
</back>
</rfc>
| PAFTECH AB 2003-2026 | 2026-04-23 08:21:27 |