One document matched: draft-ietf-idr-bgp-nh-cost-00.xml


<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
     which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
     There has to be one entity for each item to be referenced. 
     An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2918 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2918.xml">
<!ENTITY RFC4760 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4760.xml">
<!ENTITY RFC4271 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4271.xml">
<!ENTITY RFC5492 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5492.xml">
<!ENTITY RFC4724 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4724.xml">
<!ENTITY I-D.ietf-idr-dynamic-cap SYSTEM "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.ietf-idr-dynamic-cap.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
     please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
     (Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
     (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="std" docName="draft-ietf-idr-bgp-nh-cost-00"
     ipr="pre5378Trust200902">
  <!-- category values: std, bcp, info, exp, and historic
     ipr values: full3667, noModification3667, noDerivatives3667
     you can add the attributes updates="NNNN" and obsoletes="NNNN" 
     they will automatically be output with "(if approved)" -->

  <!-- ***** FRONT MATTER ***** -->

  <front>
    <!-- The abbreviated title is used in the page header - it is only necessary if the 
         full title is longer than 39 characters -->

    <title abbrev="draft-ietf-idr-bgp-nh-cost">Carrying next-hop cost
    information in BGP</title>

    <!-- add 'role="editor"' below for the editors if appropriate -->

    <!-- Another author who claims to be an editor -->

    <author fullname="Ilya Varlashkin" initials="I.V." surname="Varlashkin">
      <organization>Easynet Global Services</organization>

      <address>
        <email>ilya.varlashkin@easynet.com</email>
      </address>
    </author>

    <author fullname="Robert Raszuk" initials="R" surname="Raszuk">
      <organization>NTT MCL Inc.</organization>

      <address>
        <postal>
          <street>101 S Ellsworth Avenue Suite 350</street>

          <city>San Mateo</city>

          <region>CA</region>

          <code>94401</code>

          <country>US</country>
        </postal>

        <email>robert@raszuk.net</email>
      </address>
    </author>

    <date year="2012"/>

    <!-- Meta-data Declarations -->

    <area>General</area>

    <workgroup>Internet Engineering Task Force</workgroup>

    <!-- WG name at the upperleft corner of the doc,
         IETF is fine for individual submissions.  
	 If this element is not present, the default is "Network Working Group",
         which is used by the RFC Editor as a nod to the history of the IETF. -->

    <keyword>IDR</keyword>

    <keyword>BGP</keyword>

    <!-- Keywords will be incorporated into HTML output
         files in a meta tag but they have no effect on text or nroff
         output. If you submit your draft to the RFC Editor, the
         keywords will be used for the search engine. -->

    <abstract>
      <t>This document describes new BGP SAFI to exchange cost information to
      next-hops for the purpose of calculating best path from a peer
      perspective rather than local BGP speaker own perspective.</t>
    </abstract>
  </front>

  <middle>
    <section title="Motivation">
      <t>In certain situation route-reflector clients may not get optimum path
      to certain destinations. ADDPATH solves this problem by letting
      route-reflector to advertise multiple paths for given prefix. If number
      of advertised paths sufficiently big, route-reflector clients can choose
      same route as they would in case of full-mesh. This approach however
      places additional burden on the control plane. Solutions proposed by
      [BGP-ORR] use different approach - instead of calculating best path from
      local speaker own perspective the calculations are done using cost from
      the client to the next-hops. Although they eliminate need for
      transmitting redundant routing information between peers, there are
      scenarios where cost to the next-hop cannot be obtained accurately using
      this methods. For example, if next-hop information itself has been
      learned via BGP then simple SPF run on link-state database won't be
      sufficient to obtain cost information. To address such scenarios this
      document proposes a solution where cost information to the next-hops is
      carried within BGP itself using dedicated SAFI.</t>
    </section>

    <section title="NEXT-HOP INFORMATION BASE">
      <t>To facilitate further description of the proposed solution we
      introduce new table for all known next hops and costs to it from various
      routers on the network.</t>

      <t>Next-Hop Information Base (NHIB) stores cost to reach next-hop from
      arbitrary router on the network. This information is essential for
      choosing best path from a peer perspective rather than BGP-speaker own
      perspective. In canonical form NHIB entry is triplet (router, next-hop,
      cost), however this specification does not impose any restriction on how
      BGP implementations store that information internally. The cost in NHIB
      is does not have to be an IGP cost, but all costs in NHIB MUST be
      comparable with each other.</t>

      <t>NHIB can be populated from various sources both static and dynamic.
      This document focuses on populating NHIB using BGP. However it is
      possible that protocols other than BGP could be also used to populate
      NHIB.</t>
    </section>

    <section title="BGP BEST PATH SELECTION MODIFICATION">
      <t>This section applies regardless of method used to populate NHIB.</t>

      <t>When BGP speaker conforming to this specification selects routes to
      be advertised to a peer it SHOULD use cost information from NHIB rather
      than its own IGP cost to the next-hop after step (d) of 9.1.2.2 in <xref
      target="RFC4271"/>.</t>
    </section>

    <section title="USING BGP TO POPULATE NHIB">
      <t>This section describes extension to base BGP specification that
      allows BGP to be used for exchanging next-hop information between BGP
      speakers via new SAFI in order to populate NHIB. Although next-hops
      costs are exchanged via dedicated SAFI, this information is vital to
      best path selection process for other AFI/SAFI (e.g. IPv4 and IPv6
      unicast). It's therefore recommended that next-hop cost information is
      exchanged before other AFI/SAFI.</t>

      <section title="NEXT-HOP SAFI">
        <t>This document introduces Next-Hop SAFI (NH SAFI) with value to be
        assigned by IANA and purpose of exchanging information about cost to
        next-hops.</t>
      </section>

      <section title="CAPABILITY ADVERTISEMENT">
        <t>A BGP speaker willing to exchange next-hop information MUST
        advertise this in the OPEN message using BGP Capability Code 1
        (Multiprotocol Extensions, see <xref target="RFC4760"/>) setting AFI
        appropriately to indicate IPv4 or IPv6 and SAFI to the value assigned
        by IANA for NH SAFI. Note that if BGP speaker whishes to exchange cost
        information for both IPv4 and IPv6, then it MUST advertise two
        capabilities: one NH SAFI for IPv4 and one NH SAFI for IPv6.</t>
      </section>

      <section title="INFORMATION ENCODING">
        <t>To request cost to a next-hop from peer or to inform peer about
        cost to a next-hop BGP attribute 14 is used as follow:</t>

        <t><list style="numbers">
            <t>AFI is set to indicate IPv4 or IPv6 (whichever is
            appropriate)</t>

            <t>SAFI is set to NH SAFI</t>

            <t>Network Address of Next-Hop field is zeroed out</t>

            <t>NLRI field is encoded as shown in the next figure</t>
          </list></t>

        <t><figure>
            <artwork><![CDATA[ +-------------+------------+
 |  NEXT_HOP   |     cost   |
 +-------------+------------+]]></artwork>
          </figure></t>

        <t>Where cost is 32-bit unsigned integer (value described below), and
        NEXT_HOP is AFI-specific address of the next-hop cost to which is
        being communicated or requested. Size of NEXT_HOP field is inferred
        from total length of attribute 14.</t>

        <t>To request cost to arbitrary next-hop from a peer, BGP speaker sets
        cost field to zero.</t>

        <t>To inform peer about cost to a next-hop BGP speaker sets cost to
        actual cost value.</t>

        <t>To inform peer that a next-hop is not reachable the cost is set to
        all-ones (0xFFFFFFFF).</t>
      </section>

      <section title="SESSION ESTABLISHMENT">
        <t>BGP speakers willing to exchange next-hop information SHOULD NOT
        establish more then one session for given AFI and NH SAFI, even using
        different transport addresses. This can be ensured for example by
        checking peer’s Router Id.</t>
      </section>

      <section title="INFORMATION EXCHANGE">
        <t>Typically NH SAFI sessions will be established between
        route-reflectors and its internal peers (both clients and
        non-clients). As soon as the NH SAFI session is ESTABLISHED requests
        for next-hop cost and information information about next-hop costs MAY
        be sent independently. That is, route-reflector MAY send multiple
        requests without waiting for response, and its peers MAY send cost
        information before or after receiving such request. On the other hand,
        Router Reflectors SHOULD request cost information from their internal
        peers as soon as possible (due to reasons stated in section “BGP
        best path selection modification”). BGP speaker does not need to
        track outstanding requests to the peer.</t>

        <t>When a BGP speaker receives request for cost information it MUST
        reply with actual cost (not necessarily IGP cost, but whatever has
        been chosen to be carried in NH SAFI) to given next-hop or with cost
        set to all-ones indicating that next-hop is unreachable.</t>

        <t>Note that BGP speaker MUST use longest match rather than exact
        match for the next-hop.</t>

        <t>When a BGP speaker detects change in cost to previously advertised
        next-hop with delta equal or exceeding configured advertisement
        threshold, it SHOULD inform peer by advertising new cost or
        0xFFFFFFFF.</t>

        <t>When a BGP speaker discovers new next-hop among candidate routes it
        SHOULD request cost information from the peer.</t>
      </section>

      <section title="TERMINATION OF NH SAFI SESSION">
        <t>When BGP speaker terminates (for whatever reason) NH SAFI session
        with a peer, it SHOULD remove all cost information received from that
        peer unless instructed by configuration to do otherwise.</t>
      </section>

      <section title="GRACEFUL RESTART AND ROUTE REFRESH">
        <t>NH SAFI sessions could use graceful restart and route refresh
        mechanisms in the same way as it’s used for IPv4 and IPv6
        unicast.</t>
      </section>
    </section>

    <section title="Security considerations">
      <t>No new security issues are introduced to the BGP protocol by this
      specification.</t>
    </section>

    <section title="IANA Considerations">
      <t>IANA is requested to allocate value for Next-Hop Subsequent Address
      Family Identifier.</t>
    </section>
  </middle>

  <!--  *****BACK MATTER ***** -->

  <back>
    <references title="Normative References">
      &RFC4760;

      &RFC4271;
    </references>

    <references title="Informative References">
      &RFC2918;

      <?rfc include="reference.I-D.raszuk-bgp-optimal-route-reflection"?>
    </references>

    <section title="USAGE SCENARIOS">
      <section title="Trivial case">
        <figure>
          <artwork><![CDATA[
     --+---NetA---+--
       |          |
      r1          r2
       |          |
       R1--RR-----R2
       | \        |
       |  +------R4
       R3        
        ]]></artwork>
        </figure>

        <t>In this scenario r1 and r3 along with NetA are part of AS1; and
        R1-R4 along with RR are in AS2.</t>

        <t>If RR implements non-optimized route-reflection, then it will
        choose path to NetA via R1 and advertise it to both R3 and R4. Such
        choice is good from R3 perspective, but it results in suboptimal
        traffic flow from R4 to NetA.</t>

        <t>Using NH SAFI the route-reflector will learn that cost from R4 to
        R1 is 8 whereas to R2 it's only 1. RR will announce NetA to R4 with
        next-hop set to R2, while its announce to R3 will still have R1 as
        next-hop. Both R3 and R4 now will send traffic to NetA via closest
        exit, achieving same behaviour as if full iBGP mesh would have been
        configured.</t>
      </section>

      <section title="Non-IGP based cost">
        <t>When it's desirable to direct traffic over an exit other than the
        one with smallest IGP cost, NH SAFI can be used to convey cost which
        is not based on IGP. For example, network operator may arrange exit
        points in order of administrative preference and configure routers to
        send this instead of IGP cost. Route reflector then will then
        calculate best path based on administrative preference rather than IGP
        metrics.</t>

        <t>Network operators should excercise care to ensure that all routers
        up to and including exit point do not devert packets on to a different
        path, otherwise routing loops may occur. One way to achieve this is to
        have consistent administrative preference among all routers. Another
        option is to use a tunneling mechanism (e.g. MPLS-TE tunnel) between
        source and the exit point, provided that the router serving as exit
        point will send packets out of the network rather than diverting them
        to another exit point.</t>
      </section>

      <section title="Multiple route-reflectors">
        <t>This example demonstrates that NH SAFI peerings are necessary only
        between routers that already exchange other AFI/SAFI.</t>

        <figure>
          <artwork><![CDATA[
                           |
R1----R3---------R5----R7--+
      |           |        |
     RR1          |       NetA
      |          RR2       |
      |           |        |
R2----R4---------R6----R8--+
			   |
       ]]></artwork>
        </figure>

        <t>In the above network the routers R1-R4 are clients of RR1, and
        R5-R8 are clients of RR2. RR1 and RR2 also peer with each other and
        use ADDPATH.</t>

        <t>RR2 learns about NetA from R7 and R8. Since it sends not just
        best-path but all prefixes to RR1, there is no need for RR2 to learn
        cost information from R1 and R2 towards R7 and R8. On the other hand
        RR1 does exchange NH SAFI information with R1 and R2 so that each of
        them can receive routes, which are best from their perspective.</t>

        <t>As addition to ADDPATH a mechanism could be devised that would
        allow RR2 to learn how many alternative routes does it need to send to
        RR1. For example, if NetA would also be connected to R9 (not shown)
        but all clients of RR1 prefer R7 as exit point and R9 as next-best,
        then there is no need for RR2 to send NetA routes with next-hop R8 to
        RR1.</t>

        <t>Discussion: authors would like to solicit discussion whether there
        is sufficient interest in such mechanism.</t>
      </section>

      <section title="Inter-AS MPLS VPN">
        <t>Previous example could be transposed to Inter-AS MPLS VPN Option C
        scenario. In this case route reflectors RR1 and RR2 can be from
        different autonomous system. Essentially the behaviour of routers
        remains as already described.</t>
      </section>

      <section title="Corner case">
        <figure>
          <artwork><![CDATA[
   --+---NetA--+--
     |         |
RR---R1        R2
       \      /
        R3---R4       ]]></artwork>
        </figure>

        <t>In the above network cost from R3 to R1 is 10, all other costs are
        1. If RR advertises NetA to R3 based on cost information received from
        R3, but uses its own cost when advertising NetA to R4, there will be a
        loop formed. This is the reason why section “BGP best path
        selection modification” requires RR to have next-hop cost
        information for every next-hop and every peer.</t>

        <t>Note that the problem is the same as if RR would not use extensions
        described in this document and R3 would peer directly with R1 and R2,
        while R4 would peer only with RR.</t>
      </section>
    </section>
  </back>
</rfc>

PAFTECH AB 2003-20262026-04-24 04:15:31