One document matched: draft-ietf-tsvwg-sctp-failover-15.xml


<?xml version="1.0" encoding="US-ASCII"?>
<!-- edited with XMLSPY v5 rel. 3 U (http://www.xmlspy.com)
     by Daniel M Kohn (private) -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY rfc2119 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY rfc4690 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4960.xml">
]>
<rfc category="std" docName="draft-ietf-tsvwg-sctp-failover-15.txt"
     ipr="trust200902">
  <?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>

  <?rfc toc="yes" ?>

  <?rfc symrefs="yes" ?>

  <?rfc sortrefs="yes"?>

  <?rfc iprnotified="no" ?>

  <?rfc strict="yes" ?>

  <front>
    <title abbrev="SCTP-PF">SCTP-PF: Quick Failover Algorithm in SCTP</title>

    <author fullname="Yoshifumi Nishida" initials="Y.N" surname="Nishida">
      <organization>GE Global Research</organization>

      <address>
        <postal>
          <street>2623 Camino Ramon</street>

          <city>San Ramon</city>

          <region>CA</region>

          <code>94583</code>

          <country>USA</country>
        </postal>

        <email>nishida@wide.ad.jp</email>
      </address>
    </author>

    <author fullname="Preethi Natarajan" initials="P.N" surname="Natarajan">
      <organization>Cisco Systems</organization>

      <address>
        <postal>
          <street>510 McCarthy Blvd</street>

          <city>Milpitas</city>

          <region>CA</region>

          <code>95035</code>

          <country>USA</country>
        </postal>

        <email>prenatar@cisco.com</email>
      </address>
    </author>

    <author fullname="Armando Caro" initials="A.C" surname="Caro">
      <organization>BBN Technologies</organization>

      <address>
        <postal>
          <street>10 Moulton St.</street>

          <city>Cambridge</city>

          <region>MA</region>

          <code>02138</code>

          <country>USA</country>
        </postal>

        <email>acaro@bbn.com</email>
      </address>
    </author>

    <author fullname="Paul D. Amer" initials="P.A" surname="Amer">
      <organization>University of Delaware</organization>

      <address>
        <postal>
          <street>Computer Science Department - 434 Smith Hall</street>

          <city>Newark</city>

          <region>DE</region>

          <code>19716-2586</code>

          <country>USA</country>
        </postal>

        <email>amer@udel.edu</email>
      </address>
    </author>

    <author fullname="Karen E. E. Nielsen" initials="K.N" surname="Nielsen">
      <organization>Ericsson</organization>

      <address>
        <postal>
          <street>Kistavägen 25</street>

          <city>Stockholm</city>

          <region/>

          <code>164 80</code>

          <country>Sweden</country>
        </postal>

        <email>karen.nielsen@tieto.com</email>
      </address>
    </author>

    <date/>

    <abstract>
      <t>SCTP supports multi-homing. However, when the failover operation
      specified in RFC4960 is followed, there can be significant delay and
      performance degradation in the data transfer path failover. To overcome
      this problem this document specifies a quick failover algorithm
      (SCTP-PF) based on the introduction of a Potentially Failed (PF) state
      in SCTP Path Management.</t>

      <t>The document also specifies a dormant state operation of SCTP. This
      dormant state operation is required to be followed by an SCTP-PF
      implementation, but it may equally well be applied by a standard RFC4960
      SCTP implementation.</t>

      <t>Additionally, the document introduces an alternative switchback
      operation mode called Primary Path Switchover that will be beneficial in
      certain situations. This mode of operation applies to both a standard
      RFC4960 SCTP implementation as well as to a SCTP-PF implementation.</t>

      <t>The procedures defined in the document require only minimal
      modifications to the RFC4960 specification. The procedures are
      sender-side only and do not impact the SCTP receiver.</t>
    </abstract>
  </front>

  <middle>
    <section title="Introduction">
      <t>The Stream Control Transmission Protocol (SCTP) specified in <xref
      target="RFC4960"/> supports multi-homing at the transport layer. SCTP's
      multi-homing features include failure detection and failover procedures
      to provide network interface redundancy and improved end-to-end fault
      tolerance. In SCTP's current failure detection procedure, the sender
      must experience Path.Max.Retrans (PMR) number of consecutive failed
      timer-based retransmissions on a destination address before detecting a
      path failure. Until detecting the path failure, the sender continues to
      transmit data on the failed path. The prolonged time in which <xref
      target="RFC4960"/> SCTP continues to use a failed path severely degrades
      the performance of the protocol. To address this problem, this document
      specifies a quick failover algorithm (SCTP-PF) based on the introduction
      of a new Potentially Failed (PF) path state in SCTP path management. The
      performance deficiencies of the <xref target="RFC4960"/> failover
      operation, and the improvements obtainable from the introduction of a
      Potentially Failed state in SCTP, were proposed and documented in <xref
      target="NATARAJAN09"/> for Concurrent Multipath Transfer SCTP <xref
      target="IYENGAR06"/>.</t>

      <t>While SCTP-PF can accelerate failover process and improve
      performance, the risks that an SCTP endpoint enters the dormant state
      where all destination addresses are inactive can be increased. <xref
      target="RFC4960"/> leaves the protocol operation during dormant state to
      implementations and encourages to avoid entering the state as much as
      possible by careful tuning of the Path.Max.Retrans (PMR) and
      Association.Max.Retrans (AMR) parameters. We specify a dormant state
      operation for SCTP-PF which makes SCTP-PF provide the same disruption
      tolerance as <xref target="RFC4960"/> despite that the dormant state may
      be entered more quickly. The dormant state operation may equally well be
      applied by an <xref target="RFC4960"/> implementation and will here
      serve to provide added fault tolerance for situations where the tuning
      of the Path.Max.Retrans (PMR) and Association.Max.Retrans (AMR)
      parameters fail to provide adequate prevention of the entering of the
      dormant state.</t>

      <t>The operation after the recovery of a failed path also impacts the
      performance of the protocol. With the procedures specified in <xref
      target="RFC4960"/> SCTP will, after a failover from the primary path,
      switch back to use the primary path for data transfer as soon as this
      path becomes available again. From a performance perspective such a
      forced switchback of the data transmission path can be suboptimal as the
      CWND towards the original primary destination address has to be rebuilt
      once data transfer resumes, <xref target="CARO02"/>. As an optional
      alternative to the switchback operation of <xref target="RFC4960"/>,
      this document specifies an alternative Primary Path Switchover procedure
      which avoid such forced switchbacks of the data transfer path. The
      Primary Path Switchover operation was originally proposed in <xref
      target="CARO02"/>.</t>

      <t>While SCTP-PF primarily is motivated by a desire to improve the
      multi-homed operation, the feature applies also to SCTP single-homed
      operation. Here the algorithm serves to provide increased failure
      detection on idle associations, whereas the failover or switchback
      aspects of the algorithm will not be activated. This is discussed in
      more detail in Appendix C.</t>

      <t>A brief description of the motivation for the introduction of the
      Potentially Failed state including a discussion of alternative
      approaches to mitigate the deficiencies of the <xref target="RFC4960"/>
      failover operation are given in the Appendices. Discussion of path
      bouncing effects that might be caused by frequent switchovers, are also
      provided there.</t>
    </section>

    <section title="Conventions and Terminology">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119"/>.</t>
    </section>

    <section anchor="SCTP-PF"
             title="SCTP with Potentially Failed Destination State (SCTP-PF)">
      <section title="Overview">
        <t>To minimize the performance impact during failover, the sender
        should avoid transmitting data to a failed destination address as
        early as possible. In the <xref target="RFC4960"/> SCTP path
        management scheme, the sender stops transmitting data to a destination
        address only after the destination address is marked inactive. This
        process takes a significant amount of time as it requires the error
        counter of the destination address to exceed the Path.Max.Retrans
        (PMR) threshold. The issue cannot simply be mitigated by lowering of
        the PMR threshold because this may result in spurious failure
        detection and unnecessary prevention of the usage of a preferred
        primary path. Also due to the coupled tuning of the Path.Max.Retrans
        (PMR) and the Association.Max.Retrans (AMR) parameter values in <xref
        target="RFC4960"/>, lowering of the PMR threshold may result in
        lowering of the AMR threshold, which would result in decrease of
        the fault tolerance of SCTP.</t>

        <t>The solution provided in this document is to extend the SCTP path
        management scheme of <xref target="RFC4960"/> by the addition of the
        Potentially Failed (PF) state as an intermediate state in between the
        active and inactive state of a destination address in the <xref
        target="RFC4960"/> path management scheme, and let the failover of
        data transfer away from a destination address be driven by the
        entering of the PF state instead of by the entering of the inactive
        state. Thereby SCTP may perform quick failover without negatively impacting
        the overall fault tolerance of <xref target="RFC4960"/> SCTP. At the
        same time, RTO-based HEARTBEAT probing is initiated towards a
        destination address once it enters PF state. Thereby SCTP may quickly
        ascertain whether network connectivity towards the destination address
        is broken or whether the failover was spurious. In the case where the
        failover was spurious data transfer may quickly resume towards the
        original destination address.</t>

        <t>The new failure detection algorithm assumes that loss detected by a
        timeout implies either severe congestion or network connectivity
        failure. It recommends that by default a destination address is
        classified as PF at the occurrence of the first timeout.</t>
      </section>

      <section title="Specification of the SCTP-PF Procedures">
        <t>The SCTP-PF operation is specified as follows: <list
            style="numbers">
            <t>The sender maintains a new tunable SCTP Protocol Parameter
            called PotentiallyFailed.Max.Retrans (PFMR). The PFMR defines the
            new intermediate PF threshold on the destination address error
            counter. When this threshold is exceeded the destination address is classified
            as PF. The RECOMMENDED value of PFMR is 0, but other values MAY be
            used. If PFMR is set to be greater than or equal to
            Path.Max.Retrans (PMR), the resulting PF threshold will be so high
            that the destination address will reach the inactive state before
            it can be classified as PF.</t>

            <t>The error counter of an active destination address is
            incremented as specified in <xref target="RFC4960"/>. This means
            that the error counter of the destination address will be
            incremented each time the T3-rtx timer expires, or each time a
            HEARTBEAT chunk is sent when idle and not acknowledged within an
            RTO. When the value in the destination address error counter
            exceeds PFMR, the endpoint MUST mark the destination address as in
            the PF state.</t>

            <t>A SCTP-PF sender SHOULD NOT send data to destination addresses
            in PF state when alternative destination addresses in active state
            are available. Specifically this means that: <list style="hanging">
                <t hangText="i">When there is outbound data to send and the
                destination address presently used for data transmission is in
                PF state, the sender SHOULD choose a destination address in
                active state, if one exists, and use this
                destination address for data transmission.</t>

                <t hangText="ii">When retransmitting data that has timed out
                and the sender thus by <xref target="RFC4960"/>, section
                6.4.1, should attempt to pick a new destination address for
                data retransmission, the sender SHOULD choose an alternate
                destination transport address in active state if one
                exists.</t>

                <t hangText="iii">When there is outbound data to send and the
                SCTP user explicitly requests to send data to a destination
                address in PF state, the sender SHOULD send the data to an
                alternate destination address in active state if one
                exists.</t>
              </list>When choosing among multiple destination addresses in
            active state 
                an SCTP sender will follow the guiding principles of section 6.4.1 of <xref target="RFC4960"/> 
                of choosing most divergent source-destination pairs
                compared with, for i.: the destination address in PF state
                that it performs a failover from, and for ii.: the destination
                address towards which the data timed out. Rules for picking
                the most divergent source-destination pair are an
                implementation decision and are not specified within this
                document. 
               <vspace />  
               <vspace />  

            In all cases, the sender MUST NOT change the state of
            chosen destination address, whether this state be active or PF,
            and it MUST NOT clear the error counter of the destination address
            as a result of choosing the destination address for data
            transmission.</t>

            <t>When the destination addresses are all in PF state or some in
            PF state and some in inactive state, the sender MUST choose one
            destination address in PF state and transmit or retransmit data to
            this destination address using the following rules: <list
                style="letters">
                <t>The sender SHOULD choose the destination in PF state with
                the lowest error count (fewest consecutive timeouts) for data
                transmission and transmit or retransmit data to this
                destination.</t>

                <t>When there are multiple destination addresses in PF state
                with same error count, the sender should let the choice among
                the multiple destination addresses in PF state with equal
                error count be based on the <xref target="RFC4960"/>, section
                6.4.1, principles of choosing most divergent
                source-destination pairs when executing (potentially
                consecutive) retransmission. Rules for picking the most
                divergent source-destination pair are an implementation
                decision and are not specified within this document.</t>

               
              </list> The sender MUST NOT change the state and the error
            counter of any destination address regardless of whether it has
            been chosen for transmission or not.</t>

            <t>The HB.interval of the Path Heartbeat function of <xref
            target="RFC4960"/> MUST be ignored for destination addresses in PF
            state. Instead HEARTBEAT chunks are sent to destination addresses
            in PF state once per RTO. HEARTBEAT chunks SHOULD be sent to
            destination addresses in PF state, but the sending of HEARTBEATS
            MUST honor whether the Path Heartbeat function (Section 8.3 of
            <xref target="RFC4960"/>) is enabled for the destination address
            or not. I.e., if the Path Heartbeat function is disabled for the
            destination address in question, HEARTBEATS MUST NOT be sent. Note
            that when Heartbeat function is disabled, it may take longer to
            transition a destination address in PF state back to active
            state.</t>

            <t>HEARTBEATs are sent when a destination address reaches the PF
            state. When a HEARTBEAT chunk is not acknowledged within the RTO,
            the sender increments the error counter and exponentially backs
            off the RTO value. If the error counter is less than PMR, the
            sender transmits another packet containing the HEARTBEAT chunk
            immediately after timeout expiration on the previous HEARTBEAT.
            When data is being transmitted to a destination address in the PF
            state, the transmission of a HEARTBEAT chunk MAY be omitted in
            case where the receipt of a SACK of the data or a T3-rtx timer
            expiration on the data can provide equivalent information, such as
            the case where the data chunk has been transmitted to a single
            destination address only. Likewise, the timeout of a HEARTBEAT
            chunk MAY be ignored if data is outstanding towards the
            destination address.</t>

            <t>When the sender receives a HEARTBEAT ACK from a HEARTBEAT sent
            to a destination address in PF state, the sender SHOULD clear the
            error counter of the destination address and transition the
            destination address back to active state. When the sender resumes
            data transmission on a destination address after a transition of
            the destination address from PF to active state, it MUST do this
            following the prescriptions of Section 7.2 of <xref
            target="RFC4960"/>. In a situation where a HEARTBEAT ACK arrives
            while there is data outstanding towards the destination address to
            which the HEARTBEAT was sent, then an implementation MAY choose to
            not have the HEARTBEAT ACK reset the error counter, but have the
            error counter reset await the fate of the outstanding data
            transmission. This situation can happen when data is sent to a
            destination address in PF state.</t>

            <t>Additional (PMR - PFMR) consecutive timeouts on a destination
            address in PF state confirm the path failure, upon which the
            destination address transitions to the inactive state. As
            described in <xref target="RFC4960"/>, the sender (i) SHOULD
            notify the ULP about this state transition, and (ii) transmit
            HEARTBEAT chunks to the inactive destination address at a lower
            HB.interval frequency as described in Section 8.3 of <xref
            target="RFC4960"/> (when the Path Heartbeat function is enabled
            for the destination address).</t>

            <t>Acknowledgments for chunks that have been transmitted to
            multiple destinations (i.e., a chunk which has been retransmitted
            to a different destination address than the destination address to
            which the chunk was first transmitted) SHOULD NOT clear the error
            count for an inactive destination address and SHOULD NOT
            move a destination address in PF state back to active state,
            since a sender cannot disambiguate whether the ACK was for the
            original transmission or the retransmission(s). A SCTP sender MAY 
            clear the error counter and move a destination address back to active 
            state if it has other information, than the acknowledgment, that uniquely 
            determines which destination, among multiple destination addresses, 
            the chunk reached. This document makes no 
            reference to what such information could consist of,
            nor how such information could be obtained. </t>

            <t>Acknowledgments for data chunks that has been transmitted to
            one destination address only MUST clear the error counter for the
            destination address and MUST transition a destination address in
            PF state back to active state. This situation can happen when new
            data is sent to a destination address in the PF state. It can also
            happen in situations where the destination address is in the PF
            state due to the occurrence of a spurious T3-rtx timer and
            acknowledgments start to arrive for data sent prior to occurrence
            of the spurious T3-rtx and data has not yet been retransmitted
            towards other destinations. This document does not specify special
            handling for detection of or reaction to spurious T3-rtx timeouts,
            e.g., for special operation vis-a-vis the congestion control
            handling or data retransmission operation towards a destination
            address which undergoes a transition from active to PF to active
            state due to a spurious T3-rtx timeout. But it is noted that this
            is an area which would benefit from additional attention,
            experimentation and specification for single-homed SCTP as well as
            for multi-homed SCTP protocol operation.</t>

            <t>When all destination addresses are in inactive state, and SCTP
            protocol operation thus is said to be in dormant state, the
            prescriptions given in <xref target="dormant"/> shall be
            followed.</t>

            <t>The SCTP stack SHOULD expose the PF state of its destination
            addresses to the ULP as well as provide the means to notify the
            ULP of state transitions of its destination addresses from active
            to PF, and vice-versa. However it is recommended that an SCTP
            stack implementing SCTP-PF also allows for that the ULP is kept
            ignorant of the PF state of its destinations and the associated
            state transitions, thus allowing for retain of the simpler state
            transition model of RFC4960 in the ULP. For this reason it is
            recommended that an SCTP stack implementing SCTP-PF also provides
            the ULP with the means to suppress exposure of the PF state and
            the associated state transitions.</t>
          </list></t>
      </section>
    </section>

    <section anchor="dormant" title="Dormant State Operation">
      <t>In a situation with complete disruption of the communication in
      between the SCTP Endpoints, the aggressive HEARTBEAT transmissions of
      SCTP-PF on destination addresses in PF state may make the association
      enter dormant state faster than a standard <xref target="RFC4960"/> SCTP
      implementation given the same setting of Path.Max.Retrans (PMR) and
      Association.Max.Retrans (AMR). For example, an SCTP association with two
      destination addresses typically would reach dormant state in half the
      time of an <xref target="RFC4960"/> SCTP implementation in such
      situations. This is because a SCTP PF sender will send HEARTBEATS and
      data retransmissions in parallel with RTO intervals when there are
      multiple destinations addresses in PF state. This argument presumes that
      RTO << HB.interval of <xref target="RFC4960"/>. With the design
      goal that SCTP-PF shall provide the same level of disruption tolerance
      as an <xref target="RFC4960"/> SCTP implementation with the same
      Path.Max.Retrans (PMR) and Association.Max.Retrans (AMR) setting, we
      prescribe for that an SCTP-PF implementation SHOULD operate as described
      below in <xref target="dormant_details"/> during dormant state.</t>

      <t>An SCTP-PF implementation MAY choose a different dormant state
      operation than the one described below in <xref
      target="dormant_details"/> provided that the solution chosen does not
      decrease the fault tolerance of the SCTP-PF operation.</t>

      <t>The below prescription for SCTP-PF dormant state handling SHOULD NOT
      be coupled to the value of the PFMR, but solely to the activation of
      SCTP-PF logic in an SCTP implementation.</t>

      <t>It is noted that the below dormant state operation is considered to
      provide added disruption tolerance also for an <xref target="RFC4960"/>
      SCTP implementation, and that it can be sensible for an <xref
      target="RFC4960"/> SCTP implementation to follow this mode of operation.
      For an <xref target="RFC4960"/> SCTP implementation the continuation of
      data transmission during dormant state makes the fault tolerance of SCTP
      be more robust towards situations where some, or all, alternative paths
      of an SCTP association approach, or reach, inactive state before the
      primary path used for data transmission observes trouble.</t>

      <section anchor="dormant_details" title="SCTP Dormant State Procedure">
        <t><list style="letters">
            <t>When the destination addresses are all in inactive state and
            data is available for transfer, the sender MUST choose one
            destination and transmit data to this destination address.</t>

            <t>The sender MUST NOT change the state of the chosen destination
            address (it remains in inactive state) and it MUST NOT clear the
            error counter of the destination address as a result of choosing
            the destination address for data transmission.</t>

            <t>The sender SHOULD choose the destination in inactive state with
            the lowest error count (fewest consecutive timeouts) for data
            transmission. When there are multiple destinations with same error
            count in inactive state, the sender SHOULD attempt to pick the
            most divergent source - destination pair from the last source -
            destination pair where failure was observed. Rules for picking the
            most divergent source-destination pair are an implementation
            decision and are not specified within this document. To support
            differentiation of inactive destination addresses based on their
            error count SCTP will need to allow for increment of the
            destination address error counters up to some reasonable limit
            above PMR+1, thus changing the prescriptions of <xref
            target="RFC4960"/>, section 8.3, in this respect. The exact limit
            to apply is not specified in this document but it is considered
            reasonable to require for the limit to be an order of magnitude
            higher than the PMR value. A sender MAY choose to deploy other
            strategies that the strategy defined here. The strategy to
            prioritize the last active destination address, i.e., the
            destination address with the fewest error counts is optimal when
            some paths are permanently inactive, but suboptimal when a path
            instability is transient.</t>
          </list></t>
      </section>
    </section>

    <section anchor="permanent_failover" title="Primary Path Switchover">
      <t>The objective of the Primary Path Switchover operation is to allow
      the SCTP sender to continue data transmission on a new working path even
      when the old primary destination address becomes active again. This is
      achieved by having SCTP perform a switchover of the primary path to the
      new working path if the error counter of the primary path exceeds a
      certain threshold. This mode of operation can be applied not only to
      SCTP-PF implementations, but also to <xref target="RFC4960"/>
      implementations.</t>

      <t>The Primary Path Switchover operation requires only sender side
      changes. The details are:</t>

      <t><list style="numbers">
          <t>The sender maintains a new tunable parameter, called
          Primary.Switchover.Max.Retrans (PSMR). For SCTP-PF implementations,
          the PSMR MUST be set greater or equal to the PFMR value. For <xref
          target="RFC4960"/> implementations the PSMR MUST be set greater or
          equal to the PMR value. Implementations MUST reject any other values
          of PSMR.</t>

          <t>When the path error counter on a set primary path exceeds PSMR,
          the SCTP implementation MUST autonomously select and set a new
          primary path.</t>

          <t>The primary path selected by the SCTP implementation MUST be the
          path which at the given time would be chosen for data transfer. A
          previously failed primary path can be used as data transfer path as
          per normal path selection when the present data transfer path
          fails.</t>

          <t>For SCTP-PF, the recommended value of PSMR is PFMR when Primary
          Path Switchover operation mode is used. This means that no forced
          switchback to a previously failed primary path is performed. An
          SCTP-PF implementation of Primary Path Switchover MUST support the
          setting of PSMR = PFMR. A SCTP-PF implementation of Primary Path
          Switchover MAY support setting of PSMR > PFMR.</t>

          <t>For <xref target="RFC4960"/> SCTP, the recommended value of PSMR
          is PMR when Primary Path Switchover is used. This means that no
          forced switchback to a previously failed primary path is performed.
          A <xref target="RFC4960"/> SCTP implementation of Primary Path
          Switchover MUST support the setting of PSMR = PMR. An <xref
          target="RFC4960"/> SCTP implementation of Primary Path Switchover
          MAY support larger settings of PSMR > PMR.</t>

          <t>It MUST be possible to disable the Primary Path Switchover
          operation and obtain the standard switchback operation of <xref
          target="RFC4960"/>.</t>
        </list></t>

      <t>The manner of switchover operation that is most optimal in a given
      scenario depends on the relative quality of a set primary path versus
      the quality of alternative paths available as well as on the extent to
      which it is desired for the mode of operation to enforce traffic
      distribution over a number of network paths. I.e., load distribution of
      traffic from multiple SCTP associations may be sought to be enforced by
      distribution of the set primary paths with <xref target="RFC4960"/>
      switchback operation. However as <xref target="RFC4960"/> switchback
      behavior is suboptimal in certain situations, especially in scenarios
      where a number of equally good paths are available, an SCTP
      implementation MAY support also, as alternative behavior, the Primary
      Path Switchover mode of operation and MAY enable it based on applications'
      requests.</t>

      <t>For an SCTP implementation that implements the Primary Path
      Switchover operation, this specification RECOMMENDS that the standard
      RFC4960 switchback operation is retained as the default operation.</t>
    </section>

    <section title="Suggested SCTP Protocol Parameter Values">
      <t>This document does not alter the <xref target="RFC4960"/> value
      RECOMMENDATIONS for the SCTP Protocol Parameters defined in <xref
      target="RFC4960"/>.</t>

      <t>The following protocol parameter is RECOMMENDED:<list style="empty">
          <t>PotentiallyFailed.Max.Retrans (PFMR) - 0</t>
        </list></t>
    </section>

    <section title="Socket API Considerations">
      <t>This section describes how the socket API defined in <xref
      target="RFC6458"/> is extended to provide a way for the application to
      control and observe the SCTP-PF behavior as well as the Primary Path
      Switchover function.</t>

      <t>Please note that this section is informational only.</t>

      <t>A socket API implementation based on <xref target="RFC6458"/> is, by
      means of the existing SCTP_PEER_ADDR_CHANGE event, extended to provide
      the event notification when a peer address enters or leaves the
      potentially failed state as well as the socket API implementation is
      extended to expose the potentially failed state of a peer address in the
      existing SCTP_GET_PEER_ADDR_INFO structure.</t>

      <t>Furthermore, two new read/write socket options for the level
      IPPROTO_SCTP and the name SCTP_PEER_ADDR_THLDS and
      SCTP_EXPOSE_POTENTIALLY_FAILED_STATE are defined as described below. The
      first socket option is used to control the values of the PFMR and PSMR
      parameters described in <xref target="SCTP-PF"/> and in <xref
      target="permanent_failover"/>. The second one controls the exposition of
      the potentially failed path state.</t>

      <t>Support for the SCTP_PEER_ADDR_THLDS and
      SCTP_EXPOSE_POTENTIALLY_FAILED_STATE socket options need also to be
      added to the function sctp_opt_info().</t>

      <section anchor="pf_support_api"
               title="Support for the Potentially Failed Path State">
        <t>As defined in <xref target="RFC6458"/>, the SCTP_PEER_ADDR_CHANGE
        event is provided if the status of a peer address changes. In addition
        to the state changes described in <xref target="RFC6458"/>, this event
        is also provided, if a peer address enters or leaves the potentially
        failed state. The notification as defined in <xref target="RFC6458"/>
        uses the following structure:</t>

        <figure>
          <artwork>
struct sctp_paddr_change {
  uint16_t spc_type;
  uint16_t spc_flags;
  uint32_t spc_length;
  struct sockaddr_storage spc_aaddr;
  uint32_t spc_state;
  uint32_t spc_error;
  sctp_assoc_t spc_assoc_id;
}
</artwork>
        </figure>

        <t><xref target="RFC6458"/> defines the constants SCTP_ADDR_AVAILABLE,
        SCTP_ADDR_UNREACHABLE, SCTP_ADDR_REMOVED, SCTP_ADDR_ADDED, and
        SCTP_ADDR_MADE_PRIM to be provided in the spc_state field. This
        document defines in addition to that the new constant
        SCTP_ADDR_POTENTIALLY_FAILED, which is reported if the affected
        address becomes potentially failed.</t>

        <t>The SCTP_GET_PEER_ADDR_INFO socket option defined in <xref
        target="RFC6458"/> can be used to query the state of a peer address.
        It uses the following structure:</t>

        <figure>
          <artwork>
struct sctp_paddrinfo {
  sctp_assoc_t spinfo_assoc_id;
  struct sockaddr_storage spinfo_address;
  int32_t spinfo_state;
  uint32_t spinfo_cwnd;
  uint32_t spinfo_srtt;
  uint32_t spinfo_rto;
  uint32_t spinfo_mtu;
};
</artwork>
        </figure>

        <t><xref target="RFC6458"/> defines the constants SCTP_UNCONFIRMED,
        SCTP_ACTIVE, and SCTP_INACTIVE to be provided in the spinfo_state
        field. This document defines in addition to that the new constant
        SCTP_POTENTIALLY_FAILED, which is reported if the peer address is
        potentially failed.</t>
      </section>

      <section title="Peer Address Thresholds (SCTP_PEER_ADDR_THLDS) Socket Option">
        <t>Applications can control the SCTP-PF behavior by getting or setting
        the number of consecutive timeouts before a peer address is considered
        potentially failed or unreachable. The same socket option is used by
        applications to set and get the number of timeouts before the primary
        path is changed automatically by the Primary Path Switchover function.
        This socket option uses the level IPPROTO_SCTP and the name
        SCTP_PEER_ADDR_THLDS.</t>

        <t>The following structure is used to access and modify the
        thresholds:</t>

        <figure>
          <artwork>
struct sctp_paddrthlds {
  sctp_assoc_t spt_assoc_id;
  struct sockaddr_storage spt_address;
  uint16_t spt_pathmaxrxt;
  uint16_t spt_pathpfthld;
  uint16_t spt_pathcpthld;
};
</artwork>
        </figure>

        <t><list style="hanging">
            <t hangText="spt_assoc_id:">This parameter is ignored for
            one-to-one style sockets. For one-to-many style sockets the
            application may fill in an association identifier or
            SCTP_FUTURE_ASSOC. It is an error to use SCTP_{CURRENT|ALL}_ASSOC
            in spt_assoc_id.</t>

            <t hangText="spt_address:">This specifies which peer address is of
            interest. If a wild card address is provided, this socket option
            applies to all current and future peer addresses.</t>

            <t hangText="spt_pathmaxrxt:">Each peer address of interest is
            considered unreachable, if its path error counter exceeds
            spt_pathmaxrxt.</t>

            <t hangText="spt_pathpfthld:">Each peer address of interest is
            considered Potentially Failed, if its path error counter exceeds
            spt_pathpfthld.</t>

            <t hangText="spt_pathcpthld:">Each peer address of interest is not
            considered the primary remote address anymore, if its path error
            counter exceeds spt_pathcpthld. Using a value of 0xffff disables
            the selection of a new primary peer address. If an implementation
            does not support the automatically selection of a new primary
            address, it should indicate an error with errno set to EINVAL if a
            value different from 0xffff is used in spt_pathcpthld. For
            SCTP-PF, the setting of spt_pathcpthld < spt_pathpfthld should
            be rejected with errno set to EINVAL. For <xref target="RFC4960"/>
            SCTP, the setting of spt_pathcpthld < spt_pathmaxrxt should be
            rejected with errno set to EINVAL. A SCTP-PF implementation MAY
            support only setting of spt_pathcpthld = spt_pathpfthld and
            spt_pathcpthld = 0xffff and a <xref target="RFC4960"/> SCTP
            implementation MAY support only setting of spt_pathcpthld =
            spt_pathmaxrxt and spt_pathcpthld = 0xffff. In these cases SCTP
            shall reject setting of other values with errno set to EINVAL.</t>
          </list></t>
      </section>

      <section title="Exposing the Potentially Failed Path State                       (SCTP_EXPOSE_POTENTIALLY_FAILED_STATE) Socket Option">
        <t>Applications can control the exposure of the potentially failed
        path state in the SCTP_PEER_ADDR_CHANGE event and the
        SCTP_GET_PEER_ADDR_INFO as described in <xref
        target="pf_support_api"/>. The default value is implementation
        specific.</t>

        <t>This socket option uses the level IPPROTO_SCTP and the name
        SCTP_EXPOSE_POTENTIALLY_FAILED_STATE.</t>

        <t>The following structure is used to control the exposition of the
        potentially failed path state:</t>

        <figure>
          <artwork>
struct sctp_assoc_value {
  sctp_assoc_t assoc_id;
  uint32_t assoc_value;
};
</artwork>
        </figure>

        <t><list style="hanging">
            <t hangText="assoc_id:">This parameter is ignored for one-to-one
            style sockets. For one-to-many style sockets the application may
            fill in an association identifier or SCTP_FUTURE_ASSOC. It is an
            error to use SCTP_{CURRENT|ALL}_ASSOC in assoc_id.</t>

            <t hangText="assoc_value:">The potentially failed path state is
            exposed if and only if this parameter is non-zero.</t>
          </list></t>
      </section>
    </section>

    <section title="Security Considerations">
      <t>Security considerations for the use of SCTP and its APIs are
      discussed in <xref target="RFC4960"/> and <xref target="RFC6458"/>.</t>

      <t>The logic introduced by this document does not impact existing SCTP
      messages on the wire. Also, this document does not introduce any new
      SCTP messages on the wire that require new security considerations.</t>

      <t>SCTP-PF makes SCTP not only more robust during primary path
      failure/congestion but also more vulnerable to network
      connectivity/congestion attacks on the primary path. SCTP-PF makes it
      easier for an attacker to trick SCTP to change data transfer path, since
      the duration of time that an attacker needs to negatively influence the network
      connectivity is much shorter than <xref target="RFC4960"/>. However,
      SCTP-PF does not constitute a significant change in the duration of time
      and effort an attacker needs to keep SCTP away from the primary path.
      With the standard switchback operation <xref target="RFC4960"/> SCTP
      resumes data transfer on its primary path as soon as the next HEARTBEAT
      succeeds.</t>

      <t>On the other hand, usage of the Primary Path Switchover mechanism,
      does change the threat analysis. This is because on-path attackers can
      force a permanent change of the data transfer path by blocking the
      primary path until the switchover of the primary path is triggered by
      the Primary Path Switchover algorithm. This especially will be the case
      when the Primary Path Switchover is used together with SCTP-PF with the
      particular setting of PSMR = PFMR = 0, as Primary Path Switchover here
      happens already at the first RTO timeout experienced. Users of the
      Primary Path Switchover mechanism should be aware of this fact.</t>

      <t>The event notification of path state transfer from active to
      potentially failed state and vice versa gives attackers an increased
      possibility to generate more local events. However, it is assumed that
      event notifications are rate-limited in the implementation to address
      this threat.</t>
    </section>

    <section title="IANA Considerations">
      <t>This document does not create any new registries or modify the rules
      for any existing registries managed by IANA.</t>
    </section>

    <section title="Acknowledgements">
      <t>The authors wish to thank Michael Tuexen for his many invaluable
      comments and for his very substantial support with the making of this
      document.</t>
    </section>

    <section title="Proposed Change of Status (to be Deleted before Publication)">
      <t>Initially this work looked to entail some changes of the Congestion
      Control (CC) operation of SCTP and for this reason the work was proposed
      as Experimental. These intended changes of the CC operation have since
      been judged to be irrelevant and are no longer part of the
      specification. As the specification entails no other potential harmful
      features, consensus exists in the WG to bring the work forward as
      PS.</t>

      <t>Initially concerns have been expressed about the possibility for the
      mechanism to introduce path bouncing with potential harmful network
      impacts. These concerns are believed to be unfounded. This issue is
      addressed in Appendix B.</t>

      <t>It is noted that the feature specified by this document is
      implemented by multiple SCTP SW implementations and furthermore that
      various variants of the solution have been deployed in telephony signaling
      environments for several years with good results.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119" ?>

      <?rfc include="reference.RFC.4960" ?>
    </references>

    <references title="Informative References">
      <?rfc include="reference.RFC.6458" ?>

      <reference anchor="IYENGAR06" target="">
        <front>
          <title>Concurrent Multipath Transfer using SCTP Multihoming over
          Independent End-to-end Paths.</title>

          <author fullname="" initials="J." surname="Iyengar">
            <organization/>
          </author>

          <author fullname="" initials="P." surname="Amer">
            <organization/>
          </author>

          <author fullname="" initials="R." surname="Stewart">
            <organization/>
          </author>

          <date month="10" year="2006"/>
        </front>

        <seriesInfo name="IEEE/ACM Trans on Networking" value="14(5)"/>
      </reference>

      <reference anchor="NATARAJAN09" target="">
        <front>
          <title>Concurrent Multipath Transfer during Path Failure</title>

          <author fullname="" initials="P." surname="Natarajan">
            <organization/>
          </author>

          <author fullname="" initials="N." surname="Ekiz">
            <organization/>
          </author>

          <author fullname="" initials="P." surname="Amer">
            <organization/>
          </author>

          <author fullname="" initials="R." surname="Stewart">
            <organization/>
          </author>

          <date month="5" year="2009"/>
        </front>

        <seriesInfo name="Computer Communications" value=""/>
      </reference>

      <reference anchor="JUNGMAIER02" target="">
        <front>
          <title>On the use of SCTP in failover scenarios</title>

          <author fullname="" initials="A." surname="Jungmaier">
            <organization/>
          </author>

          <author fullname="" initials="E." surname="Rathgeb">
            <organization/>
          </author>

          <author fullname="" initials="M." surname="Tuexen">
            <organization/>
          </author>

          <date month="7" year="2002"/>
        </front>

        <seriesInfo name="World Multiconference on Systemics, Cybernetics and Informatics"
                    value=""/>
      </reference>

      <reference anchor="GRINNEMO04" target="">
        <front>
          <title>Performance of SCTP-controlled failovers in M3UA-based
          SIGTRAN networks</title>

          <author fullname="" initials="K-J" surname="Grinnemo">
            <organization/>
          </author>

          <author fullname="" initials="A." surname="Brunstrom">
            <organization/>
          </author>

          <date month="4" year="2004"/>
        </front>

        <seriesInfo name="Advanced Simulation Technologies Conference"
                    value=""/>
      </reference>

      <reference anchor="FALLON08" target="">
        <front>
          <title>SCTP Switchover Performance Issues in WLAN
          Environments</title>

          <author fullname="" initials="S." surname="Fallon">
            <organization/>
          </author>

          <author fullname="" initials="P." surname="Jacob">
            <organization/>
          </author>

          <author fullname="" initials="Y." surname="Qiao">
            <organization/>
          </author>

          <author fullname="" initials="L." surname="Murphy">
            <organization/>
          </author>

          <author fullname="" initials="E." surname="Fallon">
            <organization/>
          </author>

          <author fullname="" initials="A." surname="Hanley">
            <organization/>
          </author>

          <date month="1" year="2008"/>
        </front>

        <seriesInfo name="IEEE CCNC" value="2008"/>
      </reference>

      <reference anchor="CARO04" target="">
        <front>
          <title>End-to-End Failover Thresholds for Transport Layer Multi
          homing</title>

          <author fullname="" initials="A." surname="Caro Jr.">
            <organization/>
          </author>

          <author fullname="" initials="P." surname="Amer">
            <organization/>
          </author>

          <author fullname="" initials="R." surname="Stewart">
            <organization/>
          </author>

          <date month="11" year="2004"/>
        </front>

        <seriesInfo name="MILCOM 2004" value=""/>
      </reference>

      <reference anchor="CARO05" target="">
        <front>
          <title>End-to-End Fault Tolerance using Transport Layer Multi
          homing</title>

          <author fullname="" initials="A." surname="Caro Jr.">
            <organization/>
          </author>

          <date month="1" year="2005"/>
        </front>

        <seriesInfo name="Ph.D Thesis, University of Delaware" value=""/>
      </reference>

      <reference anchor="CARO02" target="">
        <front>
          <title>A Two-level Threshold Recovery Mechanism for SCTP</title>

          <author fullname="" initials="A." surname="Caro Jr.">
            <organization/>
          </author>

          <author fullname="" initials="J." surname="Iyengar">
            <organization/>
          </author>

          <author fullname="" initials="P." surname="Amer">
            <organization/>
          </author>

          <author fullname="" initials="G." surname="Heinz">
            <organization/>
          </author>

          <author fullname="" initials="R." surname="Stewart">
            <organization/>
          </author>

          <date month="7" year="2002"/>
        </front>

        <seriesInfo name="Tech report, CIS Dept, University of Delaware"
                    value=""/>
      </reference>
    </references>

    <section anchor="alternative_approach"
             title="Discussions of Alternative Approaches">
      <t>This section lists alternative approaches for the issues described in
      this document. Although these approaches do not require to update
      RFC4960, we do not recommend them from the reasons described below.</t>

      <section title="Reduce Path.Max.Retrans (PMR)">
        <t>Smaller values for Path.Max.Retrans shorten the failover duration
        and in fact this is recommended in some research results <xref
        target="JUNGMAIER02"/> <xref target="GRINNEMO04"/> <xref
        target="FALLON08"/>. However to significantly reduce the failover time
        it is required to go down (as with PFMR) to Path.Max.Retrans=0 and
        with this setting SCTP switches to another destination address already
        on a single timeout which may result in spurious failover. Spurious
        failover is a problem in <xref target="RFC4960"/> SCTP as the
        transmission of HEARTBEATS on the left primary path, unlike in
        SCTP-PF, is governed by 'HB.interval' also during the failover
        process. 'HB.interval' is usually set in the order of seconds
        (recommended value is 30 seconds) and when the primary path becomes
        inactive, the next HEARTBEAT may be transmitted only many seconds
        later. Indeed as recommended, only 30 secs later. Meanwhile, the
        primary path may since long have recovered, if it needed recovery at
        all (indeed the failover could be truly spurious). In such situations,
        post failover, an endpoint is forced to wait in the order of many
        seconds before the endpoint can resume transmission on the primary
        path and furthermore once it returns on the primary path the CWND
        needs to be rebuild anew - a process which the throughput already have
        had to suffer from on the alternate path. Using a smaller value for
        'HB.interval' might help this situation, but it would result in a
        general waste of bandwidth as such more frequent HEARTBEATING would
        take place also when there are no observed troubles. The bandwidth
        overhead may be diminished by having the ULP use a smaller
        'HB.interval' only on the path which at any given time is set to be
        the primary path, but this adds complication in the ULP.</t>

        <t>In addition, smaller Path.Max.Retrans values also affect the
        'Association.Max.Retrans' value. When the SCTP association's error
        count exceeds Association.Max.Retrans threshold, the SCTP sender
        considers the peer endpoint unreachable and terminates the
        association. Section 8.2 in <xref target="RFC4960"/> recommends that
        Association.Max.Retrans value should not be larger than the summation
        of the Path.Max.Retrans of each of the destination addresses. Else the
        SCTP sender considers its peer reachable even when all destinations
        are INACTIVE and to avoid this dormant state operation, <xref
        target="RFC4960"/> SCTP implementation SHOULD reduce
        Association.Max.Retrans accordingly whenever it reduces
        Path.Max.Retrans. However, smaller Association.Max.Retrans value
        decreases the fault tolerance of SCTP as it increases the chances of
        association termination during minor congestion events.</t>
      </section>

      <section title="Adjust RTO related parameters">
        <t>As several research results indicate, we can also shorten the
        duration of failover process by adjusting RTO related parameters <xref
        target="JUNGMAIER02"/> <xref target="FALLON08"/>. During failover
        process, RTO keeps being doubled. However, if we can choose smaller
        value for RTO.max, we can stop the exponential growth of RTO at some
        point. Also, choosing smaller values for RTO.initial or RTO.min can
        contribute to keep the RTO value small.</t>

        <t>Similar to reducing Path.Max.Retrans, the advantage of this
        approach is that it requires no modification to the current
        specification, although it needs to ignore several recommendations
        described in the Section 15 of <xref target="RFC4960"/>. However, this
        approach requires to have enough knowledge about the network
        characteristics between end points. Otherwise, it can introduce
        adverse side-effects such as spurious timeouts.</t>

        <t>The significant issue with this approach, however, is that even if
        the RTO.max is lowered to an optimal low value, then as long as the
        Path.Max.Retrans is kept at the <xref target="RFC4960"/> recommended
        value, the reduction of the RTO.max doesn't reduce the failover time
        sufficiently enough to prevent severe performance degradation during
        failover.</t>
      </section>
    </section>

    <section anchor="path_bouncing"
             title="Discussions for Path Bouncing Effect">
      <t>The methods described in the document can accelerate the failover
      process. Hence, they might introduce the path bouncing effect where the
      sender keeps changing the data transmission path frequently. This sounds
      harmful to the data transfer, however several research results indicate
      that there is no serious problem with SCTP in terms of path bouncing
      effect <xref target="CARO04"/> <xref target="CARO05"/>.</t>

      <t>There are two main reasons for this. First, SCTP is basically
      designed for multipath communication, which means SCTP maintains all
      path related parameters (CWND, ssthresh, RTT, error count, etc) per each
      destination address. These parameters cannot be affected by path
      bouncing. In addition, when SCTP migrates the data transfer to another
      path, it starts with the minimal or the initial CWND. Hence, there is
      little chance for packet reordering or duplicating.</t>

      <t>Second, even if all communication paths between the end-nodes share
      the same bottleneck, the SCTP-PF results in a behavior already allowed
      by <xref target="RFC4960"/>.</t>
    </section>

    <section anchor="sh" title="SCTP-PF for SCTP Single-homed Operation ">
      <t>For a single-homed SCTP association the only tangible effect of the
      activation of SCTP-PF operation is enhanced failure detection in terms
      of potential notification of the PF state of the sole destination
      address as well as, for idle associations, more rapid entering, and
      notification, of inactive state of the destination address and more
      rapid end-point failure detection. It is believed that neither of these
      effects are harmful, provided adequate dormant state operation is
      implemented, and furthermore that they may be particularly useful for
      applications that deploys multiple SCTP associations for load balancing
      purposes. The early notification of the PF state may be used for
      preventive measures as the entering of the PF state can be used as a
      warning of potential congestion. Depending on the PMR value, the
      aggressive HEARTBEAT transmission in PF state may speed up the end-point
      failure detection (exceed of AMR threshold on the sole path error
      counter) on idle associations in case where relatively large HB.interval
      value compared to RTO (e.g. 30secs) is used.</t>
    </section>
  </back>
</rfc>

PAFTECH AB 2003-20262026-04-24 03:00:28