One document matched: draft-ietf-tewg-restore-hierarchy-01.txt-73914.txt
Differences from 01.txt-00.txt
Traffic Engineering Working Group Wai Sum Lai, AT&T
Internet Draft Dave McDysan, WorldCom
<draft-ietf-tewg-restore-hierarchy-01.txt> (Co-Editors)
Category: Informational
Expiration Date: January 2003 Jim Boyle, PDNets
Malin Carlzon
Rob Coltun, Movaz
Tim Griffin, AT&T
Ed Kern
Tom Reddington, Lucent
July 2002
Network Hierarchy and Multilayer Survivability
Status of this Memo
This document is an Internet-Draft and is in full conformance
with all provisions of Section 10 of RFC2026 [1].
Internet-Drafts are working documents of the Internet
Engineering Task Force (IETF), its areas, and its working
groups. Note that other groups may also distribute working
documents as Internet-Drafts. Internet-Drafts are draft
documents valid for a maximum of six months and may be updated,
replaced, or obsoleted by other documents at any time. It is
inappropriate to use Internet- Drafts as reference material or
to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed
at http://www.ietf.org/shadow.html.
1. Abstract
This document is the deliverable out of the Network Hierarchy
and Survivability Techniques Design Team established within the
Traffic Engineering Working Group. This team collected and
documented current and near term requirements for survivability
and hierarchy in service provider environments. For clarity,
an expanded set of definitions is included. The team
determined that there appears to be a need to define a small
set of interoperable survivability approaches in packet and
non-packet networks. Suggested approaches include path-based
as well as one that repairs connections in proximity to the
network fault. They operate primarily at a single network
layer. For hierarchy, there did not appear to be a driving
near-term need for work on "vertical hierarchy," defined as
communication between network layers such as TDM/optical and
MPLS. In particular, instead of direct exchange of signaling
Lai, et al Category - Expiration [1]
Network Hierarchy and Multilayer Survivability July 2002
and routing between vertical layers, some looser form of
coordination and communication, such as the specification of
hold-off timers, is a nearer term need. For "horizontal
hierarchy" in data networks, there are several pressing needs.
The requirement is to be able to set up many LSPs in a service
provider network with hierarchical IGP. This is necessary to
support layer 2 and layer 3 VPN services that require edge-to-
edge signaling across a core network.
Please send comments to te-wg@ops.ietf.org
2. Conventions used in this document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described
in RFC-2119 [2].
Table of Contents
1. Abstract..................................................1
2. Conventions used in this document.........................2
3. Introduction..............................................3
4. Terminology and Concepts..................................4
4.1 Hierarchy................................................4
4.1.1 Vertical Hierarchy.....................................5
4.1.2 Horizontal Hierarchy...................................5
4.2 Survivability Terminology................................6
4.2.1 Survivability..........................................6
4.2.2 Generic Operations.....................................6
4.2.3 Survivability Techniques...............................8
4.2.4 Survivability Performance..............................9
4.3 Survivability Mechanisms: Comparison.....................9
5. Survivability............................................11
5.1 Scope...................................................11
5.2 Required initial set of survivability mechanisms........12
5.2.1 1:1 Path Protection with Pre-Established Capacity....12
5.2.2 1:1 Path Protection with Pre-Planned Capacity........12
5.2.3 Local Restoration....................................13
5.2.4 Path Restoration.....................................13
5.3 Applications Supported..................................13
5.4 Timing Bounds for Survivability Mechanisms..............14
5.5 Coordination Among Layers...............................15
5.6 Evolution Toward IP Over Optical........................16
6. Hierarchy Requirements...................................16
6.1 Historical Context......................................16
6.2 Applications for Horizontal Hierarchy...................17
6.3 Horizontal Hierarchy Requirements.......................18
7. Survivability and Hierarchy..............................19
8. Security Considerations..................................19
9. References...............................................20
Lai, et al Category - Expiration 2
Network Hierarchy and Multilayer Survivability July 2002
10. Acknowledgments.........................................21
11. Author's Addresses......................................22
Appendix A: Questions used to help develop requirements.....22
Full Copyright Statement....................................25
3. Introduction
This document presents a proposal of the near-term and
practical requirements for network survivability and hierarchy
in current service provider environments. With feedback from
the working group solicited, the objective is to help focus the
work that is being addressed in the TEWG (Traffic Engineering
Working Group), CCAMP (Common Control and Measurement Plane
Working Group), and other working groups. A main goal of this
work is to provide some expedience for required functionality
in multi-vendor service provider networks. The initial focus
is primarily on intra-domain operations. However, to maintain
consistency in the provision of end-to-end service in a multi-
provider environment, rules governing the operations of
survivability mechanisms at domain boundaries must also be
specified. While such issues are raised and discussed, where
appropriate, they will not be treated in depth in the initial
release of this document.
The document first develops a set of definitions to be used
later in this document and potentially in other documents as
well. It then addresses the requirements and issues associated
with service restoration, hierarchy, and finally a short
discussion of survivability in hierarchical context.
Here is a summary of the findings:
A. Survivability Requirements
o need to define a small set of interoperable survivability
approaches in packet and non-packet networks
o suggested survivability mechanisms include
- 1:1 path protection with pre-established backup capacity
(non-shared)
- 1:1 path protection with pre-planned backup capacity
(shared)
- local restoration with repairs in proximity to the
network fault
- path restoration through source-based rerouting
o timing bounds for service restoration to support voice call
cutoff (140 msec to 2 sec), protocol timer requirements in
premium data services, and mission critical applications
o use of restoration priority for service differentiation
B. Hierarchy Requirements
B.1. Horizontally Oriented Hierarchy (Intra-Domain)
Lai, et al Category - Expiration 3
Network Hierarchy and Multilayer Survivability July 2002
o ability to set up many LSPs in a service provider network
with hierarchical IGP, for the support layer 2 and layer 3
VPN services
o requirements for multi-area traffic engineering need to be
developed to provide guidance for any necessary protocol
extensions
B.2. Vertically Oriented Hierarchy
The following functionality for survivability is common on most
routing equipment today.
o near-term need is some loose form of coordination and
communication based on the use of nested hold-off timers,
instead of direct exchange of signaling and routing between
vertical layers
o means for an upper layer to immediately begin recovery
actions in the event that a lower layer is not configured to
perform recovery
C. Survivability Requirements in Horizontal Hierarchy
o protection of end-to-end connection is based on a
concatenated set of connections, each protected within their
area
o mechanisms for connection routing may include (1) a network
element that participates on both sides of a boundary (e.g.,
OSPF ABR) - note that this is a common point of failure; (2)
route server
o need for inter-area signaling of survivability information
(1) to enable a "least common denominator" survivability
mechanism at the boundary; (2) to convey the success or
failure of the service restoration action; e.g., if a part
of a "connection" is down on one side of a boundary, there
is no need for the other side to recover from failures
4. Terminology and Concepts
4.1 Hierarchy
Hierarchy is a technique to build scalable complex systems. It
is based on an abstraction, at each level, of what is most
significant from the details and internal structures of the
levels further away. This approach makes use of a general
property of all hierarchical systems composed of related
subsystems that interactions between subsystems decrease as the
level of communication between subsystems decreases.
Network hierarchy is an abstraction of part of a network's
topology, routing and signaling mechanisms. Abstraction may be
used as a mechanism to build large networks or as a technique
Lai, et al Category - Expiration 4
Network Hierarchy and Multilayer Survivability July 2002
for enforcing administrative, topological, or geographic
boundaries. For example, network hierarchy might be used to
separate the metropolitan and long-haul regions of a network,
or to separate the regional and backbone sections of a network,
or to interconnect service provider networks (with BGP which
reduces a network to an Autonomous System).
In this document, network hierarchy is considered from two
perspectives:
(1) Vertically oriented: between two network technology layers
(2) Horizontally oriented: between two areas or administrative
subdivisions within the same network technology layer
4.1.1 Vertical Hierarchy
Vertical hierarchy is the abstraction, or reduction in
information, which would be of benefit when communicating
information across network technology layers, as in propagating
information between optical and router networks.
In the vertical hierarchy, the total network functions are
partitioned into a series of functional or technological layers
with clear logical, and may be even physical, separation
between adjacent layers. Survivability mechanisms either
currently exist or are being developed at multiple layers in
networks [3]. The optical layer is now becoming capable of
providing dynamic ring and mesh restoration functionality, in
addition to traditional 1+1 or 1:1 protection. The SDH/SONET
layer provides survivability capability with automatic
protection switching (APS), as well as self-healing ring and
mesh restoration architectures. Similar functionality has been
defined in the ATM Layer, with work ongoing to also provide
such functionality using MPLS [4]. At the IP layer, rerouting
is used to restore service continuity following link and node
outages. Rerouting at the IP layer, however, occurs after a
period of routing convergence, which may require from a few
seconds to several minutes to complete [5].
4.1.2 Horizontal Hierarchy
Horizontal hierarchy is the abstraction that allows a network
at one technology layer, for instance a packet network, to
scale. Examples of horizontal hierarchy include BGP
confederations, separate Autonomous Systems, and multi-area
OSPF.
In the horizontal hierarchy, a large network is partitioned
into multiple smaller, non-overlapping sub-networks. The
partitioning criteria can be based on topology, network
function, administrative policy, or service domain demarcation.
Two networks at the *same* hierarchical level, e.g., two
Lai, et al Category - Expiration 5
Network Hierarchy and Multilayer Survivability July 2002
Autonomous Systems in BGP, may share a peer relation with each
other through some loose form of coupling. On the other hand,
for routing in large networks using multi-area OSPF,
abstraction through the aggregation of routing information is
achieved through a hierarchical partitioning of the network.
4.2 Survivability Terminology
In alphabetical order, the following terms are defined in this
section:
backup entity, same as protection entity (section 4.2.2)
extra traffic (section 4.2.2)
non-revertive mode (section 4.2.2)
normalization (section 4.2.2)
preemptable traffic, same as extra traffic (section 4.2.2)
preemption priority (section 4.2.4)
protection (section 4.2.3)
protection entity (section 4.2.2)
protection switching (section 4.2.3)
protection switch time (section 4.2.4)
recovery (section 4.2.2)
recovery by rerouting, same as restoration (section 4.2.3)
recovery entity, same as protection entity (section 4.2.2)
restoration (section 4.2.3)
restoration priority (section 4.2.4)
restoration time (section 4.2.4)
revertive mode (section 4.2.2)
shared risk group (SRG) (section 4.2.2)
survivability (section 4.2.1)
working entity (section 4.2.2)
4.2.1 Survivability
Survivability is the capability of a network to maintain
service continuity in the presence of faults within the network
[6]. Survivability mechanisms such as protection and
restoration are implemented either on a per-link basis, on a
per-path basis, or throughout an entire network to alleviate
service disruption at affordable costs. The degree of
survivability is determined by the network's capability to
survive single failures, multiple failures, and equipment
failures.
4.2.2 Generic Operations
This document does not discuss the sequence of events of how
network failures are monitored, detected, and mitigated. For
more detail of this aspect, see [4]. Also, the repair process
following a failure is out of the scope here.
Lai, et al Category - Expiration 6
Network Hierarchy and Multilayer Survivability July 2002
A working entity is the entity that is used to carry traffic in
normal operation mode. Depending on the context, an entity can
be a channel or a transmission link in the physical layer, an
LSP in MPLS, or a logical bundle of one or more LSPs.
A protection entity, also called backup entity or recovery
entity, is the entity that is used to carry protected traffic
in recovery operation mode, i.e., when the working entity is in
error or has failed.
Extra traffic, also referred to as preemptable traffic, is the
traffic carried over the protection entity while the working
entity is active. Extra traffic is not protected, i.e., when
the protection entity is required to protect the traffic that
is being carried over the working entity, the extra traffic is
preempted.
A shared risk group (SRG) is a set of network elements that are
collectively impacted by a specific fault or fault type. For
example, a shared risk link group (SRLG) is the union of all
the links on those fibers that are routed in the same physical
conduit in a fiber-span network. This concept includes,
besides shared conduit, other types of compromise such as
shared fiber cable, shared right of way, shared optical ring,
shared office without power sharing, etc. The span of an SRG,
such as the length of the sharing for compromised outside
plant, needs to be considered on a per fault basis. The
concept of SRG can be extended to represent a "risk domain" and
its associated capabilities and summarization for traffic
engineering purposes. See [7] for further discussion.
Normalization is the sequence of events and actions taken by a
network that returns the network to the preferred state upon
completing repair of a failure. This could include the
switching or rerouting of affected traffic to the original
repaired working entities or new routes. Revertive mode refers
to the case where traffic is automatically returned to a
repaired working entity (also called switch back).
Recovery is the sequence of events and actions taken by a
network after the detection of a failure to maintain the
required performance level for existing services (e.g.,
according to service level agreements) and to allow
normalization of the network. The actions include notification
of the failure followed by two parallel processes: (1) a repair
process with fault isolation and repair of the failed
components, and (2) a reconfiguration process using
survivability mechanisms to maintain service continuity. In
protection, reconfiguration involves switching the affected
traffic from a working entity to a protection entity. In
restoration, reconfiguration involves path selection and
rerouting for the affected traffic.
Lai, et al Category - Expiration 7
Network Hierarchy and Multilayer Survivability July 2002
Revertive mode is a procedure in which revertive action, i.e.,
switch back from the protection entity to the working entity,
is taken once the failed working entity has been repaired. In
non-revertive mode, such action is not taken. To minimize
service interruption, switch-back in revertive mode should be
performed at a time when there is the least impact on the
traffic concerned, or by using the make-before-break concept.
Non-revertive mode is the case where there is no preferred path
or it may be desirable to minimize further disruption of the
service brought on by a revertive switching operation. A
switch-back to the original working path is not desired or not
possible since the original path may no longer exist after the
occurrence of a fault on that path.
4.2.3 Survivability Techniques
Protection, also called protection switching, is a
survivability technique based on predetermined failure
recovery: as the working entity is established, a protection
entity is also established. Protection techniques can be
implemented by several architectures: 1+1, 1:1, 1:n, and m:n.
In the context of SDH/SONET, they are referred to as Automatic
Protection Switching (APS).
In the 1+1 protection architecture, a protection entity is
dedicated to each working entity. The dual-feed mechanism is
used whereby the working entity is permanently bridged onto the
protection entity at the source of the protected domain. In
normal operation mode, identical traffic is transmitted
simultaneously on both the working and protection entities. At
the other end (sink) of the protected domain, both feeds are
monitored for alarms and maintenance signals. A selection
between the working and protection entity is made based on some
predetermined criteria, such as the transmission performance
requirements or defect indication.
In the 1:1 protection architecture, a protection entity is also
dedicated to each working entity. The protected traffic is
normally transmitted by the working entity. When the working
entity fails, the protected traffic is switched to the
protection entity. The two ends of the protected domain must
signal detection of the fault and initiate the switchover.
In the 1:n protection architecture, a dedicated protection
entity is shared by n working entities. In this case, not all
of the affected traffic may be protected.
The m:n architecture is a generalization of the 1:n
architecture. Typically m <= n, m dedicated protection
entities are shared by n working entities.
Lai, et al Category - Expiration 8
Network Hierarchy and Multilayer Survivability July 2002
Restoration, also referred to as recovery by rerouting [4], is
a survivability technique that establishes new paths or path
segments on demand, for restoring affected traffic after the
occurrence of a fault. The resources in these alternate paths
are the currently unassigned (unreserved) resources in the same
layer. Preemption of extra traffic may also be used if spare
resources are not available to carry the higher-priority
protected traffic. As initiated by detection of a fault on the
working path, the selection of a recovery path may be based on
preplanned configurations, network routing policies, or current
network status such as network topology and fault information.
Signaling is used for establishing the new paths to bypass the
fault. Thus, restoration involves a path selection process
followed by rerouting of the affected traffic from the working
entity to the recovery entity.
4.2.4 Survivability Performance
Protection switch time is the time interval from the occurrence
of a network fault until the completion of the protection-
switching operations. It includes the detection time necessary
to initiate the protection switch, any hold-off time to allow
for interworking of protection schemes, and the switch
completion time.
Restoration time is the time interval from the occurrence of a
network fault to the instant when the affected traffic is
either completely restored, or until spare resources are
exhausted, and/or no more extra traffic exists that can be
preempted to make room.
Restoration priority is a method of giving preference to
protect higher-priority traffic ahead of lower-priority
traffic. Its use is to help determine the order of restoring
traffic after a failure has occurred. The purpose is to
differentiate service restoration time as well as to control
access to available spare capacity for different classes of
traffic.
Preemption priority is a method of determining which traffic
can be disconnected in the event that not all traffic with a
higher restoration priority is restored after the occurrence of
a failure.
4.3 Survivability Mechanisms: Comparison
In a survivable network design, spare capacity and diversity
must be built into the network from the beginning to support
some degree of self-healing whenever failures occur. A common
strategy is to associate each working entity with a protection
entity having either dedicated resources or shared resources
Lai, et al Category - Expiration 9
Network Hierarchy and Multilayer Survivability July 2002
that are pre-reserved or reserved-on-demand. According to the
methods of setting up a protection entity, different approaches
to providing survivability can be classified. Generally,
protection techniques are based on having a dedicated
protection entity set up prior to failure. Such is not the
case in restoration techniques, which mainly rely on the use of
spare capacity in the network. Hence, in terms of trade-offs,
protection techniques usually offer fast recovery from failure
with enhanced availability, while restoration techniques
usually achieve better resource utilization.
A 1+1 protection architecture is rather expensive since
resource duplication is required for the working and protection
entities. It is generally used for specific services that need
a very high availability.
A 1:1 architecture is inherently slower in recovering from
failure than a 1+1 architecture since communication between
both ends of the protection domain is required to perform the
switch-over operation. An advantage is that the protection
entity can optionally be used to carry low-priority extra
traffic in normal operation, if traffic preemption is allowed.
Packet networks can pre-establish a protection path for later
use with pre-planned but not pre-reserved capacity. That is,
if no packets are sent onto a protection path, then no
bandwidth is consumed. This is not the case in transmission
networks like optical or TDM where path establishment and
resource reservation cannot be decoupled.
In the 1:n protection architecture, traffic is normally sent on
the working entities. When multiple working entities have
failed simultaneously, only one of them can be restored by the
common protection entity. This contention could be resolved by
assigning a different preemptive priority to each working
entity. As in the 1:1 case, the protection entity can
optionally be used to carry preemptable traffic in normal
operation.
While the m:n architecture can improve system availability with
small cost increases, it has rarely been implemented or
standardized.
When compared with protection mechanisms, restoration
mechanisms are generally more frugal as no resources are
committed until after the fault occurs and the location of the
fault is known. However, restoration mechanisms are inherently
slower, since more must be done following the detection of a
fault. Also, the time it takes for the dynamic selection and
establishment of alternate paths may vary, depending on the
amount of traffic and connections to be restored, and is
influenced by the network topology, technology employed, and
the type and severity of the fault. As a result, restoration
time tends to be more variable than the protection switch time
Lai, et al Category - Expiration 10
Network Hierarchy and Multilayer Survivability July 2002
needed with pre-selected protection entities. Hence, in using
restoration mechanisms, it is essential to use restoration
priority to ensure that service objectives are met cost-
effectively.
Once the network routing algorithms have converged after a
fault, it may be preferable, in some cases, to reoptimize the
network by performing a reroute based on the current state of
the network and network policies.
5. Survivability
5.1 Scope
Interoperable approaches to network survivability were
determined to be an immediate requirement in packet networks as
well as in SDH/SONET framed TDM networks. Not as pressing at
this time were techniques that would cover all-optical networks
(e.g., where framing is unknown), as the control of these
networks in a multi-vendor environment appeared to have some
other hurdles to first deal with. Also, not of immediate
interest were approaches to coordinate or explicitly
communicate survivability mechanisms across network layers
(such as from a TDM or optical network to/from an IP network).
However, a capability should be provided for a network operator
to perform fault notification and to control the operation of
survivability mechanisms among different layers. This may
require the development of corresponding OAM functionality.
However, such issues and those related to OAM are currently
outside the scope of this document. (For proposed MPLS OAM
requirements, see [8, 9]).
The initial scope is to address only "backhoe failures" in the
inter-office connections of a service provider network. A link
connection in the router layer typically comprises of multiple
spans in the lower layers. Therefore, the types of network
failures that cause a recovery to be performed include
link/span failures. However, linecard and node failures may
not need to be treated any differently than their respective
link/span failures, as a router failure may be represented as a
set of simultaneous link failures.
Depending on the actual network configuration, drop-side
interface (e.g., between a customer and an access router, or
between a router and an optical cross-connect) may be
considered either inter-domain or inter-layer. Another inter-
domain scenario is the use of intra-office links for
interconnecting a metro network and a core network, with both
networks being administered by the same service provider.
Failures at such interfaces may be similarly protected by the
mechanisms of this section.
Lai, et al Category - Expiration 11
Network Hierarchy and Multilayer Survivability July 2002
Other more complex failure mechanisms such as systematic
control-plane failure, configuration error, or breach of
security are not within the scope of the survivability
mechanisms discussed in this document. Network impairment such
as congestion that results in lower throughput are also not
covered.
5.2 Required initial set of survivability mechanisms
5.2.1 1:1 Path Protection with Pre-Established Capacity
In this protection mode, the head end of a working connection
establishes a protection connection to the destination. There
should be the ability to maintain relative restoration
priorities between working and protection connections, as well
as between different classes of protection connections.
In normal operation, traffic is only sent on the working
connection, though the ability to signal that traffic will be
sent on both connections (1+1 Path for signaling purposes)
would be valuable in non-packet networks. Some distinction
between working and protection connections is likely, either
through explicit objects, or preferably through implicit
methods such as general classes or priorities. Head ends need
the ability to create connections that are as failure disjoint
as possible from each other. This requires SRG information
that can be generally assigned to either nodes or links and
propagated through the control or management plane. In this
mechanism, capacity in the protection connection is pre-
established, however it should be capable of carrying
preemptable extra traffic in non-packet networks. When
protection capacity is called into service during recovery,
there should be the ability to promote the protection
connection to working status (for non-revertive mode operation)
with some form of make-before-break capability.
5.2.2 1:1 Path Protection with Pre-Planned Capacity
Similar to the above 1:1 protection with pre-established
capacity, the protection connection in this case is also pre-
signaled. The difference is in the way protection capacity is
assigned. With pre-planned capacity, the mechanism supports
the ability for the protection capacity to be shared, or
"double-booked". Operators need the ability to provision
different amounts of protection capacity according to expected
failure modes and service level agreements. Thus, an operator
may wish to provision sufficient restoration capacity to handle
a single failure affecting all connections in an SRG, or may
wish to provision less or more restoration capacity.
Mechanisms should be provided to allow restoration capacity on
each link to be shared by SRG-disjoint failures. In a sense,
this is 1:1 from a path perspective; however, the protection
Lai, et al Category - Expiration 12
Network Hierarchy and Multilayer Survivability July 2002
capacity in the network (on a link by link basis) is shared in
a 1:n fashion, e.g., see the proposals in [10, 11]. If
capacity is planned but not allocated, some form of signaling
could be required before traffic may be sent on protection
connections, especially in TDM networks.
The use of this approach improves network resource utilization,
but may require more careful planning. So, initial deployment
might be based on 1:1 path protection with pre-established
capacity and the local restoration mechanism to be described
next.
5.2.3 Local Restoration
Due to the time impact of signal propagation, dynamic recovery
of an entire path may not meet the service requirements of some
networks. The solution to this is to restore connectivity of
the link or span in immediate proximity to the fault, e.g., see
the proposals in [12, 13]. At a minimum, this approach should
be able to protect against connectivity-type SRGs, though
protecting against node-based SRGs might be worthwhile. Also,
this approach is applicable to support restoration on the
inter-domain and inter-layer interconnection scenarios using
intra-office links as described in the Scope Section.
Head end systems must have some control as to whether their
connections are candidates for or excluded from local
restoration. For example, best-effort and preemptable traffic
may be excluded from local restoration; they only get restored
if there is bandwidth available. This type of control may
require the definition of an object in signaling.
Since local restoration may be suboptimal, a means for head end
systems to later perform path-level re-grooming must be
supported for this approach.
5.2.4 Path Restoration
In this approach, connections that are impacted by a fault are
rerouted by the originating network element upon notification
of connection failure. Such a source-based approach is
efficient for network resources, but typically takes longer to
accomplish restoration. It does not involve any new
mechanisms. It merely is a mention of another common approach
to protecting against faults in a network.
5.3 Applications Supported
With service continuity under failure as a goal, a network is
"survivable" if, in the face of a network failure, connectivity
is interrupted for a "brief" period and then recovered before
the network failure ends. The length of this interrupted
period is dependent on the application supported. Here are
Lai, et al Category - Expiration 13
Network Hierarchy and Multilayer Survivability July 2002
some typical applications and considerations that drive the
requirements for an acceptable protection switch time or
restoration time:
- Best-effort data: recovery of network connectivity by
rerouting at the IP layer would be sufficient
- Premium data service: need to meet TCP timeout or application
protocol timer requirements
- Voice: call cutoff is in the range of 140 msec to 2 sec (the
time that a person waits after interruption of the speech
path before hanging up or the time that a telephone switch
will disconnect a call)
- Other real-time service (e.g., streaming, fax) where an
interruption would cause the session to terminate
- Mission-critical applications that cannot tolerate even brief
interruptions, for example, real-time financial transactions
5.4 Timing Bounds for Survivability Mechanisms
The approach to picking the types of survivability mechanisms
recommended was to consider a spectrum of mechanisms that can
be used to protect traffic with varying characteristics of
survivability and speed of protection/restoration, and then
attempt to select a few general points that provide some
coverage across that spectrum. The focus of this work is to
provide requirements to which a small set of detailed proposals
may be developed, allowing the operator some (limited)
flexibility in approaches to meeting their design goals in
engineering multi-vendor networks. Requirements of different
applications as listed in the previous sub-section were
discussed generally, however none on the team would likely
attest to the scientific merit of the ability of the timing
bounds below to meet any specific application's needs. A few
assumptions include:
1. Approaches that protection switch without propagation of
information are likely to be faster than those that do
require some form of fault notification to some or all
elements in a network.
2. Approaches that require some form of signaling after a fault
will also likely suffer some timing impact.
Proposed timing bounds for different survivability mechanisms
are as follows (all bounds are exclusive of signal
propagation):
1:1 path protection with pre-established capacity: 100-500 ms
1:1 path protection with pre-planned capacity: 100-750 ms
Local restoration: 50 ms
Path restoration: 1-5 seconds
To ensure that the service requirements for different
applications can be met within the above timing bounds,
Lai, et al Category - Expiration 14
Network Hierarchy and Multilayer Survivability July 2002
restoration priority must be implemented to determine the order
in which connections are restored (to minimize service
restoration time as well as to gain access to available spare
capacity on the best paths). For example, mission critical
applications may require high restoration priority. At the
fiber layer, instead of specific applications, it may be
possible that priority be given to certain classifications of
customers with their traffic types enclosed within the customer
aggregate. Preemption priority should only be used in the
event that not all connections can be restored, in which case
connections with lower preemption priority should be released.
Depending on a service provider's strategy in provisioning
network resources for backup, preemption may or not be needed
in the network.
5.5 Coordination Among Layers
A common design goal for networks with multiple technological
layers is to provide the desired level of service in the most
cost-effective manner. Multilayer survivability may allow the
optimization of spare resources through the improvement of
resource utilization by sharing spare capacity across different
layers, though further investigations are needed. Coordination
during recovery among different network layers (e.g., IP,
SDH/SONET, optical layer) might necessitate development of
vertical hierarchy. The benefits of providing survivability
mechanisms at multiple layers, and the optimization of the
overall approach, must be weighed with the associated cost and
service impacts.
A default coordination mechanism for inter-layer interaction
could be the use of nested timers and current SDH/SONET fault
monitoring, as has been done traditionally for backward
compatibility. Thus, when lower-layer recovery happens in a
longer time period than higher-layer recovery, a hold-off timer
is utilized to avoid contention between the different single-
layer survivability schemes. In other words, multilayer
interaction is addressed by having successively higher
multiplexing levels operate at a protection/restoration time
scale greater than the next lowest layer. This can impact the
overall time to recover service. For example, if SDH/SONET
protection switching is used, MPLS recovery timers must wait
until SDH/SONET has had time to switch. Setting such timers
involves a tradeoff between rapid recovery and creation of a
race condition where multiple layers are responding to the same
fault, potentially allocating resources in an inefficient
manner.
In other configurations where the lower layer does not have a
restoration capability or is not expected to protect, say an
unprotected SDH/SONET linear circuit, then there must be a
mechanism for the lower layer to trigger the higher layer to
take recovery actions immediately. This difference in network
Lai, et al Category - Expiration 15
Network Hierarchy and Multilayer Survivability July 2002
configuration means that implementations must allow for
adjustment of hold-off timer values and/or a means for a lower
layer to immediately indicate to a higher layer that a fault
has occurred so that the higher layer can take restoration or
protection actions.
Furthermore, faults at higher layers should not trigger
restoration or protection actions at lower layers [3, 4].
It was felt that the current approach to coordination of
survivability approaches currently did not have significant
operational shortfalls. These approaches include protecting
traffic solely at one layer (e.g., at the IP layer over linear
WDM, or at the SDH/SONET layer). Where survivability
mechanisms might be deployed at several layers, such as when a
routed network rides a SDH/SONET protected network, it was felt
that current coordination approaches were sufficient in many
cases. One exception is the hold-off of MPLS recovery until
the completion of SDH/SONET protection switching as described
above. This limits the recovery time of fast MPLS restoration.
Also, by design, the operations and mechanisms within a given
layer tend to be invisible to other layers.
5.6 Evolution Toward IP Over Optical
As more pressing requirements for survivability and horizontal
hierarchy for edge-to-edge signaling are met with technical
proposals, it is believed that the benefits of merging (in some
manner) the control planes of multiple layers will be outlined.
When these benefits are self-evident, it would then seem to be
the right time to review if vertical hierarchy mechanisms are
needed, and what the requirements might be. For example, a
future requirement might be to provide a better match between
the recovery requirements of IP networks with the recovery
capability of optical transport. One such proposal is
described in [14].
6. Hierarchy Requirements
Efforts in the area of network hierarchy should focus on
mechanisms that would allow more scalable edge-to-edge
signaling, or signaling across networks with existing network
hierarchy (such as multi-area OSPF). This appears to be a more
urgent need than mechanisms that might be needed to
interconnect networks at different layers.
6.1 Historical Context
One reason for horizontal hierarchy is functionality (e.g.,
metro versus backbone). Geographic "islands" or partititons
reduce the need for interoperability and make administration
and operations less complex. Using a simpler, more
Lai, et al Category - Expiration 16
Network Hierarchy and Multilayer Survivability July 2002
interoperable, survivability scheme at metro/backbone
boundaries is natural for many provider network architectures.
In transmission networks, creating geographic islands of
different vendor equipment has been done for a long time
because multi-vendor interoperability has been difficult to
achieve. Traditionally, providers have to coordinate the
equipment on either end of a "connection," and making this
interoperable reduces complexity. A provider should be able to
concatenate survivability mechanisms in order to provide a
"protected link" to the next higher level. Think of SDH/SONET
rings connecting to TDM DXCs with 1+1 line-layer protection
between the ADM and the DXC port. The TDM connection, e.g., a
DS3 is protected, but usually all equipment on each SDH/SONET
ring is from a single vendor. The DXC cross connections are
controlled by the provider and the ports are physically
protected resulting in a highly available design. Thus,
concatenation of survivability approaches can be used to
cascade across horizontal hierarchy. While not perfect, it is
workable in the near- to mid-term until multi-vendor
interoperability is achieved.
While the problems associated with multi-vendor
interoperability may necessitate horizontal hierarchy as a
practical matter in the near to mid-term (at least this has
been the case in TDM networks), there should not be a technical
reason for it in the standards developed by the IETF for core
networks, or even most access networks. Establishing
interoperability of survivability mechanisms between multi-
vendor equipment in core IP networks is urgently required to
enable adoption of IP as a viable core transport technology and
to facilitate the traffic engineering of future multi-service
IP networks [3].
Some of the largest service provider networks currently run a
single area/level IGP. Some service providers, as well as many
large enterprise networks, run multi-area OSPF to gain
increases in scalability. Often, this was from an original
design, so it is difficult to say if the network truly required
the hierarchy to reach its current size.
Some proposals on improved mechanisms to address network
hierarchy have been suggested [15, 16, 17, 18, 19]. This
document aims to provide the concrete requirements so that
these and other proposals can first aim to meet some limited
objectives.
6.2 Applications for Horizontal Hierarchy
A primary driver for intra-domain horizontal hierarchy is
signaling capabilities in the context of edge-to-edge VPNs,
potentially across traffic-engineered data networks. There are
a number of different approaches to layer 2 and layer 3 VPNs
and they are currently being addressed by different emerging
Lai, et al Category - Expiration 17
Network Hierarchy and Multilayer Survivability July 2002
protocols in the provider-provisioned VPNs (e.g., virtual
routers) and Pseudo Wire Edge-to-Edge Emulation (PWE3) efforts
based on either MPLS and/or IP tunnels. These may or not need
explicit signaling from edge to edge, but it is a common
perception that in order to meet SLAs, some form of edge-to-
edge signaling may be required.
With a large number of edges (N), scalability is concerned with
avoiding the O(N^2) properties of edge-to-edge signaling.
However, the main issue here is not with the scalability of
large amounts of signaling, such as in O(N^2) meshes with a
"connection" between every edge-pair. This is because, even if
establishing and maintaining connections is feasible in a large
network, there might be an impact on core survivability
mechanisms which would cause protection/restoration times to
grow with N^2, which would be undesirable. While some value of
N may be inevitable, approaches to reduce N (e.g. to pull in
from the edge to aggregation points) might be of value.
Thus, most service providers feel that O(N^2) meshes are not
necessary for VPNs, and that the number of tunnels to support
VPNs would be within the scalability bounds of current
protocols and implementations. That may be the case, there is
currently a lack of ability to signal MPLS tunnels from edge to
edge across IGP hierarchy, such as OSPF areas. This may
require the development of signaling standards that support
dynamic establishment and potentially restoration of LSPs
across a 2-level IGP hierarchy.
For routing scalability, especially in data applications, a
major concern is the amount of processing/state that is
required in the variety of network elements. If some nodes
might not be able to communicate and process the state of every
other node, it might be preferable to limit the information.
There is one school of thought that says that the amount of
information contained by a horizontal barrier should be
significant, and that impacts this might have on optimality in
route selection and ability to provide global survivability are
accepted tradeoffs.
6.3 Horizontal Hierarchy Requirements
Mechanisms are required to allow for edge-to-edge signaling of
connections through a network. One network scenario includes
medium to large networks that currently have hierarchical
interior routing such as multi-area OSPF or multi-level IS-IS.
The primary context of this is edge-to-edge signaling which is
thought to be required to assure the SLAs for the layer 2 and
layer 3 VPNs that are being carried across the network.
Another possible context would be edge-to-edge signaling in TDM
SDH/SONET networks with IP control, where metro and core
networks again might be in a hierarchical interior routing
domain.
Lai, et al Category - Expiration 18
Network Hierarchy and Multilayer Survivability July 2002
To support edge-to-edge signaling in the above network
scenarios within the framework of existing horizontal
hierarchies, current traffic engineering (TE) methods [20, 6]
may need to be extended. Requirements for multi-area TE need
to be developed to provide guidance for any necessary protocol
extensions.
7. Survivability and Hierarchy
When horizontal hierarchy exists in a network technology layer,
a question arises as to how survivability can be provided along
a connection that crosses hierarchical boundaries.
In designing protocols to meet the requirements of hierarchy,
an approach to consider is that boundaries are either clean, or
are of minimal value. However, the concept of network elements
that participate on both sides of a boundary might be a
consideration (e.g., OSPF ABRs). That would allow for devices
on either side to take an intra-area approach within their
region of knowledge, and for the ABR to do this in both areas,
and splice the two protected connections together at a common
point (granted it is a common point of failure now). If the
limitations of this approach start to appear in operational
settings, then perhaps it would be time to start thinking about
route-servers and signaling propagated directives. However,
one initial approach might be to signal through a common border
router, and to consider the service as protected as it consist
of a concatenated set of connections which are each protected
within their area. Another approach might be to have a least
common denominator mechanism at the boundary, e.g., 1+1 port
protection. There should also be some standardized means for a
survivability scheme on one side of such a boundary to
communicate with the scheme on the other side regarding the
success or failure of the recovery action. For example, if a
part of a "connection" is down on one side of such a boundary,
there is no need for the other side to recover from failures.
In summary, at this time, approaches as described above that
allow concatenation of survivability schemes across
hierarchical boundaries seem sufficient.
8. Security Considerations
The set of SRGs that are defined for a network under a common
administrative control and the corresponding assignment of
these SRGs to nodes and links within the administrative control
is sensitive information and needs to be protected. An SRG is
an acknowledgement that nodes and links that belong to an SRG
are susceptible to a common threat. An adversary with access
to information contained in an SRG could use that information
Lai, et al Category - Expiration 19
Network Hierarchy and Multilayer Survivability July 2002
to design an attack, determine the scope of damage caused by
the attack and, therefore, used to maximize the effect of an
attack.
The label used to refer to a particular SRG must allow for an
encoding such that sensitive information such as physical
location, function, purpose, customer, fault type, etc. is not
readily discernable by unauthorized users.
SRG information that is propagated through the control and
management plane should allow for an encryption mechanism. An
example of an approach would be to use IPSEC [21] on all
packets carrying SRG information.
9. References
1 Bradner, S., "The Internet Standards Process -- Revision 3",
BCP 9, RFC 2026, October 1996.
2 Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
3 K. Owens, V. Sharma, and M.Oommen, "Network Survivability
Considerations for Traffic Engineered IP Networks,"
Internet-Draft, Work in Progress, May 2002.
4 V. Sharma, B. Crane, S. Makam, K. Owens, C. Huang, F.
Hellstrand, J. Weil, L. Andersson, B. Jamoussi, B. Cain, S.
Civanlar, and A. Chiu, "Framework for MPLS-based Recovery,"
Internet-Draft, Work in Progress, May 2002.
5 M. Thorup, "Fortifying OSPF/ISIS Against Link Failure,"
http://www.research.att.com/~mthorup/PAPERS/lf_ospf.ps
6 D.O. Awduche, A. Chiu, A. Elwalid, I. Widjaja, and X. Xiao,
"Overview and Principles of Internet Traffic Engineering,"
RFC 3272, May 2002.
7 S. Dharanikota, R. Jain, D. Papadimitriou, R. Hartani, G.
Bernstein, V. Sharma, C. Brownmiller, Y. Xue, and J. Strand,
"Inter-domain routing with Shared Risk Groups," Internet-
Draft, Work in Progress, July 2001.
8 N. Harrison, P. Willis, S. Davari, E. Cuevas, B. Mack-Crane,
E. Franze, H. Ohta, T. So, S. Goldfless, and F. Chen,
"Requirements for OAM in MPLS Networks," Internet-Draft,
Work in Progress, December 2001.
9 D. Allan and M. Azad, "A Framework for MPLS User Plane OAM,"
Internet-Draft, Work in Progress, July 2001.
Lai, et al Category - Expiration 20
Network Hierarchy and Multilayer Survivability July 2002
10 S. Kini, M. Kodialam, T.V. Lakshman, S. Sengupta, and C.
Villamizar, "Shared Backup Label Switched Path Restoration,"
Internet-Draft, Work in Progress, May 2001.
11 G. Li, C. Kalmanek, J. Yates, G. Bernstein, F. Liaw, and V.
Sharma, "RSVP-TE Extensions For Shared-Mesh Restoration in
Transport Networks," Internet-Draft, Work in Progress, July
2001.
12 P. Pan (Editor), D.H. Gan, G. Swallow, J. Vasseur, D.
Cooper, A. Atlas, and M. Jork, "Fast Reroute Extensions to
RSVP-TE for LSP Tunnels," Internet-Draft, Work in Progress,
January 2002.
13 A. Atlas, C. Villamizar, and C. Litvanyi, "MPLS RSVP-TE
Interoperability for Local Protection/Fast Reroute,"
Internet-Draft, Work in Progress, July 2001.
14 A. Chiu and J. Strand, "Joint IP/Optical Layer Restoration
after a Router Failure," Proc. OFC'2001, Anaheim, CA, March
2001.
15 K. Kompella and Y. Rekhter, "Multi-area MPLS Traffic
Engineering," Internet-Draft, Work in Progress, March 2001.
16 G. Ash, et al, "Requirements for Multi-Area TE," Internet-
Draft, Work in Progress, September 2001.
17 A. Iwata, N. Fujita, G.R. Ash, and A. Farrel, "Crankback
Routing Extensions for MPLS Signaling," Internet-Draft, Work
in Progress, July 2001.
18 C-Y Lee, A Celer, N Gammage, S Ghanti, G. Ash, "Distributed
Route Exchangers," Internet-Draft, Work in Progress, March
2001.
19 C-Y Lee and S Ghanti, "Path Request and Path Reply Message,"
Internet-Draft, Work in Progress, July 2001.
20 D. Awduche, J. Malcolm, J. Agogbua, M. O'Dell, J. McManus,
"Requirements for Traffic Engineering Over MPLS," RFC 2702,
September 1999.
21 S. Kent and R. Atkinson, "Security Architecture for the
Internet Protocol," RFC 2401, November 1998.
10. Acknowledgments
A lot of the direction taken in this document, and by the team in
its initial effort was steered by the insightful questions provided
Lai, et al Category - Expiration 21
Network Hierarchy and Multilayer Survivability July 2002
by Bala Rajagoplan, Greg Bernstein, Yangguang Xu, and Avri Doria.
The set of questions is attached as Appendix A in this document.
After the release of the first draft, a number of comments were
received. Thanks to the inputs from Jerry Ash, Sudheer Dharanikota,
Chuck Kalmanek, Dan Koller, Lyndon Ong, Steve Plote, and Yong Xue.
11. Author's Addresses
Wai Sum Lai
AT&T
200 Laurel Avenue
Middletown, NJ 07748, USA
Tel: +1 732-420-3712
wlai@att.com
Dave McDysan
WorldCom
22001 Loudoun County Pkwy
Ashburn, VA 20147, USA
dave.mcdysan@wcom.com
Jim Boyle
Protocol Driven Networks
Tel: +1 919-852-5160
jboyle@pdnets.com
Malin Carlzon
malin@sunet.se
Rob Coltun
Movaz
Tim Griffin
AT&T
180 Park Avenue
Florham Park, NJ 07932, USA
Tel: +1 973-360-7238
griffin@research.att.com
Ed Kern
ejk@tech.org
Tom Reddington
Lucent Technologies
67 Whippany Rd
Whippany, NJ 07981, USA
Tel: +1 973-386-7291
treddington@bell-labs.com
Appendix A: Questions used to help develop requirements
Lai, et al Category - Expiration 22
Network Hierarchy and Multilayer Survivability July 2002
A. Definitions
1. In determining the specific requirements, the design team should
precisely define the concepts "survivability", "restoration",
"protection", "protection switching", "recovery", "re-routing" etc.
and their relations. This would enable the requirements doc to
describe precisely which of these will be addressed.
In the following, the term "restoration" is used to indicate the
broad set of policies and mechanisms used to ensure survivability.
B. Network types and protection modes
1. What is the scope of the requirements with regard to the types of
networks covered? Specifically, are the following in scope:
Restoration of connections in mesh optical networks (opaque or
transparent)
Restoration of connections in hybrid mesh-ring networks
Restoration of LSPs in MPLS networks (composed of LSRs overlaid on a
transport network, e.g., optical)
Any other types of networks?
Is commonality of approach, or optimization of approach more
important?
2. What are the requirements with regard to the protection modes to
be supported in each network type covered? (Examples of protection
modes include 1+1, M:N, shared mesh, UPSR, BLSR, newly defined modes
such as P-cycles, etc.)
3. What are the requirements on local span (i.e., link by link)
protection and end-to-end protection, and the interaction between
them? E.g.: what should be the granularity of connections for each
type (single connection, bundle of connections, etc).
C. Hierarchy
1. Vertical (between two network layers):
What are the requirements for the interaction between
restoration procedures across two network layers, when these
features are offered in both layers? (Example, MPLS network
realized over pt-to-pt optical connections.) Under such a case,
(a) Are there any criteria to choose which layer should provide
protection?
(b) If both layers provide survivability features, what are the
requirements to coordinate these mechanisms?
(c) How is lack of current functionality of cross-layer
coordination currently hampering operations?
Lai, et al Category - Expiration 23
Network Hierarchy and Multilayer Survivability July 2002
(d) Would the benefits be worth additional complexity associated
with routing isolation (e.g. VPN, areas), security, address
isolation and policy / authentication processes?
2. Horizontal (between two areas or administrative subdivisions
within the same network layer):
(a) What are the criteria that trigger the creation of protocol
or administrative boundaries pertaining to restoration? (e.g.,
scalability? multi-vendor interoperability? what are the practical
issues?) multi-provider? Should multi-vendor necessitate
hierarchical separation?
When such boundaries are defined:
(b) What are the requirements on how protection/restoration is
performed end-to-end across such boundaries?
(c) If different restoration mechanisms are implemented on two
sides of a boundary, what are the requirements on their interaction?
What is the primary driver of horizontal hierarchy? (select one)
- functionality (e.g. metro -v- backbone)
- routing scalability
- signaling scalability
- current network architecture, trying to layer on TE on top of
an already hierarchical network architecture
- routing and signalling
For signalling scalability, is it
- managability
- processing/state of network
- edge-to-edge N^2 type issue
For routing scalability, is it
- processing/state of network
- are you flat and want to go hierarchical
- or already hierarchical?
- data or TDM application?
D. Policy
1. What are the requirements for policy support during
protection/restoration, e.g., restoration priority, preemption, etc.
E. Signaling Mechanisms
1. What are the requirements on the signaling transport mechanism
(e.g., in-band over SDH/SONET overhead bytes, out-of-band over an IP
network, etc.) used to communicate restoration protocol messages
between network elements? What are the bandwidth and other
requirements on the signaling channels?
Lai, et al Category - Expiration 24
Network Hierarchy and Multilayer Survivability July 2002
2. What are the requirements on fault detection/localization
mechanisms (which is the prelude to performing restoration
procedures) in the case of opaque and transparent optical networks?
What are the requirements in the case of MPLS restoration?
3. What are the requirements on signaling protocols to be used in
restoration procedures (e.g., high priority processing, security,
etc)?
4. Are there any requirements on the operation of restoration
protocols?
F. Quantitative
1. What are the quantitative requirements (e.g., latency) for
completing restoration under different protection modes (for both
local and end-to-end protection)?
G. Management
1. What information should be measured/maintained by the control
plane at each network element pertaining to restoration events?
2. What are the requirements for the correlation between control
plane and data plane failures from the restoration point of view?
Full Copyright Statement
"Copyright (C) The Internet Society (date). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph
are included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Lai, et al Category - Expiration 25
| PAFTECH AB 2003-2026 | 2026-04-23 00:44:40 |