One document matched: draft-armitage-ion-mars-scsp-03.txt
Differences from draft-armitage-ion-mars-scsp-02.txt
Internet-Draft Grenville Armitage
Lucent Technologies
July 3rd, 1997
Redundant MARS architectures and SCSP
<draft-armitage-ion-mars-scsp-03.txt>
Status of this Memo
This document was submitted to the IETF Internetworking over NBMA
(ION) WG. Publication of this document does not imply acceptance by
the ION WG of any ideas expressed within. Comments should be
submitted to the ion@nexen.com mailing list.
Distribution of this memo is unlimited.
This memo is an internet draft. Internet Drafts are working documents
of the Internet Engineering Task Force (IETF), its Areas, and its
Working Groups. Note that other groups may also distribute working
documents as Internet Drafts.
Internet Drafts are draft documents valid for a maximum of six
months. Internet Drafts may be updated, replaced, or obsoleted by
other documents at any time. It is not appropriate to use Internet
Drafts as reference material or to cite them other than as a "working
draft" or "work in progress".
Please check the lid-abstracts.txt listing contained in the
internet-drafts shadow directories on ds.internic.net (US East
Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or
munnari.oz.au (Pacific Rim) to learn the current status of any
Internet Draft.
Abstract
The Server Cache Synchronisation Protocol (SCSP) has been proposed as
a general mechanism for synchronising the contents of databases.
This document identifies a range of distributed MARS (RFC 2022)
scenarios, highlights associated problems, and describe the issues
the must be addressed when using SCSP to synchronize a distributed
MARS.
Document History
July 1997
Version 03. Updated references, cleaned up some text, now point to
Armitage Expires January 3rd, 1998 [Page 1]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
new MARS/SCSP specification by Jim Luciani and Anthony Gallo
(draft-armitage-ion-mars-scsp-spec is no longer being worked on).
Updated the author's contact details.
November 1996
Version 02. Removed appendix describing specific SCSP based
solution. This will be developed within draft-armitage-ion-mars-
scsp-spec.
1. Introduction.
SCSP [1] being developed within the Internetworking over NBMA (ION)
working group as a general solution for synchronizing distributed
databases such as distributed Next Hop Servers [2] and MARSs [3].
This document attempts to identify the range of redundant MARS
scenarios and describe the associated problems that will need to be
addressed by an SCSP based solution [4].
In the current MARS model a Cluster (typically the same scope as a
LIS) consists of a number of MARS Clients (IP/ATM interfaces in
routers and/or hosts) utilizing the services of a single MARS. This
MARS is responsible for tracking the IP group membership information
across all Cluster members, and providing on-demand associations
between IP multicast group identifiers (addresses) and multipoint ATM
forwarding paths. It is also responsible for allocating Cluster
Member IDs (CMIs) to Cluster members (inserted into outgoing data
packets, to allow reflected packet detection when Multicast Servers
are placed in the data path).
Two different, but significant goals motivate the distribution of the
MARS functionality across a number of physical entities:
Fault tolerance
If a client discovers the MARS it is using has
failed, it can switch to another MARS and continue
operation where it left off.
Load sharing
A logically single MARS is realized using a number
of individual MARS entities. MARS Clients in a given
Cluster are shared amongst the individual MARS
entities.
A general solution to Load Sharing may also provide Fault Tolerance
to the MARS Clients. However, it is not necessarily true that methods
Armitage Expires January 3rd, 1998 [Page 2]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
for supporting Fault Tolerance will also support Load Sharing.
Some additional terminology is required to describe the options.
These reflect the differing relationships the MARSs have with each
other and the Cluster members (clients).
Fault tolerant model:
Active MARS
The single MARS serving the Cluster. It allocates
CMIs and tracks group membership changes by itself.
It is the sole entity that constructs replies to
MARS_REQUESTs.
Backup MARS
An additional MARS that tracks the information being
generated by the Active MARS. Cluster members may
re-register with a Backup MARS if the Active MARS
fails, and they'll assume the Backup has sufficient
up to date knowledge of the Cluster's state to take
the role of Active MARS.
Living Group
The set of Active MARS and current Backup MARS
entities. When a MARS entity dies it falls out of
the Living Group. When it restarts, it rejoins the
Living Group. Election of the Active MARS takes
place amongst the members of the Living Group.
MARS Group
The total set of MARS entities configured to be part
of the distributed MARS. This is the combination of
the Living Group and 'dead' MARS entities that may
be currently dying, dead, or restarting. The list is
constructed in the following order {Active MARS,
Backup MARS, ... Backup MARS, dead MARS,.... dead
MARS}. If there are no 'dead' MARS entities, the
MARS Group and Living Group are identical.
Load sharing model:
Active Sub-MARS
Each simultaneously active MARS entity forming part
of a distributed MARS is an Active Sub-MARS. Each
Active Sub-MARS must create the impression that it
performs all the operations of a single Active MARS
- allocating CMIs and tracking group membership
information within the Cluster. MARS_REQUESTs sent
Armitage Expires January 3rd, 1998 [Page 3]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
to a single Active Sub-MARS return information
covering the entire Cluster.
Active Sub-MARS Group
The set of Active Sub-MARS entities that are
currently representing the co-ordinated distributed
MARS for the Cluster. Cluster members are
distributed amongst all the members of the Active
Sub-MARS Group.
Backup Sub-MARS
A MARS entity that tracks the activities of an
Active Sub-MARS, and is able to become a member of
the Active Sub-MARS group when failure occurs.
MARS Group
The set of Active Sub-MARS and Backup Sub-MARS
entities. When a MARS entity dies it falls out of
the MARS Group. When it restarts, it rejoins the
MARS Group. Election of the Active Sub-MARS entities
takes place amongst the members of the MARS Group.
Load Sharing does not involve complete distribution of processing
load and database size amongst the members of the Active Sub-MARS
Group. This is discussed further in section 4.
The rest of this document looks at a variety of different failure and
load sharing scenarios, and describes what is expected of the various
MARS entities. Section 2 begins by reviewing the existing Client
interface to the MARS. Sections 3 takes a closer look at the
problems faced by the Fault Tolerant service. Section 4 expands on
this to include the additional demands of Load Sharing.
2. MARS Client expectations.
MARS Clients (and Multicast Servers) expect only one MARS to be
currently Active, with zero or more Backup MARS entities available in
case a failure of the Active MARS is detected. From their perspective
the Active MARS is the target of their registrations, MARS_REQUESTs,
and group membership changes. The Active MARS is the source of group
membership change information for other Cluster members, and
MARS_REDIRECT_MAP messages listing the currently available Backup
MARS entities.
A MARS client will act as though:
MARS_REQUESTs to the Active MARS return Cluster-wide information.
Armitage Expires January 3rd, 1998 [Page 4]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
MARS_JOINs and MARS_LEAVEs sent to the Active MARS have Cluster-
wide impact when necessary.
MARS_JOINs and MARS_LEAVEs received from the Active MARS represent
Cluster-wide activity.
The MARS entities listed in received MARS_REDIRECT_MAP messages
are legitimate Backup MARS entities for the Cluster.
MARS Clients have a specific behavior during MARS failure (Section
5.4 of [3]). When a MARS Client detects a failure of its MARS, it
steps to the next member of the Backup MARS list (from the most
recent MARS_REDIRECT_MAP) and attempts to re-register. If the re-
registration fails, the process repeats until a functional MARS is
found.
Sections 5.4.1 and 5.4.2 of [3] describe how a MARS Client, after
successfully re-registering with a MARS, re-issues all the MARS_JOIN
messages that it had sent to its previous MARS. This causes the new
MARS to build a group membership database reflecting that of the
failed MARS immediately prior to its failure. (This behaviour is
required for the case where there is only one MARS available and it
suffers a crash/reboot cycle.) The MARS Clients behave like a
distributed cache 'memory', imposing their group membership state
onto the newly restarted MARS.
(It is worth noting that the MARS itself will propagate MARS_JOINs
out on ClusterControlVC for each group re-joined by a MARS Client.
Other MARS Clients will treat the new MARS_JOINs as redundant
information - if they already have a pt-mpt VC out to a given group,
the re-joining group member will already be a leaf node.)
An alternative use of the MARS_REDIRECT_MAP message is also provided
- forcing Clients to shift from one MARS to another even when failure
has not occurred. This is achieved when a Client receives a
MARS_REDIRECT_MAP message where the first listed MARS address is not
the same as the address of the MARS it is currently using. The
client then uses bit 7 of the mar$redirf flag to control whether a
'hard' or 'soft' redirect will be performed. If the bit is reset, a
'soft' redirect occurs which does not include re-joining all groups.
(In contrast, a client re-registering after actual MARS failure
performs a 'hard' redirect.)
The current MARS specification is not clear on how MARS Clients
should handle the Cluster Sequence Number (CSN). In Section 5.4.1.2
of [3] it says:
"When a new cluster member starts up it should initialise HSN to
Armitage Expires January 3rd, 1998 [Page 5]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
zero. When the cluster member sends the MARS_JOIN to register
(described later), the HSN will be correctly updated to the
current CSN value when the endpoint receives the copy of its
MARS_JOIN back from the MARS."
(The HSN - Host Sequence Number - is the MARS Client's own opinion of
what the last seen CSN value was. Revalidation is triggered if the
CSN exceeds HSN + 1.)
Although the text in [3] is not explicit, a MARS Client MUST reset
its own HSN to the CSN value carried in the registration MARS_JOIN
returned by the new MARS (section 5.2.3 [3]).
The reason for this is as follows:
CSN increments occur every time a message is transmitted on
ClusterControlVC. This can occur very rapidly.
It may not be reasonable to keep the Backup MARS entities uptodate
with the CSN from the Active MARS, considering how much inter-MARS
SCSP traffic this would imply.
If the HSN is not updated using the Backup MARS's CSN, and the
Backup's CSN is lower than the client's HSN, no warnings are
given. This opens a window of opportunity for cluster members to
lose messages on the new ClusterControlVC and not detect the
losses.
It is not a major issue if the HSN is updated using the Backup MARS's
CSN, and the Backup's CSN was higher than the client's original HSN.
This should only occur when the MARS Client is doing a hard-redirect
or re-registration after MARS failure, in which case complete group
revalidation must occur anyway (section 5.4.1 [3]). A soft-redirect
is reserved only for those cases when the Backup MARS is known to be
fully synchronised with the Active MARS.
3. Architectures for Fault Tolerance.
This section look at the possible situations that will be faced by a
Fault Tolerant distributed MARS model. No attempt will be made to
consider the more general goals of Load Sharing amongst the
distributed MARS entities.
The following initial Cluster arrangement will be assumed for all
examples:
C1 C2 C3
Armitage Expires January 3rd, 1998 [Page 6]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
| | |
------------- M1 ------------
|
M2--M3--M4
The MARS Group for this Cluster is {M1, M2, M3, M4}. Initially the
Living Group is equivalent to the MARS Group. The Cluster members
(C1, C2, and C3) begin by using M1 as the Active MARS for the
Cluster. M2, M3, and M4 are the Backup MARS entities.
The Active MARS regularly transmits a MARS_REDIRECT_MAP on
ClusterControlVC containing the members of the MARS Group (not the
Living Group, this will be discussed in section 3.5). In this example
M1 transmits a MARS_REDIRECT_MAP specifying {M1, M2, M3, M4}.
Communication between M1, M2, M3, and M4 (to co-ordinate their roles
as Active and Backup MARS entities) is completely independent of the
communication between M1 and C1, C2, and C3. (The lines represent
associations, rather than actual VCs. M1 has pt-pt VCs between itself
and the cluster members, in addition to ClusterControlVC spanning out
to the cluster members.)
3.1 Initial failure of an Active MARS.
Assume the initial Cluster configuration is functioning properly.
Now assume some failure mode kills M1 without affecting the set of
Backup MARS entities. Each Cluster member re-registers with M2 (the
next MARS in the MARS_REDIRECT_MAP list), leaving the rebuilt cluster
looking like this:
C1 C2 C3
| | |
------------- M2 ------------
|
M3--M4
As noted in section 2, re-registering with M2 involves each cluster
member re-issuing its outstanding MARS_JOINs to M2. This will occur
whether or not M2 had prior knowledge of the group membership
database in M1.
The Living Group is now {M2, M3, M4}. In the immediate aftermath of
the cluster's rebuilding, M2 must behave as the Active MARS. This
includes transmitting a new version of MARS_REDIRECT_MAP that lists
the re-ordered MARS Group, {M2, M3, M4, M1}.
Armitage Expires January 3rd, 1998 [Page 7]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
3.2 Failure of the Active MARS and a Backup MARS.
In the scenario of section 3.1 it is possible that M2 was also
affected by the condition that caused M1 to fail. In such situation,
the MARS clients would have tried to re-register with M3, then M4,
then cycled back to M1. This sequence would repeat until one of the
set {M1, M2, M3, M4} allowed the clients to re-register. (Although
the Living Group is now {M3, M4}, the MARS Clients are not aware of
this and will cycle through the list of MARS entities they last
received in a MARS_REDIRECT_MAP.)
There is a potential here for the MARS Clients to end up re-
registering with different MARS entities. Consider what might occur
if M2's failure is transient. C1, C2, and C3 may not necessarily
attempt to re-register at exactly the same time. If C1 makes the
first attempt and discovers M2 is not responding, it will shift to
M3. If C2 and C3 attempt to re-register with M2 a short time later,
and M2 responds, we end up with the following cluster arrangement:
C1 C2 C3
| | |
M3 M2 ------------
| |
----- M4 ------
Obviously M2 and M3 cannot both behave as the Active MARS, because
they are attached to only a subset of the Cluster's members.
The solution is for members of a Living Group to elect and enforce
their own notion of who the Active MARS should be. This must occur
whenever a current member dies, or a new member joins the Living
Group. This can be utilized as follows:
The elected Active MARS builds an appropriate MARS Group list to
transmit in MARS_REDIRECT_MAPs. The elected Active MARS will be
listed first in the MARS_REDIRECT_MAP.
The Backup MARS entities obtain copies of this MARS_REDIRECT_MAP.
Clients that attempt to register with a Backup MARS will
temporarily succeed. However, the Backup MARS will immediately
issue its MARS_REDIRECT_MAP (with bit 7 of the mar$redirf flag
set).
Receipt of this MARS_REDIRECT_MAP causes the client to perform a
hard-redirect back to the indicated Active MARS.
Two election procedures would be triggered when M2's transient
Armitage Expires January 3rd, 1998 [Page 8]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
failure caused it leave then rejoin the Living Group. Depending on
how the election procedure is defined, the scenario described above
could have resulted in C1 shifting back to M2 (if M2 was re-elected
Active MARS), or C2 and C3 being told to move on to M3 (if M3
retained its position as Active MARS, attained when M2 originally
failed).
3.3 Tracking Cluster Member IDs during re-registration.
One piece of information that is not supplied by cluster members
during re-registration/re-joining is their Cluster Member ID (CMI) -
this must be supplied by the new Active MARS. It is highly desirable
that when a cluster member re-registers with M2 it be assigned the
same CMI that it obtained from M1. To ensure this, the Active MARS
MUST ensure that the Backup MARSs are aware of the ATM addresses and
CMIs of every cluster member.
This requirement stems from the use of CMIs in multicast data
AAL_SDUs for reflected packet detection. During the transition from
M1 to M2, some cluster members may transition earlier than others. If
they are assigned the same CMI as a pre-transition cluster member to
whom they are currently sending IP packets, they recipient will
discard these packets as though they were reflections from an MCS.
In the absence of a CMI tracking scheme, the problem would correct
itself once all cluster members had transitioned to M2. However, it
is preferable to avoid this interval completely, since there is
little reason for MARS failures to interrupt on-going data paths
between cluster members.
3.4 Re-introducing failed MARS entities into the Cluster.
As noted in Section 3.2, the Living Group must have a mechanism for
electing and enforcing its choice of Active MARS. A byproduct of this
election must be to prioritize the Backup MARS entities, so that the
Active MARS can issue useful MARS_REDIRECT_MAP messages.
While MARS Clients only react when the Active MARS dies, the Living
Group must react when anyone of its members dies. Conversely, when a
new member joins the Living Group (presumably a previously dead MARS
that has been restarted), a decision needs to be made about what role
the new member plays in the Living Group.
Two possibilities exist:
The new member becomes a Backup MARS, and is listed by the Active
MARS in subsequent MARS_REDIRECT_MAP messages.
Armitage Expires January 3rd, 1998 [Page 9]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
The new member is immediately elected to take over as Active MARS.
Simply adding a new Backup MARS causes no disruption to the Cluster.
For example, if M1 restarted after the simple example in section 3.1,
M2 (as the Active MARS) might continue to send {M2, M3, M4, M1} in
its MARS_REDIRECT_MAP messages. Since M2 is still listed as the
Active MARS, MARS Clients will take no further action. (If M1 had
some characteristics that make it more desirable than M3 or M4, M2
might instead start sending {M2, M1, M3, M4}, but the immediate
effect would be the same.)
However, it is possible that M1 has characteristics that make it
preferable to any of the other Living Group members whenever it is
available. (This might include throughput, attachment point in the
ATM network, fundamental reliability of the underlying hardware,
etc.) Ideally, once M1 has recovered it is immediately re-elected to
the position of Active MARS. This action does have the ability to
temporarily disrupt MARS Clients, so it should be performed using the
soft-redirect function (Section 5.4.3 of [3]).
The soft-redirect avoids having each MARS Client re-join the
multicast groups it was a member of (consequently, the new Active
MARS must have synchronized its database with the previous Active
MARS prior to the redirection). Using the example from Section 3.1
again, once M1 had rejoined the Living Group and synchronized with
M2, M2 would stop sending MARS_REDIRECT_MAPs with {M2, M3, M4, M1}
and start sending MARS_REDIRECT_MAPs with {M1, M2, M3, M4}. Bit 7 of
the mar$redirf flag would be reset to indicate a soft redirect.
Cluster members re-register with M1, and generate a lot less
signaling traffic than would have been evident if a hard-redirect was
used.
Hard-redirects are used by Backup MARS entities to force wayward MARS
Clients back to the elected Active MARS.
3.5 Sending the MARS Group in MARS_REDIRECT_MAP.
It is important that MARS_REDIRECT_MAPs contain the entire MARS Group
rather than just the Living Group. Whilst the dead MARS entities (if
any) are obviously of no immediate benefit to a MARS Client,
including them in the MARS_REDIRECT_MAP improves the chances of a
Cluster recovering from a catastrophic failure of all MARS entities
in the MARS Group.
Consider what might happen if only the Living Group were listed in
MARS_REDIRECT_MAP. As each Active MARS dies, the Living Group
shrinks, and each MARS Client is updated with a smaller list of
Armitage Expires January 3rd, 1998 [Page 10]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
Backup MARS entities to cycle through during the next MARS failure
(as described in section 2). If the final MARS fails, the MARS Client
is potentially left with a list of just one MARS entity to keep re-
trying (the last Living Group advertised by the Active MARS).
There is no way to predict that the final Active MARS to die will
restart even if the rest of the MARS Group does. By listing the
entire MARS Group we improve the chances of a MARS Client eventually
finding a restarted MARS entity after the final MARS dies.
Prioritizing the list in each MARS_REDIRECT_MAP, such that Backup
MARS entities known to be alive are ahead of dead MARS entities,
ensures this approach does not cause MARS Clients any problems while
the Living Group has one or more members.
3.6 The impact of Multicast Servers.
The majority of the analysis presented for MARS Clients applies to
Multicast Servers (MCS) as well. They utilize the MARS in a parallel
fashion to MARS Clients, and respond to MARS_REDIRECT_MAP (received
over ServerControlVC) in the same way. In the same way that MARS
Clients re-join their groups after a hard-redirect, MCSs also re-
register (using MARS_MSERV) for groups that they are configured to
support.
However, the existence of MCS supported groups imposes a very
important requirement on the Living Group. Consider what would happen
if the Backup MARS M2 in section 3.1 had no knowledge of which groups
were MCS supported immediately after the failure of M1.
Active MARS fails.
Cluster members and MCSs gradually detect the failure, and begin
re-registering with the first available Backup MARS.
Cluster members re-join all groups they were members of.
As the Backup (now Active) MARS receives these MARS_JOINs it
propagates them on its new ClusterControlVC.
Simultaneously each MCS re-registers for all groups they were
configured to support.
If a MARS_MSERV arrives for a group that already has cluster
members, the new Active MARS transmits an appropriate MARS_MIGRATE
on its new ClusterControlVC.
Assume that group X was MCS supported prior to M1's failure. Each
Armitage Expires January 3rd, 1998 [Page 11]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
cluster member had a pt-mpt VC out to the MCS (a single leaf node).
MARS failure occurs, and each cluster member re-registers with M2.
The pt-mpt VC for group X is unchanged. Now cluster members begin
re-issuing MARS_JOINs to M2. If the MCS for group X has not yet re-
registered to support group X, M2 thinks the group is VC Mesh based,
so it propagates the MARS_JOINs on ClusterControlVC. Other cluster
members then update their pt-mpt VC for group X to add each 'new'
leaf node. This results on cluster members forwarding their data
packets to the MCS and some subset of the cluster members directly.
This is not good. When the MCS finally re-registers to support group
X, M2 will issue a MARS_MIGRATE. This fixes every cluster member's
pt-mpt VC for group X, but the transient period is quite messy.
If the entire Living Group is constantly aware of which groups are
MCS supported, a newly elected Active MARS can take temporary action
to avoid the scenario above. An obvious solution is for the new
Active MARS to internally treat the groups as MCS supported even
before the MCSs themselves have correctly re-registered. It would
suppress MARS_JOINs on ClusterControlVC for that group, just as
though the MCS was actually registered. Ultimately the MCS would re-
register, and operation continues normally. [This issue needs further
careful thought, especially to cover the situation where the MCS
fails to re-register in time. Perhaps the new Active MARS fakes a
MARS_LEAVE on ClusterControlVC for the MCS if it doesn't re-register
in the appropriate time? In theory at least this would correctly
force the group back to being VC Mesh based.]
4. Architectures for Load Sharing.
The issue of Load Sharing is typically raised during discussions on
the scaling limits for a Cluster [5]. Some of the 'loads' that are
of interest in a MARS Cluster are:
Number of MARS_JOIN/LEAVE messages handled per second by a given
MARS.
Number of MARS_REQUESTs handled per second by a given MARS.
Size of the group membership database.
Number of SVCs terminating on a given MARS entity from MARS
Clients and MCSs.
Number of SVCs traversing intervening ATM switches on their way to
a MARS that is topologically distant from some or all of its MARS
Clients and/or MCSs.
Armitage Expires January 3rd, 1998 [Page 12]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
Having more than one MARS entity available does not affect each of
these loads equally.
It can be assumed that the average number of MARS_JOIN/LEAVE events
within a Cluster will rise as the number of Cluster members rises.
The group membership state changes of all Cluster members must be
propagated to all other Cluster members whenever they occur.
Subdividing a Cluster among a number of Active Sub-MARSs does not
change the fact that each Active Sub-MARS must track each and every
MARS_JOIN/LEAVE event. The MARS_JOIN/LEAVE event load is therefore
going to be effectively the same in each Active Sub-MARS as it would
have been for a single Active MARS. (An 'event' between the Active
Sub-MARSs is most likely an SCSP activity conveying the semantic
equivalent of the MARS_JOIN/LEAVE.)
If each Active Sub-MARS has a complete view of the cluster's group
membership, they can answer MARS_REQUESTs using locally held
information. It is possible that the average MARS_REQUEST rate
perceived by any one Active Sub-MARS would be lower than that
perceived by a single Active MARS. However, it is worth noting that
steady-state MARS_REQUEST load is likely to be significantly lower
than steady-state MARS_JOIN/LEAVE load anyway (since a MARS_REQUEST
is only used when a source first establishes a pt-mpt VC to a group -
subsequent group changes are propagated using MARS_JOIN/LEAVE
events). Distributing this load may not be a sufficiently valuable
goal to warrant the complexity of a Load Sharing distributed MARS
solution.
Partitioning the group membership database among the Active Sub-MARS
entities would actually work against the reduction in MARS_REQUEST
traffic per Active Sub-MARS. With a partitioned database each
MARS_REQUEST received by an Active Sub-MARS would require a
consequential query to the other members of the Active Sub-MARS
Group. The nett effect would be to bring the total processing load
for handling MARS_REQUEST events (per Active Sub-MARS) back up to the
level that a single Active MARS would see. It would seem that a fully
replicated database across the Active Sub-MARS Group is preferable.
SVC limits at any given MARS are not actually as important as they
might seem. A single Active MARS would terminate an SVC per MARS
Client or MCS, and originate two pt-mpt SVCs (ClusterControlVC and
ServerControlVC). It might be argued that if a MARS resides over an
ATM interface that supports only X SVCs, then splitting the MARS into
two Active Sub-MARS would allow approximately 2*X MARS Clients and/or
MCSs (and so forth for 3, 4, ... N Active Sub-MARSs). However,
consider the wider context (discussed in [5]). If your ATM NICs are
technologically limited to X SVCs, then the MARS Clients and MCSs
making up the Cluster are likely to be similarly technologically
Armitage Expires January 3rd, 1998 [Page 13]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
limited. Having 2 Active Sub-MARSs will not change the fact that your
cluster cannot have more than X members. (Consider that a VC Mesh for
any given multicast group could end up with a mesh of X by X, or an
MCS for the same group would have to terminate SVCs from up to X
sources.) So conserving SVCs at the MARS may not be a valid reason to
deploy a Load Sharing distributed MARS solution.
SVC distributions across the switches of an ATM cloud can be
significantly affected by placement of MARS Clients relative to the
MARS itself. This load does benefit from the use of multiple Active
Sub-MARSs. If MARS Clients are configured to use a 'topologically
local' Active Sub-MARS, we reduce the number of long-haul pt-pt SVCs
that might otherwise traverse an ATM cloud to a single Active MARS.
Of the loads identified above, this one is arguably the only one that
justifies a Load Sharing distributed MARS solution.
The rest of this section will look at a number of scenarios that
arise when attempting to provide a Load Sharing distributed MARS.
4.1 Partitioning the Cluster.
A partitioned cluster has the following characteristics:
ClusterControlVC (CCVC) is partitioned into a number of sub-CCVCs,
one for each Active Sub-MARS. The leaf nodes of each sub-CCVC are
those cluster members making up the cluster partition served by
the associated Active Sub-MARS.
MARS_JOIN/LEAVE traffic to one Active Sub-MARS must propagate out
on each and every sub-CCVC to ensure Cluster wide distribution.
This propagation must occur quickly, as it will impact the overall
group change latency perceived by MARS Clients around the Cluster.
Allocation of CMIs across the cluster must be co-ordinated amongst
the Active Sub-MARSs to ensure no CMI conflicts within the
cluster.
Each sub-CCVC must carry MARS_REDIRECT_MAP messages with a MARS
list appropriate for the partition it sends to.
Each Active Sub-MARS must be capable of answering a MARS_REQUEST
or MARS_GROUPLIST_QUERY with information covering the entire
Cluster.
Three mechanisms are possible for distributing MARS Clients among the
available Active Sub_MARSs.
MARS Clients could be manually configured with (or learn from a
Armitage Expires January 3rd, 1998 [Page 14]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
configuration server) the ATM address of their administratively
assigned Active Sub-MARS. Each Active Sub-MARS simply accepts
whoever registers with it as a cluster member.
MARS Clients could be manually configured with (or learn from a
configuration server) an Anycast ATM address representing "the
nearest" Active Sub-MARS. Each Active Sub-MARS simply accepts
whoever registers with it as a cluster member.
MARS Clients could be manually configured with (or learn from a
configuration server) the ATM address of an arbitrary Active Sub-
MARS. The Active Sub-MARS entities have a mechanism for deciding
which clients should register with which Active Sub-MARS. If a
clients registers with an incorrect Active Sub-MARS, it will be
redirected to the correct one.
Regardless of the mechanism used, it must be kept in mind that MARS
Clients themselves have no idea that they are being served by an
Active Sub-MARS. They see a single Active MARS at all times.
The Anycast ATM address approach is nice, but suffers from the fact
that such a service is not available under UNI 3.0 or UNI 3.1. This
limits us to configuring clients with the specific ATM addresses of
an Active Sub-MARS to use when a client first starts up.
Finally, if an Active Sub-MARS is capable of redirecting a MARS
Client to another Active Sub-MARS on-demand, then the client's choice
of initial Active Sub-MARS is more flexible. However, while dynamic
reconfiguring is desirable it makes complex demands on the Client and
MARS interactions. One issues the choice of metric to match MARS
Clients to particular Active Sub-MARSs. Ideally this should be based
on topological location of the MARS Clients. However, this implies
that any given Active Sub-MARS has the ability to deduce the ATM
topology between a given MARS Client and the other members of the
Active Sub-MARS Group. Unless MARS entities are restricted to running
on switch control processors, this may not be possible.
4.2 What level of Fault Tolerance?
Providing Load Sharing does not necessarily encompass Fault Tolerance
as described in section 3. A number of different service levels are
possible:
At the simplest end there are no Backup Sub-MARS entities. Each
Active Sub-MARS looks after only one partition. If the Active
Sub-MARS fails then all the cluster members in the associated
partition have no MARS support until the Active Sub-MARS returns.
Armitage Expires January 3rd, 1998 [Page 15]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
An alternative is to provide each Active Sub-MARS with one or more
Backup Sub-MARS entities. Cluster members switch to the Backup(s)
for their partition (previously advertised by their Active Sub-
MARS) if the Active Sub-MARS fails. If the Backups for the
partition all fail, the associated partition has no MARS support
until one of the Sub-MARS entities restarts. Backup Sub-MARS
entities serving one partition may not be dynamically re-assigned
to another partition.
A refinement on the preceeding model would allow temporary re-
assignment of Backup Sub-MARS entities from one partition to
another.
The most complex model requires a set of MARS entities from which
a subset may at any one time be Active Sub-MARS entities
supporting the Cluster, while the remaining entities form a pool
of Backup Sub-MARS entities. The partitioning of the cluster
amongst the available Active Sub-MARS entities is dynamic. The
number of Active Sub-MARS entities may also vary with time,
implying that partitions may change in size and scope dynamically.
The following subsections touch on these different models.
4.3 Simple Load Sharing, no Fault Tolerance.
In the simplest model each partition has one Active Sub-MARS, there
are no backups, and no dynamic reconfiguration is available. Each
Active Sub-MARS supports any cluster member that chooses to register
with it.
Consider a cluster with 4 MARS Clients, and 2 Active Sub-MARSs. The
following picture shows one possible configuration, where the cluster
members are split evenly between the sub-MARSs:
C1 C2 C3 C4
| | | |
----- M1 ------ ----- M2 -----
| |
-----------------------
C1, C2, C3, and C4 all consider themselves to be members of the same
Cluster. M1 manages a sub-CCVC with {C1, C2} as leaf nodes -
Partition 1. M2 manages a sub-CCVC with {C3, C4} as leaf nodes -
Partition 2. M1 and M2 form the Active Sub-MARS Group, and exchange
cluster co-ordination information using SCSP.
When a MARS_JOIN/LEAVE event occurs in Partition 1, M1 uses SCSP to
indicate the group membership transition to M2, which fabricates an
Armitage Expires January 3rd, 1998 [Page 16]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
equivalent MARS_JOIN/LEAVE message out to Partition 2. (e.g. When C1
issues a MARS_JOIN/LEAVE message it is propagated to {C1, C2} via M1.
M1 also indicates the group state change to M2, which sends an
matching MARS_JOIN/LEAVE to {C3, C4}.)
As discussed earlier in this section, MARS_REQUEST processing is
expedited if each Active Sub-MARS keeps a local copy of the group
membership database for the entire cluster. This is a reasonable
requirement, and imposes no additional demands on the data flow
between each Active Sub-MARS (since every MARS_JOIN/LEAVE event
results in SCSP updates to every other Active Sub-MARS).
Cluster members registering with either M1 or M2 must receive a CMI
that is unique within the scope of the entire cluster. Since each
Active Sub-MARS is administratively configured, and no dynamic
partitioning is supported, two possibilities emerge:
Divide the CMI space into non-overlapping blocks, and assign each
block to a different Active Sub-MARS. The Active Sub-MARS then
assigns CMIs from its allocated CMI block.
Define a distributed CMI allocation mechanism for dynamic CMI
allocation amongst the Active Sub-MARS entities.
Since this scheme is fundamentally oriented towards fairly static
configurations, a dynamic CMI allocation scheme would appear to be
overkill. Network administrators should assign CMI blocks in roughly
the same proportion that they assign clients to each Active Sub-MARS
(to minimize the chances of an Active Sub-MARS running out of CMIs).
The MARS_REDIRECT_MAP message from each Active Sub-MARS lists only
itself, since there are no backups. M1 lists {M1}, and from M2 lists
{M2}.
4.4 Simple Load Sharing, intra-partition Fault Tolerance.
A better solution exists when each Active Sub-MARS has one or more
Backup Sub-MARS entities available. The diagram from section 4.3
might become:
C1 C2 C3 C4
| | | |
----- M1 ------ ----- M2 -----
/ \ / \
| -------------------- |
M3 M4
In this case M3 is a Backup for M1, and M4 is a Backup for M2. M3 is
Armitage Expires January 3rd, 1998 [Page 17]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
never shared with M2, and M4 is never shared with M1.
This model is a union of section 3 and section 4.3, applying section
3's rules within the context of a Partition instead of entire
Cluster. M1 is the Active MARS for the Partition, and M3 behaves as a
member of the Living Group for the Partition. The one key difference
is that the Active MARS for each Partition are also members of the
Active Sub-MARS Group for the Cluster, and share information using
SCSP as described in section 4.3. As a consequence, the election of
partition's Backup MARS to Active MARS must also trigger election
into the Cluster's Active Sub-MARS Group.
Borrowing from section 3, each Active Sub-MARS transmits a
MARS_REDIRECT_MAP containing the Sub-MARS entities assigned to the
the partition (whether Active, Backup, or dead). In this example M1
would list {M1, M3}, and M2 would list {M2, M4}.
If M1 failed, the procedures from section 3 would be applied within
the context of Partition 1 to elect M3 to Active Sub-MARS:
C1 C2 C3 C4
| | | |
----- M3 ------ ----- M2 -----
| | \
----------------------- |
M4
Clients in Partition 1 would now receive MARS_REDIRECT_MAPs from M3
listing {M3, M1}. Clients in Partition 2 would see no change.
If M1 recovers, an intra-partition re-election procedure may see M3
and M1 swap places or M3 remain as the Active Sub-MARS with M1 as a
Backup Sub-MARS. (Parameters affecting the election choice between M1
and M3 would now include the topological distance between M3 and the
partition's cluster members.)
The information shared between an Active Sub-MARS and its associated
Backup Sub-MARS(s) now also includes the CMI block that has been
assigned to the partition.
4.5 Dynamically configured Load sharing.
A completely general version of the model in section 4.4 would allow
the following additional freedoms:
The Active Sub-MARS Group can grow or shrink in number over time,
implying that partitions can have time-varying numbers of cluster
Armitage Expires January 3rd, 1998 [Page 18]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
members.
Backup Sub-MARS entities may be elected at any time to support any
partition.
If such flexibility exists, each Active Sub-MARS can effectively
become each other's Backup Sub-MARS. Shifting clients from a failed
Active Sub-MARS to another Active Sub-MARS is partition
reconfiguration from the perspective of the Sub-MARSs, but is fault
tolerant MARS service from the perspective of the clients.
However, a number of problems currently exist before we can implement
completely general re-configuration of cluster partitions. The most
important one is how a single Active Sub-MARS can redirect a subset
of the MARS Clients attached to it, while retaining the rest.
For example, assume this initial configuration:
C1 C2 C3 C4
| | | |
----- M1 ------ ----- M2 -----
| |
-----------------------
M1 lists {M1, M2} in its MARS_REDIRECT_MAPs, and M2 lists {M2, M1}.
The cluster members neither know nor care that the Backup MARS listed
by their Active MARS is actually an Active MARS for another partition
of the Cluster.
If M1 failed, its partition of the cluster collapses. C1 and C2 re-
register with M2, and the picture becomes:
C1 C2 C3 C4
| | | |
--------------------------- M2 -----
All cluster members start receiving MARS_REDIRECT_MAPs from M2,
listing {M2, M1}. Unfortunately, we currently have no obvious
mechanism for re-partitioning the cluster once M1 has recovered. M2
needs some what of inducing C1 and C2 to perform a soft-redirect (or
hard, if appropriate) to M1, without losing C3 and C4.
One way of avoiding this scenario is to insist that the number of
partitions cannot change, even while Active Sub-MARSs fail.
Provision enough Active Sub-MARSs for the desired load sharing, and
then provide a pool of shared Backup Sub-MARSs. The starting
configuration might be redrawn as:
Armitage Expires January 3rd, 1998 [Page 19]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
C1 C2 C3 C4
| | | |
----- M1 ------ ----- M2 -----
| |
-----------------------
| |
M3 M4
In this case M1 lists {M1, M3, M4} in its MARS_REDIRECT_MAPs, and M2
lists {M2, M3, M4}. If M1 fails, the MARS Group configures to:
C1 C2 C3 C4
| | | |
----- M3 ------ ----- M2 -----
| |
-----------------------
|
M4
Now, if M3 stays up while M1 is recovering from its failure, there
will be a period within which M3 lists {M3, M4, M1} in its
MARS_REDIRECT_MAPs, and M2 lists {M2, M4, M1}. This implies that the
failure of M1, and the promotion of M3 into the Active Sub-MARS
Group, causes M2 to re-evaluate the list of available Backup Sub-
MARSs too.
When M1 is detected to be available again, M1 might be placed on the
list of Backup Sub-MARS. The cluster would be configured as:
C1 C2 C3 C4
| | | |
----- M3 ------ ----- M2 -----
| |
-----------------------
| |
M1 M4
M3 lists {M3, M1, M4} in its MARS_REDIRECT_MAPs, and M2 lists {M2,
M4, M1}. (Unchanged from the MARS_REDIRECT_MAPs immediately after M1
died. As discussed in section 3, it is important to list all possible
MARS entities to assist client's in recovering from catastrophic MARS
failure.)
M1 may be re-elected as Active Sub-MARS for {C1, C2}, requiring M3 to
trigger a soft-redirect in MARS Clients back to M1. The Active Sub-
MARS Group must also be updated.
There are additional problems with sharing the Backup Sub-MARS
Armitage Expires January 3rd, 1998 [Page 20]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
entities. If M1 and M2 failed simultaneously the cluster would
probably rebuild itself to look like:
C1 C2 C3 C4
| | | |
----- M3 ------ ----- M4 -----
| |
-----------------------
However, as described in section 3, transient failures of a Backup
Sub-MARS might cause M3 to be unavailable during the failure of M1.
This would lead to the topology we saw earlier:
C1 C2 C3 C4
| | | |
--------------------------- M4 -----
The two partitions have collapsed into one.
An obivous additionaly requirement is that M1 and M2 list an opposite
sequence of Backup Sub-MARSs in their MARS_REDIRECT_MAPs. For
example, if M1 listed {M1, M3, M4} and M2 listed {M2, M3, M4} the
cluster would look like this after a simultaneous failure of M1 and
M2:
C1 C2 C3 C4
| | | |
--------------------------- M3 -----
|
M4
Again, the two partitions have collapsed into one.
A not entirely fool proof solution would be for the Active MARS to
issue specifically targetted MARS_REDIRECT_MAP messages on the pt-pt
VCs that each client has open to it. If C1 and C2 still had their
pt-pt VCs open, e.g. after re-registration, M3 could send them
private MARS_REDIRECT_MAPs listing {M4, M3} as the list, forcing only
C1 and C2 to re-direct.
Another possibility is for the remaining Active Sub-MARS entities to
split into multiple logical Active Sub-MARS entities, and manage each
partition separately (with a separate sub-CCVC for its members) until
one of the real Sub-MARS entities restarts. The secondary 'logical'
Active Sub-MARS could then redirect the partition back to the newly
restarted 'real' Active Sub-MARS.
Both of these approaches require further thought.
Armitage Expires January 3rd, 1998 [Page 21]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
4.6 What about Multicast Servers ?
As noted in section 3, it is imperative that knowledge of MCS
supported groups is propagated to Backup MARS entities to minimize
transient changes to pt-mpt SVCs out of the clients during an Active
MARS failure. However, with a partitioned Cluster the issue becomes
more complex.
For an Active Sub-MARS to correctly filter MARS_JOIN/LEAVE messages
it may want to transmit on its local Sub-CCVC it MUST know what
groups are, cluster wide, being supported by an MCS. Since the MCS in
question may have registered with only one Active Sub-MARS, the
Active Sub-MARS Group must exchange timely information on MCS
registrations and supported groups.
The propagation of MCS information must be carefully tracked at each
Active Sub-MARS, as it impacts on whether the local partition should
see a MARS_JOIN, MARS_LEAVE, or MARS_MIGRATE on the sub-CCVC (or
nothing at all). There may well be race conditions where one Active
Sub-MARS is processing a group MARS_JOIN, while simultaneously an MCS
is registering to support the same group with a different Active
Sub-MARS. The problem is not unsolvable, it just requires careful
design.
Finally, all preceding discussions in section 4 on partitioning of
ClusterControlVC also apply to ServerControlVC (SCVC). An MCS may
attach to any Active Sub-MARS, which then must originate a sub-SCVC.
It is not yet clear how a distributed Active Sub-MARS Group would
interact with a distributed MCS supporting the same multicast group.
5. Conclusion
For the purely fault-tolerant model (section 3), the requirements
are:
Active MARS election from amongst the Living Group must be
possible whenever a MARS Group member dies or restarts (section
3.2 and 3.4).
When a new Active MARS is elected, and there already exists an
operational Active MARS, complete database synchronisation between
the two is required before a soft-redirect is initiated by the
current Active MARS (section 3.4)
The Living Group's members must have an up to date map of the CMI
allocation (section 3.3).
Armitage Expires January 3rd, 1998 [Page 22]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
The Living Group's members must have an up to date map of the MCS
supported groups (section 3.6).
The entire MARS Group is transmitted in MARS_REDIRECT_MAPs. The
only change that occurs as entities die or Living Group elections
occur is to the order in which MARS addresses are listed.
No special additions are required to handle client requests (e.g.
MARS_REQUEST or MARS_GROUPLIST_QUERY), since there is only a single
Active MARS.
For the load sharing models (section 4), the problems described in
section 4.5 make the fully dynamic partition model very unattractive.
The fixed load sharing approaches in section 4.3 and 4.4 demand a
significantly simpler solution, while providing a valuable service.
Active Sub-MARSs track the cluster wide group membership for all
groups so they can answer MARS_REQUESTs from locally held
information.
CMI mappings to actual cluster members need to be propagated
amongst Active and Backup Sub-MARSs. In addition, the CMI space
needs to be split into non-overlapping blocks so that each Active
Sub-MARS can allocate CMIs that are unique cluster-wide.
To ensure each Active Sub-MARS can filter the JOIN/LEAVE traffic
it propagates on its Sub-CCVC, information on what groups are MCS
supported MUST be distributed around the Active Sub-MARS Group,
not just between Active Sub-MARSs and their Backups.
Security Consideration
This document is Informational, and does not specify any new protocol
or extensions to the MARS (RFC 2022) protocol. As such, security
issues are not specifically addressed. The security impact of
specific modifications to the RFC 2022 specification for distributed
MARS support will be described in the appropriate future documents.
Acknowledgments
Jim Rubas and Anthony Gallo of IBM helped clarify some points in the
initial release. Rob Coulton and Carl Marcinik of FORE Systems
engaged in helpful discussions after the June 1996 IETF presentation.
Armitage Expires January 3rd, 1998 [Page 23]
Internet Draft <draft-armitage-ion-mars-scsp-03.txt> July 3rd, 1997
Author's Address
Grenville Armitage
Bell Laboratories, Lucent Technologies.
101 Crawfords Corner Rd,
Holmdel, NJ, 07733
USA
Email: gja@lucent.com
References
[1] J. Luciani, G. Armitage, J. Jalpern, "Server Cache
Synchronization Protocol (SCSP) - NBMA", INTERNET DRAFT, draft-ietf-
ion-scsp-01.txt, March 1997
[2] J. Luciani, et al, "NBMA Next Hop Resolution Protocol (NHRP)",
INTERNET DRAFT, draft-ietf-rolc-nhrp-11.txt, February 1997.
[3] G. Armitage, "Support for Multicast over UNI 3.0/3.1 based ATM
Networks.", RFC 2022, Bellcore, November 1996.
[4] J. Luciani, A. Gallo, "A Distributed MARS Service Using SCSP",
INTERNET DRAFT, draft-ietf-ion-scsp-mars-00.txt, July 1997
[5] G. Armitage, "Issues affecting MARS Cluster Size", Bellcore, RFC
2121, March 1997.
Armitage Expires January 3rd, 1998 [Page 24]
| PAFTECH AB 2003-2026 | 2026-04-23 06:52:22 |