One document matched: draft-whittle-ivip-fpr-00.txt
Network Working Group R. Whittle
Internet-Draft First Principles
Intended status: Experimental January 18, 2010
Expires: July 22, 2010
Fast Payload Replication mapping distribution for Ivip
draft-whittle-ivip-fpr-00.txt
Abstract
Fast Payload Replication (FPR) is a technique for fanning out the
payloads of individual packets to large numbers of recipients. By
trading off efficiency for robustness, the system can be made highly
tolerant of random packet loss or loss of connection from some
upstream Replicators. FPR is simpler and less efficient than
Reliable Multicast or Secure Multicast, but can operate on a global
scale over the DFZ. It is a host-to-host arrangement and is
independent of routers and network topology. Packets are DTLS
encrypted so spoofed packets cannot enter the Replicator system.
Since it is not completely robust against packet or link loss, or
secure against an attack which compromises a Replicator, the basic
FPR should be supplemented with Missing Payload Servers and end-to-
end authentication of received data in order to make an entirely
robust and secure system. FPR is being developed as part of a global
fast-push mapping distribution system for the Ivip core-edge
separation scalable routing architecture. It should be able to fan
out information to hundreds of thousands of recipients, worldwide, in
less than a second. FPR may have other applications.
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
Whittle Expires July 22, 2010 [Page 1]
Internet-Draft Fast Payload Replication January 2010
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on July 22, 2010.
Copyright Notice
Copyright (c) 2010 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the BSD License.
Whittle Expires July 22, 2010 [Page 2]
Internet-Draft Fast Payload Replication January 2010
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1. Potentially high data volume . . . . . . . . . . . . . . . 13
2.2. Independent of routers and network structure . . . . . . . 13
2.3. Simple UDP (DTLS) only operation . . . . . . . . . . . . . 13
2.4. Good but not perfect robustness . . . . . . . . . . . . . 14
2.5. Flexible trade-off of efficiency for robustness . . . . . 14
2.6. Robustness against DoS may be achieved with private
network links . . . . . . . . . . . . . . . . . . . . . . 15
3. Non-goals . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1. Not intended to provide end-to-end security . . . . . . . 16
3.2. No autodiscovery or monitoring . . . . . . . . . . . . . . 16
3.3. No attempt to automatically adapt to varying PMTUs . . . . 16
3.4. Not intended for mass market consumer applications . . . . 17
4. Streams of packets for Replicators and QSDs . . . . . . . . . 19
5. Packet payloads and identification . . . . . . . . . . . . . . 23
6. The Fresh vs. Repeat Algorithm . . . . . . . . . . . . . . . . 26
7. RUAS functionality . . . . . . . . . . . . . . . . . . . . . . 28
8. Replicator Functionality . . . . . . . . . . . . . . . . . . . 29
9. QSD Functionality . . . . . . . . . . . . . . . . . . . . . . 31
10. Further elaborations . . . . . . . . . . . . . . . . . . . . . 33
10.1. Missing Payload Servers (MSPs) . . . . . . . . . . . . . . 33
10.2. Delaying the output of Replicators . . . . . . . . . . . . 35
10.3. Private network links to avoid DoS attacks . . . . . . . . 36
11. Security Considerations . . . . . . . . . . . . . . . . . . . 39
12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 40
13. Informative References . . . . . . . . . . . . . . . . . . . . 41
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 42
Whittle Expires July 22, 2010 [Page 3]
Internet-Draft Fast Payload Replication January 2010
1. Introduction
This is a fresh document written quickly in an effort to support the
RRG debate - so please excuse the lack of sub-headings and any roughs
spots.
This ID explores a combination of techniques which as far as I know
is novel. It may be useful for various other applications than the
one it was developed for: the central part of a fast-push mapping
distribution system, on a global level, for the Ivip core-edge
elimination architecture. [I-D.whittle-ivip-arch]
[I-D.whittle-ivip-db-fast-push] Ivip is intended to solve the routing
scaling problem for both IPv4 and IPv6 - whilst also being a good
basis for the TTR Mobility architecture. A system such as Ivip could
also support TTR mobility, irrespective of its support for routing
scalability. [TTR Mobility]
FPR stands for "Fast Payload Replication", although "Directed
Flooding Packet Payload Replication" would also be appropriate. (It
seems that the acronym "FPR" is only used in the IETF for "Frame
Policing Ratio" in [RFC3133].)
FPR is intended to be suitable for a business environment where
multiple companies combine to create a shared infrastructure, with no
single point of failure.
The units of FPR data replication are payloads contained within
individual UDP packets. The simplest way to implement FPR will
probably be to use DTLS [RFC4347] encryption and authentication
between the sources of packets, the Replicators, and the destination
devices which use these payloads. It would also be possible to use
IPSec Authentication Header or a specifically written authentication
arrangement to protect these devices from accepting spoofed packets.
For simplicity, the use of DTLS is assumed in the following
discussion.
While the following discussion of FPR does address other possible
uses, the focus is on FPR as a component of the Ivip fast-push
mapping distribution system. In this application, the final
destination of the payloads are full mapping database query servers,
known as QSDs (Query Server with a full Database) and the source of
the packets are multiple devices known as RUASes (Root Update
Authorization Servers). For simplicity, IPv4 is assumed in the
examples, but FPR is equally suitable for use with IPv6.
In a later section I discuss another network element - a Missing
Payload Server (MPS). This would extend the basic FRP system of
Replicators to provide the QSDs with a distributed set of servers
Whittle Expires July 22, 2010 [Page 4]
Internet-Draft Fast Payload Replication January 2010
from which to obtain payloads which did not arrive in packets from
the QSD's multiple upstream Replicators.
Whittle Expires July 22, 2010 [Page 5]
Internet-Draft Fast Payload Replication January 2010
[RUAS-1] One of 20 or so RUASes, each of which drives its packets
| | | to all three level 0 Replicators. Each level 0
| | V Replicator receives a complete set of data to be sent
| V | to hundreds of thousands of QSDs.
V | |
| | \--------->---------\ Three fully meshed level 0
| | \ Replicators drive each other
| \--->---\ \ and send streams to 20 level
| \ \ 1 Replicators. Each stream
| /------------<-->--------\ \ contains packets with
| / \ \ \ payloads from all RUASes.
[R0-0 ]--<-->--[R0-1 ]--<-->--[R0-2 ]
//|||\\ //|||\\ //|||\\
| / | \ /-<--/ | 30 level 1 Replicators each
| / | \/ V receive two streams of
| /--<--/ | /\-->--\ | packets from an upstream
V / V / \ | level 0 Replicator.
| / | / \ |
[R1-00] [R1-01] [R1-02] Each drives 20 streams to
//|||\\ //|||\\ //|||\\ one of 300 level 2
| \/---<--/ \ Replicators.
| /\ \ | | / | /
| /---<-----/--------------<---------------------[R1-29]
V / / \ \ | | /
| / / /-------------<------------[R1-12]
| / | / \ \ |
| / V / \->-\ /--<---[R1-07]
| / | / \ /
[R2-000] [R2-001] [R2-002] Each level 2 Replicator
//|||\\ //|||\\ //|||\\ drives 20 streams to a
level 3 Replicator.
etc. etc. etc.
3,000 level 3 Replicators.
etc. etc. etc.
30,000 level 4 Replicators:
\ | | /
[R4-10472] [R4-27610] 300,000 QSDs each receive
//||||\\ //||||\\ two streams from level 4
\ / Replicators.
[QSD]
Figure 1: Five levels of Replicators drive hundreds of thousands of
QSDs.
Whittle Expires July 22, 2010 [Page 6]
Internet-Draft Fast Payload Replication January 2010
Figure 1 depicts a system with 3 level 0 Replicators, 30 level 1
Replicators, 300 level 2 Replicators etc. MPSes (Missing Payload
Servers) are not shown. The Ivip system would probably involve 5 to
8 level 0 Replicators, but 3 makes for a clearer diagram.
The amplification factor is how many output streams a Replicator
sends divided by how many it receives. This will vary depending on
local choices, but I have shown all Replicators producing 20 streams
and all level 1 and greater Replicators consuming two.
The amplification factor per level may be higher than this so fewer
levels may be required to drive the required numbers of QSDs. In the
long-term future, it is possible that hundreds of thousands of QSDs
in ISP and larger end-user networks will receive mapping updates, in
order to serve the ITRs in those networks. This ID explores the
design of such a large-scale system. When Ivip is introduced, the
whole system would be much simpler, such as with 3 Level 0
Replicators, higher amplification factors due to the initially low
rate of updates and fewer levels due to this and the initially lower
number of QSDs to be driven.
The principle is that the level 0 Replicators are fully meshed and
(except for any lost packets, dead links or failure in one of these
Replicators) receive the full set of packets to be sent out to all
level 1 Replicators.
If a level 1 Replicator has a packet missing from one of its upstream
level 0 Replicators, then it will usually be able to obtain the same
payload from the equivalent packet from its other upstream level 0
Replicator.
A level 1 Replicator which is missing a packet from both its sources
will not be able to send this packet's payload to its 20 downstream
devices. However, depending on how the cross-linking is structured,
generally those Replicators will obtain the payload that packet
carried from the equivalent packet from their other source.
This principle continues to the end - where the QSD is generally able
to cope with a missing packet by using the payload of the equivalent
packet from the other upstream level 4 Replicator.
The maximum length of the packets needs to be chosen so as not to
violate any Path MTU from one Replicator to the next, and to the
final recipient devices.
The data to be carried needs to be split into individual blocks, one
for each payload, of around 1300 bytes. This suits Ivip reasonably
well, since it will frequently be the case that an RUAS has either no
Whittle Expires July 22, 2010 [Page 7]
Internet-Draft Fast Payload Replication January 2010
updates to send in a given period of time, such as 0.3 seconds, and
sometimes may have enough updates to fill several payloads.
Consequently, the RUAS only sends a packet when it needs to, and the
length and number of the packets reflect the amount of data to be
sent.
Ivip mapping updates apply to a particular MAB (Mapped Address Block)
a DFZ-advertised prefix which encompasses a block of SPI (Scalable
Provider Independent) address space. Each MAB is split into many
(potentially hundreds of thousands) of arbitrary length ranges of
address space called "micronets". Each micronet is mapped to a
single ETR (Egress Tunnel Router) address. The most common mapping
update is to change the ETR address of an existing micronet. Other
updates join and split micronets, or announce that the RUAS which
controls this MAB has made a snapshot of the full state of the MAB's
mapping - which a QSD can download for the purpose of initialising
its copy of the mapping database, or to overcome any errors which
have somehow accumulated in the section of it which concerns this
MAB.
Within a short period of time, such as 100ms, there may be a series
of mapping updates concerning a particular MAB. All mapping updates
for a given MAB always arrive in payloads from one RUAS. If a
complete series arrives in a single payload, then this is relatively
simple. If this payload is missing from the streams of packets which
arrive from the QSD's two or more upstream Replicators, then the QSD
will obtain the missing payload within a few seconds from a MPS.
Then this series of changes will be applied to the MAB, and the
slight delay will be of no consequence.
If the series of changes spans two or more payloads, and one or more
of these is missing, then the situation is more complex. The QSD
generally can't apply any changes to the MAB until it has the full
series which was transmitted effectively at the same time. This is
because the series may involve zeroing the mapping of micronets,
splitting and joining them and then setting the mapping of the
resulting micronet to a new ETR address. The series of changes is
best applied all at once, so the QSD needs to buffer this MAB's
changes and only apply them once it has the missing payloads.
In either case, where updates to the mapping of one MAB are delayed,
the QSC must buffer subsequently received updates and apply them in
order. This is for the same reason: that some updates alter the
structure of the micronets and so cannot be applied out of order.
This illustrates that the Ivip system will work fine with all or most
updates being applied within a second or so of them being sent by the
RUAS, but that in the small number of cases where there is a few
Whittle Expires July 22, 2010 [Page 8]
Internet-Draft Fast Payload Replication January 2010
seconds delay due to a missing payload, this is of no serious
negative consequence. No end-user network absolutely relies on ITRs
changing their tunneling behavior within a second or two of a mapping
update being sent. It suffices that most will do this, and on rare
occasions some ITRs will change a few seconds later.
Even if all changes to ITR tunneling behaviour were suspended for a
minute or two, or some ITRs lagged behind the rest by a few minutes,
no serious harm would occur. The worst outcome would be a consequent
delay in multihoming service restoration (which for a minute or two
is highly undesirable, but not disastrous) and likewise delays in
traffic engineering changes or packets to mobile devices being sent
to a new TTR, rather than the old one. In the mobile case, this
simply means the mobile node needs to maintain its tunnel to the old
TTR for a few minutes longer than it otherwise would. (Such changes
in mapping only occur if the mobile node moves 1000km or so - not
ever time it gains a new access network or IP address.)
The foregoing discussion illustrates that it is good enough for FPR
to generally be very fast and robust, but for it to sometimes involve
delays of a few seconds, or more rarely, a few minutes.
The total data-rate of the "complete stream" of packets handled by
the FPR system needs to be carefully bounded in order to ensure that
all devices in the system will not be overloaded. In Ivip, there
will be multiple, asynchronous, independent sources of packets which
drive the input to the system "level 0 Replicator" part of the
system. In this ID, these are assumed to be a set of RUASes (Root
Update Authorization Servers). The exact number of these is not
particularly important, but in practice there might be a few dozen.
These would need to coordinate their packet rate to ensure that at no
instant was the FPR expected to carry more than some data rate of
packets.
In the fully designed Fast-Push Mapping System, there may be an
additional source of packets which is not an RUAS. Nonetheless, the
discussion below anticipating a dozen or so 20 or so RUASes is
sufficient to explore the current FPR design.
FPR is secure against an attacker sending spoofed packets, since
these will not pass the DTLS software which accepts them into each
Replicator and QSD.
FPR is not secure against an attacker gaining control of one or more
Replicators. In order to achieve end-to-end integrity, the final
recipient device (in Ivip, a QSD) will need to be able to
authenticate each payload worth of mapping data, or larger bodies of
mapping data assembled from multiple payloads. This will probably be
Whittle Expires July 22, 2010 [Page 9]
Internet-Draft Fast Payload Replication January 2010
via a public key signature of that payload or larger body of data,
which is included in the payload data stream - using the public key
of the RUAS which sent the payloads.
Likewise, in order to be able to ensure confidentiality against an
attacker who can snoop packets being sent by Replicators, the
application data must be protected by encryption. In the present
design, DTLS both encrypts and authenticates the payload of each
packet. Confidentiality is not required for Ivip. (I have not yet
determined if there is a DTLS mode which supplies only
authentication.)
Since FPR is not absolutely robust against link loss or random packet
loss, a complete system which delivers data entirely robustly must
supplement FPR with some method by which the recipient can request
missing packets. This could be external to the FPR system, but in a
later section I explore the possibility of integrating Missing
Payload Servers into the FPR system of Replicators. Forward Error
Correction is another method of coping with a certain level of lost
packets, but this involves considerable complexity and overhead. It
also involves handling data in long block lengths which are not
suitable for Ivip.
FPR's use in Ivip is a critical part of the core-edge separation
system. The FPR part of the mapping distribution system is a single
global-scale system and it is intended to run reliably, continually,
delivering data to potentially hundreds of thousands of QSDs all over
the world. A halt in its operation for a few seconds or minutes
would not be disastrous, but would delay multihoming service
restoration, mapping changes for inbound TE, and the ability of
mobile nodes to choose a closer TTR. The Ivip system won't fail if
the FPR system stops for minutes or even tens of minutes.
Nonetheless, the FPR system is intended to operate continually,
indefinitely, despite its individual component Replicators being
taken in and out of service and the connections between them being
changed from time to time.
FPR is most suited to an application which requires very fast
replication of information, including perhaps on a global scale,
where it is important that the data generally arrive quickly, but
that occasional lost packets and consequent delays obtaining
replacements will not be a problem.
While FPR is much less efficient than ordinary multicast - in which a
single stream is replicated into multiple streams at one or more
points in the distribution system. Efficiency is traded off directly
for greater robustness against packet loss.
Whittle Expires July 22, 2010 [Page 10]
Internet-Draft Fast Payload Replication January 2010
FPR does not rely on conventional multicast protocols or router
capabilities. It should be possible to implement an FPR Replicator
as a user space daemon on any server.
There is nothing particularly surprising about the outcomes of the
FPR arrangement. However, this combination of capabilities is a
crucial component of Ivip. FPR's capability to convey "mapping"
information in essentially real-time from end-user networks (or
entities they authorise to control their mapping) to hundreds of
thousands of QSDs in ISP and large end-user networks all over the
world enables Ivip to achieve at least two major benefits compared to
other core-edge elimination systems, most prominently LISP
[I-D.ietf-lisp].
Firstly, there is no need for ITRs (Ingress Tunnel Routers) to have
to choose between multiple ETRs (Egress Tunnel Routers) - since the
end-user network is able to control the ITR tunneling behavior in
real-time. (ITRs receive mapping from QSDs in response to queries,
and QSDs send updates to ITRs if they receive changed mapping via the
FPR system.)
Secondly, this modularly separates control of ITR behavior from the
core-edge separation scheme itself, enabling the end-user network to
control the ITRs for whatever purposes they desire, and with whatever
techniques and information they employ.
Without a real-time global mapping distribution system, the other
core-edge elimination architectures to date cannot control ITRs
directly, and so must build all the system's reachability testing and
decision-making capabilities into each ITRs and give end-users
control via more complex mapping which includes multiple ETR
addresses.
This discussion of core-edge separation architectures is of no direct
relevance to FPR as a subsystem. However, it illustrates that FPR's
particular capabilities are crucial to being able to make some
attractive architectural choices in a core-edge separation scheme.
Reliable Multicast [RFC2887] would not be as suitable, since it
involves a single stream of packets, whereas in Ivip, FPR will fan
out multiple independent streams, one from each of 20 or so RUASes.
Reliable Multicast involves long blocks of data for its Forward Error
Correction arrangement, which would introduce delays in the sending
and receiving of application data. In Ivip, this would delay and
very much complicate the reception of data, particularly when the
data rate from each RUAS is low.
Neither Reliable Multicast or Secure Multicast [RFC3740] are robust
Whittle Expires July 22, 2010 [Page 11]
Internet-Draft Fast Payload Replication January 2010
against lost packets and dead links, while FPR can be used in a way
which gives a much higher degree of robustness against these.
As far as I know, most multicast protocols assume the use of routers
at specific parts of the network. FPR is intended to operate without
reference to routers or the address structures inherent in networks.
FPR Replicators can be implemented in servers on arbitrary stable
global unicast addresses. The structure of the links between
Replicators has no reliance on network topology and can be
arbitrarily chosen. FPR is intended to work reliably with links
across the DFZ and so be able to scale well to a global distribution
of recipient devices.
In summary, FPR is relatively simple and may complement established
multicast protocols rather than exceed their performance in the
applications they are best suited to. At least with Ivip, FPR's
apparently unique capabilities will enable a larger system to be
designed in ways which would not be possible - or at least not as
easy - with existing techniques.
I intend to write the requisite software for FPR - code for a
Replicator - later in 2010.
Whittle Expires July 22, 2010 [Page 12]
Internet-Draft Fast Payload Replication January 2010
2. Goals
2.1. Potentially high data volume
FPR should be able to handle relatively high data volumes.
The limiting factor with DTLS is likely to be the ability of
Replicator software to send the output streams each with its own DTLS
protection. If a customised authentication arrangement was used
instead, then each Replicator could send essentially identical
packets to all its downstream devices, saving on the separate
cryptographic processing of each stream which would be inherent in
DTLS or IPsec.
At present, I am unsure of the efficiency of using DTLS to produce
large numbers, such as 20 or 50 output streams of packets. With 4
core 64 bit CPUs clocked at close to 3GHz, I would not be surprised
if a modern COTS (Commercial Off The Shelf) server could fill a
gigabit Ethernet link. However, this remains to be determined. The
average data rates per stream with Ivip would be fractions of a
megabit per second, even with the largest imaginable deployment.
2.2. Independent of routers and network structure
While FPR could be implemented in routers, it is intended to be
implemented as software in a server. The use of DTLS means that a
user-daemon with inbuilt DTLS capabilities can be operated on any
server, since there are no special demands on the operating system.
In a global system, Replicators can be on any stable global unicast
address. In a private network the addresses need only be stable. In
all cases, the passage of packets between Replicators is controlled
directly and in no way depends on the topology or addressing
structure of the network. This makes FPR suitable for a global
packet replication system with links across the DFZ.
2.3. Simple UDP (DTLS) only operation
Since DTLS sessions are set up via the same UDP ports which are used
for data transfer, the entire Replicator could use a single UDP port.
This should facilitate recipients being behind NAT, since the
recipient device makes the DTLS link to its upstream Replicators.
Replicators themselves cannot be behind NAT, since the DTLS session
could not be established to them. Replicators, at least for Ivip,
are generally meant to be at well-connected data centers where the
multiple links to other data centers can be used to ensure physical
diversity of the streams being sent to any one Replicator.
Whittle Expires July 22, 2010 [Page 13]
Internet-Draft Fast Payload Replication January 2010
2.4. Good but not perfect robustness
If a recipient receives streams of packets two upstream Replicators,
and both of these feeds are disrupted in some way, then the recipient
will not get some packet payloads. FPR has no NACK, but in a later
section I discuss a system of "Missing Payload Servers".
The aim is to make FPR delivery of packets for any recipient with
reasonably good network links (such as by the streams arriving via
two physical links from Replicators with different topological
locations) highly robust against individual packet losses or the
failure or unreachability of an upstream Replicator. The purpose is
to make missing payload recovery a rare enough event that the
occasional delays and extra traffic it involves are not significant
problems.
There can be no perfectly robust system, of course, in the event that
all links from outside sources are disrupted at the same time.
2.5. Flexible trade-off of efficiency for robustness
By choosing how many input streams each Replicator or recipient
device has, and by choosing these to arrive from Replicators near and
far (geographically and topologically) it should be possible to
achieve a wide-range of compromises between efficiency and
robustness.
These choices can be made at a local level, for each particular
Replicator or recipient. While the discussion below generally
assumes each will receive two feeds, it will be possible to configure
them to receive more than this number of feeds.
For instance, a Replicator or Recipient which can be given five
feeds, each arriving over a different physical link, each from a
Replicator whose location in the network is topologically different
from the others. Each such upstream Replicator should, ideally, have
feeds from other upstream Replicators are at least partially diverse
with respect to each other. Then, the ability of the recipient to
receive all packet payloads could be extremely robust against random
packet losses and against outages in routers, data-links and other
Replicators - at the cost of requiring outputs from more upstream
Replicators and paying for the bandwidth of their multiple incoming
streams.
Whittle Expires July 22, 2010 [Page 14]
Internet-Draft Fast Payload Replication January 2010
2.6. Robustness against DoS may be achieved with private network links
No device on the open Internet can be reliably protected against a
flood of packets generated by botnets. In order to minimise the
damage such an attack could have on an FPR system, the higher layer
Replicators (closer to layer 0, and including layer 0) of the
inverted tree structure would need to be linked by private network
links.
At some point in the Replication hierarchy, where Replicators are
sufficiently numerous, the links to the next level (numerically
higher, but lower in the inverted tree) could be carried by the
public Internet. A DDoS attack with a given bandwidth capacity would
only be able to affect a subset of the Replicators at that level,
depending on how many there are at that level and whether their input
capacity was 100Mbps or 1Gbps. Depending on all the factors, it
would be possible to ensure that even the largest botnet DoS attacks
has little impact on the delivery of data to recipients, if there are
one or more layers of Replicators below this.
Costly private networks links between Replicators is a significant
expense, but will probably be justified for Ivip in order to ensure
this critical piece of Internet infrastructure can only be partly
affected by the largest DoS attacks.
Whittle Expires July 22, 2010 [Page 15]
Internet-Draft Fast Payload Replication January 2010
3. Non-goals
3.1. Not intended to provide end-to-end security
While Replicators only receive feeds from upstream replicators they
are configured to use, and which accept their credentials, and while
DTLS protects the payloads of packets between the Replicators and
from the Replicators to the recipient devices, FPR does not provide
end-to-end security against either alteration of the data or snooping
of its contents.
This is because the recipient has no way of knowing that all the
upstream Replicators it relies upon are not under the control of an
attacker. A single such compromised Replicator could drive packets
to most or all of its downstream Replicators by sending out packets
with the identification numbers expected from the genuine source a
little earlier than the genuine packets.
The use of DTLS to protect packets sent from Replicators to other
Replicators and to recipient devices is intended primarily to prevent
any of these accepting a spoofed packet generated by an attacker who
does not control any Replicators. This protects against attackers
injecting their own packets with bogus payloads.
3.2. No autodiscovery or monitoring
The current description is for the basic functions of Replicators and
later Missing Payload Server. In some applications it may be
desirable for the Replicators to automatically choose their upstream
and downstream Replicators. In almost any practical system, some
kind of diagnostic functions would be needed in order to evaluate
performance and debug problems. Such capabilities are for future
work.
3.3. No attempt to automatically adapt to varying PMTUs
To be deployed across today's DFZ, all packets would need to be less
than 1500 bytes long. I will assume 1470 bytes, for convenience, as
a PMTU which can reasonably be expected in any DFZ path, because I
have observed Google servers sending unfragmentable packets of this
length. [DFZ-unfrag-1470]
The FPR system of Replicators has no PMTUD capabilities - and any
PMTU problem encountered by a packet will not result in an RFC 1191
Packet Too Big message being sent beyond the upstream Replicator
which sent the packet. Replicators would ignore such a message.
The Missing Payload Servers receive streams of packets just like
Whittle Expires July 22, 2010 [Page 16]
Internet-Draft Fast Payload Replication January 2010
Replicators and QSDs, so they need to be located where there are no
local PMTU restrictions which would prevent the reception of packets
of the chosen maximum length. Missing Payload Servers communicate
with each other, and handle requests from QSDs, via TCP - which does
not involve any special MTU constraints.
In Ivip, is it likely that some Replicators, Missing Payload Servers
and QSDs will be located in end-user networks which use SPI (Scalable
Provider Independent) addresses. Packets addressed to SPI addresses
will pass through an ITR and ETR. (Replicators may include an
inbuilt ITR function so the packets it sends don't have to go to any
separate ITR.) If encapsulation is the method used for ITR to ETR
tunneling then for IPv4, this involves a 20 byte IP-in-IP header. So
the maximum length of a packet which could be handled by the FPR
system in this scenario - a UDP packet with DTLS header and payload -
is 1450 bytes.
In an ISP or end-user network today where gigabit Ethernet interfaces
are always used and where all MTUs support ~9kbyte jumbo-frames, it
would be possible to run an FPR network with ~9kbyte packets.
If a 1450 byte FPR system was successfully operating over the DFZ, at
some time in the future, when all DFZ paths and likewise paths
between all Replicators and recipients could support ~9kbyte packets,
there could be a transition to using these larger packets.
Replicators will handle ~9kbyte packets and in principle the same
Replicators could begin handling the larger packets without any need
for reconfiguring the entire system. If the numbering systems by
which the packet payloads are identified did not overlap, and if the
Replicators had the capacity, the same system of Replicators could
handle the 1460 byte packets and ~9kbyte packets simultaneously.
These larger packets would involve a different way of splitting up
the data to be transmitted. Recipient devices (QSDs) may have
software which copes automatically with different packet formats, but
a more likely scenario is that the switch to jumboframes in the
future would be accompanied by somewhat different ways of carrying
the data - and so by the need for updated recipient software.
3.4. Not intended for mass market consumer applications
At each point - a Replicator, Missing Payload Server or QSD -
redundancy is bought by increasing the incoming bandwidth, according
to how many upstream Replicators are used. This is expensive for
high data-rate applications and so FPR is not intended as a system
for delivering audio or video material to mass-market end-users.
Whittle Expires July 22, 2010 [Page 17]
Internet-Draft Fast Payload Replication January 2010
It is intended for recipients in ISP networks where the two or more
feeds can be chosen to arrive via different physical links, different
peering points and different border routers - so the physical
diversity available in these settings can be directly employed to
provide increased robustness. Since FPR lacks PMTUD capability, it
is best used in scenarios where the location of Replicators and
recipients is stable and carefully planned, with regard to any PMTU
limitations which may affect them.
Whittle Expires July 22, 2010 [Page 18]
Internet-Draft Fast Payload Replication January 2010
4. Streams of packets for Replicators and QSDs
All packets discussed below are those which pass the DTLS
authentication process, and are presented to the FPR code as DTLS
payloads, each consisting of an FPR header and FPR payload. (For
simplicity, much of this discussion assumes that these packets are
only received by Replicators and QSDs. However, Missing Payload
Servers will also receive streams of packets, in exactly the same
way.)
In all cases, the receiving device does not distinguish between
packets which arrive from one incoming stream from those which arrive
from another. This information is available from the DTLS software,
but is not important to how the device processes the payloads of each
incoming packet.
In this discussion the term QSD (Ivip full-database query server) is
used to denote the devices which receive the packets and put their
payloads to use, rather then sending the payloads to others, as
Replicators do. This helps explain the FPR system's role within
Ivip. If FPR was used for another purpose, the packets would be
received and used by some other device.
It would be possible for a single device to function as both a
Replicator and a QSD. This may make sense during initial Ivip
introduction. However, in a fully deployed Ivip system, with the QSD
handling many requests from ITRs (directly and via caching QSCs) and
with the QSD having a significant workload receiving the packets and
processing them to update its database, separate servers for the QSD
and Replicator functions would be the best approach.
The Replicator and QSD code would share some common elements - for
the reception and processing of incoming packets. Replicators and
QSDs are both required to receive multiple streams of packets. While
they may operate with a single stream, two would be a typical number
to receive and they may be required to receive many more. These
statements apply also to the code for the Missing Payload Server.
A QSD or Missing Payload Server only receives streams from upstream
Replicators. 2 streams would be a typical number, but perhaps as many
5 could be used to maximise robustness.
A level 1 or greater Replicator receives typically two or more
streams from upstream Replicators in the numerically lower numbered
level - which is "above" (upstream) in the inverted tree structure.
Level 0 Replicators receive streams from one or potentially many
sources of packets. In Ivip, the sources are multiple RUASes. They
Whittle Expires July 22, 2010 [Page 19]
Internet-Draft Fast Payload Replication January 2010
also receive a stream from every other level 0 Replicator.
The FPR system handles the sum of the unique payloads sent by all
RUASes. For instance, in a given time period such as 100ms, RUAS-0
sends streams of packets to all five level 0 Replicators, with each
stream containing 7 packets with a set of 7 unique DTLS payloads (FPR
headers and FPR payloads). While the packets received by one level 0
Replicator are all different from those received by another, due to
DTLS encryption, each level 0 Replicator receives 7 packets from
RUAS-0, and the DTLS payloads of the packets received by one level 0
Replicator are identical to the DTLS payloads of received by each
other level 0 Replicator.
The purpose of RUAS-0 sending five streams containing the same DTLS
payloads, one stream to each level 0 Replicator, is to maximise the
fault tolerance of the system. If one or two level 0 Replicators are
down, or if they can't be reached from RUAS-0, then there will be no
loss of data being sent to the QSDs. Even if RUAS-0 was only able to
send its 7 packets to a single level 0 Replicator, or if a single set
of 7 was sent to various level 0 Replicators (such as packet 0 to
R0-0, packet 1 and 2 to R0-3 and the rest to R0-5) then the system
would still deliver all payloads to the QSDs. This is due to the
level 0 Replicators being "fully meshed". Every one has an output
stream to every other one. So as long as at least one packet with a
given payload arrives at any level 0 Replicator, within a fraction of
a second, all other level 0 Replicators will receive it as well.
In all cases, the receiving device (Replicator or QSD) establishes
the DTLS session with the source of the packets. To continue with
the five level example of Figure 1, QSDs establish their DTLS
sessions with level 4 Replicators. Typically two would be a good
choice, but more could be used. The QSD is configured to use
particular level 4 Replicators and the DTLS session can only be
established if each level 4 Replicator accepts the username and
password provided by the QSD.
Similarly, Replicators at levels 4, 3, 2 and 1 establish DTLS
sessions with Replicators at the level above (numerically one less).
Each layer 0 Replicator establishes a DTLS session with each other
layer 0 Replicator, and with each RUAS.
In this discussion, it is assumed that there is a strict layering of
Replicators. While layer 0 is fully meshed, there is no meshing of
other layers - no layer 3 Replicator receives a stream of packets
from any other layer 3 Replicator. Also, no Replicator at levels 1
or greater is shown accepting a stream from Replicators at any level
other than the one above. The diagram shows the Replicator system
Whittle Expires July 22, 2010 [Page 20]
Internet-Draft Fast Payload Replication January 2010
ending at level 4, and with the next level being composed entirely of
QSDs.
There is nothing to prevent a QSD being driven partly or wholly by
streams from Replicators in levels other then 4.
Nor is there anything to prevent a Replicator getting some of its
streams from levels other than the one above. For instance, it would
be possible, in principle, to cross-connect all level 3 Replicators.
However, due to their large number (3,000) this would be impractical
and inefficient. The strict layering of Replicators is not
absolutely required, and it may make sense to have a Replicator
driven by streams from a level 2 and a level 3 Replicator. It would
also be possible to take a stream from a level 4 Replicator and feed
it to a level 3 or 2 Replicator. This cannot result in the
equivalent of "routing loops", since most or all of the packets which
arrive from this link will contain payloads which the level 3 or 2
Replicator has already received - so those packets will not lead to
any further action. The advantage of doing this, from a level 4
Replicator which is dependent on different level 3 or 2 Replicators
than those streams are received from, is to provide diversity. If
those streams from the directly used level 3 or 2 Replicators are
disrupted, it is unlikely that there will be the same disruption in
the stream received from the topologically distant level 4
Replicator.
I have presented FPR in a strictly layered arrangement because this
is easier to depict and is theoretically the most efficient way of
fanning out information. However the details of connections between
Replicators and QSDs is not technically constrained by the FPR
system, and can be chosen freely to trade off bandwidth and computing
resources for robustness according to local conditions.
For instance, if a level 3 Replicator in Sydney Australia has
incoming streams from level 2 Replicators in Sydney and Singapore,
analysis of the connections might reveal that these two operate from
three or four level 1 Replicators which are not ideally diverse in a
topological sense, with their origins being mainly in the USA.
Assuming that the higher level Replicator outputs are more difficult
to obtain access to than those of the lower levels, it would be
possible to have a third stream feed this level 3 Replicator, from a
level 3 or 4 Replicator in Russia, which has most of its incoming
streams arriving from Europe. Typically, the packets arriving from
the Russian Replicator would arrive later than those from the higher
level Sydney and Singapore Replicators, and so would be ignored.
However, if there was a network outage which affected both the Sydney
and Singapore Replicators, even for a fraction of a second, the
payloads in the packets arriving from Russia would be used
Whittle Expires July 22, 2010 [Page 21]
Internet-Draft Fast Payload Replication January 2010
automatically.
Each stream sent to a QSD, Missing Payload Server or a level 1 or
greater Replicator is, under ideal circumstances (no packet loss), a
"complete stream" in that its packets contains a complete set of DTLS
payloads which every QSD, ideally, will receive at least one of.
This is also true of the streams each level 0 Replicator receives
from other level 0 Replicators. If there is a single external source
of packets, then ideally, that source will send a separate "complete"
stream of packets to every level 0 Replicator. Due to the fully-
meshed flooding arrangement of the level 0 Replicators, then -
assuming there were no packet losses - it would suffice for the
single external source to send a single complete stream to just one
level 0 Replicator. Alternatively, the single source could send a
complete stream in various subsets, each to a different level 0
Replicator.
When the FPR system is used in Ivip, it is intended to receive
packets from multiple external sources - each an RUAS system.
Ideally, every RUAS will send its subset of the complete stream to
every level 0 Replicator. In this case, the "complete" stream is the
sum of all packets sent by all external sources. In fact, it would
suffice (assuming again no packet losses) for each external source to
send just a single set of packets to just one level 0 Replicator, or
scattered to various level 0 Replicators, because the payload of each
packet one will flood to all the other level 0 Replicators.
At the highest level - level 0 - the FPR system involves brute-force
flooding and fully-meshed redundancy to ensure that in ordinary
circumstances every level 0 Replicator receives the "complete stream"
- either directly from the one or more external sources, or from its
level 0 peers.
For a global, real-time, system such as Ivip, I anticipate that 4 to
8 level 0 Replicators would suffice. Each would be in a
geographically and topologically different location, and they would
all be meshed by private network links which would, ideally, be
geographically and topologically diverse. In a section below I
discuss the use of private networks to protect against DoS attacks.
Whittle Expires July 22, 2010 [Page 22]
Internet-Draft Fast Payload Replication January 2010
5. Packet payloads and identification
Replicators and QSDs decipher each received packet from upstream
Replicators (or for the level 0 Replicators, from RUASes and other
level 0 Replicators) and use the first 32 bits of the DTLS payload to
decide what to do with the entire DTLS payload. The options are to
use it because it is deemed to be "Fresh" payload - a Replicator
replicating it, or a QSD using the payload to update its database -
or to "ignore" it, because the device has already received a packet
with the same payload, meaning it is deemed to be a "Repeat" payload.
Some or potentially all bits of the FPR header are used by the
Replicator or QSD to decide whether the Replicator or QSD has already
received a packet with the same payload. This section describes this
process in principle, and the next describes one way this process
could be implemented in Replicators, Missing Payload Servers and
QSDs.
FPR's units of replication and flooding are payloads of packets. If
IPv4 packets are limited to 1450 bytes then there are 1422 bytes
available after the IP and UDP headers. If the DTLS header involves
an overhead is 50 bytes then 1372 bytes remain as the "DTLS payload".
(I have not yet researched DTLS in sufficient detail to determine
exactly what the overhead would be. This would depend on choice of
encryption algorithm and Message Authentication Code.)
At the start of the DTLS payload, a fixed number of bits must be
devoted to identifying the packet. I will refer to this as the "FPR
header" and for now assume it is 32 bits. The remainder of the DTLS
payload is the "FPR payload" - and is available for application data.
With these assumptions, the application data is contained in the 1368
byte FPR payload. The FPR software in the QSD (or other recipient
device, if FPR is used for a purpose other than Ivip) should make the
FPR header bits available along with the FPR payload, since it may be
helpful in processing the FPR payload.
In the current design, the only function of the "FPR header" is to
enable each Replicator and QSD to use this header to decide whether
each incoming packet, once deciphered from its DTLS form, is either a
"fresh" or a "repeat" packet with the timeframe T. The exact
algorithm for this decision is described in the following section
below. For now, the definitions are loosely:
Whittle Expires July 22, 2010 [Page 23]
Internet-Draft Fast Payload Replication January 2010
Fresh:
No packet with FPR header bits identical to this packet's FPR
header bits has been received in the recent time period T.
(Therefore, the packet is assumed to contain a fresh FPR
payload and so must be replicated.)
Repeat:
One or more packets with FPR header bits identical to this
packet's FPR header bits value HAS been received in the recent
time period T. (The first such packet was replicated, so
subsequent packets, which presumably have the same FPR payload,
are ignored.)
The idea is that in any given scenario, there may be some mechanism
by which a packet could be so delayed by the routing system and
Replicators that it arrives some time D later than it otherwise
would. The algorithm needs to identify any delayed packet with the
same payload as one already received as a "Repeat" which can be
ignored.
There are various ways of achieving these goals. One approach is to
use the 32 bit FPR header in the following manner. This is purely
for Ivip. Other applications would choose to identify the payloads
differently. (This is a preliminary exploration, to demonstrate one
way of performing the algorithm.)
10 bits: epoch in 1 sec increments (epochsec):
When the RUAS sends out packets with this payload, it sets
these 10 bits according to the current time (epoch), quantized
to seconds units. RUASes should agree on a common timebase, so
all the packets sent at a particular time by all RUASes have
the same, or +/-1, values for these bits. This wraps around
every 17 minutes 4 secs. Maybe it would be better to have 20
or 32 bits here. This value is not currently used for the
Fresh / Repeat algorithm, but it will be used by QSDs and
Missing Payload Servers for identifying recent payloads.
7 bits: RUAS identifier (ruas):
This identifies the RUAS which sent the payload, from 128
possible RUASes.
1 bit: Normal / Jumbo (nj)
0 means normal ~1500 byte packet size. 1 means ~9kbyte packet
size. To support simultaneous reception of both types of
packet, the RUAS will maintain separate sequence number
counters for each set of packets.
Whittle Expires July 22, 2010 [Page 24]
Internet-Draft Fast Payload Replication January 2010
14 bits: Sequence number (seq):
The RUAS sends out each payload with a sequence number which is
one more than that used for the previously sent payload. The
Fresh / Repeat algorithm doesn't rely on this sequential order,
but it will help with the retrieval of missing payloads from
Missing Payload Servers. These numbers wrap around every
16,384 payloads. The RUAS should not send more than 1000
packets a second, which is about 1.3 megabytes a second,
assuming the packets are ~1500 bytes. So "seq" can't wrap
around in less than 16.384 seconds.
The next section explains how these bits are used.
Whittle Expires July 22, 2010 [Page 25]
Internet-Draft Fast Payload Replication January 2010
6. The Fresh vs. Repeat Algorithm
This section describes an algorithm for deciding whether DTLS payload
(FPR header and FRP payload) is "Fresh" or a "Repeat". I tried using
sliding windows and ran into problems. This approach maintains a
timer for each sequence number, for each RUAS. This would have been
prohibitive in the past, but a quad core ~3GHz CPU would use only a
small fraction of its power running these timers.
There could be other ways of implementing this algorithm. The aim
here is to show a practical approach, which may not be optimal.
For each of the 128 RUASes, the software maintains an array of 2^14
timer (down-counter) variables for ~1500 byte packets another such
array for ~9kbyte packets. (It will be many years before the DFZ
supports ~9k byte packets, but the code should be ready to support a
separate stream of such packets.)
In the implementation below, the timer variables are 4 bits each, but
I use only 3 bits. The 4 bit timer variables are in a
multidimensional array, indexed on "ruas" (2^7), "nj" (2^1) and "seq"
(2^14). So there are 2^22 4 bit timer variables, occupying 2
megabytes of RAM. This is a few cent's worth of DRAM. The whole
array fits well within the L2 cache of modern multi-core CPUs, which
is typically 8 megabytes.
When a packet arrives and is successfully deciphered, the software
looks at the four fields: "ruas", "nj" and "seq" in the FPR header.
The software uses these to index into the array and read a particular
timer variable.
If the timer value is zero, then the payload is deemed to be "Fresh".
This is because this payload is the first to be received from this
RUAS with this "seq" number in the last 10 or so seconds. The
software then sets the timer variable to 5.
Later, after this RUAS has sent another 16,384 payloads, it will send
another payload with the same "seq" value. But by then, more than 10
seconds will have elapsed and this timer value will have reached zero
- so that new payload will be recognised as Fresh too.
If the timer variable is non-zero, this payload is deemed to be a
"Repeat" and no further action is taken on it, or the timer variable.
This would occur if another packet with the same payload arrives less
than 10 to 12 seconds after the first one.
Meanwhile, a background process steps through all the timer values
every 2 seconds - a million a second, which is a fraction of a CPU's
Whittle Expires July 22, 2010 [Page 26]
Internet-Draft Fast Payload Replication January 2010
worth of work, and modern chips have four CPUs. If the value is non-
zero, it is decremented. If it is zero (which most of them will be)
the variable is not changed. There is no need for locking these
timer variables, since these two types of access are thread-safe.
The payload handling code only writes the variable if it was zero and
the timer code only writes to it if it was non-zero.
Since the down-counting operation is asynchronous with respect to the
payload handling code, it could be 0.0 to 2.0 seconds before the
first decrement operation. The actual time required for the counter
to reach zero after a Fresh packet is recognised will be between 10.0
and 12.0 seconds.
This arrangement will reject as a "Repeat" any second occurrence of a
payload which arrives up to 10 seconds after the first ("Fresh") one.
It may reject one which arrives as much as 12 seconds later. I
assume that the Replicators themselves do not delay packets and that
the routing system would never deliver a packet with such delays
which would amount to 10 seconds. If a longer time is required, this
algorithm could be modified.
Whittle Expires July 22, 2010 [Page 27]
Internet-Draft Fast Payload Replication January 2010
7. RUAS functionality
With the assumptions from the previous section, there can be up to
128 RUASes. Each can generate up to 1000 DTLS payloads per second.
However, the total FPR system will have a specified maximum data
rate, probably at a granularity of a short time such as a few
milliseconds to a few tens of milliseconds. Therefore, there needs
to be some arrangement by which the RUASes cooperate so the rate at
which packets (really DTLS payloads) are replicated by the level 0
Replicators does not exceed this maximum.
Each RUAS has its own section of the 2^22 bit numbering range to use
as sequence numbers for its DTLS payloads for - the value it writes
to the "ruas", "nj" and "seq" fields in the FPR header. For the
~1500 byte stream of payloads, the RUAS must cycle through the 2^14
range of "seq" sequentially. If it is also sending jumboframe
packets, it will maintain an independent counter to set the "seq"
bits in those payloads.
As part of this cycling, the RUAS should not, within 12.0 seconds,
generate two DTLS payloads with the same particular value for "seq"
in their FPR headers, but with different FPR payloads.
All RUASes should use a common timebase for setting the "epochsec"
field in the payloads they generate.
At a bare minimum, for the RUAS to successfully launch a DTLS
payload, it must deliver a packet containing that payload to at least
one level 0 Replicator. This is assuming all the level 0 Replicators
are operating and that they are fully meshed - with each receiving a
stream from the others. If the RUAS only delivered the payload to a
single level 0 Replicator, which was not sending a stream to any
other level 0 Replicator, but was sending streams to all its
downstream level 1 Replicator, then depending on the interconnections
at the various levels, this may not result in the payload being
delivered to all QSDs.
Therefore, the RUAS should ideally have a stream to each level 0
Replicator, to maximise the chance that most or all of these
Replicators receive the payload directly, or from another such
Replicator.
The question of how RUASes format the data in the FPR payload, for
the purposes of reassembly in the QSDs and so that QSDs can use end-
to-end encryption to check its authenticity, is outside the scope of
the FPR system.
Whittle Expires July 22, 2010 [Page 28]
Internet-Draft Fast Payload Replication January 2010
8. Replicator Functionality
Most Replicators will need to receive two or perhaps a few more
streams from upstream Replicators. Level 0 Replicators will receive
many more streams. Firstly, they will receive a stream from each
other level 0 Replicator. Secondly they will receive a stream from
each RUAS.
The same Replicator code should be usable at all levels, so
Replicators in general should be capable of receiving over 100 input
streams. This does not mean the total volume of packets would be 100
times the complete set of payloads the FPR system is replicating.
If we assume an upper limit of 8 level 1 Replicators, then the worst
case quantity of packets any Replicator must handle is 8 times the
total actually being replicated. This would be when a level 0
replicator receives the total set collectively from the 100 or so
RUASes and then receives the same set from each of the 7 streams from
the other level 0 Replicators. So this provides a reasonable
definition of how many DTLS sessions a Replicator may need to create
to "upstream" devices - and the total volume of data it should be
able to receive via these sessions.
Except for monitoring purposes, the Replicator makes no distinction
between DTLS payloads which arrive from any of its upstream sources.
Each such payload is handled, as described above, by the Fresh /
Delayed algorithm. Only payloads deemed Fresh require any further
action.
Each Fresh payload is replicated to all the downstream devices, each
with its own DTLS protection, due to each such session having
different session keys and states. Just as the incoming streams are
unidirectional, so are the output streams. Apart from DTLS
handshakes, each Replicator does not send packets upstream, or
receive them from downstream, devices.
The replication process does not alter the DTLS payload. There is no
hop-count or checksum to check or update. The same DTLS payload is
simply sent out via all downstream DTLS sessions. It would be best
if this was scheduled to even out the flow of packets for each such
session. So a DTLS payload would be sent out on session 0, then on
session 1, etc. rather than sending two or more different DTLS
payloads on any one session one after the other.
When a Replicator is handling a jumboframe stream as well as the
ordinary ~1500 byte stream, it maintains separate input and output
sessions for the jumboframe packets. So the structure of links
between Replicators for jumboframe packets could be identical to that
Whittle Expires July 22, 2010 [Page 29]
Internet-Draft Fast Payload Replication January 2010
for ~1500 byte packets, could be similar or could be entirely
different. Therefore, a Replicator which handles both will need
approximately double the DTLS sessions and of course bandwidth and
CPU power to handle both.
Whittle Expires July 22, 2010 [Page 30]
Internet-Draft Fast Payload Replication January 2010
9. QSD Functionality
The QSD receives incoming streams as just described for Replicators.
However, a QSD would only receive all ~1500 byte streams, or all
jumboframe streams. Therefore, its Fresh / Repeat algorithm only
needs half the number of timer variables as a Replicator.
QSDs don't receive streams from the numerous RUASes, and it is
probably safe to assume that no-one would run a QSD with more than 8
input streams. So while a QSD is only required to handle up to 8 or
so DTLS sessions, each of these streams would be a complete stream,
so the incoming data rate requirement is the same as that of a
Replicator - 8 times the total data rate of the complete stream.
When Fresh DTLS payloads are received, their contents - the 32 bit
FPR header and FPR payload is passed to the rest of the QSD software,
and the mapping information in these payloads will be interpreted as
will be described in [I-D.whittle-ivip-db-fast-push].
This processing will involve some kind of end-to-end integrity
checking, involving the public key of the RUAS which sent the
payload. With the above arrangement, the RUAS of the packet can
easily be determined from the "ruas" field in the FPR header.
Perhaps it will be possible to individually authenticate every
payload - but I am concerned about devoting too much space in every
payload to the required MAC bits. This concern would not apply to
jumboframe payloads which are much longer. Checking each payload
would be simpler, but more costly in terms of CPU resources and space
used in each payload. Assembling information from multiple payloads
into a larger block for authentication would be more efficient, but
more complex. It also means that a missing payload will delay the
use of information in other payloads.
Exactly how the end-to-end authentication will be done is for future
work. It depends more on the Ivip mapping system than on the FPR
system itself, so I intend to explore this in the future in
[I-D.whittle-ivip-db-fast-push].
QSDs will also need to recognise any missing packets and to download
a replacement. The algorithm for this is for further work, but the
10 bit "epochsec" field will also be useful for this. Missing
packets could be detected, after a second or two, by a gap in the
sequence numbers of payloads from a given RUAS.
Perhaps one form of request for missing packets might be to send two
32 bit values, containing the FPR headers of the successfully
received payloads which bracket the assumed missing packets. The
Whittle Expires July 22, 2010 [Page 31]
Internet-Draft Fast Payload Replication January 2010
full 32 bits uniquely identifies each payload, and the 1 second
resolution "epochsec" field will enable the Missing Payload Server to
narrow down its search through its cache.
Whittle Expires July 22, 2010 [Page 32]
Internet-Draft Fast Payload Replication January 2010
10. Further elaborations
The above is a reasonably exhaustive exposition on the early design
phase of a simple, but flexible, data replication system. Here are
some elaborations to be more fully developed in the future.
10.1. Missing Payload Servers (MSPs)
I originally planned for QSDs to request payloads they did not
receive from a handful of HTTP servers run by each RUAS. This could
have scaling problems, so I have developed an alternative which is
closely integrated with the basic FPR system of Replicators.
In Ivip, the RUAS will be making snapshots of the mapping information
for each MAB (Mapped Address Block) on a regular basis, such as every
5 minutes or so. It will make these snapshots (in a compressed form)
available via several HTTP servers so QSDs all over the world can
download them during initialization. If a QSD was more than a few
minutes behind with missing payloads, then it would be better for it
to download the most recent snapshot instead and apply the updates it
has received since that snapshot was made. So the missing packet
server probably only needs to handle packets in the last 5 or so
minutes. This fits well with the 10 bit "epochsec" field in the FPR
header.
I haven't yet decided how a QSD can specify which missing payload(s)
it wants. One method may be to send the 32 bit FPR headers of the
last payload received before the missing payloads and of the first
payload received afterwards.
I considered using a UDP protocol for requesting and receiving
missing payloads from MSPs, but chose TCP instead, probably HTTP or
HTTPS over TCP. This avoids any PMTU problems and removes the need
for acks, resending queries and responses. TCP also avoids
difficulties inherent in lightweight UDP protocols where the MSP
could be used to amplify small query packets with spoofed source
addresses into larger responses to DoS a victim.
An MPS (Missing Packet Server) is a COTS server running software with
an input stage identical to that of a Replicator or QSD. That is, it
uses DTLS to receive two or more streams from Replicators and it uses
the Fresh / Repeat algorithm to ignore all but the first appearance
of a new payload.
An MPS at a particular location would receive a stream from one or
more physically and topologically close Replicators and ideally from
some physically and topologically distant Replicators. "Topology" in
this case means not just the underlying DFZ topology, but also that
Whittle Expires July 22, 2010 [Page 33]
Internet-Draft Fast Payload Replication January 2010
distant Replicator's location in the "topology" of upstream
Replicators.
The aim is to receive at least one local stream, which is inexpensive
- probably from a Replicator in the same data center - and one or a
few streams from distant Replicators. This is so that in the event
of the local Replicator suffering an outage and so missing some
packets, it is likely that the distant one will not be missing the
same packets. If the local outage, such a complete loss of
connectivity for a few seconds, or significant packet loss due to
congestion, also affects the ability of the MPS to receive packets
from distant Replicators, then the same packets may be lost. A
simple workaround for this is to have the distant Replicator delay
its stream by ten seconds or so. Such delayed outputs from a
Replicator should only be used to drive QSDs and MSPs - never another
Replicator.
MPSes do not need to interpret the packets in order to update a
mapping database, as does a QSD. The MPS does not need to interpret
the payloads at all, or perform end-to-end authentication on their
contents. The MPS only needs to store complete DTLS payloads for ten
minutes or so and be able to provide them to requesters. The
requesters will be either QSDs or other MPSes. So an MPS is a
relatively light-weight network element. It may be quite busy at
times responding to queries and sending out payloads, but most of the
time, it is storing payloads in a simple fashion, and is not required
to do any work on their contents.
The request protocol does not need to be secure, since Ivip mapping
information is public information. However, each MPS may wish to
restrict its queriers to those which match an ACL.
By some means TBD, each QSD and MPS could be configured to use
several MPSes - including perhaps a distant one which is unlikely to
be affected by any brief local outage which caused this QSD to be
missing some packets. (Folks in North America, Siberia and Africa
would have reason to give each other access to their MPSes!)
In ordinary operation, each MPS would have a complete list of recent
packets. If it was missing some packets, it would determine this by
looking at the FPR headers and finding a gap in the "seq" numbers
recently received for each RUAS.
It would be scalable for each MPS to maintain a TCP connection with
another MPS so the two could use the one link to request and deliver
missing packets in both directions. Therefore, the MPSes could be
arranged in multiple partially meshed groups - or these could be
connected and so form a single global network of MPSes. The request
Whittle Expires July 22, 2010 [Page 34]
Internet-Draft Fast Payload Replication January 2010
protocol would probably need an option to cancel a request. For
instance, an MPS in Los Angeles might first request one or more
missing packets from an MPS in New York. But if the NY MPS replies
that it too is missing these packets, the LA MPS might request them
from an MPS in Beijing - which responds that it has them, and starts
sending them. The LA MPS will then want to cancel the request to the
NY MPS. HTTP or HTTPS is probably a good protocol for this purpose.
QSDs would use the same protocol for querying MSPs. Whether the QSD
starts an HTTP(S)-TCP connection when it needs missing packets, or
whether it maintains such a connection in readiness, would be a
matter for local policy.
In this scenario, MPSes form an interdependent network, which will be
highly robust. Most MPSes will have all the recent packets. Those
which don't will automatically obtain them from other MPSes within a
few seconds.
An ISP which runs one or a few QSDs could run a missing packet server
for all of them, with long-lasting sessions to a few other MSPs in
nearby and distant ISPs. Alternatively, the QSDs could use one or
more MSPs operated by other organisations, perhaps on a commercial
basis.
Since an MSP is simply software running on a COTS server, they are
not expensive or difficult to deploy. It would be possible to run an
MPS on the same host as a QSD, but if they are using streams from the
same Replicators, then there will be a high correlation between the
sets of packets which each function misses. Therefore, it makes
sense for each QSD to use a nearby MSP, and then a distant one,
rather than to run an MSP at the same site which will need to make
much the same queries of other MSPs as the QSC would.
10.2. Delaying the output of Replicators
If a QSD or MSP relied on a single upstream physical link, or a
router or other device which might be subject to transient
disruption, then having multiple streams from upstream Replicators
will not necessarily ensure the QSD gets all the payloads which are
sent. This is because the disruption will likely affect all such
streams, which will be carrying much the same payloads at the same
time.
A possible workaround is to have one or more of the streams delayed
at its source - in the output function of its Replicator. If one
such stream was delayed by 5 seconds, then it would typically be able
to deliver every payload which was not delivered during a 4 second
disruption.
Whittle Expires July 22, 2010 [Page 35]
Internet-Draft Fast Payload Replication January 2010
So it may be desirable for delays such as this to be an option when a
QSD or MSP requests a stream from a Replicator. A Replicator does
not see a QSD request any differently from the request from another
Replicator. So the question arises as to whether the stream from one
Replicator to another should be delayed - and if so, by how much.
It should be reasonably safe for a QSD or MPS to receive a stream
with a delay of a few seconds, since the QSD does not propagate the
payloads any further. The time could be locally chosen so that when
added to a reasonable estimate of the longest delay affecting packets
going into that Replicator, that there is still a safety margin
within the minimum timeout of the QSDs timers for the purposes of
Fresh / Repeat detection.
To delay the packets received by a Replicator would be much more
problematic. These delayed payloads could be propagated to other
Replicators - and these delays could be added to by similar
arrangements between other Replicators. Then, the total delay might
exceed the limits of the Fresh / Repeat algorithm and QSDs and
Replicators would mistake older payloads for ones which were actually
sent 15 seconds or so later. (This could be prevented with a more
elaborate algorithm, which also uses the 10 "epochsec" bits, but I
think this raises further complications.) This would only occur with
the highest allowed data-rates which might not occur in practice
until the system was being used intensively - many years in the
future.
I think that delaying a stream to a Replicator could in principle
improve its robustness if its two streams were likely to be subject
to the same brief disruptions. However, it would be better to locate
Replicators at data centres with multiple physical links and to try
to ensure that the streams are most likely to arrive over diverse
links. QSDs will cope with missing packets, and the aim of the
Replicator system is to minimise the number of packets they miss.
However, Replicators should not contribute to delays which might
disrupt the ability of QSDs and other Replicators to correctly
distinguish a Fresh packet from a Repeat.
10.3. Private network links to avoid DoS attacks
The Replicator system as described above is a promising method of
fanning out information to a very large number of recipient devices
all over the world, in fractions of a second. While the system is
distributed and has no single point of failure, if it was used for a
purpose as important as distributing mapping for a core-edge
separation system such as Ivip, it would no-doubt be threatened by
DoS attacks in the form of gigabits per second of packets directed
from large numbers of hacked botnet PCs.
Whittle Expires July 22, 2010 [Page 36]
Internet-Draft Fast Payload Replication January 2010
Internet protocols are intended to operate on the open Internet.
However, the use of FPR may be a partial exception. Some root
nameservers are toughened against DoS by being distributed to
multiple high-bandwidth sites using anycast. In principle, by having
enough fully meshed level 0 Replicators, the same goal could be
achieved - for an attack to succeed, it would need to overwhelm all,
or almost all of the devices at the same time.
To some extent this can be achieved with Replicators, but it would
probably be best to toughen the system against DoS attacks by linking
the RUASes, the level 0 Replicators and at least the level 1
Replicators over private network links with assured bandwidth and no
possibility of being affected by packets arriving from the Internet.
In this case, the RUASes and level 0 Replicators may have private
addresses. The level 1 Replicators may also have private addresses
on their input side - the part which makes DTLS links to level 0
Replicators.
In this model, the output addresses of the level 1 Replicators would
be on public addresses so level 2 Replicators could establish
sessions with them. Probably these level 1 Replicators would have
two separate gigabit Ethernet ports - one for the private address and
upstream links and the other for the public addresses and downstream
links.
The downstream public address of the level 1 Replicators might be the
target of a DoS attack, but once the sessions have been established,
Replicators do not need to receive any packets on those DTLS
sessions. So a DoS attempt there would have little or no effect.
Instead, a DoS attack would need to focus on the level 2 Replicators.
Ideally, depending on the capacity of the attackers, these would be
so numerous that an attack could only disrupt a subset of them. Even
then, due to the cross-linked nature of the Replicator system, the
impact of that attack on QSDs may be greatly diluted due to level 3
and 4 Replicators working fine from streams arriving from level 2
Replicators which were not targeted.
Using private network links to fully mesh the level 0 Replicators,
and for their streams to the level 1 Replicators, is a non-trivial
matter. However, the benefits of a fast push mapping distribution
system core-edge separation scheme for the Internet in general are
immense - so this expense is therefore worth considering.
If the first two levels are carefully optimised so that there are,
for instance, 5 to 8 level 0 Replicators (only two or three are
needed for highly reliable operation) and 50 to 100 or level 1
Replicators (of which quite a few could be dead without significantly
Whittle Expires July 22, 2010 [Page 37]
Internet-Draft Fast Payload Replication January 2010
disrupting streams to QSDs) then this system could drive 1000 to 2000
or perhaps more level 2 Replicators. This would probably make the
system largely immune to DoS attacks - but of course the exact
details would need to be considered at the time of deployment.
Whittle Expires July 22, 2010 [Page 38]
Internet-Draft Fast Payload Replication January 2010
11. Security Considerations
For future work, but see notes above about the need for end-to-end
authentication, and hardening against DoS attacks.
Whittle Expires July 22, 2010 [Page 39]
Internet-Draft Fast Payload Replication January 2010
12. IANA Considerations
[To do.]
Whittle Expires July 22, 2010 [Page 40]
Internet-Draft Fast Payload Replication January 2010
13. Informative References
[DFZ-unfrag-1470]
Whittle, R., "Google sends 1470 byte unfragmentable
packets", August 2008, <http://www.firstpr.com.au/ip/ivip/
ipv4-bits/actual-packets.html>.
[I-D.ietf-lisp]
Farinacci, D., Fuller, V., Meyer, D., and D. Lewis,
"Locator/ID Separation Protocol (LISP)",
draft-ietf-lisp-05 (work in progress), September 2009.
[I-D.whittle-ivip-arch]
Whittle, R., "Ivip (Internet Vastly Improved Plumbing)
Architecture", draft-whittle-ivip-arch-04 (work in
progress), January 2010.
[I-D.whittle-ivip-db-fast-push]
Whittle, R., "Ivip Mapping Database Fast Push",
draft-whittle-ivip-db-fast-push-03 (work in progress),
January 2010.
[RFC2887] Handley, M., Floyd, S., Whetten, B., Kermode, R.,
Vicisano, L., and M. Luby, "The Reliable Multicast Design
Space for Bulk Data Transfer", RFC 2887, August 2000.
[RFC3133] Dunn, J. and C. Martin, "Terminology for Frame Relay
Benchmarking", RFC 3133, June 2001.
[RFC3740] Hardjono, T. and B. Weis, "The Multicast Group Security
Architecture", RFC 3740, March 2004.
[RFC4347] Rescorla, E. and N. Modadugu, "Datagram Transport Layer
Security", RFC 4347, April 2006.
[TTR Mobility]
Whittle, R. and S. Russert, "TTR Mobility Extensions for
Core-Edge Separation Solutions to the Internets Routing
Scaling Problem", August 2008,
<http://www.firstpr.com.au/ip/ivip/TTR-Mobility.pdf>.
Whittle Expires July 22, 2010 [Page 41]
Internet-Draft Fast Payload Replication January 2010
Author's Address
Robin Whittle
First Principles
Email: rw@firstpr.com.au
URI: http://www.firstpr.com.au/ip/ivip/
Whittle Expires July 22, 2010 [Page 42]
| PAFTECH AB 2003-2026 | 2026-04-24 11:09:17 |