One document matched: draft-whittle-ivip-db-fast-push-02.txt
Differences from draft-whittle-ivip-db-fast-push-01.txt
Network Working Group R. Whittle
Internet-Draft First Principles
Intended status: Experimental January 13, 2010
Expires: July 17, 2010
Ivip Mapping Database Fast Push
draft-whittle-ivip-db-fast-push-02.txt
Abstract
From the base of draft-whittle-ivip-arch-03 and later, this ID
describes in greater detail Ivip's fast-push mapping distribution
system. This accepts mapping changes from end-user networks or
organizations they authorise to make these changes. The mapping
changes are handled by RUAS (Root Update Authorization Server)
companies who collectively run a small set of Launch servers and a
global network of Replicator servers. Each second, the Launch
servers send sets of packets with mapping updates to a larger number
of level 0 Replicators, each of which gets at least two feeds of
these mapping updates from different Launch servers. Each Level 1
Replicator fans out the mapping changes to multiple Level 2
Replicators, which also receive at least two feeds from upstream
Level 1 Replicators. In this way, within a fraction of a second, the
mapping changes are fanned out securely and reliably to full database
query servers (QSDs) in ISPs and some end-user networks all over the
Net. Additionally, QSDs can download missing packets and snapshots of
segments of the mapping database. A WAG of 4 billion mapping changes
a year gives a raw data rate for IPv6 mapping changes of only 32kbps.
TTR mobility only involves mapping changes if the MN moves a large
distance, such as 1000km. Multihoming service restoration updates
would be infrequent. Mapping changes for TE could be numerous
depending on cost. It is hard to imagine a scenario where mapping
changes would present significant difficulties in terms of bandwidth
or in terms of the capacity of QSDs to handle them.
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
Whittle Expires July 17, 2010 [Page 1]
Internet-Draft Ivip DB Fast Push January 2010
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on July 17, 2010.
Copyright Notice
Copyright (c) 2010 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the BSD License.
Whittle Expires July 17, 2010 [Page 2]
Internet-Draft Ivip DB Fast Push January 2010
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1. Outline of the RUAS, Launch and Replicator systems . . . . 4
1.2. Assumptions . . . . . . . . . . . . . . . . . . . . . . . 6
1.3. It may not be so daunting... . . . . . . . . . . . . . . . 7
2. Goals, Non-Goals and Challenges . . . . . . . . . . . . . . . 8
2.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2. Non-goals . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3. Challenges . . . . . . . . . . . . . . . . . . . . . . . . 10
3. Definition of Terms . . . . . . . . . . . . . . . . . . . . . 11
3.1. SPI - Scalable PI space . . . . . . . . . . . . . . . . . 11
3.1.1. Conventional global unicast address space . . . . . . 11
3.2. MAB - Mapped Address Block . . . . . . . . . . . . . . . . 11
3.3. UAB - User Address Block . . . . . . . . . . . . . . . . . 11
3.4. Micronet . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5. RUAS - Root Update Authorisation System . . . . . . . . . 12
3.6. UAS - Update Authorisation System . . . . . . . . . . . . 13
3.7. UMUC - User Mapping Update Command . . . . . . . . . . . . 13
3.8. SUMUC - Signed User Mapping Update Command . . . . . . . . 15
3.9. MABUS - Update Stream specific to one MAB . . . . . . . . 15
3.10. Launch server . . . . . . . . . . . . . . . . . . . . . . 15
3.11. Replicator . . . . . . . . . . . . . . . . . . . . . . . . 16
3.12. QSD - Query Server with full Database . . . . . . . . . . 16
3.13. QSC - Query Server with Cache . . . . . . . . . . . . . . 17
4. Update Authorities and User Interfaces . . . . . . . . . . . . 18
4.1. RUAS Outputs . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.1. Updates every second . . . . . . . . . . . . . . . . . 19
4.1.2. MAB snapshots . . . . . . . . . . . . . . . . . . . . 19
4.1.3. Missing packet servers . . . . . . . . . . . . . . . . 21
4.2. Authentication of RUAS-generated data . . . . . . . . . . 22
4.2.1. Snapshot and missing packet files . . . . . . . . . . 22
4.2.2. Mapping updates . . . . . . . . . . . . . . . . . . . 22
4.3. RUAS - UAS interconnection . . . . . . . . . . . . . . . . 24
5. The Launch system . . . . . . . . . . . . . . . . . . . . . . 30
5.1. Phase 1 - collecting updates from RUASes . . . . . . . . . 30
5.2. Phase 2 - checksum comparison . . . . . . . . . . . . . . 31
5.3. Phase 3 - identical update streams . . . . . . . . . . . . 32
6. Replicators . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.1. Scaling limits . . . . . . . . . . . . . . . . . . . . . . 35
6.2. Managing Replicators . . . . . . . . . . . . . . . . . . . 37
7. Security Considerations . . . . . . . . . . . . . . . . . . . 39
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 40
9. Informative References . . . . . . . . . . . . . . . . . . . . 41
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 42
Whittle Expires July 17, 2010 [Page 3]
Internet-Draft Ivip DB Fast Push January 2010
1. Introduction
The aim of this I-D is to establish that Ivip's fast-push mapping
distribution system (FMS) is practical and desirable for very large
numbers of micronets (EIDs in LISP terminology) and rates of change
of the mapping database. Please refer to [I-D.whittle-ivip-arch] for
an explanation of Ivip in general. A glossary of Ivip and some
general scalable routing terms and acronyms is:
[I-D.whittle-ivip-glossary].
This is a revision of the 00 and 01 versions, with the only
substantial change being a much lower estimate of the worst-case
number of updates, with a correspondingly lower worst-case required
bandwidth.
The most unusual and demanding part of Ivip's fast-push system is the
network of "Replicator" servers which fan the mapping updates out to
potentially hundreds of thousands of full database query servers
(QSDs) at ISP and end-user network sites all over the world.
1.1. Outline of the RUAS, Launch and Replicator systems
The largest part of the FMS is comprised of thousands (perhaps
several hundred thousand in the long term future) of essentially
identical "Replicator" servers. There may be other, better,
approaches, but this is ID describes the current design. This cross-
linked, tree-like, structure of Replicators in some ways resembles a
tree of multicast routers. However, each Replicator receives at
least two streams of identical mapping data, so it is much less
likely to miss a packet from this stream than if it only received the
packets from a single source.
The first level of Replicators (level 1) is driven by a small set of
Launch servers, which are geographically and topologically diverse,
but which work as a team to reliably send the mapping update packets
to the Level 1 Replicators. The Launch servers gain this
information, second-by-second, from a small number (ten to a few
dozen at most) RUAS systems (Root Update Authorization Servers), each
belonging to a different RUAS company.
At the first level, each Replicator receives two identical streams,
over separate authenticated and encrypted links, from two different
Launch servers in different geographical locations, and over
different physical long distance links. The Launch system and
perhaps the first level (1) of Replicators will probably be
implemented with private network links, rather than relying on open
Internet addresses which are subject to flooding attacks.
Whittle Expires July 17, 2010 [Page 4]
Internet-Draft Ivip DB Fast Push January 2010
If a packet goes missing from one stream, it will probably be present
in the second. As the packets arrive, the Replicator takes the first
one from either stream and sends its contents out simultaneously on a
larger number of similar links to the next level of Replicators.
Consequently, the delay time for update information passing through a
Replicator will be no more than a few to ten milliseconds, and is
comparable to the delays imposed by a packet traversing a router.
In this way, each Replicator consumes two identical streams from
geographically and topologically different sources, and fans the
content of the streams out to some larger number of Replicators or
QSDs at the next level. This number of output streams per Replicator
may be in the tens to one hundred range, depending on the volume of
updates. Initially, it would be quite high, when update rates are
low - meaning that the initial global Replicator network could serve
the growing number of QSDs with few levels of Replicators, and with
each one fanning out updates to a large number of Replicators at the
next level.
After some number of levels of replication, determined by local
conditions, the streams deliver the update information at a QSD.
Ideally, each QSD will receives two streams from two geographically
dispersed Replicators. These need not be at the same level, so the
system is relatively flexible, and each Replicator will generally be
sending a complete streams of packets.
The Launch system generates the stream as a variable number of
packets on a regular schedule, such as every second. Data within
each packet enables QSDs to authenticate the mapping information, and
to request from remote servers any packets which did not arrive.
Snapshots of segments of the mapping database are taken regularly by
each RUAS. Each snapshot contains a complete copy of the mapping of
one MAB (Mapped Address Block) at a particular instant. At that
point in time, a hash function of the mapping data for this MAB is
generated and within a few seconds is sent to all QSDs. This enables
each QSD to verify its copy of the mapping for this QSD is fully up-
to-date.
During initialisation, and if an error is found in the local copy of
the mapping for a particular MAB, the QSD downloads snapshots from
HTTP servers provided by the RUAS companies. The QSD buffers all
updates for the MAB which arrive after the snapshot and hash message.
Once the snapshot is downloaded and unpacked into the QSDs copy of
the mapping database, the buffered updates are applied and the
database then contains an up-to-date copy of mapping for this MAB.
Updates are then applied as they arrive from the two or more upstream
Replicators.
Whittle Expires July 17, 2010 [Page 5]
Internet-Draft Ivip DB Fast Push January 2010
1.2. Assumptions
For the purposes of this discussion, it is assumed there will be a
single global Ivip system, with multiple organisations being
responsible for the management of the various blocks of address space
which are managed with Ivip.
It would also be possible for an organisation to establish an Ivip-
like system, without reference to any IETF RFCs, and to conduct a
business renting out address space in small, flexible, chunks, with
portability and multihoming via any ISP who provides the requisite,
relatively simple, ETRs. The most likely scenario is this being
done, with one or more independent Ivip-like systems operated by
different companies, primarily for supporting TTR mobility [TTR
Mobility], but also usable for portability, multihoming and inbound
Traffic Engineering for non-mobile end-user networks.
For simplicity, this ID assumes that Ivip development will be
coordinated into a single global system, as DNS is, following
appropriate IETF engineering work and administrative decisions in
RIRs and other relevant organisations. A development timeframe of
2010 to ca. 2014 is assumed, with widespread deployment being
achieved later in the decade, for IPv4 at least.
The IPV4 FMS for is identical in principle to the IPv6. The server
software which implements the Replicators will probably remain as two
separate items, but a single server could run them both,
independently, and so be both an IPv4 and IPv6 Replicator. Each RUAS
would have both IPv4 and IPv6 sections, with separate outputs of
mapping data. The Launch servers for IPv4 would be physically
different and independent of those for IPv6.
In addition to the global fast push database update distribution
system discussed in this ID, Ivip also involves Query Servers sending
"notifications" to ITRs which recently requested mapping for a
micronet whose mapping has just changed. This is a second form of
push - on a local scale - and is outlined in [I-D.whittle-ivip-arch]
.
This ID concentrates on IPv4, since the future map-encap scheme is
more urgently required for IPv4 than for IPv6. In principle, the
same arrangements will apply for IPv6, with a different and more
verbose data format than the 12 or so bytes required for each IPv4
mapping update. It may make sense to defer finalisation of any
future IPv6 map-encap scheme until substantial operational experience
was gained with the IPv4 scheme.
Whittle Expires July 17, 2010 [Page 6]
Internet-Draft Ivip DB Fast Push January 2010
1.3. It may not be so daunting...
Ivip documentation is written with a preference for detailed
discussion over terseness. So Ivip IDs may appear rather daunting at
first. Hopefully these IDs will be clearly understandable, and the
reader will recognise that this scalable routing solution is a
momentous development, requiring detailed consideration. Ivip goes
beyond the formal RRG requirements of providing portability (the only
way of allowing free choice of alternative ISPs) multihoming and
inbound traffic engineering, by also providing with TTR mobility, a
global mobility system for both IPv4 and IPv6. While no mapping
changes are required unless the Mobile Node moves a large distance,
such as 1000km or more, it is important that the Ivip FMS be able to
scale to very large numbers of updates and cope with mapping
databases for up to 10^10 micronets.
This ID focuses on handling billions of micronets and potentially
thousands or tens of thousands of updates a second. These data-rates
may sound high today, but domestic customers are already downloading
full quality video in real-time. By the time such large levels of
adoption arise, the bandwidth needed for these will not be a
significant obstacle.
Also, during initial deployment, the demands on the fast push system
will be far lighter than those anticipated below, so the system might
initially be somewhat simpler. In the initial stages of
introduction, there may be little need to deploy dedicated servers
for the "Replicator" functions, since the volume of updates may be so
light as to make it practical to run this software on existing
servers, such as nameservers.
Furthermore, in the early years of introduction, when there are
hundreds of thousands or a few million micronets, the low level of
update packets (compared to the highest imaginable levels
contemplated below) should enable each Replicator to fan out to many
more next-level Replicators than would be possible when hundreds of
millions or billions of micronets are handled by the system. This
would mean fewer levels of Replicators, fewer Replicators and
generally faster delivery of the mapping information than would be
possible with current technology if the system was handling billions
of micronets.
So this ID explores how the FMS would be structured in the most
demanding future scenarios which can be realistically expected.
Building the initial FMS for trials and early services won't be as
daunting as it may look from the diagrams and discussions below.
Whittle Expires July 17, 2010 [Page 7]
Internet-Draft Ivip DB Fast Push January 2010
2. Goals, Non-Goals and Challenges
2.1. Goals
The overall goal of the fast push system is to enable end-users, who
manage the mapping of their one or more micronets of address space,
to securely, reliably and easily communicate their mapping change
command to some organisation with which they have a business
relationship, so that that change will be propagated to every QSD as
soon as possible.
"As soon as possible" means typical delay times of a few seconds,
ideally zero seconds, but in practice probably four to five seconds.
(Most of this delay is in the RUAS and Launch systems, which could be
optimised in the future to process the updates much faster than this,
without affecting the much larger Replicator system.
"Reliably" means that in the great majority of cases, the QSDs
receive every mapping change as expected and that in the relatively
rare event of this being impossible due to packet loss, that the QSD
can recover from this situation within one or at the most two seconds
by requesting a copy of the packet from a remote HTTP server provided
by the RUAS company whose mapping update packet was lost.
Reliability also involves robustness against DoS attacks. This can
never be completely protected against for any device on the open
Internet, since its link(s) can easily be flooded by packets sent
from botnets etc. A workaround for DoS attacks would be to run the
first few levels of Replicators via global private network links.
These levels would be owned and operated by the RUAS companies
working together. This would enable reliable feeds to hundreds or
perhaps a thousand or so Replicators all over the Net, which would
mean that a DoS attack against a small number of Replicators could
only affect a smaller portion of the total system.
"Securely" means that each QSD which receives the updates will be
able to instantly verify that the updates are genuine, rather than
the result of an attacker who might, for instance, send forged
packets to that device or to some other part of the fast push system.
The data format for the mapping update packets is TBD. It is
possible that each packet's contents could be signed by the RUAS
which originated it. In the present design, the use of DTLS RFC 4347
links between each Launch Servers, Replicators and QSDs is assumed to
provide sufficient security. The data format needs to provide for
open-ended extensions in the future and to support authentication at
the time of reception.
The mapping change command, as sent by the end-user, or by some other
Whittle Expires July 17, 2010 [Page 8]
Internet-Draft Ivip DB Fast Push January 2010
organisation or device which has the end-user's credentials, would
involve the length of the micronet being checked to ensure it is the
same as the currently configured length of the micronet which starts
at that location. The end-user's command might be part of an
encrypted exchange involving a challenge-response protocol and the
end-user's private key. Alternatively, an encrypted link could be
used, such as via HTTPS, and a conventional username and password
given as part of the command.
The end-user would previously have communicated directly or
indirectly with their RUAS to configure their total assigned address
space into one or more micronets. This ID concentrates on the
changes of ETR address for existing micronets, but the mapping change
packets will also contain information about how existing micronets
have been deleted and replaced by other micronets, smaller or larger
and with different start and end-points.
RUASes and the multiple servers of the Launch system are few in
number and will be administered carefully, so this ID does not
consider automated aids to their management and debugging. However,
the Replicators will be numerous and operated by a wide range of
organisations. Future work will concern maximising the degree to
which the Replicator system can be robustly and easily managed,
rather than requiring a great deal of manual configuration etc.
In order to debug the way the Ivip system is used, such as transient
erroneous or malicious mapping updates which cause packets to be
tunnelled to addresses where they are not welcome, there will need to
be a system which monitors all mapping changes and keeps a lasting
record of them. Then, aggrieved parties can search such a system for
the address on which the received the unwanted packets, and so
determine the micronet involved. This will enable the aggrieved
party to complain to the RUAS which is responsible for that micronet.
This "mapping history" function could be performed by one or multiple
separate systems, each simply taking a feed from the Replicator
system.
2.2. Non-goals
Apart from checking the ETR address against any specific exclusion
lists (such as specific prefixes, private RFC 1198 and multicast
space) and to ensure it is not part of a Mapped Address Block (MAB -
a BGP advertised prefix containing micronets), the entire Ivip system
takes no interest in whether there is a device at that address,
whether the address is advertised in BGP, whether there is or was an
ETR at that address, whether the ETR is reachable or whether the ETR
can deliver packets to the micronet's destination device.
Whittle Expires July 17, 2010 [Page 9]
Internet-Draft Ivip DB Fast Push January 2010
These are all matters which fall under the responsibility of the end-
user network whose micronet this ETR address is for.
It is not a goal of the system to keep mapping changes secret from
any party. This would be impossible. Therefore, it cannot be a goal
of this or probably any core-edge elimination scheme that in a mobile
setting, the movement of an individual's device could not be inferred
by anyone who monitors the mapping updates. However, the mapping
only concerns the currently active TTR. MNs can still use a TTR no-
matter where they are physically connected, and using a TTR hundreds
or even thousands of km distant will probably present no serious
difficulties due to path-length or lost packets. So mapping changes
need not indicate much, or anything, about the physical location of
the MN.
Replicators perform a best-effort copying of mapping update packets.
They do not store these packets for any appreciable time or attempt
to request a packet in the sequence which is missing from their two
or more input streams.
2.3. Challenges
There are obvious challenges building a global network which is
distributed, to avoid any single point of failure whilst also being
highly reliable, coordinated and secure. For this network to
propagate information from one of many input points to a very large
number (potentially millions) of endpoints, with very low levels of
loss, is a further challenge on the open Internet.
Although the Launch system and level 1 and perhaps level 2
Replicators could operate over private network links. However, the
final levels of the Replicator system - those which drive the QSDs
need to operate on the open Internet, as do the end-users' methods of
interaction with the RUASes, directly or indirectly.
The closest existing technology to what is required may be Reliable
Multicast, but this is optimised for long block lengths. This
technology should be considered in greater depth as an alternative to
what is proposed here, but the rest of this ID is based on the
assumption that novel techniques are required.
Whittle Expires July 17, 2010 [Page 10]
Internet-Draft Ivip DB Fast Push January 2010
3. Definition of Terms
3.1. SPI - Scalable PI space
Once Ivip is operational, a growing subset of the global unicast
addresses will be handled by ITRs tunnelling the packets to an ETR,
which delivers the packets to the destination. This subset is used
by end-user networks and provides portability, multihoming and
inbound traffic engineering in a manner which is highly scalable -
does not overly burden DFZ routers.
SPI space is "mapped" by Ivip and this mapping system can divide it
into smaller sections than is possible with BGP in the DFZ - a 256 IP
address granularity for IPv4, due to a widely enforced convention on
the lengths of routes which are accepted.
The granularity with which Ivip maps SPI space - dividing it into
micronets (described below) is single IP addresses for IPv4, and /64
prefixes for IPv6.
3.1.1. Conventional global unicast address space
This is global unicast address space as it is used today. With Ivip,
this will be a subset of the full unicast space - the part which is
not used for SPI space. The LISP term for this is "RLOC" space.
3.2. MAB - Mapped Address Block
A MAB is a BGP advertised prefix which is used as SPI space. DITRs
(Default ITRs in the DFZ) all over the Net advertise this prefix,
tunnelling the packets to ETRs according to the current mapping for
the destination address of each packet.
A MAB could, in principle, be as large as a /8. Larger MABs are
preferred in general, because each one burdens the BGP system with
only a single advertisement, but includes the SPI space of
potentially hundreds of thousands of end-user networks. However, for
reasons discussed below - including load sharing between ITRs and
ease of initially loading snapshots of the mapping database - it may
be best if MABs are more typically in the /12 to /17 range for IPv4.
3.3. UAB - User Address Block
Each MAB typically contains address space which has been assigned by
some means to many (perhaps tens of thousands) separate end-users. A
UAB is a contiguous range of addresses within a MAB which is assigned
to one end-user. UABs are important divisions for the RUAS company,
but UABs are not specifically mentioned or needed in the mapping
Whittle Expires July 17, 2010 [Page 11]
Internet-Draft Ivip DB Fast Push January 2010
update packets handled by Launch servers and Replicators. Nor are
UABs relevant to the operation of QSDs, QSCs (caching query servers),
ITRs or ETRs.
A MAB could be assigned entirely to one end-user - as might be the
case if the end-user converted a prefix of theirs which was
previously conventional PI space to be managed as SPI space by the
Ivip system. Generally speaking, MABs are ideally large (short
prefixes) and each contains space for multiple end-users. Generally,
MABs are owned or at least administered by MAB companies, who rent
SPI space to end-user networks.
An end-user might have multiple UABs in a MAB, UABs in multiple MABs
from the same company or UABs in MABs from multiple MAB companies.
For simplicity, this ID assumed each end-user has a has a single UAB.
UABs are specified by starting address and length, in units as
mentioned above IPv4 addresses or IPv6 /64s. While a MAB is always
on power of two boundaries of these units, since it is a prefix
advertised in the DFZ, UABs and micronets have arbitrary starting
points and lengths - they are not at all constrained by binary
"prefix" boundaries.
3.4. Micronet
Following Bill Herrin's suggestion, the term "micronet" refers to a
range of SPI space for which all addresses have the same mapping. In
LISP, these are known as EID prefixes. In Ivip, a micronet need not
be on binary boundaries - it is specified by a starting address and a
length, in units of single IPv4 addresses or IPv6 /64 prefixes.
An end-user could use their entire UAB as a single micronet, or they
could split it into as many micronets as they wish, and change these
divisions dynamically.
Any micronet which is mapped to zero (its ETR address is 0.0.0.0 in
IPv4) will cause ITRs to drop any packets addressed to this micronet.
A micronet can be defined within the whole or part of a contiguous
range of address space which is currently mapped to zero, by the fast
push mapping distribution system carrying an update message
specifying the new micronet's starting address, its length, and a
non-zero address for its mapping. (Future work: decide exactly what
instructions are needed and which sequences of operations are
allowable for making new micronets in place of existing ones.)
3.5. RUAS - Root Update Authorisation System
Multiple RUASes collectively generate the total stream of mapping
update messages. Each RUAS is responsible for one or more MABs.
Whittle Expires July 17, 2010 [Page 12]
Internet-Draft Ivip DB Fast Push January 2010
There may be a dozen to a few dozen RUASes. (More RUAS companies is
good for competition and innovation, creates some difficulties for
the Launch servers, which must reach an agreement on which updates to
send from all these RUASes.) Each RUAS either receives mapping
updates directly from end-user networks (or their appointed
Multihoming Mapping companies) - or may receive these indirectly via
intermediate organisations, each of which runs a UAS.
3.6. UAS - Update Authorisation System
A UAS is the system of an organisation which accepts mapping change
commands from end-users, and conveys them directly - or perhaps
indirectly via another UAS - to the RUAS which handles the relevant
MAB. An RUAS which accepts mapping update commands from end-users
does so via its own UAS system.
A UAS accepts upstream input from end-users and/or other UASes. It
generates output to downstream RUASes and/or other UASes. One UAS
may have relationships with multiple RUASes. A MAB may be assigned
to an RUAS and control of parts of this may be delegated to multiple
UASes. A single UAS may work only with a single RUAS, or with
multiple and perhaps all RUASes.
Whether the MAB itself is administratively assigned (by an RIR, or
some national Internet Registry) to the UAS or to the RUAS is not
important in a technical sense. End-users will choose address space
according to the RUAS (and any UASes) it depends upon with care,
because the reliability of this MAB's address space will forever be
dependent on these organisations.
The number of RUASes will be limited to enable them to efficiently
and reliably work together with their jointly operated system of
Launch servers to create a single stream of updates for the entire
Ivip system. The ability of companies with UASes to act as agents
for RUAS companies and/or to have their own MABs which they contract
a RUAS to handle the mapping for, will enable a large number of
organisations to compete in the rental of SPI space.
3.7. UMUC - User Mapping Update Command
A UMUC is whatever action the end-user performs on one or more
different user-interfaces of whatever UAS they use to change the
mapping of their one or more micronets. The system would also be
able to tell the user the current mapping and also confirm that a
requested change to the mapping was acceptable. In other words, the
system lets end-user networks (and/or whichever Multihoming
Monitoring company they contract to control the mapping of their
micronets) to "see" (server-to-human and server-to-server) how their
Whittle Expires July 17, 2010 [Page 13]
Internet-Draft Ivip DB Fast Push January 2010
UAB is broken into micronets and what ETR addresses those micronets
are mapped to.
The system could also provide diagnostics such as testing the
reachability of their network via one or more ETR addresses. The
system would also enable trialling mapping changes and altered
micronet boundaries without actually executing the changes - so the
end-user network operators can manually test their proposed changes
are valid, before actually making them.
QSDs will only accept certain kinds of updates, and it is vital that
the mapping updates are applied in the order they are sent - and that
these updates are in themselves valid. For instance, it may be best
(from the point of view of QSDs sending updates to their queriers,
and therefore directly or indirectly to ITRs) for micronets to be
mapped to an ETR address of 0.0.0.0 before being split or joined.
In addition to testing proposed changes for validity, the UAS system
should be able to combine multiple updates into a single set, to be
executed in order, but at the same time. The complete set would be
sent on the FMS in a single second. For instance, mapping an 8-long
micronet's ETR address to zero, and splitting it into three smaller
micronets and then setting the ETR address of each.
When testing proposed changes, or deciding whether to accept changes
which have been ordered with the end-user network's credentials, the
UAS system would generate an error if the mapping was to a disallowed
address - multicast, SPI space, private address space or to some
other prefixes which the Ivip system does not support the tunnelling
of packets. Similarly, and error would be generated if the end-user
attempted to change the mapping for some address space outside their
UAB, or if they defined a new micronet within that space with non-
zero mapping, or which overlapped some addresses for which the
mapping was currently non-zero.
For the sake of discussion, it will be assumed that all UMUCs have
passed these validity sanity tests at the UAS and are for valid
mapping addresses - so a UMUC is a successfully accepted update
command from the end-user, or some person or system or with the end-
user's credentials.
There could be many methods by which this command is communicated,
including HTTPS web forms with username and password authentication.
SSL sessions might be more suitable for automated mapping change
systems, such as those of a Multihoming Monitoring company which the
end-user authorises to control the mapping of some or all of their
UAB.
Whittle Expires July 17, 2010 [Page 14]
Internet-Draft Ivip DB Fast Push January 2010
In addition to authentication, the command takes the form of the
starting address of the micronet, the length of the micronet, and a
single ETR IP address to which this micronet will have its mapping
changed to.
3.8. SUMUC - Signed User Mapping Update Command
This is the information contained in a UMUC, signed by the UAS which
accepted it from the user (or by some other UAS), being handed down
the tree to another UAS or to the RUAS of the tree, so that the
recipient UAS/RUAS can verify the signature and regard the UMUC as
authoritative.
3.9. MABUS - Update Stream specific to one MAB
This is a stream of data by which the real-time updates to the
mapping data for any one MAB are conveyed. For the purposes of
discussion, the RUASes and the Launch system are assumed to work in a
synchronized fashion, generating a body of updates for each MAB once
a second. (Probably the case of no updates will be codified
specifically in the update stream, rather than just resulting in no
mention of the MAB.)
Each RUAS will generate one MABUS for each of its MABs. So each
second, the RUASes collectively generate a variable length body of
update information for every MAB in the Ivip system. The MABUS
includes mapping changes (altering ETR addresses of existing
micronets), changes to micronet boundaries and snapshot messages
(described above). The data format would be extensible for purposes
not yet anticipated.
The contents of each MABUS may be digitally signed at some stage
(before or after being broken and assembled with other MABUSes into
multiple packets). This is if it the Ivip design involves the QSDs
being able to authenticate all the mapping changes, snapshot messages
etc. they receive for each MAB, such as via the public key of the
RUAS which is responsible for this MAB.
3.10. Launch server
A small (such as 8) number of widely dispersed Launch servers are
operated by the RUASes and work together to generate, every second,
multiple identical streams of packets to Replicators in the first
level (1) of the Replicator system. Each Launch server receives its
input in the previous second from the RUASes.
Whittle Expires July 17, 2010 [Page 15]
Internet-Draft Ivip DB Fast Push January 2010
3.11. Replicator
A cross-linked, tree-like, system of Replicators form a redundant,
reliable, high-speed distribution system for delivering mapping
updates to full database ITRs and Query Servers all over the Net.
Each Replicator receives one or more (typically two) streams of
update packets from an upstream Replicator or Launch server. These
two source streams should come from widely topologically separated
sources, ideally over two separate physical links. For instance a
Replicator in Berlin might receive its update streams from London and
Berlin, two sources in Berlin which are in different ISP networks, or
in any combination which minimises the likelihood that both sources
will be disrupted by any one fault.
The Replicator identifies the packets in each input stream by a
simple sequence number in the start of the payload, and another
number for which second in time the packet belongs to. The
Replicator uses data in the received packets to tell it how many
packets to receive in each second. For packets of each sequence
number in a given second, the first packet to arrive with this
sequence number has its data extracted from the DTLS packet, and this
data is used to create a separate DTLS-protected packet for each of
the 20 or so downstream Replicators on the next (numerically 1
greater) level.
In this way, unless the same numbered packet is lost from both input
streams, each Replicator receives the full set of mapping update
packets for this second, and sends them to tens or perhaps hundreds
of downstream devices, which are other Replicators, or QSDs. 20
output streams is assumed in examples below. Since the recipient
Replicators are assumed to receive two streams, each level of
Replicators in these examples has an amplification factor of 10.
The receive and send links use DTLS, which prevents an attacker from
spoofing these packets and so altering the behavior of ITRs.
Replicators could be implemented in routers, but are probably best
implemented in ordinary software on a GNU-Linux/BSD etc. COTS
(Commercial Off The Shelf) server. Replicators do not cache
information and need no hard drive storage. A server performing as a
QSD could also operate as a Replicator.
3.12. QSD - Query Server with full Database
QSDs get a full feed of updates from one or more Replicators. When
they boot, they download individual snapshot files for each MAB in
the Ivip system.
Whittle Expires July 17, 2010 [Page 16]
Internet-Draft Ivip DB Fast Push January 2010
QSDs respond immediately to queries from nearby ITRs and from caching
Query Servers (QSCs) - and send notifications to these if mapping
data changes for a micronet which was the subject of a recent query.
QSDs have no routing or traffic handling functions. In a full-scale
billion-plus micronet deployment they need a lot of memory, so the
best way to implement a QSD is probably on an ordinary server with
one or more gigabit Ethernet interfaces. No hard drive is required,
except perhaps for logging purposes.
3.13. QSC - Query Server with Cache
A QSC could be implemented in a router or more likely a COTS server.
It does not route packets, and its memory and computational
requirements are likely to be modest compared to those of a QSD.
There is no need for a full feed of updates from the Replicator
system. However, each QSD must be able to get mapping information
from one or more upstream QSDs - or via upstream QSCs which
themselves access upstream QSDs.
The easiest way to implement a QSC would be software on a modest
server, which would only need a hard drive for logging purposes.
Whittle Expires July 17, 2010 [Page 17]
Internet-Draft Ivip DB Fast Push January 2010
4. Update Authorities and User Interfaces
This section is a detailed discussion of the fast push mapping
distribution system itself, starting with the systems which accept
commands from end-users (or their authorised representatives or
systems) and prepare the information for the Launch system.
This is the early stage of an ambitious design, so a number of
options are contemplated. This section of the system may not need
IETF standardised protocols, since only a small number of
organisations need to interact to make it work. The Replicators and
the data format of mapping updates do need to be standardized. The
purpose of exploring the RUAS and Launch server systems is to
estimate the difficulty of constructing them - and hopefully to show
that an approach like this is feasible and desirable. There may well
be easier approaches than the ones explored here.
Probably the closest thing to them would be the large scale systems
for managing DNS, such as for .com and other major TLDs. I don't
know anything about these and people with experience in such systems
could probably design the UAS, RUAS and perhaps Launch server systems
better than I could.
The real-time nature of these systems of controlling ITR behavior has
no precedent. Generally, the system should work on a continual
basis. However, if there is a technical problem or the system is
stopped for a few minutes to do an upgrade or whatever, the Internet
is not going to grind to a halt. In that downtime, end-user networks
which experience a multihoming failure will have to wait for their
connectivity to be restored. Likewise, end-user networks which send
mapping changes for inbound TE will have to wait. The effect on TTR
mobility would be minor, since mapping changes are not required when
the MN changes its physical connections, including when moving to an
entirely different access network. The delay in mapping changes
means that those few MNs which have chosen a new, closer, TTR will
need to wait for traffic to be tunneled to that new TTR - meaning
they will need to keep up the tunnel to the old, and now more
distant, TTR for these minutes. Normally, with mapping changes
getting to ITRs in a few seconds, the MN could terminate the tunnel
to the old TTR within a few seconds of the ITRs beginning their
tunneling to the new TTR.
The final authority to control mapping information is fully devolved
to end-users, who by means of a username and password or some other
authentication method, are able to issue commands to define micronets
within their UAS, and to map each micronet to any ETR address.
However the physical authority to control the mapping of all Mapped
Whittle Expires July 17, 2010 [Page 18]
Internet-Draft Ivip DB Fast Push January 2010
space within a single MAB rests with a single RUAS. That RUAS may be
acting for a UAS who is administers a MAB. The RUAS may administer
it - perhaps on behalf of another company - and may delegate control
of parts of it to one or more UASes. The RUAS may have relationships
directly to the end-users of this MAB, through its own UAS. Here we
discuss the flow of information and trust between these various
entities, in real-time, so that every second (for example, the actual
time period will need to be carefully considered) each RUAS assembles
a body of update information for each of its MABs.
In the diagrams below, each RUAS or UAS is depicted as a single
entity. Each such entity acts as a single functional block, but will
typically be implemented as a redundant system over several servers.
4.1. RUAS Outputs
4.1.1. Updates every second
Every second (or some other time-period not exceeding two or three
seconds), for each MAB the RUAS is authoritative for, the RUAS
generates a set of mapping updates, and works with other RUASes to
integrate this into the next second's output from the Launch system.
As previously mentioned, these updates are primarily actual mapping
updates for individual micronets within the MAB, but also contain
occasional messages to the effect that a snapshot of this MAB's full
mapping database has been made and is, or soon will be, available via
various servers.
4.1.2. MAB snapshots
Every few minutes (or some other time period, as chosen by the RUAS,
but with some reasonable maximum defined by a BCP) the RUAS makes a
copy of the complete mapping information for a MAB. Snapshots for
each MAB are independent of each other, and so can be done with
different frequencies.
The snapshot is in a format which needs to be standardized, so it can
be downloaded and understood by any ITRD or QSD, now and in the
future. This data format needs to be extensible to cover new kinds
of mapping information and other functions not yet anticipated -
which will be ignored by devices which are not capable of these
functions.
The exact format for this is for future work, but for instance would
begin with some identifying information about the MAB, a block
defining that the following data concerns IPv4 micronet mapping
information (and snapshot announcements), with the possibility of
Whittle Expires July 17, 2010 [Page 19]
Internet-Draft Ivip DB Fast Push January 2010
other blocks containing different kinds of data. Binary format would
probably be best, and the file could then be compressed with gzip
etc.
Each such file will be given a distinctive name, according to a
standardised format, which indicates at least the MAB starting
address and length, and the time of the snapshot.
The snapshot process will take a second or two to complete from the
time it is initiated, and the resulting file will be copied to a
number of servers, ideally located in a variety of locations around
the Net.
Each such server would be run by the RUAS directly, or as part of all
RUASes working together. The servers can probably be conventional
HTTP servers, so that QSDs can download the snapshots when needed.
There is scope for some careful design with DNS so that there is an
automatic structure in the domain names of these servers, enabling an
expandable system to be automatically used by QSDs without manual
configuration.
These files will be publicly available, and need to be made available
for somewhat longer than the cycle time of snapshots. So with a ten
minute snapshot cycle, the previous snapshot should be available for
a while - probably 10 minutes or so - after the new one is available.
Snapshots are downloaded by QSDs when they boot, and if they suffer a
disruption in mapping updates which necessitates a reload of this
part of the complete mapping database. To facilitate this, MABs
should not be too large in terms of IPv4 addresses or /IPv6 /64s - or
at least should not contain too many micronets - which would make
individual snapshot files excessively large.
At boot time, or when re-synching, the QSD will monitor the update
streams for each MAB until a snapshot announcement is found. It will
then buffer all subsequent updates and download the snapshot as soon
as it is available. Once the snapshot has arrived, and been unpacked
to RAM, the buffered updates are applied to it. Then, this MAB's
part of the mapping database is up-to-date and the ITR can begin
advertising this MAB, and therefore tunnelling all packets which are
addressed to this MAB.
In order to reduce total path lengths for these file downloads, and
likewise for retrieving missing packets from the same servers, it
would be desirable if each QSD in a given location could access a
nearby snapshot server. It may be desirable to have every snapshot
of every MAB in a single server, or a single set of servers which are
accessed by geographically close QSDs. Anycast is not a good
Whittle Expires July 17, 2010 [Page 20]
Internet-Draft Ivip DB Fast Push January 2010
technology for this, since file retrieval is best done via TCP
sessions. The ITR system itself can't be used, to avoid circular
dependencies - so the servers must be on conventional addresses.
Likewise, any DNS servers involved in this server system need to be
strictly on conventional addresses.
Each QSD needs to be configured with, or to automatically discover,
two or more such servers - at least one of which is relatively close
- so the data can be found despite one server being down.
From the point of view of the QSC, seeking an update for a given MAB
of a particular RUAS, the address to request the file from could be
made up from the RUAS identifier yyyy which is contained in the
snapshot announcement (in the stream of mapping updates),
concatenated with a locally configured "xxxxx" and
"ipv4.ivipservers.net". In the event that this server was
unavailable one or more locally configured alternatives to this
initial "xxxxx" value could be tried - including one or more for
nearby countries.
The most significant 24 bits of the MAB's starting address (probably
48 bits for IPv6, assuming this is the granularity of BGP
advertisements) for would be transformed into a text string such as
150.101.072. A similar transformation of the precise time of the
snapshot would result in a second text string, and these would be
used to reliably identify the appropriate directory and file in the
server.
4.1.3. Missing packet servers
The cross-linked tree-structured Launch and Replicator systems should
provide a robust method of delivering the complete set of MAB updates
every second, to every ITRD and QSD. There may be more subtle and
efficient methods than this somewhat brute-force approach, which
involves typically a doubling of the amount of update traffic in the
pursuit of robustness. However, the rate of updates will only be
problematic by current standards at a date so far in the future that
the technology of the day will render the task far less daunting that
it would now be.
In the event that an ITRD or QSD misses one or more packets, it will
be able to easily identify which are missing, due to the sequence
numbers built into their payloads. This will transform easily into
an address to use by which the missing one or more packets can be
retrieved, probably via HTTP, from one of the servers described
previously which provide snapshots.
Whittle Expires July 17, 2010 [Page 21]
Internet-Draft Ivip DB Fast Push January 2010
4.2. Authentication of RUAS-generated data
Careful consideration must be given to how QSDs can quickly and
reliably ensure that the information they receive ostensibly from
each RUAS is genuine. Perhaps the DTLS links to two upstream
Replicators will be considered good enough. But that places too much
trust in Replicators which are probably controlled by other
organisations than the one running the QSD and its dependent ITRs.
Being able to direct traffic to an attacker's site, by means of
altering the mapping information in an ITR, is such a threat to
security, and such an attractive proposition for attackers, that some
kind of digital signing of the update packets themselves will almost
certainly be required. At this early stage of development, the model
is pretty simple.
4.2.1. Snapshot and missing packet files
Each RUAS has a key pair and signs the MAB snapshot and missing
packet files with its private key. QSDs can verify the signature
with the RUAS's public key, subject to a PKI arrangement of
certificates, or some other simpler arrangements.
Both these types of files are only handled occasionally, so the
overhead in performing crypto operations is insignificant.
4.2.2. Mapping updates
This principle does not apply to the update information contained in
packets received from the Replicator system. It would be onerous to
individually authenticate each packet, or each body of updates from
each RUAS contained in potentially multiple packets. At present, I
can't see an alternative. The system needs to be highly secure
against attack, because even a second or two of an ITR mapping
packets to the attacker's site constitutes an unacceptable breach.
At least two types of attack can be contemplated. Firstly, the
attacker could send spoofed packets to a QSD or Replicator intending
them to be received before the genuine packets. This would be
essentially impossible with DTLS protection of the packets coming
from the upstream Replicator. Secondly, the attacker could somehow
gain control of the upstream replicators for a given QSD. The
protocols can't protect against an attacker gaining control of a QSD,
RUAS or UAS system. Neither is there any protection against an
attacker who has obtained the credentials to send mapping changes for
the victim's micronets.
The second attack - gaining control of Replicators outside the
network of the QSD - is still credible. Internet communications are
Whittle Expires July 17, 2010 [Page 22]
Internet-Draft Ivip DB Fast Push January 2010
always vulnerable to attackers gaining control of a router. If we
assume or somehow require that Replicators are as robust against
general attack, and have their passwords as closely guarded, as DFZ
routers - then perhaps the level of threat is so similar to the
existing level that no further measures need to be taken.
Here is an exploration of possible attacks and defences. Today, to
snoop on packets, divert packets and/or to perform man-in-the-middle
attacks, an attacker needs to gain control of a router. As far as I
know, this is not a serious problem in the DFZ or in ISP networks
today - but nonetheless, SSH and SSL/TLS are routinely used for the
many transactions which do need to be secure.
If we consider a QSD in the network of ISP-A, and assume the ITRs in
this network, and any connected end-user networks, use this QSD, then
the question is how can protocols protect QSD's mapping data if ISP-A
does everything right. (If ISP-A is sloppy with security, then
protocols can't protect the QSD itself against being compromised.)
Digital signing of every packet would work fine - but is expensive
and the signature wastes valuable space in each packet. Digital
signing of data in multiple packets looks more attractive.
The reliance on two upstream Replicators, outside ISP-A's network,
and presumably in networks of two other ISPs or transit providers,
might appear to make the attacker's task more difficult. However,
this is not the case. Only a transient success in altering the
mappings would still be a security breach - and if a single
compromised Replicator sent packets a little sooner than the
uncompromised one, then the QSD would never notice that the packets
which arrived second differed from the first ones. The first
received would be used to altering the mapping and so control the
behavior of ITRs. (Any such attack would probably require the QSD to
download a snapshot for the affected MAB - so we must protect against
such attacks also from a DoS perspective.)
It would be possible to have QSCs check the streams of packets
received from both Replicators against each other. This would be
inexpensive, but it would not really help, since the attacker could
launch a flood of packets to temporarily disrupt some router which
carries the packets from the non-compromised Replicator. The QSD has
to operate entirely from one Replicator's packets if the other
Replicator dies. So it seems that the attacker needs to gain control
only of a single upstream Replicator to be successful.
An attacker gaining control of a Replicator one level above the
immediate upstream replicator might also succeed, since by sending
its packets a little earlier, its packets would be accepted by the
next level and so sent to multiple QSDs.
Whittle Expires July 17, 2010 [Page 23]
Internet-Draft Ivip DB Fast Push January 2010
It is not good enough to detect the forged mapping information after
it has been used to update the mapping database. So it seems there
is no alternative to signing the update packets themselves - or more
likely the contents of multiple such packets as a single unit.
If the body of data to be signed was spread over 5 packets, then the
QSD couldn't use any of this information if a single packet was
missing. Therefore, perhaps the "missing packet" system could be
simplified to work not on packets, but on entire blocks of data - the
same size block which is signed.
Another approach would be to have the Launch system add one or more
packets to the stream, containing MD5 (or some better function) hash
of either each packet, or each body of update information from each
RUAS. This packet would be signed by some authority - the consortium
formed by the RUAS companies which runs the Launch servers and first
few levels of Replicators. It would be trivial to have a checksum
for the entire second's worth of updates, but then a single missing
packet would make it impossible to check the rest.
Perhaps each RUAS's set of updates can be broken into sections, such
as packets or something typically bigger than packets, with hashes
for each section enclosed in another packet, with that set of hashes
signed by the RUAS.
The MD5 checksums could be sent twice, for robustness, and some care
would be needed in deciding how much update information each one
covers. A separate hash for every packet would be conceptually
simple and enable individual packets to be accepted immediately, even
if another packet was not received and so required a "missing packet"
request. However, this would increase the number of hashes to
transmit.
The current proposal is to have a hash for the updates for each MAB
for which updates are received, which may be less than a packet, or
perhaps more.
There are multiple ways of solving this problem. I doubt anyone
would argue that it is so difficult as to warrant the abandonment of
the entire fast-push, local query server concept. With more work
later, I believe a satisfactory method can be found of the QSD
ensuring the updates are authentic before applying them.
4.3. RUAS - UAS interconnection
This section depicts a single tree of delegated responsibility for
the user control of mapping of one MAB. The Root UAS at the base of
the tree is run by Company X - RUAS-X. RUAS-X could be authoritative
Whittle Expires July 17, 2010 [Page 24]
Internet-Draft Ivip DB Fast Push January 2010
for other MABs, and each such tree of delegation may have the same
set of other UAS systems, or it could be different. Each delegation
tree is separate from the delegation trees of other MABs, even if
they look similar, because the tree includes specific subsets of the
whole MAB address range as one of the defining characteristics of its
branches and leaves.
The initial action which leads to the database being changed is a
user generated (manually or by the user's equipment or by a system
authorised by the user) UMUC (User Mapping Update Command).
For authorising and feeding UMUCs to the RUAS-X, there is a tree as
depicted in Figure 1. Delegation of authority flows up the tree as
the total address range of the MAB is split at each branching
junction. This tree structure involves data, in the form of SUMUCs
(Signed User Mapping Updated Commands) flowing down towards the root
of the tree. (Data would also flow up the tree so each user-
interface leaf could tell end-users what their current mapping was,
could test their requests against constraints etc.) The idea is that
RUAS-X could delegate control of one or more subsets of the MAB's
total range of addresses to some other system, which in turn could
delegate control to other systems. There would be no absolute limit
on the height (usually called depth) of these hierarchies.
The servers which handle the end-user interaction needs to be one of
the leaves of this tree structure, so as not to burden the RUAS-X
database servers themselves with details of user interaction. This
enables various companies to give different kinds of control for the
mapping of the SPI space their branch of the tree controls. Figure 1
does not show RUAS-X having any user interface servers, but it could.
The simplest arrangement would be the RUAS having simply a user-
interface server and no tree of other UASes.
There would need to be IETF standardised methods by which some server
could execute a UMUC with the user-interface servers of any of these
UASes. This standardisation would be especially important for
multihoming, because some reasonably trusted company could run an
automated monitoring system, and have the credentials (username,
password, key etc.) stored in their system so their system can change
the mapping of one or more micronets the moment one link was detected
to be faulty. It is vital that there be a standardised method by
which all multihoming monitoring companies could send these mapping
change commands (and queries about the current state of mapping) to
UASes. Also, the company (such as X, Y or Z in Figure 1) which
controls a particular range of the Mapped space may offer such a
multihoming monitoring system itself.
The tree in this example controls an MAB with the address range
Whittle Expires July 17, 2010 [Page 25]
Internet-Draft Ivip DB Fast Push January 2010
20.0.0.0 to 20.3.255.255. In this example, company X has been
assigned by an RIR the entire range 20.0.0.0 to 20.3.255.255.
Company X leases to Y a quarter of this: 20.1.0.0 to 20.1.255.255.
These divisions are on binary boundaries, but they need not be. It
would be just as possible for X to delegate to Y an arbitrary subset
of the whole range, or the entire range - or just one IPv4 address or
IPv6 /64.
X's Root Update Authorisation Server (RUAS) has a private key for
signing all the MAB snapshot files it periodically creates and makes
available. The same key would be used for signing the list of hashes
which are used to authenticate the updates for each MAB, as mentioned
previously.
In this example, company Y delegates control of some of its space to
company Z, and Z has an end-user U, who needs to control the mapping
of a UAB containing one or more micronets in Z's range.
Z has various interfaces by which U can do this, with its own
arrangements for authentication, for monitoring a multihoming system
and making changes automatically etc. Ideally there might be one or
more automated, host-to-server, IETF-standardised protocols so all
end users and their appointed multihoming monitoring companies could
have standardised software for talking to whichever company's servers
they use to control the mapping of their IP address(es).
Whittle Expires July 17, 2010 [Page 26]
Internet-Draft Ivip DB Fast Push January 2010
User-R User-S User-T User-U Multihoming
\ \ | | Monitoring
\ \ | | Inc.
\ ................. /
\----. Web interface .---/
. other protocols .
. etc. .
....UAS-Z........
|
Other companies |
like Y and Z |
/-----<----/
| | \ | /
| | \|/
| | UAS-Y
\ | |
\ | /----<-----/
\ | /
\|/
RUAS-X Root Update Authorisation Server company X
| \
| \
V \->-[ Multiple web servers for MAB snapshot ]
| [ and missing packet files. ]
|
| Other RUASes like RUAS-X, each authoritative
| for mapping one or more MABs and producing
| regular MAB snapshots and update streams to
| which are sent to all Query Servers.
\
\ | | | /
\ | | | /
\ | | | /
\ | | | /
\ | | | /
\ | | | |
| | | | |
V V V V V
| | | | |
Each line depicts 8 streams of packets with
identical payloads - one stream for each of
the 8 Launch servers.
Figure 1: Delegation tree of UASes above one RUAS.
Whittle Expires July 17, 2010 [Page 27]
Internet-Draft Ivip DB Fast Push January 2010
When user-U (or a device or system with user-U's credentials) changes
the mapping of their micronet via a web interface this is achieved
via Z's website, authenticating him-, her- or it-self, by whatever
means Z requires. This causes UAS-Z to generate a signed copy of
this update command (a SUMUC) and to send it to UAS-Y. This may
include multiple commands to be executed in order.
The simplest SUMUC would be a change to the ETR address of an
existing micronet. This would consist of three items (assuming IPv4
for simplicity): A starting address for which micronet this update
covers, the number of IP addresses covered by the micronet to be
changed (>=1) (or alternatively the last address of the micronet),
and a new mapping value - a 32 bit ETR address. The SUMUC could also
consist of a time in the future the update should be executed. In
that case, it would be stored by RUAS-X and sent to the FMS at the
appointed time.
Mapping change commands would also include commands to join and split
micronets. Sequences of these commands would be sent, in order - and
the UAS should check their validity before putting them into a SUMUC.
So a SUMUC consists of one or multiple mapping change commands
concerning a particular micronet, or perhaps a set of micronets. The
commands will be executed in order, but as if at once.
If the SUMUC consists simply of changing a micronet's ETR address,
including zeroing it, then this will be applied by every QSD and
updates sent to any ITRs which need it. Multiple such changes all
together in the one SUMUC would cause the same effects, for multiple
micronets. However, if the changes involved a sequence of changes
affecting the same SPI addresses, the QSD will update ITRs (its
queriers, which could be ITRs or QSCs) to the final state of the
mapping after the changes.
For instance a sequence of changes could zero two micronets (set
their ETR address to 0.0.0.0) and then join them into one micronet.
The resulting micronet could then be split into five micronets and
each one mapped to a different ETR address. The QSD may have a
querier which is caching the mapping for the first original micronet,
but not the other. It will send that querier updates which define
the new mapping arrangements for exactly that range of SPI addresses
which the original response covered. This avoids the ITR (or the
QSC, if that is the querier) having to be told about a larger amount
of SPI space than it was told about in the initial reply. As noted
previously, the caching time for these newly defined micronets, each
of which will now be in the cache of the ITR or QSC, will be flushed
from the cache at the same time as the originally cached micronet
would have been.
Whittle Expires July 17, 2010 [Page 28]
Internet-Draft Ivip DB Fast Push January 2010
UAS-Y trusts this SUMUC because it can authenticate UAS-Z's
signature. It strips off the signature and adds its own, before
passing the SUMUC down to the next level: RUAS-X.
RUAS-X likewise has a copy of UAS-Y's public key and within a
fraction of a second of U initiating the UMUC, the master copy of
this MAB's database, in RUAS-X is altered accordingly. (This would
be a distributed, redundant, database system.)
Authority is delegated up the tree, because UAS-Y will only accept
update commands if they are signed by one of its branch UASes, and
for the particular address range that UAS has been authorised to
control.
User-U may have given their username and password etc. to Multihoming
Monitoring Inc. so this company can monitor their multihoming links
and change the mapping as soon as one link goes down. UAS-Z doesn't
know or care who actually makes the change - as long as they can
authenticate themselves for whatever micronet they want to change the
mapping of. UAS-Z would keep an audit trail of all interactions such
as with User-U or Multihoming Monitoring Inc.
Whittle Expires July 17, 2010 [Page 29]
Internet-Draft Ivip DB Fast Push January 2010
5. The Launch system
In this discussion 8 Launch servers will be assumed. The exact
number could be varied over time. Initial introduction could no-
doubt be done with a simpler system, but the purpose of this
discussion is to explore how a the system could scale to very large
numbers of micronets (billions) and large numbers of updates per
second.
The exact logic of the Launch system remains to be determined. The
following is a rough guide to how it might be done. I understand
there are some protocols for making distributed decisions, including
in a robust way if not all participants are active or have the full
amount of information others have.
The task of the Launch system is every cycle - in this example every
second - to collate the update information from all the RUASes, agree
on what has been collected, and then to generate multiple streams of
packets containing that information, from multiple locations, to the
widely geographically dispersed level 1 Replicators. Links between
the Launch servers would best be done via private links to avoid
packet flooding attacks. Likewise the links to level 1 Replicators.
Each Launch server has a link to every other Launch server, and every
RUAS has a link to every Launch server. This may seem rather over-
engineered, but the system will be robust in the event of failure of
quite a few of these links, and the task at hand is a momentous one,
deserving considerable effort to make it fast and reliable.
The exact details of how packets are handled, information combined
into packets etc. remains for future work.
Each Launch server may be a single physical server, with a live
backup at the same address, or a redundant cluster of servers which
behaves as if it is one device.
While the Launch servers are sending out the update packets for one
second, they are comparing notes about updates to be sent in the next
second and collecting updates to be sent in the second after that.
Perhaps this one second timing clock will prove to be too ambitious,
or the operations may be broken into four phases, rather than three.
5.1. Phase 1 - collecting updates from RUASes
In phase 1, all RUASes attempt to send their complete set of updates
to every Launch server, where they are buffered in readiness for
Phase 2. The Launch server authenticates this information, by
standard cryptographic means based on the public key of each RUAS or
Whittle Expires July 17, 2010 [Page 30]
Internet-Draft Ivip DB Fast Push January 2010
simply via using SSH for the communications protocol.
The contents of each RUAS's updates are then collected, and an MD5
(or some other hash algorithm) hash is created for each one.
5.2. Phase 2 - checksum comparison
Each Launch server sends to every other Launch server its record of
the hashes of the updates received from each RUAS.
This enables each Launch server to identify its state as one of the
following:
o Normal: no received set of checksums from other Launch servers
includes updates from more RUASes, or from different RUASes, than
where received by this Launch server - and all the hashes agree
with the locally generated hashes. Therefore, this Launch server
has established that it correctly received the complete set of
updates.
o Missing updates: One or more received lists contained checksums
from an RUAS for which this Launch server did not correctly
receive any updates. Therefore, this Launch server has
established that it has missed out on updates from one or more
RUASes.
o Invalid updates: The local checksum value for one or more RUAS
sets of updates does not equate to two or more checksums from
other Launch servers, which themselves are equal. The Launch
server has established that it received an erroneous copy of at
least one RUAS's set of updates.
Each Launch server now sends a signed message to the other Launch
servers, containing the state determined above: Normal, invalid
updates or missing updates.
Those Launch servers which are in the Normal state count how many
others are also in this state. If the number is above some "quorum"
constant, say 4 in an 8 server system, then each such Launch server
is ready to send the collected updates in phase 3. These Launch
servers independently process the same update data into a series of
packets, with sequence numbers which can easily be identified by the
recipient devices - initially level 1 Replicators but ultimately
QSDs. Those packets are stored, ready for transmission in phase 3.
Normally, all 8 Launch servers will receive the same information
correctly, and so will participate in phase 3. The purpose of this
constant is to ensure that there will not be a condition in which
Whittle Expires July 17, 2010 [Page 31]
Internet-Draft Ivip DB Fast Push January 2010
only one or two Launch servers participate in phase 3. The idea is
that the updates will be launched into the Replicator network
robustly, or not at all. Robustly means 4 or more of the 8 Launch
servers all launch the same information, and the others launch
nothing. If only 3, 2 or 1 Launch servers sent the information, or
if some Launch servers sent different information from the others,
then it is possible that some QSDs would not get the full set of
updates.
With further development work, it should be possible to fine-tune
this system to adequately guard against single or multiple points of
failure, but also to ensure that the system only sends out data when
it can send from at least four, or some constant number of Launch
servers. Careful analysis will be required to anticipate various
failure modes. There's quite a lot of work devising this, but it
only needs to be done once, for this one set of Launch servers.
Updates to the software can be done without much fuss - it is not
like having to change the functionality of all QSDs.
RUASes monitor the output of the Launch system, and if a particular
second's worth of updates are not sent, then the RUAS will send them
again soon.
This raises some potential ordering difficulties, where one second
contains a command to map a micronet to zero, and the next second
contains a command to map part of it to some valid address. While
these should be combined in the one second, if they were not, and the
first second was not sent, then the second second's command would
fail in the QSD, because it would be defining a new smaller micronet
in part of a micronet which was not at the time mapped to zero. If a
QSD for some reason misses all or part of the updates for a given
MAB, it needs to buffer subsequent updates until it can retrieve the
missing packet(s) - since updates must be applied in the correct
order.
The above algorithm will need to be extended so that a flaky RUAS,
which only transmits to a few Launch servers, will not cause the
quorum test to fail, due for instance to two Launch servers getting
its updates, and the rest recognising that they didn't.
5.3. Phase 3 - identical update streams
Those Launch servers which have the full set of update data now send
the packets they generated, in separate DTLS protected streams, to
level 1 Replicators. It would probably be best if the packets are
sent in numeric sequence, with sending times decided to spread the
packets over the whole second. Exactly how many level 1 Replicators
there are, and how many are driven by each Launch server, will be a
Whittle Expires July 17, 2010 [Page 32]
Internet-Draft Ivip DB Fast Push January 2010
matter for further work.
The result will be in each cycle that either the full set of updates
are sent out, robustly, by all or most all Launch servers. Due to
the cross-linked nature of the level 1 Replicators receiving at least
two feeds from separate Launch servers, in all but the most
pathological cases, every level 1 Replicator will receive the same
set of information and so launch it to the level 1 Replicators. Even
if there is a relatively high packet loss from some or many of these,
and some broken links, all, or almost all level 2 Replicators will
receive a full set of packets. This pattern of redundancy, for a
doubling in bandwidth used, continues all the way to QSDs.
Whittle Expires July 17, 2010 [Page 33]
Internet-Draft Ivip DB Fast Push January 2010
6. Replicators
Further work is required to reach a more precise description of how
the update information is placed in packets, and signed in such a way
that QSDs can be sure they have received the correct information. If
we assume that this problem can be solved, then the following
description of the functionality of individual Replicators and the
way they are arranged will lead to an understanding of how they will
form a robust, packet amplifying, global network for delivering the
output of the Launch system to a million or more QSDs.
(See "Figure 2 Tree of UASes above one RUAS".)
\ | / } Update information from end-users - directly
\ V / } or indirectly - to one of a dozen or so RUASes.
\|/
RUAS-X ->--------------[snapshot & missing packet HTTP servers]
/|\
/ V \ Streams of packets containing identical real-time
| mapping updates to the 8 Launch servers.
|
\ \ | / / Each of the 8 Launch servers gets a
\ \ V / / stream from each RUAS.
\ \ | / /
<>[Launch server N]<> The 8 Launch servers have links with each
/ / | \ \ other. Each second, each one sends a set
/ / V \ \ of updated packets to 20 level 1
/ / | \ \ Replicators. Each level 1 Replicator
| receives two streams, each from a
| different Launch server.
\
\ / Even with packet losses and link failures,
\ / most of the 80 level 1 Replicators receive
level 1 \ / a complete set of update packets, each
[Replicator] second, which they each replicate to 20
/ / | \ \ level 2 Replicators.
/ / V \ \
/ | | | \ In this example, each Replicator consumes
| two feeds from the upstream level, and
/ generates 20 feeds to Replicators in
/ the level below (numbered one above the
\ / current level). So each level involves
\ / 10 times the number of Replicators.
level 2 \ /
[Replicator] These figures might be typical of later
/ / | \ \ years with 10^9+ micronets and 100k+
/ / V \ \ of QSDs to drive. In the first five or
/ | | | \ ten years, with fewer updates, the
Whittle Expires July 17, 2010 [Page 34]
Internet-Draft Ivip DB Fast Push January 2010
/ | | | \ amplification ratio of each level could
/ | | | \ be much higher, with fewer levels.
/ | | | \
| | | Replicators are well-connected COTS
| | servers at peering points and ISP data
| | centers, though the 8000 Level 3 and
[Levels 3 and 4] 80,000 Level 4 Replicators may be in
[Replicators ] ISPs and larger end-user networks.
\ | \ /
\ | \ / Up to 800k QSDs get two or more ideally
\ | \ / identical full feeds of updates.
QSD QSD
Figure 2: Multiple levels of Replicators drive hundreds of thousands
of QSDs.
6.1. Scaling limits
The Replicator system is scalable to any size simply by adding
Replicators. Assuming two input streams for each Replicator, N
output streams gives an N/2 amplification of stream numbers per
level. N could be quite high in the early years of introduction,
when the number of micronets and updates is small by comparison with
the design target of one to ten billion micronets, with accompanying
update rates driven by their use for inbound TE for multihomed non-
mobile end-user networks and by mobile devices selecting new TTRs.
First, a maximal IPv4 example will be considered. Assume a billion
micronets, most of them for single IP addresses. Presumably most of
these will be for individual end-users, at home or with mobile
devices. The update rate will be relatively low for multihoming the
home and office-based micronets.
The update rate due to inbound TE is impossible to predict. Being
able to steer traffic dynamically to maximise utilization of multiple
links is economically highly attractive. Market mechanisms will tend
to set prices for updates which balance competing concerns. If the
price is too low, there will be more of them and the Replicator
system will need to be improved to cope with them - so the price
would rise to either reduce the number, or pay for the upgrades.
It is possible that the RUASes collectively could set prices low
enough to cover their make a profit running their operation and many
of the Replicators - with a very high volume of TE updates. If this
grew to the point where those operating QSDs found they had to spend
money upgrading their QSDs just to cope with the volume, then there
would be the possibility that they could instead program their QSDs
to ignore the most frequent updates which had patterns resembling TE
Whittle Expires July 17, 2010 [Page 35]
Internet-Draft Ivip DB Fast Push January 2010
updates.
Then, in order for the RUASes to be able to continue charging for
these TE updates, the RUASes might need to pay QSD operators to
accept such a high level of updates. This would probably be
excessively expensive - so RUASes would be under strong pressure to
limit the total rate of updates to a level the great majority of QSD
operators are happy with. The price of updates will not deter their
use for multihoming service restoration - and this would represent a
small proportion of total updates. Higher prices per update would
reduce the number for TE, in a highly elastic manner. Likewise,
higher prices per update would cause mobile users (or more directly
the TTR companies, who are paying for each update) not to change TTRs
as often.
So overall, it is impossible to state with confidence what update
rates might be expected.
Even with the entire Earth's population owning a mobile device with
its own micronets, if we pick some figure, such as 1000 km, within
which there is no significant benefit in choosing a closer TTR, then
a WAG (Wild-Ass Guess) could be based on airline passenger numbers.
If we assume that each such trip would be long enough to require a
new TTR, then we would get some very approximate worst-case figure.
Statistics from the International Air Transport Association
[IATA-2009] indicate that commercial airlines carried 2.271 billion
passengers in 2008. I have not been able to find estimates for the
number of people travelling large distances by road or train, but it
is reasonable to assume these are relatively small compared to the
numbers of airline passengers. Most travel by car and train involves
trips short enough, with a return trip home, that there will be no
need to use a closer TTR during the whole trip. Truck drivers
crossing continents might be an exception, but the number of such
trips would be small compared to the 2 billion airline passenger
figure.
There could be growth in passenger numbers and it is possible that on
long trips, the aircraft's satellite link would connect to several
ground stations, with the MNs in the aircraft therefore (ideally)
changing their mapping to a new TTR near the ground station. (This
is explored in [TTR Mobility]. There are various ways of
extrapolating these figures, such as with population growth. For
simplicity, I will double the 2 billion figure and use this to
roughly include all mapping changes due to multihoming service
restoration and TE. So I have WAG of 4 billion mapping changes a
year.
Whittle Expires July 17, 2010 [Page 36]
Internet-Draft Ivip DB Fast Push January 2010
This is about 128 updates a second.
The raw data for change to an IPv6 micronet's ETR address is 32
bytes: 64 bits for the micronet's starting /64, another 64 bits for
its length or end, and 128 bits for the ETR address. 128 of these a
second is 4k bytes a second - 32kbps. There would be peaks and
troughs, and there could be peaks due to a major outage driving many
end-user networks to switch ETRs for multihoming service restoration.
If there were 5 or 10 billion mobile devices, each with a micronet,
many of these would keep using the same TTR from one year to the
next. There would be a mapping change when the micronet was assigned
to a given handset, and then another when the handset was no longer
used, or replaced by another. So there would also be a significant
background level of administrative mapping changes with billions of
micronets for mobile devices.
It is hard to imagine a scenario in which the update rate would
require prohibitive volumes of data, even by today's standard, for
any substantial ISP. The flow of update packets would be somewhat
greater than this raw data rate due to the need for packing them into
some kind of robust format, having hashes of them with digital
signatures etc. The total amount of mapping data coming into an ISP
would be 2 to 4 times this due to the need for feeds from two or more
Replicators. Still, by the times such high levels of adoption could
occur, the bandwidth they require will surely not present a
significant difficulty for any ISP, or for larger end-user networks
which want to run their own ITRs and wish to have their own QSDs,
rather than relying on the QSDs of their ISPs.
6.2. Managing Replicators
Replicators should be easy to create and deploy. Any substantial
server with the requisite software, in a suitable location, will do
the job - but it should be well secured against attackers gaining
root access. A successful system will require some mechanisms which
ensure reliable operation with a minimal amount of configuration and
ongoing management.
In the current model, each Replicator normally receives feeds from
two upstream Replicators, and generates some figure N feeds for
downstream devices. Each Replicator should be able to request and
quickly gain a replacement feed from another upstream Replicator if
one of those it is using becomes unavailable, or unreliable.
This requires that Replicators in general be operating below
capacity, so that when others in their level fail, they can take up
the slack. This needs to be locally configured beforehand, with
Whittle Expires July 17, 2010 [Page 37]
Internet-Draft Ivip DB Fast Push January 2010
upstream Replicators of organisations which have agreed to provide
the feeds, and with downstream Replicators of organisations who have
requested them.
It is possible to imagine a sophisticated, distributed, management
system for the Replicator network. This could be developed over
time, since for initial deployment, considerable manual configuration
and less automation would be acceptable.
Whittle Expires July 17, 2010 [Page 38]
Internet-Draft Ivip DB Fast Push January 2010
7. Security Considerations
This ID mentions some authentication and security problems and
possible solutions to them, but full consideration of security can
only occur when the architecture is fleshed out in greater detail.
Whittle Expires July 17, 2010 [Page 39]
Internet-Draft Ivip DB Fast Push January 2010
8. IANA Considerations
For future work.
Whittle Expires July 17, 2010 [Page 40]
Internet-Draft Ivip DB Fast Push January 2010
9. Informative References
[I-D.whittle-ivip-arch]
Whittle, R., "Ivip (Internet Vastly Improved Plumbing)
Architecture", draft-whittle-ivip-arch-03 (work in
progress), January 2010.
[I-D.whittle-ivip-glossary]
Whittle, R., "Glossary of some Ivip and scalable routing
terms", draft-whittle-ivip-glossary-00 (work in progress),
January 2010.
[IATA-2009]
"Fact sheet: industry statistics", September 2009, <http:/
/www.iata.org/NR/rdonlyres/
8BDAFB17-EED8-45D3-92E2-590CD87A3144/0/
FactSheetIndustryFactsSept09.pdf>.
[TTR Mobility]
Whittle, R. and S. Russert, "TTR Mobility Extensions for
Core-Edge Separation Solutions to the Internets Routing
Scaling Problem", August 2008,
<http://www.firstpr.com.au/ip/ivip/TTR-Mobility.pdf>.
Whittle Expires July 17, 2010 [Page 41]
Internet-Draft Ivip DB Fast Push January 2010
Author's Address
Robin Whittle
First Principles
Email: rw@firstpr.com.au
URI: http://www.firstpr.com.au/ip/ivip/
Whittle Expires July 17, 2010 [Page 42]
| PAFTECH AB 2003-2026 | 2026-04-24 14:20:09 |