One document matched: draft-ietf-pmtud-method-02.txt
Differences from draft-ietf-pmtud-method-01.txt
Network Working Group M. Mathis
Internet-Draft J. Heffner
Expires: November 30, 2004 PSC
K. Lahey
Freelance
June 2004
Path MTU Discovery
draft-ietf-pmtud-method-02
Status of this Memo
By submitting this Internet-Draft, I certify that any applicable
patent or other IPR claims of which I am aware have been disclosed,
and any of which I become aware will be disclosed, in accordance with
RFC 3668.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at http://
www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on November 30, 2004.
Copyright Notice
Copyright (C) The Internet Society (2004). All Rights Reserved.
Abstract
This document describes a robust new method for Path MTU Discovery
that relies on TCP or other Packetization Layer to probe an Internet
path with progressively larger packets. This method is described as
an extension to RFC 1191 and RFC 1981, which specify ICMP based Path
MTU Discovery for IP versions 4 and 6, respectively. This document
does not define a protocol, but rather a method to use features of
existing protocols to discover the path MTU.
Mathis, et al. Expires November 30, 2004 [Page 1]
Internet-Draft Path MTU Discovery June 2004
The general strategy of the new algorithm is to start with a small
MTU and probe upward, testing successively larger MTUs by probing
with single packets. If the probe is successfully delivered, then
the MTU is raised. If the probe is lost, it is treated as an MTU
limitation and not as a congestion signal.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 9
5. Implementation Issues . . . . . . . . . . . . . . . . . . . . 10
5.1 Layering . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.1 Accounting for Header Sizes . . . . . . . . . . . . . 10
5.1.2 Storing PMTU information . . . . . . . . . . . . . . . 11
5.2 Lower Layers . . . . . . . . . . . . . . . . . . . . . . . 12
5.2.1 Generating Probes . . . . . . . . . . . . . . . . . . 12
5.2.2 Selecting the initial MTU . . . . . . . . . . . . . . 14
5.2.3 Normal sequence of events to raise the MTU . . . . . . 14
5.2.4 Processing MTU Indications . . . . . . . . . . . . . . 15
5.2.5 Probing Intervals . . . . . . . . . . . . . . . . . . 20
5.2.6 Host fragmentation . . . . . . . . . . . . . . . . . . 21
5.2.7 Multicast . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Search Strategy . . . . . . . . . . . . . . . . . . . . . 22
5.3.1 Search . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3.2 Monitor . . . . . . . . . . . . . . . . . . . . . . . 24
5.3.3 Suspend . . . . . . . . . . . . . . . . . . . . . . . 24
5.4 Specific Packetization Layers . . . . . . . . . . . . . . 24
5.4.1 Probing method using TCP . . . . . . . . . . . . . . . 24
5.4.2 Probing method using SCTP . . . . . . . . . . . . . . 25
5.4.3 Probing Method for IP Fragmentation . . . . . . . . . 27
5.4.4 Issues for other transport protocols . . . . . . . . . 27
5.5 Operational Integration . . . . . . . . . . . . . . . . . 27
5.5.1 Interoperation with prior algorithms . . . . . . . . . 27
5.5.2 Interoperation over subnets with dissimilar MTUs . . . 28
5.5.3 Interoperation with tunnels . . . . . . . . . . . . . 28
5.5.4 Diagnostic tools . . . . . . . . . . . . . . . . . . . 29
5.5.5 Management interface . . . . . . . . . . . . . . . . . 29
6. References . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.1 Normative References . . . . . . . . . . . . . . . . . . . . 30
6.2 Informative References . . . . . . . . . . . . . . . . . . . 31
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 32
A. Security Considerations . . . . . . . . . . . . . . . . . . . 32
B. IANA considerations . . . . . . . . . . . . . . . . . . . . . 32
C. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 33
Intellectual Property and Copyright Statements . . . . . . . . 34
Mathis, et al. Expires November 30, 2004 [Page 2]
Internet-Draft Path MTU Discovery June 2004
1. Introduction
This document describes a method for Packetization Layer Path MTU
Discovery (PLPMTUD) which is an extension to existing Path MTU
discovery methods as described in RFC 1191 [2] and RFC 1981 [3]. The
proper MTU is determined by starting with small packets and probing
with successively larger packets. The bulk of the algorithm is
implemented above IP, in the transport layer (e.g. TCP) or other
"Packetization Protocol" that is responsible for determining packet
boundaries.
This document draws heavily RFC 1191 [2] and RFC 1981 [3] for
terminology, ideas and some of the text.
The methods described in this document apply both IPv4 and IPv6, and
many transport protocols. This document does not define a protocol,
but rather a method to use features of existing protocols to discover
the path MTU. It does not require cooperation from the lower layers
(except that they are consistent about what packet sizes are
acceptable) or the far node. Variants in implementations will not
cause interoperability problems.
The methods described in this document are carefully designed to
maximize robustness in the presence of less than ideal
implementations of other protocols or Internet components.
For sake of clarity we uniformly prefer TCP and IPv6 terminology. In
the terminology section we also present the analogous IPv4 terms and
concepts for the IPv6 terminology. In a few situations we describe
specific details that are different between IPv4 and IPv6.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [4].
This draft is a product of the Path MTU Discovery (pmtud) working
group of the IETF. Please send comments and suggestions to
pmtud@ietf.org. Interim drafts and other useful information will be
posted at http://www.psc.edu/~mathis/MTU/pmtud/index.html .
2. Terminology
IP Either IPv4 [1] or IPv6 [7].
node A device that implements IP.
Mathis, et al. Expires November 30, 2004 [Page 3]
Internet-Draft Path MTU Discovery June 2004
router A node that forwards IP packets not explicitly addressed to
itself.
host Any node that is not a router.
upper layer A protocol layer immediately above IP. Examples are
transport protocols such as TCP and UDP, control protocols such as
ICMP, routing protocols such as OSPF, and Internet or lower-layer
protocols being "tunneled" over (i.e., encapsulated in) IP such as
IPX, AppleTalk, IP itself.
link A communication facility or medium over which nodes can
communicate at the link layer, i.e., the layer immediately below
IP. Examples are Ethernets (simple or bridged); PPP links; X.25,
Frame Relay, or ATM networks; and Internet (or higher) layer
"tunnels", such as tunnels over IPv4 or IPv6. Occasionally we use
the slightly more general term "lower layer" for this concept.
interface A node's attachment to a link.
address An IP-layer identifier for an interface or a set of
interfaces.
packet An IP header plus payload.
MTU Maximum Transmission Unit, the size in bytes of the largest IP
packet, including the IP header and payload, that can be
transmitted on a link or path. Note that this could more properly
be called the IP MTU, to be consistent with how other standards
organizations use the acronym MTU.
link MTU The Maximum Transmission Unit, i.e., maximum IP packet size
in bytes, that can be conveyed in one piece over a link. Beware
that this definition differers from the definition used by other
standards organizations.
For IETF documents, link MTU is uniformly defined as the IP MTU
over the link. This includes the IP header, but excludes link
layer headers and other framing which is not part of IP or the IP
payload.
Beware that other standards organizations generally define link
MTU to include the link layer headers.
path The set of links traversed by a packet between a source node and
a destination node
Mathis, et al. Expires November 30, 2004 [Page 4]
Internet-Draft Path MTU Discovery June 2004
PMTU, path MTU The minimum link MTU of all the links in a path
between a source node and a destination node.
classical PMTU discovery, Process described in RFC 1191 and RFC 1981,
in which nodes rely on ICMP "Packet Too Big" messages to learn the
MTU of a path.
PL, packetization layer The layer of the network stack which segments
data into packets.
PLPMTUD Packetization Layer Path MTU Discovers, the method described
in this document, which is an extension to classical PMTU
discovery.
Packet Too Big message An ICMP message reporting that an IP packet is
too large to forward. This is the IPv6 term that corresponds to
the IPv4 "ICMP Can't fragment" message.
flow A context in which MTU discovery is applied. This is naturally
an instance of the packetization protocol, e.g. one side of a TCP
connection.
MPS The maximum IP payload size available over a specific path. This
is typically the path MTU minus the IP header. As an example, this
is the maximum TCP packet size, including TCP payload and headers
but not including IP headers. This has also been called the "L3
MTU".
MSS The TCP Maximum Segment Size, the maximum payload size available
to the TCP layer. This is typically the path MPS minus the size of
the TCP header.
probe packet A packet which is being used to test a path for a larger
MTU.
probe size The size of a packet being used to probe for a larger MTU.
successful probe The probe packet was delivered through the network
and acknowledged by the Packetization Layer on the far node.
inconclusive probe The probe packet was not delivered, but there were
other lost packets close enough to the probe where it can not be
presumed that the probe was lost because it was larger than the
path MTU. By implication the probe might have been lost due to
something other than MTU (such congestion), so the results are
inconclusive. Inconclusive probes are generally repeated at the
same probe size, after a suitable delay.
Mathis, et al. Expires November 30, 2004 [Page 5]
Internet-Draft Path MTU Discovery June 2004
failed probe The probe packet was not delivered and there were no
other lost packets close to the probe. This is taken as an
indication that the probe was larger than the path MTU, and future
probes should generally be for at smaller sizes.
errored probe There were losses or timeouts during the verification
phase which suggest a potentially disruptive failure or network
condition. These are generally retried only after substantially
longer intervals.
probe gap The payload data that will be lost and need to be
retransmitted if the probe is not delivered.
probe phase The interval (time or protocol events) between when a
probe is sent, and when it is determined that the the probe
succeeded, failed or was inconclusive
verification phase An additional interval during which the new path
MTU is considered provisional. Packet losses or timeouts are
treated as an indication that there may be a problem with the
provisional MTU.
Transition phase The interval between the probe phase and the
verification phase, during which packets using the new MTU
propagate to the far node and the acknowledgment propagates back.
full stop timeout a timeout where none of the packets transmitted
after some event are acknowledged by the receiver, including any
retransmissions. This is taken as an indication of some failure
condition in the network, such as a routing change onto a link
with a smaller MTU. For the sake of PLPMTUD we suggest the
following definition of a full stop timeout: the loss of one full
window of data and at least one retransmission or at least 6
consecutive packets including at least 2 retransmissions (along
with two retransmission timer expirations). [@@@ This probably
needs some experimentation.]
search strategy the heuristics used to choose successive probe sizes
to converge to the proper path MTU, as described in section 5.5.
3. Overview
This document describes a method for TCP or other packetization
protocols to dynamically discover the MTU of a path without relying
on explicit signals from the network. These procedures are applicable
to TCP and other transport- or application-level packetization
protocols in which the receiver always reports to the sender complete
Mathis, et al. Expires November 30, 2004 [Page 6]
Internet-Draft Path MTU Discovery June 2004
information about which packets were lost in the network.
The general strategy of the new procedure is for the packetization
layer to find the proper MTU by probing with progressively larger
packets, without disrupting its normal protocol operation. If a probe
packet is successfully delivered, then the path MTU is provisionally
raised. If there are no additional losses during the subsequent
verification phase, then the path MTU is confirmed (verified) to be
at least as large as the provisional MTU. PLPMTUD can then probe
again with an even larger MTU, according to MTU search strategy
described in Section 5.3.
The verification phase is used to detect some situations where
raising the MTU raises the packet loss rate. For example if a link
is striped across multiple physical channels with inconsistent MTUs,
it is possible that a probe will be delivered even if it is too large
for some of the physical channels. In such cases raising the path MTU
to the probe size will cause severe periodic loss and abysmal
performance. The verification phase is designed to prevent the path
MTU from being raised if doing so causes excessive packet losses.
A conservative implementation of PLPMTUD would use a full round trip
time for the verification phase. In this case each time PLPMTUD
raises the MTU it takes three full round trip times to do so. It
takes one round trip for the probe phase, during which the probe
propagates to the far node and an acknowledgment is returned. The
second round trip is the transitional phase, during which data
packets using the provisional MTU propagate to the far node and are
acknowledged. During he third and final round trip time, it is
verified that raising the MTU does not cause excessive loss.
The isolated loss of a probe packet (with or without a Packet Too Big
message) is treated as an indication of an MTU limit, and not as a
congestion indicator. In this case alone, the packetization protocol
is permitted to retransmit the probe gap without adjusting the
congestion window.
If there is a timeout or any additional lost packets during any of
the three phases, the loss is treated as a congestion indication as
well as an indication of some sort of failure of the PLPMTUD process.
The congestion indication is treated like any other congestion
indication: window or rate adjustments are mandatory per the relevant
congestion control standards [8]. Probing can resume with some new
probe size after a delay which is determined by the nature of the
indicated failure.
The most likely (and least serious) PLPMTUD failure is the link
experiencing legitimate congestion related losses at about the same
Mathis, et al. Expires November 30, 2004 [Page 7]
Internet-Draft Path MTU Discovery June 2004
time as the probe. In this case, it is appropriate to retry the
probe (with the same probe size) as soon as the packetization layer
has fully adapted to the congestion and recovered from the losses.
In other cases, additional losses or timeouts indicate problems with
the link or packetization layer, and that probes may be disruptive.
In these situations it is desirable to use progressively longer
delays depending on the severity of the failure and if it persists.
PLPMTUD can optionally process Packet Too Big messages to select the
provisional MTU for faster convergence in exchange for a slight
decrease in robustness. Processing malicious or erroneous Packet Too
Big messages can cause PLPMTUD to arrive at the incorrect MTU for a
path, which is likely to reduce protocol performance. There are
several different options for processing Packet Too Big messages: in
one extreme they could be completely ignored, in the other extreme,
accept all of them (fully implementing classic PMTUD within PLPMTUD).
We advocate a compromise, where Packet Too Big messages are only
processed in conjunction with probes (described in Section 5.2.4.1),
and Packetization Layer timeouts (described in Section 5.2.4.3).
Relatively few details of this procedure affect interoperability with
other standards or Internet protocols. These details are specified
in RFC2119 standards language in Section 4.
Most of the difficulty in implementing PLPMTUD arises because it
needs to be implemented in several different places within a single
node. In general each packetization protocol needs to have it's own
implementation of PLPMTUD. Furthermore, the natural mechanism to
share path MTU information between concurrent or subsequent
connections over the same path is a path information cache in the IP
layer. The various packetization protocols need to have the means to
access and update the shared cache in the IP layer. This memo
describes PLPMTUD in terms of its primary subsystems without fully
describing how they are assembled into a complete implementation.
Section 5 describes: the separation into layers, the mechanics of
probing from the point of view other lower layers, Maximum Payload
Size search heuristics; implementation in specific Packetization
Layers; and operational integration issues.
The vast majority of the implementation details are recommendations
based on experiences with earlier versions of path MTU discovery.
These are motivated by a desire to maximize robustness of PLPMTUD in
the presence of less than ideal implementations as they exist in the
field.
Mathis, et al. Expires November 30, 2004 [Page 8]
Internet-Draft Path MTU Discovery June 2004
4. Requirements
All Internet nodes SHOULD implement PLPMTUD in order to discover and
take advantage of the largest MTU supported along the Internet path.
Links MUST NOT deliver packets that are larger than their MTU. Links
that have parametric limitations (e.g. MTU bounds due to limited
clock stability) MUST include explicit mechanisms to consistently
reject packets that might otherwise be nondeterministically
delivered.
All hosts SHOULD use IPv4 fragmentation in a mode that mimics IPv6
functionality. All fragmentation SHOULD be done on the host, and all
IPv4 packets, including fragments, SHOULD have the DF bit set such
that they will not be fragmented (again) in the network. See Section
5.2.6
The requirements below only apply to those implementations that
include PLPMTUD.
If the Packetization Layer uses application data to implement PLPMTUD
it MUST use a loss reporting mechanism mechanism (e.g. TCP SACK)
which avoids spurious retransmission of other data when a probe
packet is lost.
A Packetization Layer using application data for probes MUST NOT send
a probe unless it has sufficient following data available to send
such that a lost probe will trigger Fast Retransmit or similar data
recovery algorithm.
A Packetization Layer using application data for probes SHOULD NOT
send a probe packet unless the flow is expected to have at least the
3 round trips worth of data needed to successfully complete the
probe, transition and verification phases.
Normal congestion control algorithms MUST remain in effect under all
conditions except when only an isolated probe packet is detected to
be lost. In this case alone the normal congestion (window or data
rate) reduction can be suppressed. If any other lost data is
detected, all normal congestion control MUST take place.
When a probe is lost and normal congestion control is suppressed as
permitted above, then the Packetization Layer MUST NOT probe again
until at least an interval equal to the normal congestion control
cycle. For TCP and TCP friendly protocols this generally means one
round trip of elapsed time for each packet permitted under the
current congestion window.
Mathis, et al. Expires November 30, 2004 [Page 9]
Internet-Draft Path MTU Discovery June 2004
If PLPMTUD updates the MTU for a particular path, all Packetization
Layer sessions that share the flow (path) must be notified.
Whenever the MTU is raised, the congestion state variables must be
rescaled to not to raise the window size in bytes (or date rate in
bytes per seconds).
Whenever the MTU is reduced (e.g. when unconditionally processing
ICMP Packet Too Big messages) the congestion state variable must be
rescaled not to raise the window size in packets.
All implementations MUST include a mechanism to implement diagnostic
tools that do not rely on the operating systems implementation of
path MTU discovery. This specifically requires the ability to send
packets that are larger than the known MTU for the path, and
collecting any resultant ICMP error message. See Section 5.5.4
5. Implementation Issues
This section discusses a number of issues related to the
implementation of Path MTU Discovery. This is not a specification,
but rather a set of notes provided as an aid for implementers.
The issues include:
o The seperation into layers
o The Mechanics of Probing, as seen by IP and brlow
o Search Strategy.
o How to implement PLPMTUD in specific Packetization Layers.
o How to improve Operational Integration and deployment.
5.1 Layering
5.1.1 Accounting for Header Sizes
Packetization Layer Path MTU Discovery is most easily implemented by
splitting its functions between layers. The IP layer is in the best
place to keep shared state, collect the ICMP messages, track IP
headers sizes and manage MTU information from the link layer
interfaces. However the procedures that PLPMTUD uses for probing,
verifications and scanning for the path MTU are very tightly coupled
to the data recovery and congestion control state machines in the
Packetization Layer. The most difficult part of implementing
PLPMTUD is properly splitting the implementation between the layers.
Note that this layering is constant with the advice in the current
PMTUD specifications [2][3]. Today, many implementations of classical
PMTU Discovery are already split along these same layers.
Mathis, et al. Expires November 30, 2004 [Page 10]
Internet-Draft Path MTU Discovery June 2004
Early implementation of PLPMTUD revealed that it is critically
important to have a good clean mechanism for accounting header sizes
at all layers. This is because each Packetization Layer does its
calculations in its own natural data unit, which are almost always a
reflection of the service that the Packetization Layer provides to
the application or other upper layers. For example, TCP naturally
performs all of its calculations in terms of sequence numbers and
segment sizes. The size of the Probe gap is the size of the data
segment that was that was carried by the probe packet. However, the
MTU size being probed, ICMP MTU, etc are measures of full packets,
which not only include the TCP data (measured in sequence space) but
also include fixed TCP and IP headers, and may include IPv6 extension
headers or IPv4 options, TCP options and even IPsec AH or ESP headers
as well.
PLPMTUD requires frequent translation between these two domains: the
Packetization Layer's natural data unit and full IP packet sizes.
While there are a number of possible ways to accurately implement
dual size measures, our experience has been that it is best if the
boundary between the IP layer and the Packetization layer communicate
in terms of the IP Maximum Payload Size or MPS. The MPS is the only
size measure that is common to both the IP and Packetization Layers,
because it exactly matches the boundary between the layers. The IP
Layer is responsible for adding or deducting it's own headers when
translating between MTU and MPS. Likewise the Packetization Layer is
responsible for adding or deducting its own headers when calculations
in it's natural data units.
This document does not take a stance on the placement of IPsec, which
logically sits between IP and the Packetization Layer. As far as
PLPMTUD is concerned IPsec can be treated either as part of IP or as
part of the Packetization Layer, as long as the accounting is
consistent within any given implementation. If IPsec is treated as
part of the IP layer, then each security association to a remote node
may need to be treated as a separate flow for PLPMTUD, if they have
different length security headers. If IPsec is treated as part of the
packetization layer, the IPsec header size has to be included in the
Packetization Layer's header size calculations.
5.1.2 Storing PMTU information
This memo uses the concept of a "flow" to define the scope in which
path MTU information is used. Each flow locally stores its maximum
payload size (MPS), which is used for packetizing data.
Packetization Layers may communicate with the IP layer to store or
access cached MPS values, providing a means by which similar flows
may share information. The IP layer also stores PMTU and derived MPS
information when it receives Packet Too Big messages.
Mathis, et al. Expires November 30, 2004 [Page 11]
Internet-Draft Path MTU Discovery June 2004
Ideally, a PMTU value should be associated with a specific path
traversed by packets exchanged between the source and destination
nodes. However, in most cases a node will not have enough
information to completely and accurately identify such a path.
Rather, a node must associate a PMTU value with some local
representation of a path. It is left to the implementation to select
the local representation of a path.
An implementation could use the destination address as the local
representation of a path. The PMTU value associated with a
destination would be the minimum PMTU learned across the set of all
paths in use to that destination. The set of paths in use to a
particular destination is expected to be small, in many cases
consisting of a single path. This approach will result in the use of
optimally sized packets on a per-destination basis. This approach
integrates nicely with the conceptual model of a host as described in
[ND@@@@]: a PMTU value could be stored with the corresponding entry
in the destination cache. However, NAT and other forms of middle
boxes may exhibit differing MTUs at as single IP address.
If IPv6 flows are in use, an implementation could use the IPv6 flow
id [7][14] as the local representation of a path. Packets sent to a
particular destination but belonging to different flows may use
different paths, with the choice of path depending on the flow id.
This approach will result in the use of optimally sized packets on a
per-flow basis, providing finer granularity than PMTU values
maintained on a per-destination basis.
For source routed packets (i.e. packets containing an IPv6 Routing
header, or IPv4 LSRR or SSRR options), the source route may further
qualify the local representation of a path. An implementation
could use source route information in the local representation of a
path.
If IPsec is in use, the security association can also be used to
represent a path.
5.2 Lower Layers
5.2.1 Generating Probes
A new candidate MTU is tested by sending one "probe packet", which is
larger than the current MTU. In this section we present a couple of
possible ways to alter packetization layers to generate probe
packets. The different techniques incur different overheads in
three areas: difficulty in generating the probe packet (in terms of
packetization layer implementation complexity and computational
overhead) possible additional network capacity consumed by the probes
Mathis, et al. Expires November 30, 2004 [Page 12]
Internet-Draft Path MTU Discovery June 2004
and the overhead of recovering from failed probes (both network and
protocol overhead).
For example some protocols might be extended to allow padding with
dummy data within their packets. This would greatly simplify the
implementation because the probing can be performed without
participation from the application and if the probe fails, the
missing data (the "probe gap") is assured to fit within the current
MTU when it is retransmitted. However, the padding does consume
network capacity without carrying any useful payload.
This technique does not work for TCP, because there is not a separate
length field or other mechanism to differentiate between padding and
real payload data. With TCP the natural approach is to send
additional payload data in an over-sized segment. There are several
variants which have different tradeoffs.
In one method, after a TCP probe segment has been sent the subsequent
segment(s) may be sent as though the probe segment was not
over-sized. Thus if the probe segment is lost, it will leave a gap
in the sequence space that is exactly the correct size to be filled
by one segment at the current MTU. Since this method generates
overlapping data, it will cause duplicate acknowledgments if the
probe is successfully delivered. The sender must be capable of
ignoring these expected duplicate acknowledgments in a manner which
will not cause unnecessary retransmission or congestion window
reduction.
In the second method, after a TCP probe segment has been sent,
subsequent TCP segments are sent in a non-overlapping manner. If the
probe segment is lost, it will leave a gap which will require
retransmission of multiple segments to fill. This method has lower
overhead for successful probes, but it requires more complexity in
the retransmit logic to correctly retransmit the missing data (the
"probe gap") with multiple segments that fit into the old MTU, while
properly suppressing the congestion adjustments for this one
situation and no others.
Several Packetization protocols may be best served by using an
adjunct protocol for MTU probing: a separate protocol (or protocol
feature) that does not carry and real application data. This greatly
simplify s implementation because nothing needs to be retransmitted
when the probe is lost, but it does consume network capacity without
delivering any useful payload.
Two important example of this come to mind: SCTP [9] which might use
its existing HEARTBEAT facility padded with dummy data to fill out
the probe packet; and IP fragmentation which is sometimes used as a
Mathis, et al. Expires November 30, 2004 [Page 13]
Internet-Draft Path MTU Discovery June 2004
Packetization layer for carrying oversized datagrams as described in
Section 5.2.6. In the case of IP fragmentation an entire separate
protocol in need, that has to use the diagnostic interface described
in Section 5.5.4
It should be clear that nearly all packetization layers can be
adapted to support PLPMTUD, possibly in more than one way.
5.2.2 Selecting the initial MTU
When the PLPMTUD process is started the initial MTU should normally
be set such that the Packetization Layer can carry 1 kByte data
segments. This initial MTU should be 1 kByte plus space for IP and
Packetization layer headers. (see Section 5.1 on accounting for
headers). With the this MTU, RFC2414 [6] allows TCP and other
transport protocols to start with an initial window of 4 packets.
We suspect, but have not confirmed that TCP actually starts faster
(and completes sooner for small packets) with 1kB packets rather than
1500 byte packets because the 2nd data ACK occurs one round trip
earlier
This initial MTU should also be configurable. One of the
configuration options should be to set it to default to the
interfaces MTU, to mimic classical PMTUD behavior. (See Section 5.5.1
5.2.3 Normal sequence of events to raise the MTU
If the probe size is smaller than the actual path MTU and there are
no other losses, the normal sequence of events to probe and raise the
MTU will be:
1. The probe is sent, followed by more packets at the current MTU.
By definition PLPMTUD enters the probe phase. The probe
propagates through the network and the far node acknowledges it
(or possibly latter data, if acknowledgements are cumulative and
delayed acknowledgement is in effect).
2. The acknowledgement for the probe reaches the data sender. By
definition, this ends the probe phase.
3. The packetization layer provisionally raises the MTU to the probe
size. PLPMTUD enters the transitional phase when it starts
sending data using the provisional MTU.
Note that implementations that use packet counts for congestion
accounting (e.g. keep cwnd in units of packets) must re-scale
their congestion accounting such that raising the MTU does not
raise the data rate (bytes/second) or the total congestion window
Mathis, et al. Expires November 30, 2004 [Page 14]
Internet-Draft Path MTU Discovery June 2004
in bytes.
If the implementation packetizes the data at the application
programming interface, it may transmit already queued data at the
current MTU before raising the MTU. In this case this data is not
part of either the probing or transition phases, because all of
the packets in flight fit within the current MTU.
4. Once the first packet of the transitional phase is acknowledged,
PLPMTUD enters the verification phase. In principle the
verification phase can be of arbitrary duration, however at this
time we are recommending one full window of data (i.e one full
round trip time) for most Packetization Layers.
5. Once there has been sufficient data delivered and acknowledged in
the provisional MTU is considered verified and the path MTU is
updated. PLPMTUD can then probe for an even larger MTU, as
described in the searching strategy in Section 5.3.
Other events described in the next section are treated as exceptions
and alter or cancel some of the steps above.
5.2.4 Processing MTU Indications
The descriptions below assume that the Packetization Layer protocol
that has a TCP fast retransmit style mechanism to synchronously
detect the loss of a probe packet and trigger retransmission, without
loss of the protocols self clock. If this fails, then some sort of
retransmission timeout will serve to catch the loss. It also
assumes that there is some mechanism to detect full-stop timeouts.
If any of these events (or the receipt of an ICMP Packet Too Big
message) occurs during the the above process to raise the MTU, then
it is processed as indicated in the following sections.
5.2.4.1 Processing Packet Too Big Messages
Classical PMTU discovery specifies the generation of Packet Too Big
Messages if an over-sized packet (e.g. a probe) encounters a link
that has a smaller MTU. Since these messages can not be authenticated
they introduce a number of well documented attacks against classical
PMTUD [5].
With PLPMTUD these messages are not required for correct operation,
and in principle can be summarily ignored at the expense of slower
convergence to the proper MTU. However we believe that a slightly
better compromise is to process Packet too big messages in two
specific contexts: in conjunction with a PLPMTUD probe or a full-stop
Mathis, et al. Expires November 30, 2004 [Page 15]
Internet-Draft Path MTU Discovery June 2004
timeout.
Every Packet Too Big Message should be subjected to the following
checks:
o If globally forbidden then discard the message.
o If forbidden by the application then discard the message.
o If this path has been tagged "bogus ICMP messages" then discard
the message.
o If the reported MTU fails consistency checks then set "bogus ICMP
messages" flag for this path and discards the message. These
consistency checks include:
* unrecognized or unparseable enclosed header,
* reported MTU is larger than the size indicated by the enclosed
header or
* larger than the current MTU, provisional MTU or probe size as
appropriate.
* or fails a ICMP consistency checks specific to the
Packetization Layer. (E.g. The SCTP Verification-Tag mechanism
[9][16])
To ease migration, it is suggested that implementations may
include global controls to suppress some or all of the consistency
checks.
If the Packet Too Big Message is acceptable under all of these checks
do one of two things on depending on a global configuration switch:
Emulate classical path MTU discovery by processing the message
immediately (I.e. set the path MTU to the size indicated in the
message) or save the "ICMP MTU", pending another PLPMTUD event. In
this case the saved ICMP MTU will only be acted upon under
appropriate conditions if there are lost probes, verification packets
or a full stop timeout. This greatly reduces the impact of
fraudulent ICMP Packet Too Big messages.
In either case if the Packetization Layer calls for specific actions
in response to a Packet Too Big message, that action should be
invoked only at the point when the path MTU is updated from the ICMP
MTU.
5.2.4.2 Packetization Layer Detects Lost Packets
Each packetization protocol has it's own mechanism to detect lost
packets and request the retransmission of missing data. The primary
signals used by the packetization layer are these protocol specific
Mathis, et al. Expires November 30, 2004 [Page 16]
Internet-Draft Path MTU Discovery June 2004
loss indications. The packetization layer is responsible for
retransmitting the lost data and notifying PLPMTUD that there was a
loss.
o If the probe itself was lost, and there were no other losses
during the probe phase (The RTT between when the probe was sent
and the loss detected) than it is taken as an indication that the
path MTU is smaller than the probe size. In this situation alone
the Packetization Layer is permitted to retransmit the missing
data (the "probe gap") without adjusting its congestion window or
data transmission rate.
If an accepted Packet Too Big Message was received after the probe
was sent, and it passes the additional checks that the ICMP MTU is
greater than the current MTU and less than the probe SIZE, then
set the probe side to the ICMP MTU, and restart the probe process
from step 1 in Section 5.2.3.
If there was not a accepted Packet Too Big Message, then the
indicated event is a "probe failure", which can be retried with a
smaller probe size after a suitable delay for a probe_fail_event.
See Section 5.2.4.2 for more complete descriptions of failure
events.
o If there are losses during the probe phase and the probe was not
lost, then the probe was successful. However, since additional
losses have the potential to spoil the verification phase, it is
important that PLPMTUD not progress into the transition phase
(step 3 above) until after the Packetization Layer has fully
recovered from the losses and completed the congestion window (or
rate) adjustment.
o If there are losses during the probe phase and the probe was also
lost the outcome depends on the presence an ICMP MTU set by an
acceptable Packet Too Big Message.
If there was an accepted Packet Too Big Message received since the
probe was sent, and it passes the additional checks that the ICMP
MTU is greater than the current MTU and less than the probe size,
then set the probe size to the ICMP MTU, and once the
Packetization Layer completes the recovery from the losses then
restart the probe process from step 1 in Section 5.2.3.
If there was not an accepted Packet Too big Message, then the
probe is inconclusive because the lost probe might have been
caused by congestion. The probe can be retried after a suitable
delay for a probe_inconclusive_event.
Mathis, et al. Expires November 30, 2004 [Page 17]
Internet-Draft Path MTU Discovery June 2004
o It is unlikely that losses during the transition phase are caused
by PLPMTUD, however they do potentially complicate the
verification phase. Note that we are referring to losses that are
followed by acknowledgement of packets that were sent at the old
MTU, while the transition to the provisional MTU is still
propagating through the network. The first acknowledgement of
the provisional MTU (and the transition to the verification phase)
is most likely going to occur during the recovery of the losses in
transition phase. It is important that the Packetization Layer
retransmission machinery distinguish between loses at the old MTU
(transition phase) and the provisional MTU (the verification
phase, discussed next).
o Losses during the verification phase are taken as a indication
that the path may have a non-uniform MTU or some other problems
such that raising the MTU substantially raises the loss rate. If
so, this is potentially a very serious problem, so the provisional
MTU is considered to have errored and the path MTU is set back to
the previously verified MTU (the previously current MTU).
Packet loss during the verification phase might also be due to
coincidental congestion on the path, unrelated to the probe, so it
would seem to be desirable to re-probe the path. The risk is that
this effectively raises the tolerated loss threshold because even
though raising the MTU seemed to cause additional loss, there is a
statistical chance that repeated attempts to verify a new MTU may
yield as false pass. The compromise is to re-probe once with
the same probe size (after delay probe_inconclusive_event), and if
this also fails, then the probe may not be retried until after a
suitable delay for a verification_error_event, which exponentially
increases on each successive failure.
5.2.4.3 Packetization Layer Retransmission Timeout
Note that the we do not make distinctions between the various methods
that different Packetization Layers might use for detecting and
retransmitting lost packets. It is preferable that the
Packetization Layer uses a recovery mechanism similar to TCP SACK or
fast retransmit (or other "synchronous" loss recover mechanism) to
detect losses and recover as quickly as possible.
Under some conditions the Packetization Layer may have to rely on
retransmission timeouts or other fairly disruptive techniques to
recover from losses. Since these greatly increase the cost of
failed probes, it is recommended that PLPMTUD use even longer delays
before re-probing. In these situations replace probe_fail_event with
probe_timeout_event.
Mathis, et al. Expires November 30, 2004 [Page 18]
Internet-Draft Path MTU Discovery June 2004
5.2.4.4 Packetization Layer Full Stop Timeout
Under all conditions (not just during MTU probing) a full stop
timeout should be taken as an indication of some significantly
disruptive event in the network, such as a router failure or a
routing change to a path with a smaller MTU.
If the ICMP MTU is set, and it is less that the current MTU (or
provisional MTU during the transitional phase), then the path MTU can
be reduced to the ICMP MTU. This is the only situation (a full stop
timeout) outside of a probe that we recommended that the path MTU is
set from the ICMP MTU. (In Section 5.5.1 we relax this recommendation
to facilitate migration to PLPMTUD in exchange for slightly less
protection from corrupt Packet Too Big messages)
Note that whenever a problem with the path that causes a full-stop
timeout (also known as a "persistent timeout" in other documents),
several different path restart/recovery algorithms may be invoked at
different layers in the stack. Some device drivers may be restarted
[@@], router discovery [@@], ES-IS [@@] and so forth. We recommend
that in most situation the first action should be to set the path MTU
down. Note that this recommendation is really beyond the scope of
this document, and may require substantial additional research.
Therefore, if there is a full stop timeout and there was not an ICMP
message indicating a reason (Packet Too Big, Net unreachable, etc, or
the ICMP messages was ignored for some reason), we suggest that the
first recovery action should be to set the path MTU down to a safe
minimum "restart MTU" value, and the PLPMTUD search state reset, so
PLPMTUD will start over again searching for the proper MTU. The
default restart_MTU should be the minimum MTU as specified by IPv4
(576)[1] or IPv6 (1280) [7] as appropriate, unless overridden by some
global control (See Section 5.5.5).
If and only if the full stop timeout happens during the probe or
transition phases (e.g. after the sending data using the provisional
MTU but before any of it is acknowledged) is it considered likely
that raising the MTU caused the full stop timeout. If so this
situation is is likely to be cyclic, because resetting the PLPMTUD
search state is likely to eventually cause re-probing the same
problematic MTU.
It is tempting to define additional states to detect recurrent full
stop timeouts. However in today's hostile network environment, there
is little tolerance for nodes that are so fragile that they can be
disrupted by something as simple as oversized packets. Therefor we
do not feel that it is worth the overhead of specifying a state
machine that is capable of automaticly detecting these situations and
Mathis, et al. Expires November 30, 2004 [Page 19]
Internet-Draft Path MTU Discovery June 2004
disabling PLPMTUD. However, it is important that there be a manual
way to disable or limit probing on specific paths. See Section
5.5.5.
5.2.5 Probing Intervals
Section 5.2.4.2 describes a number of probe failure events. In all
cases the basic response is the same: to wait some time interval
(dependent on the specific event and possibly the history) and then
to probe again. For events that are "inconclusive", it is generally
appropriate to re-probe with the same probe size. For events that
are identified as "failed probes" it is generally appropriate to
re-probe with a smaller probe size. The search strategy described
in Section 5.3 is used to select probe sizes.
Many of the intervals below are specified in terms of elapsed round
trips relative to the current congestion window. This is because
TCP and other Packetization Layer protocols tend to exhibit periodic
loses which cause periodic variations of the congestion window and
possibly the data rate. It is preferable that the PLPMTUD probes are
scheduled near the low point of these cycles to minimize ambiguities
caused by congestion losses.
In order from least to most serious:
probe_inconclusive_event Other lost packets near the lost probe made
the probe result ambiguous. Since the loss of non-probe packets
requires a window (or data rate) reduction, it is desirable to
schedule the re-probe (at the same probe size) at one round trip
time after the end of the loss recovery. This will be almost the
minimum congestion window size, with a small cushion to minimize
the chances that correlated losses caused by some other bursty
connection spoil another probe.
probe_fail_event A probe fail event is the one situation under which
the Packetization layer is permitted not to treat loss as a
congestion signal. Because there is some small risk that
suppressing congestion control might have unanticipated
consequences (even for one isolated loss), we require that probe
fail events be less frequent than the normal period for losses
under standard congestion control. Specifically after a probe
fail event and suppressed congestion control, PLPMTUD may not
probe again until an interval which is comparable to the expected
interval between congestion control events. See Section 4.
The simplest estimate of the interval to the next congestion event
is the same number of round trips as the current window in
packets.
Mathis, et al. Expires November 30, 2004 [Page 20]
Internet-Draft Path MTU Discovery June 2004
probe_timeout_event Since this event was detected by a timeout, it is
relatively disruptive to protocol operation. Furthermore, since
the event indirectly includes a window adjustment that may have
been caused by the MTU probe, it is important that the probe not
be repeated until congestion has more than recovered from the
loss. Therefore we recommend five times the probe_fail_event
interval. I.e. five times as many round trips as the current
congestion window in packets.
verification_error_event A verification fail event indicates that a
probe was deliver and the verification phase failed twice
separated by a congestion adjustment (so the second verification
phase was at a low point in the congestion control cycle). This is
an indication that one of the following three things might have
happened: repeated losses unrelated to PLPMTUD; the path is
striped across links with dissimilar MTUs, or the link layer has
some parametric limitation such that raising the MTU greatly
increases the random error rate.
The optimal method responding to this situation is an open
research question. We believe that the correct response is some
combination of exponentially lengthening backoffs (e.g. Starting
at 1 minute and quadrupling on each repeat.) and implicitly
treating the situation as a probe fail (and choosing a smaller
probe size) after some threshold number of repeated
verification_error_events.
5.2.6 Host fragmentation
Packetization layers are encouraged to avoid sending messages that
will require fragmentation (for the case against fragmentation, see
[17][18]). However this is not always possible. Some packetization
layers, such as a UDP application outside the kernel, may be unable
to change the size of messages it sends. This may result in packet
sizes that exceeds the Path MTU.
IPv4 permitted such applications to send packets without DF set.
Oversized packets without DF would be fragmented in the network or
sending host when they encountered a link with a small MTU. In some
case, packets could be fragmented more than once if there were
cascaded links with progressively smaller MTUs.
This approach is no longer recommended. We now recommend that IPv4
implementation use a strategy that mimics IPv6 functionality. When
an application sends datagrams that are larger than the known path
MTU they should be fragmented to the path MTU in the host IP layer
even if they are smaller than the link MTU of the first hop networks
directly attached to the host. The DF bit should be set on the
Mathis, et al. Expires November 30, 2004 [Page 21]
Internet-Draft Path MTU Discovery June 2004
fragments, so they will not be fragmented again in the network.
This technique will minimize future surprises as the Internet
migrated to IPv6. Otherwise there is the potential for widely
deployed applications or services relying on IPv4 fragmentation, in a
way that can not be implemented in IPv6. At least one major operating
system already uses this strategy.
Note that in principle the IP fragmentation layer is an example of a
Packetization Layers, it could implement full PLPMTUD in the
fragmentation process.
5.2.7 Multicast
In the case of a multicast destination address, copies of a packet
may traverse many different paths to reach many different nodes. The
local representation of the "path" to a multicast destination must in
fact represent a potentially large set of paths.
Minimally, an implementation could maintain a single MPS value to be
used for all packets originated from the node. This MPS value would
be the minimum MPS learned across the set of all paths in use by the
node. This approach is likely to result in the use of smaller
packets than is necessary for many paths.
Alternatively, if the application using multicast gets complete
delivery reports (unlikely because this requirement has poor scaling
properties), PLPMTUD could be implemented in multicast protocols.
5.3 Search Strategy
The search strategy described here is a only guide for implementors.
A standard algorithm is not specified because the strategy can
include many heuristics to optimize MPS selection for a given path.
Particularly, it may be appropriate for different protocols to follow
different strategies. There is opportunity for future improvements
to this algorithm.
The search strategy uses three variables:
SEARCH_MAX is the largest MPS that a flow might be able to use.
It is determined by such considerations as interface MTU, widths
of protocol length fields, and possibly other protocol-dependent
values, such as the the TCP MSS option. In many cases it would be
the same as the classical MTU discovery initial MSS, minus the IP
layer headers.
SEARCH_LOW is the largest validated MPS, and should be used as the
effective MPS by the packetization layer. It is the same as the
current validated MTU minus the IP layer headers. The initial
Mathis, et al. Expires November 30, 2004 [Page 22]
Internet-Draft Path MTU Discovery June 2004
value for SEARCH_LOW should be a parameter, but a value of 1024
may be a reasonable default.
SEARCH_HIGH is the least invalidated MPS. In most cases is will
be the most recent failed probe size minus the IP layer headers.
When PLPMTUD is initialized SEARCH_HIGH should be set to
SEARCH_MAX.
There are three major states: Search, Monitor and Suspend. In the
Search state, it incrementally searches for the largest MPS that the
path can support, narrowing the difference between SEARCH_LOW and
SEARCH_HIGH. Once this gap is sufficiently narrow, the probing
algorithm enters the Monitor state where it probes infrequently to
detect if the path MPS has become larger.
If the MPS probing is determined harmful, perhaps by persistent probe
failures, the flow may enter the Suspend state, completely disabling
MPS probing.
5.3.1 Search
In the Search state, the strategy follows a multi-phase scan. If
SEARCH_HIGH >= SEARCH_MAX, a course scan is used. In this mode, each
probe's payload size should be MIN(2 * SEARCH_LOW, SEARCH_MAX). If
SEARCH_HIGH < SEARCH_MAX, the fine scan mode should be used.
The fine scan algorithm may pursue a number of different methods for
choosing probe sizes. It may be useful to choose probe sizes so that
the final IP packet will fit common link MTUs, for example 1500,
4352, 9000, 17914. Optionally, probes smaller than these values by
common tunnel header sizes may be used.
When using some protocols, the cost for a failed probe may be
significantly higher than the cost of a successful probe due to
retransmission and consequent delay jitter as seen by the
application. For this reason, one possible approach to the fine scan
could be to use probes of size SEARCH_LOW + d, for some increment d.
It should enter the Monitor state when SEARCH_LOW + d >= SEARCH_HIGH.
This will result in at most one additional probe failure.
Another approach may be to use a simple binary search where each
probe size is (SEARCH_LOW + SEARCH_HIGH) / 2, entering the Monitor
state when SEARCH_LOW + s >= SEARCH_HIGH for some threshold s. This
will converge quickly, but may have a higher number of probe
failures. It is more appropriate for a protocol whose probes consist
entirely of padding.
Mathis, et al. Expires November 30, 2004 [Page 23]
Internet-Draft Path MTU Discovery June 2004
5.3.2 Monitor
In the Monitor state, a probe of size SEARCH_HIGH should be sent at
most once every MONITOR_INTERVAL seconds. If the probe succeeds,
then SEARCH_HIGH should be set to SEARCH_MAX, and the state should be
set to Search.
If there is evidence that no flow traffic is receiving its
destination, such as repeated timeouts with no acknowledgements in
TCP, it may be that the connection was re-routed to a path with a
smaller MTU, and the Packet Too Big messages are ignored of filtered.
In this case, SEARCH_LOW and SEARCH_HIGH should be set to initial
values, and the Search state should be entered.
5.3.3 Suspend
In the Suspend state, probing is entirely disabled, and the MPS
should be set to 512 bytes. The Suspend state should only be used if
it is heuristically determined that probing is causing harmful
failures.
5.4 Specific Packetization Layers
In this section we discuss specific implementation issues different
Packetization Layer protocols.
5.4.1 Probing method using TCP
TCP has no mechanism that could be used to distinguish between real
application data and some other form of padding that might be used to
fill out probe packets. Therefore, TCP must generate probes by
sending oversized segments that are carrying real data from upper
layers. As previously mentioned there are two approaches that TCP
might use to minimize the overheads associated with the probing
process.
A TCP implementation of PLPMTUD can elect to send subsequent segments
overlapping the probe as though the probe segment was not oversized.
This has the advantage that TCP only need to retransmit one segment
at the current MTU to recover from failed probes. However the
duplicate data in the probe does consume network resources and will
cause duplicate acknowledgments. It is important that these extra
duplicate acknowledgments not trigger Fast Retransmit. This can be
guaranteed by limiting the largest probe segment size to twice the
current segment size (causing at most 1 duplicate acknowledgment) or
three times the current segment size (causing at most 2 duplicate
acknowledgments).
Mathis, et al. Expires November 30, 2004 [Page 24]
Internet-Draft Path MTU Discovery June 2004
The other approach is to send non-overlapping segments following the
probe. Although this is cleaner from a protocol architecture
standpoint it clashes with many of the optimizations used improve the
efficiency of data motion withing many operating systems. In
particular many implementations divide the data into segments and
pre-compute checksums as the data is copied out of user space. In
these implementation it can be very expensive to adjust segment
boundaries after the data is already queued.
If TCP is using SACK or any other variable length headers, the
headers on the probe and verification packets should be padded to the
maximum possible length. Otherwise, future options may cause delivery
problems if they cause IP packets that are larger than the MTU.
Note that the header size and overhead calculations described in
Section 5.1 apply here. TCP's natural data accounting units are
sequence space and Maximum Segment Size. However the the PLPMTUD
process is described in terms of total packet size, which is larger
than the MSS by all fixed and optional headers.
At the point when TCP is ready to start the verification phase, it is
permitted transmit already queued data at the old MTU rather than
re-packetize it. This postpones the verification process by the time
required to send the queued data.
If the verification phase experiences any segment losses, TCP is
required to pull back to the prior MSS. Since failing the
verification phase should be an infrequent error condition it is less
important that this be as efficient as probing.
5.4.1.1 Window management
Some TCP implementations keep the congestion window in units of
segments. When segment size is increased during a connection, a
conservative implementation should scale cwnd so that, in units of
bytes, it will remain unchanged.
It is recommended that TCP should not probe a new MPS if that MPS
will likely result in a cwnd of less than 5 segments.
If the network becomes too congested, it is recommended that the MPS
be reduced to a smaller size as determined by a heuristic. The
recommended heuristic is to reduce the MPS by half if ssthresh is
reduced to 5 segments or smaller, with a minimum MPS of 512 bytes.
5.4.2 Probing method using SCTP
In the SCTP protocol packetization is the responsibility of the
Mathis, et al. Expires November 30, 2004 [Page 25]
Internet-Draft Path MTU Discovery June 2004
application or protocol above SCTP. The application writes a set
message to SCTP and SCTP will "chunkify" it into appropriate sized
pieces. Some implementations MAY bundle multiple data chunks
together, but this is NOT required implementation behavior. By
implication not all SCTP implementations can easily generate probes
sending additional application data. In particular any implementation
that does not implement data chunk bundling would not be able to
implement a probe.
For SCTP the recommended method for generating probes is to pad SCTP
HeartBeat messages to the desired probed size. A successful probe
will be acknowledged without delay by the peer SCTP implementation
returning the same Heartbeat as a HEARTBEAT-ACK. This assures that
both directions will support the probed MTU size. [@@@@@ note that
both sides of the path are tested]
The verification phase is entered after a successful probe. For
implementations that can bundle multiple DATA chunks the verification
phase completes when a windows worth of bundled DATA chunks are
exchanged at the new MTU value. An SCTP implementation SHOULD arrange
its fragmentation point to be a suitable multiple of the new MTU size
(e.g. if the MTU size is 1500 bytes in IPv4 then a fragmentation
point of 718 bytes might be selected during the verification phase.
This would allow the two bundled DATA chunks to be put together to
exactly equal the proposed new PMTU. After verification is complete
the fragmentation point can then be set to the actual PMTU assuming
that this new value is the smallest MTU of all of the SCTP paths).
An SCTP implementation is allowed to transmit already fragmented DATA
chunks that cannot be bundled together at the new MTU value that were
previously queued. For implementation that do not allow DATA chunk
bundling three subsequent HEARTBEAT messages should be sent over the
next XX@@ RTT's padded to the new proposed MTU value. If all of HB's
are successful then the new PMTU should be adopted for the path.
[@@@@NOTE: it might be simpler to always use multiple HB's to prove
in a PMTU during verification, I leave this up to you. One thing to
keep in mind is that SCTP normally fragments its messages to the
SMALLEST PMTU of all paths... since SCTP is multi-homed this makes it
so any data chunk can fit on ANY path. Most implementations DO bundle
data chunks for this very reason... its easy to do and it allows
larger PMTU's on different paths to be utilized. So using the HB may
be more efficient... its definitely simpler... I leave it to you to
choose. We may also want to mention the ICMP issue with SCTP since a
validated ICMP message with SCTP can always be trusted].
The SCTP Verification-Tag is designed to increase SCTPs robustness in
the presence of a number of attacks, including forged ICMP messages.
It relies on a 32 bit Verification Tag which is initialized to a
Mathis, et al. Expires November 30, 2004 [Page 26]
Internet-Draft Path MTU Discovery June 2004
random value during connection establishment and placed in the first
64 bits of all SCTP messages. All subsequent messages (including ICMP
messages, which copy at least the first 64 bits of the message) must
match the original Verification Tag, or they are rejected as being
likely attacks against the connection. [9][16].
It is believed that the Verification Tag mechanism is strong enough
where SCTP could unconditionally process Packet Too Large messages
that would reduce the path MTU at arbitrary times. As written, this
document does not encourage this method. The PLPMTUD ICMP validity
checks are cascaded with the SCTP checks, such that the messages are
processed only if they meet all consistency checks. In particular,
PLPMTUD only uses the ICMP MTU value following a probe, during MTU
verification, or following a hard stop timeout.
To change this an implementation would have to suppress some of the
checks in Section 5.2.4.1 for SCTP.
5.4.3 Probing Method for IP Fragmentation
As mentioned in Section 5.2.6, datagram protocols (such as UDP) can
rely on IP fragmentation as a packetization layer. Since the IP
layer does not have any way to determine if the fragments were
delivered, it can not do the probing directly. The probing has to
be done with an adjunct protocol that uses the diagnostic API
(Section 5.5.4) to send oversized probes, and some other API to
update the MPS stored in the IP layer.
5.4.4 Issues for other transport protocols
Some transport protocols (such as ISO TP4 [ISOTP]) are not allowed to
repacketize when doing a retransmission. That is, once an attempt is
made to transmit a segment of a certain size, the transport cannot
split the contents of the segment into smaller segments for
retransmission. In such a case, the original segment can be
fragmented by the IP layer during retransmission. Subsequent
segments, when transmitted for the first time, should be no larger
than allowed by the Path MTU.
5.5 Operational Integration
5.5.1 Interoperation with prior algorithms
Properly functioning Path MTU discovery is critical to the robust and
efficient operation of the Internet. Any major change (as described
in this document) has the potential to be very disruptive if it
contains any errors or oversights. Therefore, we offer a deployment
strategy in which classical PMTUD operation as described in RFC 1191
Mathis, et al. Expires November 30, 2004 [Page 27]
Internet-Draft Path MTU Discovery June 2004
and RFC 1981 is unmodified and PLPMTUD is only invoked following a
full stop timeout, presumably due to an "ICMP black hole". To do
this:
o Relax the ICMP checks in Section 5.2.4.1 specifically to allow an
ICMP Packet Too Large message to reduce the MTU at arbitrary
times.
o When there is no cached MTU, use the Interface MTU as specified
classical PMTU discovery, rather the initial MTU as specified in
Section 5.2.2
o MTU searching as described in Section 5.3 is disabled entirely or
starts in the monitor state.
o A full stop timeout is processed as described in Section 5.2.4.4.
This becomes the only mechanism to invoke the rest of PLPMTUD.
When configured in this manner, PLPMTUD will increase the robustness
of classical PMTU discovery in the presence of ICMP black holes and
other ICMP problems, with minimal exposure to unanticipated problems
during deployment. Since this configuration does not help robustness
in the presence of malicious or erroneous ICMP messages, it is not
recommended for the long term.
5.5.2 Interoperation over subnets with dissimilar MTUs
With classical PMTUD, the ingress router to a subnet is responsible
for knowing what size packets can be delivered to every node attached
to that subnets. For most subnet types, this requires that the
entire subnet has a single MTU which is common to every attached
node. (For a few subnets types (e.g. ATM[12]) the nodes on a subnet
can be negotiate the MTU on a pairwise basis, and the ingress router
is responsible for knowing the MTU to each of it peers).
This requirement has proven to be a major impediment to deploying
larger MTUs in the operational Internet. Often one single node which
does not support a larger MTU effectively vetoes raising the MTU on a
subnet, because the ingress router does not have a mechanism to
generate the proper Packet Too Big Message for the one attached node
with a smaller MTU
With PLPMTUD, this requirement is completely relaxed. As long as
oversized packets addressed the nodes with the smaller MTU are
reliably discarded, PLPMTUD will find the proper MTU for these nodes.
5.5.3 Interoperation with tunnels
PLPMTUD is specifically designed to solve many of the problems that
people are experiencing today due to poor interactions between
classical MTU discovery, IPsec, and various sorts of tunnels [5].
As long as the tunnel reliably discards packets that are too large,
Mathis, et al. Expires November 30, 2004 [Page 28]
Internet-Draft Path MTU Discovery June 2004
PLPMTUD will discover an appropriate MTU for the path.
Unfortunately due to the pervasive problems with classical PMTU
discovery, many manufacturers of various types of VPN/tunneling
equipment have resorted to ignoring the DF bit. This not only
violates the IP standard and many recommendations to the contrary
[17][18], it also violates the only requirement that PLPMTUD places
on the link layer: that oversized packets are reliably discarded.
It is imperative that people understand the impact of ignoring the DF
bit both to applications and to PLPMTUD.
We do understand the reality of the situation. It is important that
vendors who are building devices the violate the DF specification
understand that PLPMTUD requires that probe packets be discarded, and
that sending ICMP packet too big messages alone is insufficient to
prevent wholesale fragmentation if the probe packets are delivered.
Therefore, it is imperative that devices that do not honor DF include
packet size history caches and other heuristics to robustly detect
and discard probe packets, if delivering them would require
fragmentation.
5.5.4 Diagnostic tools
All implementations MUST include facilities for MTU discovery
diagnostic tools that implement PLPMTUD or other MTU discovery
algorithms in user mode without help or interference by the PMTUD
algorithm present in the operating system. This requires an
mechanism where a diagnostic application can send packets that are
larger than the operating system's notion of the current path MTU and
collect any resulting Packet Too Big Messages or other ICMP messages.
For IPv4 the diagnostic application must be able to set the DF bit.
At this time nearly all operating systems support two modes for
sending UDP datagrams: one which silently fragments packets that are
too large, and another that rejects packets that are too large.
Neither of these modes are suitable for efficiently diagnosing
problems with the MTU discovery, such as routers that return Packet
Too Big messages containing incorrect size information.
5.5.5 Management interface
It is suggested that an implementation provide a way for a system
utility program to:
o Globally disable all ICMP Packet Tool Large message processing
o Globally suppress some or all ICMP consistency checks described in
Section 5.2.4.1. Setting this option foregoes some possible
security improvements, in exchange for making PLPMTUD behave more
Mathis, et al. Expires November 30, 2004 [Page 29]
Internet-Draft Path MTU Discovery June 2004
like classical PMTU discovery. (See Section 5.5.1)
o Globally permit ICMP Packet Tool Large messages to unconditionally
reduce the MTU, even if there were not lost lost packets.
Setting option foregoes some possible security improvements, in
exchange for making PLPMTUD behave more like classical PMTU
discovery. (See Section 5.5.1)
o Globally adjust timer intervals for specific classes of probe
failures
In addition, it is important that there be a mechanism to permit per
path controls to override specific parts of the PLPMTUD algorithm.
All of these per path controls can be preset from similar global
controls.
o Disable MTU searching a given path, such that new MTU values are
never probed.
o Set the initial MTU for a given path. This could be used to
speed convergence in relatively static environments. There
should be an option to cause PLPMTUD to choose the same initial
value as would be chosen by classical PMTU discovery. I.e.
typically the Interface MTU. This is used in the mode described
in Section 5.5.1 where PLPMTUD is used only for black hole
detection in classical PMTU discovery.
o Limit the maximum probed MTU for a given path. This permits a
manual configuration to work around a link that spuriously
delivers packets that are larger than the useful path MTU.
o Per path and per application controls to disable ICMP processing,
to further limit possible damage from malicious Packet Too Big
messages (in addition to the global controls).
6. References
6.1 Normative References
[1] Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981.
[2] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
November 1990.
[3] McCann, J., Deering, S. and J. Mogul, "Path MTU Discovery for IP
version 6", RFC 1981, August 1996.
[4] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", BCP 14, RFC 2119, March 1997.
[5] Kent, S. and R. Atkinson, "Security Architecture for the
Internet Protocol", RFC 2401, November 1998.
[6] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's
Mathis, et al. Expires November 30, 2004 [Page 30]
Internet-Draft Path MTU Discovery June 2004
Initial Window", RFC 2414, September 1998.
[7] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6)
Specification", RFC 2460, December 1998.
[8] Floyd, S., "Congestion Control Principles", BCP 41, RFC 2914,
September 2000.
[9] Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer,
H., Taylor, T., Rytina, I., Kalla, M., Zhang, L. and V. Paxson,
"Stream Control Transmission Protocol", RFC 2960, October 2000.
6.2 Informative References
[10] Mogul, J., Kent, C., Partridge, C. and K. McCloghrie, "IP MTU
discovery options", RFC 1063, July 1988.
[11] Knowles, S., "IESG Advice from Experience with Path MTU
Discovery", RFC 1435, March 1993.
[12] Atkinson, R., "Default IP MTU for use over ATM AAL5", RFC 1626,
May 1994.
[13] Sung, T., "TCP And UDP Over IPX Networks With Fixed Path MTU",
RFC 1791, April 1995.
[14] Partridge, C., "Using the Flow Label Field in IPv6", RFC 1809,
June 1995.
[15] Lahey, K., "TCP Problems with Path MTU Discovery", RFC 2923,
September 2000.
[16] Stewart, R., "Stream Control Transmission Protocol (SCTP)
Implementors Guide", draft-ietf-tsvwg-sctpimpguide-10 (work in
progress), December 2003.
[17] Kent, C. and J. Mogul, "Fragmentation considered harmful",
Proc. SIGCOMM '87 vol. 17, No. 5, October 1987.
[18] Mathis, M., Heffner, J. and B. Chandler, "Fragmentation
Considered Very Harmful", draft-mathis-frag-harmful-00 (work in
progress), July 2004.
Mathis, et al. Expires November 30, 2004 [Page 31]
Internet-Draft Path MTU Discovery June 2004
Authors' Addresses
Matt Mathis
Pittsburgh Supercomputing Center
4400 Fifth Avenue
Pittsburgh, PA 15213
US
Phone: 412-268-3319
EMail: mathis@psc.edu
John W. Heffner
Pittsburgh Supercomputing Center
4400 Fifth Avenue
Pittsburgh, PA 15213
US
Phone: 412-268-2329
EMail: jheffner@psc.edu
Kevin Lahey
Freelance
EMail: kml@patheticgeek.net
Appendix A. Security Considerations
Under all conditions the PLPMTUD procedure described in this document
is at least as secure as the current standard path MTU discovery
procedures described in RFC 1191 [2] and RFC 1981 [3].
It the recommended configuration, PLPMTUD is significantly harder to
attack than current procedures, because ICMP messages are cached and
only processed in connection with lost packets. This effectively
prevents blind attacks on the path MTU discovery system.
Furthermore, since this algorithm is designed for robust operation
without any ICMP (or other messages from the network), it can be
configured to ignore all ICMP messages (globally or on a per
application basis). In this configuration it can not be attacked,
unless the attacker can identify and selectively cause probe packets
to be lost.
Appendix B. IANA considerations
None.
Mathis, et al. Expires November 30, 2004 [Page 32]
Internet-Draft Path MTU Discovery June 2004
Appendix C. Acknowledgements
Most of the SCTP text was contributed by Randall Stewart.
Matt Mathis and John Heffner are supported in this work by a grant
from Cisco Systems, Inc.
Mathis, et al. Expires November 30, 2004 [Page 33]
Internet-Draft Path MTU Discovery June 2004
Intellectual Property Statement
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the IETF's procedures with respect to rights in IETF Documents can
be found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
ietf-ipr@ietf.org.
Disclaimer of Validity
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Copyright Statement
Copyright (C) The Internet Society (2004). This document is subject
to the rights, licenses and restrictions contained in BCP 78, and
except as set forth therein, the authors retain all their rights.
Acknowledgment
Funding for the RFC Editor function is currently provided by the
Internet Society.
Mathis, et al. Expires November 30, 2004 [Page 34]
| PAFTECH AB 2003-2026 | 2026-04-22 04:07:43 |