http://stupid.domain.name/ietf/

One document matched: draft-iyengar-minion-protocol-02.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="no"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="std" docName="draft-iyengar-minion-protocol-02" ipr="trust200902">
  <front>
    <title abbrev="Minion Wire Protocol">Minion - Wire Protocol</title>

    <author fullname="Janardhan Iyengar" initials="J" surname="Iyengar">
      <organization>Franklin and Marshall College</organization>
      <address>
        <postal>
          <street>Mathematics and Computer Science</street>
          <street>PO Box 3003</street>
          <city>Lancaster</city>
          <region>Pennsylvania</region>
          <code>17604-3003</code>
          <country>USA</country>
        </postal>
        <phone>+1 717 358 4774</phone>
        <email>janardhan.iyengar@fandm.edu</email>
      </address>
    </author>

    <author fullname="Stuart Cheshire" initials="S" surname="Cheshire">
      <organization abbrev="Apple">Apple Inc.</organization>
      <address>
        <postal>
          <street>1 Infinite Loop</street>
          <city>Cupertino</city>
          <region>California</region>
          <code>95014</code>
          <country>USA</country>
        </postal>
        <phone>+1 408 974 3207</phone>
        <email>cheshire@apple.com</email>
      </address>
    </author>

    <author fullname="Josh Graessley" initials="J" surname="Graessley">
      <organization abbrev="Apple">Apple Inc.</organization>
      <address>
        <postal>
          <street>1 Infinite Loop</street>
          <city>Cupertino</city>
          <region>California</region>
          <code>95014</code>
          <country>USA</country>
        </postal>
        <phone>+1 408 974 5710</phone>
        <email>jgraessley@apple.com</email>
      </address>
    </author>

    <date day='21' month='October' year='2013'/>

    <abstract>
      <t>Minion uses TCP-format packets on-the-wire, for
      compatibility with existing NATs, Firewalls, and similar middleboxes,
      but provides a richer set of facilities to the application, as described
      in the Minion Service Model document. This document specifies the
      details of the on-the-wire protocol used to provide those services.</t>
    </abstract>
  </front>

  <middle>

    <?rfc needLines="5" ?>
    <section anchor="terminology" title="Conventions and Terminology Used in this Document">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL"
      in this document are to be interpreted as described in "Key words for use
      in RFCs to Indicate Requirement Levels" <xref target="RFC2119"/>.</t>

      <t>This document uses terminology like "kernel" and "user-level",
      as those terms pertain to many of today's Unix-like operating systems.
      Equivalent concepts apply to software that is built using a different
      architectural model than may not include such an obvious kernel/user split.</t>
    </section>

    <?rfc needLines="37" ?>
    <section anchor="intro" title="Introduction">
      <t>Minion uses TCP-format packets on-the-wire, to provide full
      compatibility with existing NATs, Firewalls, and similar middleboxes,
      but provides a richer set of facilities to the application, described in the
      <xref target="minserv">Minion Service Model and Conceptual API document</xref>.
      This document specifies the details of the on-the-wire protocol used to
      provide those services. Before reading this protocol specification document,
      familiarity with the <xref target="minserv">Minion Service Model</xref>
      is strongly recommended. That information is not repeated here.</t>

      <t>Minion runs over a standard TCP connection. Therefore, IP
      addresses and TCP ports are used just as they are with
      <xref target="RFC0793">TCP</xref>.</t>

      <t>Minion is also designed to be able to use a modified TCP connection
      which supports out-of-order delivery, giving better low-latency
      performance on lossy networks, for use by the kinds of application
      that today would use <xref target="RFC0768">UDP</xref>
      to achieve low-latency delivery.
      The goal of providing low-latency delivery -- and consequently
      the need to be able to handle a data stream that may have gaps --
      is reflected in various aspects of the Minion protocol design,
      such as the use of DTLS instead of TLS, and the use of
      <xref target="COBS">Consistent Overhead Byte Stuffing</xref>
      for reliably extracting messages from an incomplete data stream.
      Minion is able to take advantage of out-of-order delivery where
      the network stack offers that, but Minion does not require it.
      Minion still works correctly when the performance benefits of
      out-of-order delivery are not available.</t>

      <t>Minion supports messages of arbitrary size.
      Large messages are broken into chunks a little under 16 kilobytes each
      (the DTLS maximum record size, minus a few bytes for Minion header).
      At the receiving end the Minion chunks are reassembled into Minion
      messages and delivered to the client application.
      Small messages are sent in a single Minion chunk.</t>

      <t>Normally messages are sent by the client as a single atomic unit,
      and delivered to the receiving client as a single atomic unit.
      For messages too large to fit conveniently in memory, the message
      may be built incrementally by the sender, and delivered to the
      receiving client incrementally, a chunk at a time.</t>

      <t>When a Minion message is complete, or has at least one maximum
      Minion chunk size of data accumulated, then if it is eligible to
      be sent according to the message ordering facilities offered by the
      <xref target="minserv">Minion Service Model</xref>
      (Sender Ordering, Receiver Ordering, and Chaining)
      a Minion chunk is generated.</t>

      <t>Each Minion chunk contains a Minion chunk header
      followed by the client's message data, as described in 
      <xref target="chunks"/> "<xref target="chunks" format="title"/>".</t>

      <t>Each Minion chunk is encrypted using <xref target="RFC6347">DTLS</xref>.</t>

      <t>Each encrypted DTLS payload is then framed using RECOBS, as described in 
      <xref target="RECOBS"/> "<xref target="RECOBS" format="title"/>",
      so that it begins with a 00 byte and ends with an FF byte.</t>

      <t>The framed, encrypted chunk is then enqueued for transmission.</t>

      <t>If the kernel networking code supports multiple priorities, then
      the framed, encrypted chunk is placed in the transmission queue for
      the stated priority level. Any time the TCP congestion window and/or
      receive window rules allow more data to be sent, data is drawn from
      the highest-priority non-empty transmit buffer, assigned the next
      block of unused TCP sequence numbers, formed into a TCP segment, and
      transmitted on the wire. This just-in-time TCP sequencing mechanism
      has the effect of causing higher-priority data to be inserted right at
      the front of the conceptual combined transmit buffer, at the earliest
      possible byte boundary, unconstrained by message or chunk boundaries
      in the lower-priority messages. This is possible because the RECOBS
      framing is robust to pre-emption at any arbitrary byte boundary.</t>

      <t>Note that, when priorities are supported, chunks above the
      lowest priority MUST be delivered to the kernel in such a way that
      they are sent completely before the kernel resumes sending the
      lower-priority traffic. The RECOBS framing supports interrupting a
      lower priority stream with a higher-priority chunk, but not
      alternating back and forth between two priority levels. Once a
      higher-priority chunk interrupts lower-priority traffic, the
      higher-priority chunk must be completed before the lower-priority
      traffic resumes. Typically this is easily achieved by delivering
      the chunk to the kernel atomically in a single write call.</t>

    <?rfc needLines="20" ?>
      <section anchor="NAT" title="Comparison of TCP and UDP NAT Traversal">
        <t>When connecting to a server with a globally routable address,
        TCP is generally preferable to UDP. TCP includes the SYN and FIN
        bits which tell a NAT gateway when a connection starts and ends.
        In particular, the FIN bit tells the NAT gateway when it can
        discard state related to that mapping. UDP has no defined
        connection start/end indicators, which means that unused UDP
        mappings are much more likely to accumulate, which means that
        NAT gateways tend to be more aggressive about timing out UDP
        mappings <xref target="Study"/>, which means that clients using
        UDP need to be more aggressive about sending keepalive traffic,
        which is bad both for network efficiency and for battery life.
        <xref target="RFC6887">Port Control Protocol (PCP)</xref> offers
        some future hope of alleviating this problem by allowing clients
        to explicitly negotiate for longer mapping lifetimes, but PCP is
        not yet widely deployed. In the meantime, if use of UDP increases,
        NAT gateways are likely to be accumulating mappings even more
        rapidly, with no way to differentiate which are still required
        and which may be safely discarded, with the result that UDP
        mappings may have to be discarded even more aggressively. While
        a discarded UDP mapping can be recreated by another outgoing UDP
        packet, in the time between when the UDP mapping is discarded
        and then recreated, the client is cut off an unable to receive
        inbound communication from server or peer at the other end.
        Therefore, we believe that it is preferable to use TCP where
        possible.</t>

        <t>However, when connecting to a peer which is itself
        also behind a NAT gateway, in the absence of
        <xref target="RFC6887">PCP support</xref>, techniques like
        <xref target="RFC5245">Interactive Connectivity Establishment (ICE)</xref>
        are used, and research has shown that
        <xref target="RFC5128">there are cases where
        ICE works for UDP but not for TCP</xref>.</t>

        <t>To accomodate both usage scenarios, Minion is generally used
        with standard TCP format packets, but for peer-to-peer scenarios
        where TCP ICE is found not to work, Minion can be used
        <xref target="TCPoUDP">encapsulated inside UDP</xref> instead.</t>

      </section>
    </section>

    <?rfc needLines="22" ?>
    <section anchor="chunks" title="Minion Chunk Format">
      <t>A Minion Chunk begins with an eight-byte header, followed
      by the client's message data:</t>

      <figure align="left" anchor="Chunk" title="Minion Chunk Format">
<artwork align="center"><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|C|    Code     |Pri|     This Minion Chunk ID                  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Reserved      |RCP|     Referenced Minion Chunk ID            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
:                                                               :
:                     Minion Chunk Data                         :
:                                                               :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
      </figure>

      <t>If the Complete ('C') bit is zero, this message is
      incomplete; the receiver should expect to receive additional
      continuation chunks for this message. If the Complete bit is
      one, this message is complete; there will be no subsequent
      continuation chunks for this message.</t>

      <t>The seven-bit chunk code identifies what type of chunk this is,
      as described below.</t>

      <t>The two-bit priority field indicates the priority level for this
      message, with 0 being the highest priority and 3 being the default
      (lowest-level) priority.</t>

      <t>Every Minion chunk has a Chunk ID. This is a 22-bit value
      assigned from a monotonically increasing 22-bit cyclic counter.
      This means that Chunk IDs are reused every 2^22 chunks.
      At any given moment in time though, only a small portion of the
      22-bit ID space is actively in use, so Chunk IDs are not ambiguous.
      Each of the four priority levels has its own 22-bit Chunk ID space,
      i.e., Priority 1 Chunk 7 and Priority 2 Chunk 7 are different chunks.
      Also, the Chunk ID spaces in opposite directions on a connection
      are separate. Each sender is responsible for selecting the Chunk IDs
      for the chunks it sends.</t>

      <t>In some cases it is useful to refer to messages by ID, and the
      terms "Message ID" and "Chunk ID" are sometimes used interchangeably.
      For a message that is sent using a single chunk, the Message ID is the
      same as the Chunk ID. For a message that is sent using multiple chunks,
      the Message ID is the Chunk ID of the *final* chunk of the message.
      One implication of this is that a message's ID is undefined until
      the message is complete.</t>

      <t>Because Chunk IDs are eventually reused, issues of ID lifetime
      must be carefully considered in the Minion protocol design.
      For example, since a remote peer could, in principle, wait an
      arbitrary long length of time before replying to a message,
      the Message ID of a request that is awaiting a response MUST NOT
      be reused until the response has been received, and the client
      has disposed of the request message. Otherwise, a reply could
      be ambiguous, if there were two outstanding request messages
      both using the same Message ID at the same time.
      Likewise, the last Chunk ID of an incomplete message MUST NOT be
      reused until some subsequent chunk has been added to that message,
      referencing the previous Chunk ID.</t>

      <t>The Reserved field MUST be set to zero on transmission, and
      MUST be ignored on reception.</t>

      <t>For chunk types that need to refer to some other chunk,
      the Referenced Minion Chunk Priority (RCP) and
      Referenced Minion Chunk ID fields identify the referenced chunk.
      Note that some chunk types refer to chunks going in the same
      direction (e.g., a continuation chunk) and some chunk types refer to
      chunks going in the reverse direction (e.g., a reply chunk).
      For chunk types that do not to refer to any other chunk, these two fields
      MUST be set to zero on transmission, and MUST be ignored on reception.</t>

      <t>The Minion Chunk payload data follows the Minion Chunk Header.</t>

      <t>There is no explicit length field in the Minion Chunk Header,
      because the chunk length is determined implicitly in the RECOBS
      decoding step.</t>

      <?rfc needLines="18" ?>
      <section anchor="codes" title="Minion Chunk Codes">
        <t>The seven-bit chunk code identifies what kind of chunk this is.
        There are 128 chunk codes available.
        The following eight chunk codes are currently defined:
          <list style='hanging' hangIndent="5">

            <t hangText="  00">Continuation.
            This is a continuation of a previously incomplete message.
            The Referenced Minion Chunk ID
            identifies what previous chunk this is adding to.
            (If the Complete bit is one then this chunk is the final
            chunk and completes the message; no further chunks for this
            message will be arriving.)</t>

            <t hangText="  01">Cancellation.
            This is a cancellation of a previously incomplete message.
            The Referenced Minion Chunk ID
            identifies what previous chunk this is cancelling.
            In this case Complete bit is unused; the Complete bit MUST be
            set to zero on transmission, and MUST be ignored on reception.</t>

            <t hangText="  02">Unordered Message.
            This chunk begins a new unordered message.
            The Referenced Minion Chunk ID is unused, and MUST be set to
            zero on transmission, and MUST be ignored on reception.</t>

            <t hangText="  03">Sender Ordered Message.
            This chunk begins a new Sender Ordered message.
            If received out-of-order, it should nonetheless be delivered
            immediately to the receiving client.
            The Referenced Minion Chunk ID is used to deduce the Sender
            Ordering that should be applied if the receiving client
            generates a reply to this message.
            If the received message identifed by the Referenced Minion
            Chunk ID generated a reply A', then a reply to this message
            should have an automatic Sender Ordering dependency that it
            follow message A'.</t>

            <t hangText="  04">Receiver Ordered Message.
            This chunk begins a new Receiver Ordered message.
            This message is subject to Receiver Ordering; it MUST NOT be
            delivered to the receiving client until the message indicated
            by the Referenced Minion Chunk ID field has been delivered.
            If the receiving client generates a reply to this message,
            then the reply should have an automatic Receiver Ordering
            dependency that it follow the reply to the message indicated
            by the Referenced Minion Chunk ID field.</t>

            <t hangText="  05">Superseding Messages.
            This chunk begins a new message that supersedes a preceding message.
            The Referenced Minion Chunk ID identifies the preceding message.
            If the preceding message has not yet been received, then when it does
            arrive, it MUST be silently discarded.</t>

            <t hangText="  06">Chained Message.
            This chunk begins a new message that chains on after a preceding message.
            The Referenced Minion Chunk ID identifies the preceding message.
            This message MUST NOT be delivered to the receiving client
            until the previous message of the chain as been delivered to
            the receiving client, and this message MUST be delivered to
            the receiving client in a manner that indicates to the client
            that it is related to the previous message.</t>

            <t hangText="  07">Reply/Acknowledge.
            This chunk begins a new message which is an explicit reply to a
            previously received message.
            The Referenced Minion Chunk ID
            identifies the received message to which this is a reply.
            A reply may be empty, in which case it serves as a simple acknowledgement
            that the request was received and accepted, or it may contain data.
            It is anticipated that future Minion protocol development
            will create additional Minion chunk codes to negotiate
            future protocol features.
            For these capability negotiation messages, an empty reply
            referencing the request serves as an acknowledgement that
            the requested protocol feature is supported.</t>

            <t hangText="  08">Reject.
            A Minion Reject code indicates that the referenced received
            message had an error or was not accepted for some other reason.
            A Reject Message may be empty, or may contain data giving
            information concerning the reason for the rejection.
            It is possible to reject an incomplete message that is
            still arriving, by sending a Reject referencing the
            most recent Chunk ID for that partial message.
            The sender will respond by sending a Cancellation for that
            message, confirming that no further chunks will be sent.
            When used for Minion protocol capability negotiation,
            a Reject message referencing the request indicates
            that the requested protocol feature is not supported.</t>

            <t hangText="  09">End Minion.
            It is anticipated that there
            will be existing application protocols that initially add
            Minion as an optional feature, which they use only when the
            remote peer indicates it also has Minion support, and
            otherwise they will communicate using the existing protocol
            without the Minion features. Such application protocols
            typically will first connect using their existing protocol,
            and then negotiate an "upgrade" to Minion framing. For
            symmetry, it would be good if such an "upgrade" were not an
            irreversible one-way path. We would like to offer the ability
            for applications to connect over raw TCP, switch to Minion
            for some message exchanges, and then drop back to raw TCP for
            some subsequent communication.
            This Minion chunk code exists to signal, "This is the final
            Minion-format message you will receive in this particular
            Minion session; after this you're on your own."</t>
          </list>
        </t>
      </section>
    </section>

    <?rfc needLines="28" ?>
    <section anchor="RECOBS" title="Recursively Embeddable COBS">
      <t><xref target="COBS">Consistent Overhead Byte Stuffing</xref>
      allows complete messages to be reliably located within an
      incomplete data stream that may contain gaps.</t>

      <t>COBS works by transforming the payload data to eliminate all
      occurrences of zero bytes. This is like PPP byte stuffing, but
      more efficient; COBS has a worst-case data size overhead below 0.5%.
      Having created a zero-free payload, the payloads can then be
      concatenated into a single byte stream, separated by single zero
      bytes, and the zero bytes unambiguously mark the boundaries
      between payloads, because we know the payloads themselves no longer
      contain any zero bytes. At the receiving end the transformation
      is reversed to recreate the original payload data.</t>

    <?rfc needLines="12" ?>
      <t><xref target="COBS">The transformation process</xref> is, in effect,
      a simple run length encoding. An extremely simplified summary of the
      original 1997 COBS encoding is as follows:
        <list style="symbols">
          <t>If the payload begins with three nonzero bytes followed by
          a zero, then the output is the byte value 4 (the run length)
          followed by the three nonzero bytes, and the subsequent zero
          is skipped.</t>
          <t>If that is followed by fifty nonzero bytes followed by a
          zero, then the output is the byte value 51 (the run length)
          followed by the fifty nonzero bytes, and the subsequent zero
          is skipped.</t>
          <t>This process is repeated until the entire payload has been
          replaced by its zero-free equivalent.</t>
        </list>
      </t>

    <?rfc needLines="8" ?>
      <t>Recursively Embeddable COBS (RECOBS) is a derivative of the
      original 1997 COBS encoding. RECOBS code bytes have the following meanings:</t>
      <t>
        <?rfc subcompact="yes" ?>
        <list style='hanging' hangIndent="5">
          <t hangText="  00">New payload begins</t>
          <t hangText="  01">Represents a single zero byte</t>
          <t hangText="  02">Two bytes: a single nonzero byte, followed by a single zero byte</t>
          <t hangText="  03">Three bytes: two nonzero bytes, followed by a single zero byte</t>
          <t hangText="   n">Represents n bytes: n-1 nonzero bytes, followed by a zero byte</t>
          <t hangText="  FD">253 bytes: 252 nonzero bytes, followed by a single zero byte</t>
          <t hangText="  FE">253 bytes: 253 nonzero bytes, with *no* following zero byte</t>
          <t hangText="  FF">Payload ends</t>
        </list>
        <?rfc subcompact="no" ?>
      </t>

      <t>This has the effect that, after encoding, every payload has
      unambiguous bookends; every payload begins with a single 00, and
      ends with a single FF. Using this encoding, recursive embedding
      becomes possible. At *any* point in the encoded byte stream it is
      now possible to interrupt the byte stream, insert a new
      RECOBS-encoded payload, and then resume the previous byte stream.</t>

      <t>At the receiving end, the decoder is part-way through decoding
      a payload when the interruption occurs. The decoder sees a 00,
      which is not legal in RECOBS-encoded data, so the decoder knows a
      new payload is beginning. Because the decoder has not yet seen the
      FF end-marker for the previous payload, it knows that payload is
      incomplete, so it saves its decoding state for later resumption.
      The decoder then proceeds to decode the embedded payload.
      When the decoder sees the FF end-marker for the embedded payload,
      it delivers that fully decoded payload to the waiting client, and
      then resumes its decoding of the previously interrupted payload.</t>

      <t>In principle this recursive embedding could be nested arbitrarily
      deeply, limited only by the amount of storage the decoder has available
      for partially-received payloads and their associated decoding state.</t>

      <t>In practice, Minion limits RECOBS embedding to four levels
      (the base level plus three levels of nested interruption) to establish
      a defined upper bound on the amount of storage required by a decoder.</t>
    </section>

    <?rfc needLines="28" ?>
    <section anchor="flow" title="Flow Control">

      <t><xref target="RFC0793">TCP</xref> implements flow control in
      the form of the advertised receive window. This is to prevent a
      faster sender from overwhelming a slower receiver. Minion requires
      similar protection to prevent a slower receiver running out of
      memory trying to buffer messages arriving faster than it can
      handle them.</t>

      <t>For a pure user-level library implementation of Minion, this is
      achieved by having the library set an upper bound on the amount of
      memory it will use for storing received messages that have not yet
      been handled by the client. Once this limit is met, the library
      ceases reading TCP data from the kernel, which causes the TCP
      receive window to fill up, which causes the sender to stop
      sending. Once the client consumes some messages, the library then
      reads more data from the kernel, the TCP receive window opens up,
      and the sender is permitted to send more data.</t>

      <t>However, this means that there is some duplication of buffering
      -- the TCP receive window in the kernel and additional buffering
      in the user-level library. For this reason a kernel extension is
      proposed where a client (the Minion library in this case) can read
      data from the connection *without* raising the TCP receive window.
      In a sense it is reading the data "secretly", without admitting to
      the sender at the other end that it has been read. Those bytes, even
      though read into user space, are still counted against the TCP
      receive window. Later, after the client application has actually
      consumed the message, another kernel call is made to acknowledge
      consumption of those bytes, and the TCP receive window is raised.</t>

      <t>This mechanism integrates message-level flow control with TCP's
      byte-level flow control, rather than having two independent flow
      control mechanisms happening concurrently at different levels,
      in ways that might interact badly with each other.</t>

      <t>Note that the Minion protocol design will have to consider
      possible deadlock situations. For example, suppose one Minion host
      is refusing to consume any more Minion Chunks because it wishes to
      send a Reject message for them, but it cannot, because the peer's
      receive window is closed. Suppose also that the reason the peer's
      receive window is closed is because the peer also is sitting on a
      pile of unwanted Minion Chunks that it refuses to consume until
      it can send a Reject message for them. Possible deadlocks such as
      these need to be considered, and mechanisms to avoid them created.</t>

    </section>

    <?rfc needLines="5" ?>
    <section anchor="retransmission" title="Retransmission Policy">

      <t>One of the main arguments that is often presented to justify
      why a particular application protocol is built on UDP instead of
      TCP is that, "UDP is better for 'real time' applications." The
      supporting reasoning for this is often that, "TCP insists on
      continuing to retransmit data long after the client doesn't need
      any more." In truth the real problem is not retransmission; it is
      that the conventional TCP APIs don't allow received data to be
      delivered out of order. Suppose a TCP sender has 50 packets in
      flight at any given time (e.g., the bandwidth x delay product is 75
      kB) then the loss of a single packet causes all 49 following
      packets to stall at the receiver because the API doesn't allow for
      them to be delivered to the client until the missing packet has
      been received.</t>

      <t>Minion solves this problem by allowing data to be delivered
      as it arrives, even if there are gaps. But the argument still
      remains that even after removing the ordering requirement
      at the receiver, it may still be a waste of bandwidth to
      retransmit data that will arrive too late to be useful.
      And indeed, it is possible with TCP to fraudulently acknowledge
      segments that were in fact not received, and this will cause
      the sender to not retransmit those segments.</t>

      <t>However, we chose not to use fraudulent acknowledgements to
      suppress retransmissions, because certain NATs, Firewalls and
      other middleboxes may block traffic if they observe implausible
      protocol actions which they find suspicious. One of the important
      goals of Minion is 100% compatibility with today's existing
      Internet devices, not 99% compatibility.</t>

      <!-- /** JRI: I need to think about these arguments some more,
               and they need some rewriting. **/

      <t>Furthermore we argue that unnecessary retransmissions are
      so rare that optimising them away at the expense of reduced
      compatibility would be a poor design choice.</t>

      <t>A retransmission is only *unnecessary* when the message is
      'real-time' in nature, and the receiver is coded to handle the
      case where that message fails to arrive.</t>

      <t>A retransmission is also only *unnecessary* in the case where the
      retransmission arrives too late to be useful. Often, on networks
      with low round-trip delay, TCP fast retransmit may deliver the
      retransmission fast enough to be useful. We should not assume that
      retransmissions necessarily always arrive too late to be useful.</t>

      <t>And finally, the issue of retransmission only arises for
      packets that are lost, and we argue that packet loss is rare in
      the scenarios Minion is designed for.</t>

      <t>Many people seem to believe that random unexplained packet loss
      is widespread and common. We do not believe this to be true, and
      do not know from where this meme originates. Perhaps Computer
      Science students learn in their first computer networking class
      that a key difference between circuit switching and packet
      switching is that packet switching tolerates and expects packet
      loss, and they take this lesson a little too much to heart.
      Packet loss is inevitably caused when data is sent at a rate
      higher than the network can carry, and the only appropriate response
      to this is to send at a lower rate. Aside from this inevitable congestion
      loss, other random unexplained packet loss is extremely rare.</t>

      <t>The fact that random unexplained packet loss cannot be
      commonplace in a functioning network can be illustrated via a
      simple thought experiment. Suppose 10% random unexplained packet
      loss were typical for a normal Internet link. Then a packet which
      has to traverse 30 links to get to its destination would have a
      probability of 0.9^30 of surviving for 30 hops. It would have only
      a 4.24% chance of getting to its destination. So the first link,
      in addition to the packets it managed to lose itself, would also
      have 95% of its successful packets also lost, by other links on
      the path, so that even for the packets it did manage to send, 95%
      of them would just have been a waste of resources (bandwidth,
      spectrum, power, battery life, etc.) because they never made it to
      their final destination. Therefore, we argue, even though
      the Internet architecture tolerates some level of loss and can
      handle it, overall loss levels need to remain low (below 1%)
      or the network as a whole becomes horribly inefficient.
      For links like Wi-Fi that have inherently higher loss rates, this
      lossiness needs to be corrected via link-layer acknowledgements
      and retransmissions to bring the effective loss rate well below 1%
      (which is indeed exactly what Wi-Fi does).</t>

      <t>Given that we argue that packet loss should be below 1% in a
      functioning network, the cost of retransmitting those lost
      -->

      <t>We expect packet loss to be about 1% (at most a few percent) in
      a functioning network, and the cost of retransmitting those lost
      packets, even in the extreme case where *all* the retransmissions
      turn out to be unnecessary, is an overhead of about 1%.
      We argue that an overhead of about 1% is an acceptable price to
      pay in exchange for 100% compatibility with existing NATs,
      Firewalls and other middleboxes.</t>

    </section>

    <?rfc needLines="12" ?>
    <section anchor="kernel" title="Optional Kernel Extensions">

      <t>While Minion can be implemented entirely as a user-level
      library built on top of existing standard networking APIs like BSD
      sockets, it can also benefit from some optional kernel extensions:
        <list style="hanging">
          <t hangText="Send Priorities"><vspace blankLines="0" />
          Normal TCP APIs transmit data strictly in the order is is
          given to the kernel. The addition of priority support allows a
          sendmsg() call to be used in conjunction with cmsg ancillary
          data to indicate the priority level of the data. For normal
          applications this capability would be of little use because it
          would most likely result in corruption of the data stream, but
          it is useful with Minion because the RECOBS encoding is robust
          against message insertion at arbitrary byte boundaries.
          An alternative way to achieve a similar effect is, instead of
          buffering data in the kernel, to keep the data in the
          user-space library for as long as possible. When the TCP
          congestion window and/or receive window rules allow more data
          to be sent, the kernel generates some kind of upcall (e.g., a
          kevent notification) to the user-space library informing it of
          the ability to transmit, and the user-space library responds
          by selecting which particular block of data to hand to the
          kernel next.</t>

          <t hangText="Just-In-Time Data Generation"><vspace blankLines="0" />
          Through operational experience, we have learned
          (not that this was any great surprise) that excessive
          buffering in the kernel leads to poor behaviors.
          For example, two messages at the same priority level are not
          interleaved effectively if the first message is swallowed
          whole by the kernel, and held in kernel buffers,
          before the second message is even created.
          When that happens, the result is that the first message is
          sent in its entirety, followed by the second message in its
          entirety, with no interleaving.
          <vspace blankLines="1" />
          To prevent this unintended serialization, we need to avoid
          irrevocably handing off data to the kernel prematurely.
          We want to give the kernel enough data to keep the pipeline full
          (an amount equal to the connection's Bandwidth Delay Product)
          but no more.
          <vspace blankLines="1" />
          To this end, rather than having the kernel indicate that a
          socket is writable any time the kernel has space available to
          buffer more data, we'd like the kernel to indicate that a socket
          is writable only when TCP (according its protocol rules, such as
          receive window, congestion window, and Nagle's Algorithm)
          would be willing to send data, but has no data available to send.
          When this situation occurs, the socket becomes writable, and
          the client (the user-level Minion library) is able to perform a
          just-in-time determination of what data ought to be sent next.
          <vspace blankLines="1" />
          This just-in-time data generation could be achieved in the BSD
          sockets API by adding a new socket option.
          When using this new socket option, a socket will only be
          writable when TCP is actively waiting for new data.
          If the context-switching latency or software overhead is such
          that it takes the user-level code a little too long to generate
          data strictly on demand, then a middle ground can be achieved
          by modifying the new socket option such that a socket will only be
          writable when the socket has less data buffered than it expects
          to need imminently. For example, a TCP connection in slow start
          expects it will need four TCP segments when the next ack arrives.
          When used this way, if an incoming ACK allows TCP to send out
          four segments then those four segments are already buffered
          and ready in the kernel, and the socket then becomes writable again
          to allow the user-level code to generate the next four segments,
          so that they will be ready and waiting the next time TCP is able
          to transmit additional segments.
          <vspace blankLines="1" />
          We are currently experimenting with just-in-time data generation.
          If it proves to be as effective as we hope, it might even work well
          enough to provide effective priority support too, eliminating the
          need for the "Send Priorities" kernel extension.</t>

          <t hangText="Immediate Receive"><vspace blankLines="0" />
          Normal TCP APIs deliver data only in TCP sequence number
          order. The addition of support for new cmsg ancillary data in
          the recvmsg() call allows the user-space library to request
          *any* available data, not only in-order data. The cmsg
          ancillary data returned from the recvmsg() call indicates to
          the user-space library where in the TCP sequence space this
          particular block of data lies. A setsockopt() option
          (or equivalent) is also required to put the socket into this
          "Immediate Receive" mode, to inform the kernel that the client
          will accept out-of-order data on this socket, and therefore
          the client should be notified (via select(), kevent(), etc.),
          not only when there is in-order data available to be read,
          but also when there is out-of-order data available to be read.</t>

          <t hangText="Integrated Receive Window"><vspace blankLines="0" />
          Normal TCP APIs raise the receive window any time data is read
          out of the kernel into user space. The addition of new cmsg
          ancillary data in the recvmsg() call allows the user-space
          library to request that the kernel return received data
          *without* reflecting this in its receive window calculation.
          After the client application has consumed the message data
          from the user-space Minion library, the Minion library makes a
          subsequent recvmsg() call with appropriate cmsg ancillary data
          to inform the kernel how many bytes to add back into its
          receive window. In essence, the receive window boundary is
          stretched outside the kernel to account for data held by
          *both* the kernel *and* the user-space Minion library.</t>
        </list>
      </t>

      <t>These optional kernel extensions are a key part of what makes
      Minion compelling. Minion can be adopted today by any application,
      using Minion as a purely user-space library.
      Such an application performs as well as any application can
      when it is built on top of standard TCP.
      However, unlike an application built on top of standard TCP, Minion
      offers the promise of future kernel support for even better performance.
      Any given application with its own application-specific protocol
      is unlikely to receive special kernel support to make just that
      one application work better. But when many applications all use
      the Minion protocol, it then becomes reasonable to add kernel
      support to improve all of those applications.</t>

    </section>

    <?rfc needLines="33" ?>
    <section anchor="tcp" title="TCP Deviations">
      <t>When implemented entirely as a user-level library, Minion
      naturally adheres to the TCP specifications (insofar as the
      underlying operating system adheres to the TCP specifications)
      because Minion is merely using the operating system's networking
      APIs.</t>

      <t>When optional kernel extensions are in use, they may allow
      Minion to deviate from classical TCP protocol rules.
      One such instance of this deviation has already been identified.
      The TCP protocol rules allow a sender to send a FIN to end a
      connection, and then follow it with additional data bytes (with
      higher TCP sequence numbers, so that they fall later in the data
      stream) which the receiver is expected to discard because it
      recognizes that they fall after the FIN in the data stream.
      When out-of-order delivery is enabled, it's possible that if the
      TCP segment containing the FIN is lost or delayed, then subsequent
      TCP segments containing data bytes could be incorrectly delivered
      to the client application, when the TCP protocol rules dictate
      that they should have been discarded. The ability to send data
      following the FIN that the receiver is expected to discard is
      incompatible with out-of-order delivery. Note that this is referring
      to data that follows the FIN in TCP sequence number space, not
      data that follows the FIN in transmission order. If, after the FIN
      has been sent, previously transmitted data is lost and needs to be
      retransmitted, then this does not cause any problems; the
      bytes in such retransmitted TCP segments fall *before* the FIN in
      TCP sequence number space, not after. As a result of this observation,
      TCP's protocol rules, when used with Minion traffic,
      are effectively modified as follows:
        <list style="symbols">
          <t>A client using Minion MUST NOT send new data on a connection
          after that connection has been closed (i.e. a FIN indication has
          been sequenced and sent).</t>
        </list>
      In reality we do not expect this to be a major burden to TCP implementations.
      We are not aware of TCP implementations that send data after a connection
      is closed and then rely on the receiver to discard that data.</t>
    </section>

    <?rfc needLines="5" ?>
    <section anchor="iana" title="IANA Considerations">
      <t>No IANA actions are required by this document.</t>
    </section>

    <?rfc needLines="6" ?>
    <section anchor="security" title="Security Considerations">
      <t>We take security seriously. As this work develops, this section will
      contain details of any known security issues and possible mitigations.</t>
    </section>

    <?rfc needLines="5" ?>
    <section title="Acknowledgements">
      <t>Many thanks to Bryan Ford, Padma Bhooma and Anumita Biswas
      for their contributions to the development of Minion.</t>
      <t>Thanks to Joe Touch for pointing out that Minion restricts TCP's
      ability to send data, after a connection is closed, that will then be
      ignored by the receiver.</t>
    </section>

  </middle>

  <?rfc needLines="23" ?>
  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.0793" ?>
      <?rfc include="reference.RFC.2119" ?>
      <?rfc include="reference.RFC.6347" ?>

      <reference anchor="minserv">
        <front>
          <title>Minion - Service Model and Conceptual API</title>
          <author fullname="Janardhan Iyengar" initials="J" surname="Iyengar">
            <organization>Franklin and Marshall College</organization></author>
          <date month="October" year="2013" />
        </front>
        <seriesInfo name="Internet-Draft" value="draft-iyengar-minion-concept-02" />
        <format type="TXT"
          target="http://www.ietf.org/internet-drafts/draft-iyengar-minion-concept-02.txt" />
      </reference>

      <reference anchor="COBS" target="http://stuartcheshire.org/papers/COBSforToN.pdf">
        <front>
          <title>Consistent Overhead Byte Stuffing</title>
          <author fullname="Stuart Cheshire" initials="S" surname="Cheshire">
            <organization>Stanford University</organization></author>
          <author fullname="Mary Baker"      initials="M" surname="Baker"   >
            <organization>Stanford University</organization></author>
          <date month="September" year="1997" />
        </front>
        <format type="PDF" target="http://stuartcheshire.org/papers/COBSforToN.pdf" />
      </reference>

    </references>

    <references title="Informative References">

      <reference anchor="Study" target="http://conferences.sigcomm.org/imc/2010/papers/p260.pdf">
        <front>
          <title>An Experimental Study of Home Gateway Characteristics</title>
          <author fullname="Seppo Hatonen"   initials="S" surname="Hatonen">
            <organization>University of Helsinki</organization></author>
          <author fullname="Aki Nyrhinen"    initials="A" surname="Nyrhinen">
            <organization>University of Helsinki</organization></author>
          <author fullname="Lars Eggert"     initials="L" surname="Eggert">
            <organization>Nokia Research Center</organization></author>
          <author fullname="Stephen Strowes" initials="S" surname="Strowes">
            <organization>University of Glasgow</organization></author>
          <author fullname="Pasi Sarolahti"  initials="P" surname="Sarolahti">
            <organization>HIIT / Aalto University</organization></author>
          <author fullname="Markku Kojo"     initials="M" surname="Kojo">
            <organization>University of Helsinki</organization></author>
          <date month="September" year="1997" />
        </front>
        <format type="PDF" target="http://conferences.sigcomm.org/imc/2010/papers/p260.pdf" />
        <format type="PDF" target="http://eggert.org/papers/2010-imc-hgw-study.pdf" />
      </reference>

      <?rfc include="reference.RFC.0768" ?>
      <?rfc include="reference.RFC.5128"?>
      <?rfc include="reference.RFC.5245"?>
      <?rfc include="reference.RFC.6887"?>

      <reference anchor="TCPoUDP">
        <front>
          <title>Encapsulation of TCP and other Transport Protocols over UDP</title>
          <author fullname="Stuart Cheshire" initials="S" surname="Cheshire">
            <organization abbrev="Apple">Apple Inc.</organization></author>
          <author fullname="Josh Graessley" initials="J" surname="Graessley">
            <organization abbrev="Apple">Apple Inc.</organization></author>
          <author fullname="Stuart Cheshire" initials="S" surname="Cheshire">
            <organization abbrev="Apple">Apple Inc.</organization></author>
          <date month="June" year="2013" />
        </front>
        <seriesInfo name="Internet-Draft" value="draft-cheshire-tcp-over-udp-00" />
        <format type="TXT"
          target="http://www.ietf.org/internet-drafts/draft-cheshire-tcp-over-udp-00.txt" />
      </reference>

    </references>
  </back>
</rfc>
PAFTECH AB 2003-2026
2026-04-23 05:25:50