One document matched: draft-ietf-ledbat-congestion-03.xml


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
    <!ENTITY rfc2119 PUBLIC '' 
	'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml'>
    <!ENTITY rfc2581 PUBLIC '' 
	'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2581.xml'>
]>
<rfc category="exp" ipr="trust200902"
	docName="draft-ietf-ledbat-congestion-03.txt">

<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>

<?rfc toc="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="yes" ?>
<?rfc iprnotified="no" ?>
<?rfc strict="yes" ?>
<?rfc compact="yes" ?>

    <front>
        <title abbrev="LEDBAT">Low Extra Delay Background Transport (LEDBAT)</title>
        <author initials='S' surname="Shalunov" fullname='Stanislav Shalunov'>
          <organization>BitTorrent Inc</organization>
          <address>
            <postal>
              <street>612 Howard St, Suite 400</street>
              <city>San Francisco</city> <region>CA</region> <code>94105</code>
              <country>USA</country>
            </postal>
            <email>shalunov@bittorrent.com</email>
            <uri>http://shlang.com</uri>
          </address>
        </author>
        <author initials='G' surname="Hazel" fullname='Greg Hazel'>
          <organization>BitTorrent Inc</organization>
          <address>
            <postal>
              <street>612 Howard St, Suite 400</street>
              <city>San Francisco</city> <region>CA</region> <code>94105</code>
              <country>USA</country>
            </postal>
            <email>greg@bittorrent.com</email>
          </address>
        </author>
        <author initials='J' surname="Iyengar" fullname='Janardhan Iyengar'>
          <organization>Franklin and Marshall College</organization>
          <address>
            <postal>
              <street>415 Harrisburg Ave.</street>
              <city>Lancaster</city> <region>PA</region> <code>17603</code>
              <country>USA</country>
            </postal>
            <email>jiyengar@fandm.edu</email>
          </address>
        </author>
        
        <date/>
		<area>Transport</area>

		<workgroup>LEDBAT WG</workgroup>
        <abstract>
	
			<t>LEDBAT is an experimental delay-based congestion control algorithm
			  that attempts to utilize the available bandwidth on an end-to-end path
			  while limiting the consequent increase in queueing delay on the path.
			  LEDBAT uses changes in one-way delay measurements
			  to limit congestion induced in the network by the LEDBAT flow.
			  LEDBAT is designed largely for use by background bulk-transfer applications;
			  it is designed to be no more aggressive than TCP congestion control
			  and yields in the presence of competing TCP flows,
			  thus reducing interference with the network performance of the competing flows.</t>
			 </abstract>
			 </front>

    <middle>
        <section title="Requirements notation">
            <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
            "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
            and "OPTIONAL" in this document are to be interpreted as
            described in <xref target="RFC2119"/>.</t>
        </section>

		<section title="Introduction">

		  <t>
			TCP congestion control <xref target="RFC2581"/>,
			the predominant congestion control mechanism used on the Internet,
			aims to share bandwidth at a bottleneck link equitably
			among flows competing at the bottleneck.
			While TCP works well for many applications,
			applications
			such as software updates or file-sharing applications
			prefer to use bandwidth available in the network
			without interfering with the network performance of 
			other interactive applications.
			Such "background" traffic
			can yield bandwidth to TCP-based "foreground" traffic
			by reacting earlier than TCP to congestion signals.
		  </t>
			
		  <t> LEDBAT is an experimental delay-based congestion control mechanism
			that allows background applications,
			such as peer-to-peer applications,
			that send large amounts of data particularly over links with deep buffers,
			such as residential uplinks, 
			to operate in the background,
			without interfering with performance of interactive applications. 
			LEDBAT uses one-way delay measurements to determine congestion on the data path,
			and keeps latency across the tight link in the end-to-end path low
            while attempting to utilize the available bandwidth on the end-to-end path. </t>
			
		  <!--
			The predominant congestion control mechanism used on the Internet,
			TCP congestion control [XXXRFC2581],
			requires data loss to detect congestion.
			A TCP sender increases its congestion window [XXXRFC2581]
			until a loss occurs,
			which, 
			in the absence of Active Queue Management (AQM), 
			occurs only
			when the queue at the
			bottleneck link
			on the end-to-end path overflows.
			Even with AQM, 
			The queueing delay at the bottleneck link
			increases significantly before
			TCP responds to congestion at the tight link.

			This increased delay can be significant;
			default parameters on customer-side ADSL modems [XXXcite]
			can result in seconds of queueing delay on the ADSL uplink alone.
			While these large queueing delays have no known
			benefit, they have substantial drawbacks for interactive
			applications---"lag" increases for interactive games,
			and voice/video communication suffers from the consequently high roundtrip times.
			-->
			<!--
			It has been deployed by BitTorrent
            in the wild first with the BitTorrent DNA client (a P2P-based CDN) and now with the uTorrent
            client. This mechanism not only allows to keep delay across a bottleneck low, but also
            yields quickly in the presence of competing traffic with loss-based congestion
            control.</t>

		   	<t>Beyond its utility for P2P, LEDBAT enables other advanced networking applications to
            better get out of the way of interactive apps.</t>

		   	<t>In addition to direct and immediate benefits for P2P and other application that can
            benefit from scavenger service, LEDBAT could point the way for a possible future evolution
            of the Internet where loss is not part of the designed behavior and delay is
            minimized.</t>
			-->
		  	<section title="Design Goals">
			<t>As a "scavenger" mechanism for the Internet, LEDBAT's design goals are to:
				<list style="numbers">
					<t>Keep	delay low when no other traffic is present </t>
					<t>Add little to the queuing delays induced by TCP traffic</t>
					<t>Quickly yield to traffic sharing the same bottleneck queue that uses standard TCP congestion control </t>
					<t>Utilize end-to-end available bandwidth </t>
					<t>Operate well in networks with FIFO queuing with drop-tail discipline</t>
				</list>
			</t>
			</section>
		  	<section title="Applicability">
			  <t> LEDBAT is a "scavenger" congestion control mechanism---a LEDBAT flow attempts to utilize all available bandwidth
				and  yields quickly to a competing TCP flow---and is primarily motivated by background bulk-transfer applications,
				such as peer-to-peer file transfers and software updates.
				It can be used for any application that needs to run in the "background",
				to reduce the application's impact on the network
				and on other interactive network applications.</t>
		  
			  <t> LEDBAT can be used with any
				transport protocol
				capable of carrying timestamps and
				acknowledging data frequently---LEDBAT can be easily used with TCP, SCTP, and DCCP.</t>
			  <!--
			  <t> The constants specified in this document
				are based on XXX.  (XXX What contexts is LEDBAT applicable in? Residential networks? Others?) 
				</t>-->
		  	  </section>
		</section>
		
		<section title="LEDBAT Congestion Control">
		  <section title="Overview">
			<t> A TCP sender increases its congestion window
			  until a loss occurs,
			  which, 
			  in the absence of any Active Queue Management (AQM) in the network,
			  occurs only
			  when the queue at the
			  bottleneck link
			  on the end-to-end path overflows.
			  Since packet loss at the bottleneck link is often preceded by an increase in the queueing delay at the bottleneck link,
			  LEDBAT congestion control uses this increase in queueing delay as an early signal of congestion,
			  enabling it to respond to congestion earlier than TCP,
			  and enabling it to yield bandwidth to a competing TCP flow.
			</t>

			<t> LEDBAT employs one-way delay measurements
			  to estimate queueing delay.
			  When the estimated queueing delay 
			  is lesser than a pre-determined target,
			  LEDBAT infers that the network is not yet congested,
			  and increases its sending rate to utilize any spare capacity in the network.
			  When the estimated queueing delay
			  becomes greater than a pre-determined target,
			  LEDBAT decreases its sending rate quickly
			  as a response to potential congestion in the network.</t>
		  </section>

		  <section title="Preliminaries">
			<t> For the purposes of explaining LEDBAT, 
			  we assume a transport sender that uses fixed-size
			  segments and a receiver that acknowledges each segment separately.
			  It is straightforward to apply the mechanisms described here 
			  with variable-sized segments
			  and with delayed acknowledgments.
			  A LEDBAT sender uses a congestion window (cwnd)
			  that gates the amount of data that the sender can send into the network in one RTT.
			</t>
			
			<t>LEDBAT requires that each data segment carries a "timestamp" from the sender,
			  based on which the receiver computes the one-way delay from the sender,
			  and sends this computed value back to the sender.</t>

			<t> In addition to the LEDBAT mechanism described below,
			  we note that a slow start mechanism can be used as specified in <xref target="RFC2581"/>.
			  Since slow start leads to faster increase in the window than 
			  that specified in LEDBAT,
			  conservative congestion control implementations employing LEDBAT
			  may skip slow start altogether
			  and start with an initial window of XXX MSS.</t>
		  </section>
			
		  <section title="Receiver-Side Operation">
			<t> A LEDBAT receiver operates as follows:
				<figure><artwork><![CDATA[
on data_packet:
  remote_timestamp = data_packet.timestamp
  acknowledgement.delay = local_timestamp() - remote_timestamp
  # fill in other fields of acknowledgement
  acknowlegement.send()
]]></artwork></figure></t>
			</section>

			<section title="Sender-Side Operation">
			  <section title="An Overview">
			<t>As a first approximation, a LEDAT sender operates as show below.  
			  TARGET is the maximum queueing delay that LEDBAT itself can introduce in the network,
			  and GAIN determines the rate at which the congestion window changes;
			  both constants are specified later.
			</t>
				<figure><artwork><![CDATA[
on initialization:
  base_delay = +infinity

on acknowledgement:
  current_delay = acknowledgement.delay
  base_delay = min(base_delay, current_delay)
  queuing_delay = current_delay - base_delay
  off_target = TARGET - queuing_delay
  cwnd += GAIN * off_target / cwnd
]]></artwork></figure>
			</section>

			<section title="The Complete Sender Algorithm">
			  <t>The simplified mechanism above ignores noise filtering and base expiration.  
			  The full sender-side algorithm is specified below:</t>
			  <t><figure><artwork><![CDATA[
on initialization:
  set all NOISE_FILTER delays used by current_delay() to +infinity
  set all BASE_HISTORY delays used by base_delay() to +infinity
  last_rollover = -infinity # More than a minute in the past.

on acknowledgement:
  delay = acknowledgement.delay
  update_base_delay(delay)
  update_current_delay(delay)
  queuing_delay = current_delay() - base_delay()
  off_target = TARGET - queuing_delay + random_input()
  cwnd += GAIN * off_target / cwnd
  # flight_size() is the amount of currently not acked data.
  max_allowed_cwnd = ALLOWED_INCREASE + TETHER*flight_size()
  cwnd = min(cwnd, max_allowed_cwnd)

random_input()
  # random() is a PRNG between 0.0 and 1.0
  # NB: RANDOMNESS_AMOUNT is normally 0
  RANDOMNESS_AMOUNT * TARGET * ((random() - 0.5)*2)

update_current_delay(delay)
  # Maintain a list of NOISE_FILTER last delays observed.
  forget the earliest of NOISE_FILTER current_delays
  add delay to the end of current_delays

current_delay()
  min(the NOISE_FILTER delays stored by update_current_delay)

update_base_delay(delay)
  # Maintain BASE_HISTORY min delays. Each represents a minute.
  if round_to_minute(now) != round_to_minute(last_rollover)
    last_rollover = now
    forget the earliest of base delays
    add delay to the end of base_delays
  else
    last of base_delays = min(last of base_delays, delay)

base_delay()
  min(the BASE_HISTORY min delays stored by update_base_delay)
]]></artwork></figure>
			</t>
			</section>
			</section>
			<section title="Parameter Values">
			<t>TARGET parameter MUST be set to 100 milliseconds. GAIN
			SHOULD be set to 1 so that max rampup rate is the same as for
			TCP.  BASE_HISTORY SHOULD be
			10; it MUST be no less than 2 and SHOULD NOT be more than
			20.  NOISE_FILTER SHOULD be 1; it MAY be tuned so that it
			is at least 1 and no more than cwnd/2.  ALLOWED_INCREASE
			SHOULD be 1 packet; it MUST be at least 1 packet and
			SHOULD NOT be more than 3 packets. TETHER SHOULD be 1.5;
			it MUST be greater than 1. RANDONMESS_AMOUNT SHOULD be 0;
			it MUST be between 0 and 0.1 inclusively.</t>
			
			<t>Note that using the same TARGET
			  value across LEDBAT flows is important, since flows
			  using different TARGET values will not share a
			  bottleneck equitably---flows with higher values will 
			  get a  larger share of the bottleneck bandwidth.</t>

			</section>
		</section>
		<section title="Understanding LEDBAT Mechanisms">
			<t>This section describes and
			  provides insight into the delay estimation
			  and window management mechanisms 
			  used in LEDBAT congestion control.
			</t>
			
			<section title="Delay Estimation">
			  <t>LEDBAT estimates congestion in the network
				based on observed increase in queueing delay in the network.
				To observe an increase in the queueing delay in the network,
				LEDBAT must separate the queueing delay component 
				from the rest of the end-to-end delay.
				This section explains how LEDBAT decomposes the 
				observed changes in end-to-end delay into these two components.</t>

			<t> LEDBAT estimates congestion in the direction of data flow.
			  To avoid measuring queue build-up on the reverse path (or ack path),
			  LEDBAT uses changes in one-way delay estimates.
			  The extant One-Way Active Measurement
			  Protocol (OWAMP) [XXXcite], 
			  can be used for measuring one-way delay,
			  but since LEDBAT is used for sending data, 
			  and since LEDBAT requires only changes in one-way delay to infer congestion,
			  simply adding a timestamp to the data segments
			  and a measurement result field in the ack packets 
			  seems sufficient.
			  Doing so also avoids the pitfall of
			  measurement packets being treated
			  differently from the data packets in the network.</t>	
			  
			<section title="Estimating Base Delay">
			  <t>End-to-end delay
			  can be decomposed into transmission (or serialization) delay,
			  propagation (or speed-of-light) delay,
			  queueing delay,
			  and processing delay.
			  On any given path,
			  barring some noise,
			  all delay components except for queueing delay are constant;
			  over time, we expect only the queueing delay on the path
			  to change as the queue sizes at the links change.
			  Since	queuing delay is always additive to the end-to-end delay, 
			  we estimate the
			  sum of the constant delay components,
			  which we call "base delay",
			  to be the minimum delay observed on the end-to-end path. 
			  Using the minimum observed delay 
			  also allows LEDBAT to eliminate noise in the delay estimation,
			  such as due to spikes in processing delay at a node on the path.</t>
			
			<t>To respond to true changes in the base delay due to route changes,
			  LEDBAT uses only "recent" measurements---measurements over the last N minutes---in estimating
			  the base delay.
			  To implement an approximate
			  minimum over the last N minutes,
			  a LEDBAT sender stores N+1 separate minima---N for the last N minutes,
			  and one for the running current minute.
			  At the end of the current minute, the window moves---the 
			  earliest minimum is dropped and the latest minimum is added.
			  When the connection is idle for a given minute, 
			  no data is available for the one-way delay and,
			  therefore, no minimum is stored. 
			  When the connection has
			  been idle for N minutes, 
			  the measurement begins anew.</t>

			<t> The duration of the observation window itself is a tradeoff between 
			  robustness of measurement and responsiveness to change:
			  a larger observation window yields a more accurate base delay if the true base delay is unchanged,
			  whereas a smaller observation window results in faster response to true changes in the base delay.</t>

			  <!--Assuming that the
			  queuing delay distribution density has a non-zero
			  integral from zero to any sufficiently small upper
			  limit, minimum is also an asymptotically consistent
			  estimate of the constant fraction of the delay. We can
			  thus estimate the queuing delay as the difference
			  between current and base delay as usual.-->
			</section>

			<section title="Estimating Queueing Delay">
				<t>Given that the base delay is constant,
				  the queueing delay is represented by the variable component 
				  of the measured end-to-end delay.
				  We measure queueing delay as simply the
				  difference between an end-to-end delay measurement
				  and the current estimate of base delay.</t>
			</section>
			</section>

			<section title="Managing the Congestion Window">
			  <section title="Window Increase: Probing For More Bandwidth">
				<t> A LEDBAT sender increases its
				  congestion window most 
				  when the queuing delay estimate is zero. 
				  To be friendly to competing TCP flows,
				  we set this highest rate of window growth
				  to be the same as TCP's.
				  In other words,
				  the LEDBAT window grows by at most twice per round-trip time.
				  Since queuing delay estimate is always non-negative,
				  this growth rate ensures that a LEDBAT flow 
				  never ramps up faster than a competing TCP flow over the same path.
				  </t>
			  </section>
				
			  <section title="Window Decrease: Responding To Congestion">
				<t> When the sender's queuing delay estimate is lower
				  than the target, the sending rate should be increased. 
				  When the sender's queueing delay estimate is higher than the target, 
				  the sending rate should be reduced.
				  LEDBAT uses a simple linear controller to detemine sending rate
				  as a function of the delay estimate, where the
				  response is proportional to the difference between the
				  current queueing delay estimate and the target.
				  In limited experiments with Bittorrent nodes, 
				  this controller seems to work well.</t>

				<t> To deal with severe congestion when several
				  packets are dropped in the network,
				  and to provide a fallback against
				  incorrect queuing delay estimates,
				  a LEDBAT sender halves its cwnd
				  when a loss event is detected. 
				  As with NewReno, 
				  LEDBAT reduces its cwnd by half at
				  most once per RTT. 
				  Note that, unlike TCP-like loss-based
				  congestion control, LEDBAT
				  does not induce losses and so it normally does
				  not rely on losses to determine the sending
				  rate. LEDBAT's reaction to loss is thus less
				  important than it is in the case of loss-based congestion
				  control. For LEDBAT, reducing the congestion window on
				  loss is a fallback mechanism in case of severe congestion and
				  in the case of incorrect delay estimates.</t>
			  </section>

			</section>
		</section>

		<section title="Choosing Parameter Values">
		  <t> Through this discussion, we hope to encourage informed experimentation with LEDBAT.</t>
			  <section title="Queuing Delay Target">
				<t>Consider the queuing delay on the bottleneck.  This
				delay is the extra delay induced by congestion
				control. One of our design goals is to keep this delay
				low. However, when this delay is zero, the queue is
				empty and so no data is being transmitted and the link
				is thus not saturated. Hence, our design goal is to
				keep the queuing delay low, but non-zero.</t>

				<t>How low do we want the queuing delay to be? Because
				another design goal is to be deployable on networks
				with only simple FIFO queuing and drop-tail
				discipline, we can't rely on explicit signaling for
				the queuing delay.  So we're going to estimate it
				using external measurements. The external measurements
				will have an error at least on the order of best-case
				scheduling delays in the OSes. There's thus a good
				reason to try to make the queuing delay larger than
				this error. There's no reason that would want us to
				push the delay much further up. Thus, we will have a
				delay target that we would want to maintain.</t>
			  </section>
		</section>

		<section title="Discussion">
		  <section title="Framing Considerations">
			  <t>The actual framing and wire format of the protocol(s)
				using the LEDBAT congestion control mechanism is outside
				of scope of this document, which only describes the
				congestion control part.</t>
			  <t>There is an implication of the need to use one-way
				delay from the sender to the receiver in the sender. An
				obvious way to support this is to use a framing that
				timestamps packets at the sender and conveys the measured
				one-way delay back to the sender in ack packets. This is
				the method we'll keep in mind for the purposes of
				exposition. Other methods are possible and valid.</t>
			  <t>Another implication is the receipt of frequent ACK
				packets. The exposition below assumes one ACK per data
				packet, but any reasonably small number of data packets
				per ACK will work as long as there is at least one ACK
				every round-trip time.</t>
			  <t>The protocols to which this congestion control
				mechanism is applicable, with possible appropriate
				extensions, are TCP, SCTP, DCCP, etc. It is not a goal of
				this document to cover such applications. The mechanism
				can also be used with proprietary transport protocols,
				e.g., those built over UDP for P2P applications.</t>
		  </section>

		  <section title="Competing With TCP Flows">
			<t>Consider competition between a LEDBAT connection and a
			  connection governed by loss-based congestion control (on
			  a FIFO bottleneck with drop-tail discipline).
			  Loss-based connection will need to experience loss to
			  back off.  Loss will only occur after the connection
			  experiences maximum possible delays.  LEDBAT will thus
			  receive congestion indication sooner than the loss-based
			  connection.  If LEDBAT can ramp down faster than the
			  loss-based connection ramps up, LEDBAT will
			  yield. LEDBAT ramps down when queuing delay estimate
			  exceeds the target: the more the excess, the faster the
			  ramp-down.  When the loss-based connection is standard
			  TCP, LEDBAT will yield at precisely the same rate as TCP
			  is ramping up when the queuing delay is double the
			  target.</t>

			<t>LEDBAT is most aggressive when its queuing delay
			  estimate is most wrong and is as low as it can be.
			  Queuing delay estimate is nonnegative, therefore the
			  worst possible case is when somehow the estimate is
			  always returned as zero.  In this case, LEDBAT will ramp
			  up as fast as TCP and halve the rate on loss.  Thus, in
			  case of worst possible failure of estimates, LEDBAT will
			  behave identically to TCP.  This provides an extra
			  safety net.</t>
  
		  </section>

		  <section title="Fairness Among LEDBAT Flows">
			<t>The design goals of LEDBAT center around the aggregate
			  behavior of LEDBAT flows when they compete with standard
			  TCP. It is also interesting how LEDBAT flows share
			  bottleneck bandwidth when they only compete between
			  themselves.</t>
			<t>LEDBAT as described so far lacks a mechanism
			  specifically designed to equalize utilization between
			  these flows. The observed behavior of existing
			  implementations indicates that a rough equalization, in
			  fact, does occur.</t>
			<t>The delay measurements used as control inputs by LEDBAT
			  contain some amount of noise and errors. The linear
			  controller converts this input noise into the same
			  amount of output noise. The effect that this has is that
			  the uncorrelated component of the noise between flows
			  serves to randomly shuffle some amount of bandwidth
			  between flows. The amount shuffled during each RTT is
			  proportional to the noise divided by the target
			  delay. The random-walk trajectory of bandwidth utilized
			  by each of the flows over time tends to the fair
			  share. The timescales on which the rates become
			  comparable are proportional to the target delay
			  multiplied by the RTT and divided by the noise.</t>
			<t>In complex real-life systems, the main concern is
			  usually the reduction of the amount of noise, which is
			  copious if not eliminated. In some circumstances,
			  however, the measurements might be "too good" -- since
			  the equalization timescale is inversely proportional to
			  noise, perfect measurements would result in lack of
			  convergence.</t>
			<t>Under these circumstances, it may be beneficial to
			  introduce some artificial randomness into the inputs
			  (or, equivalently, outputs) of the controller. Note that
			  most systems should not require this and should be
			  primarily concerned with reducing, not adding,
			  noise.</t>

			<t>With delay-based congestion control systems, there's a
			  concern about the ability of late comers to measure the
			  base delay correctly. Suppose a LEDBAT flow saturates a
			  bottleneck; another LEDBAT flow starts and proceeds to
			  measure the base delay and the current delay and to
			  estimate the queuing delay. If the bottleneck always
			  contains target delay worth of packets, the second flow
			  would see the bottleneck as empty start building a
			  second target delay worth of queue on top of the
			  existing queue. The concern ("late comers' advantage")
			  is that the initial flow would now back off because it
			  sees the real delay and the late comer would use the
			  whole capacity.</t>
			<t>However, once the initial flow yields, the late comer
			  immediately measures the true base delay and the two
			  flows operate from the same (correct) inputs.</t>
			<t>Additionally, in practice this concern is further
			  alleviated by the burstiness of network traffic: all
			  that's needed to measure the base delay is one small
			  gap. These gaps can occur for a variety of reasons: the
			  OS may delay the scheduling of the sending process until
			  a time slice ends, the sending computer might be
			  unusually busy for some number of milliseconds or tens
			  of milliseconds, etc. If such a gap occurs while the
			  late comer is starting, base delay is immediately
			  correctly measured. With small number of flows, this
			  appears to be the main mechanism of regulating the late
			  comers' advantage.</t>
		  </section>
<!--		  <section title="Deployment Status">
		  </section>
-->		</section>
		
		<section title="IANA Considerations">
		  <t>There are no IANA considerations for this document.</t>
		</section>

        <section title="Security Considerations">
        	<t>A network on the path might choose to cause higher
        	delay measurements than the real queuing delay so that
        	LEDBAT backs off even when there's no congestion present.
        	Shaping of traffic into an artificially narrow bottleneck
        	can't be counteracted, but faking timestamp field can and
        	SHOULD.  A protocol using the LEDBAT congestion control
        	SHOULD authenticate the timestamp and delay fields,
        	preferably as part of authenticating most of the rest of
        	the packet, with the exception of volatile header fields.
        	The choice of the authentication mechanism that resists
        	man-in-the-middle attacks is outside of scope of this
        	document.</t>
        </section>
	</middle>

    <back>
        <references title='Normative References'>&rfc2119;&rfc2581;
		</references>

		<section anchor="app-additional" title="Timestamp errors">
		  <t>One-way delay measurement needs to deal with timestamp
			errors. We'll use the same locally linear clock model and
			the same terminology as Network Time Protocol (NTP). This
			model is valid for any differentiable clocks.  NTP uses
			the term "offset" to refer to difference from true time
			and "skew" to refer to difference of clock rate from the
			true rate.  The clock will thus have a fixed offset from
			the true time and a skew.  We'll consider what we need to
			do about the offset and the skew separately.</t>
				
		  <section title="Clock offset">
			<t>First, consider the case of zero skew. The offset of
			  each of the two clocks shows up as a fixed error in
			  one-way delay measurement. The difference of the offsets
			  is the absolute error of the one-way delay estimate. We
			  won't use this estimate directly, however. We'll use the
			  difference between that and a base delay. Because the
			  error (difference of clock offsets) is the same for the
			  current and base delay, it cancels from the queuing
			  delay estimate, which is what we'll use. Clock offset is
			  thus irrelevant to the design.</t>
		  </section>

		  <section title="Clock skew">
			<t>Now consider the skew. For a given clock, skew
			  manifests in a linearly changing error in the time
			  estimate.  For a given pair of clocks, the difference in
			  skews is the skew of the one-way delay estimate.  Unlike
			  the offset, this no longer cancels in the computation of
			  the queuing delay estimate.  On the other hand, while
			  the offset could be huge, with some clocks off by
			  minutes or even hours or more, the skew is typically not
			  too bad.  For example, NTP is designed to work with most
			  clocks, yet it gives up when the skew is more than 500
			  parts per million (PPM).  Typical skews of clocks that
			  have never been trained seem to often be around 100-200
			  PPM.  Previously trained clocks could have 10-20 PPM
			  skew due to temperature changes.  A 100-PPM skew means
			  accumulating 6 milliseconds of error per minute.  The
			  expiration of base delay related to route changes mostly
			  takes care of clock skew.  A technique to specifically
			  compute and cancel it is trivially possible and involves
			  tracking base delay skew over a number of minutes and
			  then correcting for it, but usually isn't necessary,
			  unless the target is unusually low, the skew is
			  unusually high, or the base interval is unusually long.
			  It is not further described in this document.</t>
					
			<t>For cases when the base interval is long or the skew is
			  high or the target is low, a technique to correct for
			  skew can be beneficial. The technique described here or
			  a different mechanism MAY be used by
			  implementations. The technique described is still
			  experimental, but it is actually currently used. The
			  pseudocode in the specification below does not include
			  any of the skew correction algorithms.</t>
			
			<section title="Deployed clock skew correction mechanism">
            <t>Clock skew can be in two directions: either the
            sender's clock is faster than the receiver's, or vice
            versa. We refer to the former situation as clock drift "in
            sender's favor" and to the latter as clock drift "in
            receiver's favor".</t>

            <t>When the clock drift is "in sender's favor", nothing
            special needs to be done, because the timestamp
            differences (i.e., the raw delay estimates) will grow
            smaller with time, and thus the base delay will be
            continuously updated with the drift.</t>

            <t>When the clock drift is "in receiver's favor", the raw
            delay estimates will drift up with time, suppressing the
            throughput needlessly. This is the case that can benefit
            from a special mechanism. Assume symmetrical framing
            (i.e., same information about delays transmitted in both
            way). If the sender can detect the receiver reducing its
            base delay, it can infer that this is due to clock drift
            "in receiver's favor". This can be compensated for by
            increasing the sender's base delay by the same
            amount. Since, in our implementation, the receiver sends
            back the raw timestamp estimate, the sender can run the
            same base delay calculation algorithm it runs for itself
            for the receiver as well; when it reduces the inferred
            receiver's delay, it increases its own by the same
            amount.</t>
          </section>
          
          <section title="Skew correction with faster virtual clock">
            <t>This is an alternative skew correction algorithm,
            currently under consideration and not deployed in the
            wild.</t>
            
            <t>Since having a faster clock on the sender is,
            relatively speaking, a non-problem, one can use two
            different virtual clocks on each LEDBAT implementation:
            use, for example, the default machine clock for situations
            where the instance is acting as a receiver, and use a
            virtual clock, easily computed from the default machine
            clock through a linear transformation, for situations
            where the instance is acting as a sender. Make the virtual
            clock, e.g., 500 PPM faster than the machine clock. Since
            500 PPM is more than the variability of most clocks (plus
            or minus 100 PPM), any sender's clock is very likely to be
            faster than any receiver's clock, thus benefitting from
            the implicit correction of taking the minimum as the base
            delay.</t>
            
            <t>Note that this approach is not compatible with the one
            described in the preceding section.</t>
          </section>
          
          <section title="Skew correction with estimating drift">
            <t>This is an alternative skew correction algorithm,
            currently under consideration and not deployed in the
            wild.</t>
            
            <t>The history of base delay minima we already keep for
            each minute provides us with direct means of computing the
            clock skew difference between the two hosts.  Namely, we
            can fit a linear function to the set of base delay
            estimates for each minute. The slope of the function is an
            estimate of the clock skew difference for the given pair
            of sender and receiver. Once the clock skew difference is
            estimated, it can be used to correct the clocks so that
            they advance at nearly the same rate. Namely, the clock
            needs to be corrected by half of the estimated skew
            amount, since the other half will be corrected by the
            other endpoint. Note that the skew differences are then
            maintained for each connection and the virtual clocks used
            with each connection can differ, since they do not attempt
            to estimate the skew with respect to the true time, but
            instead with respect to the other endpoint.</t>
            
          <section title="Byzantine skew correction">
            <t>This is an alternative skew correction algorithm,
            currently under consideration and not deployed in the
            wild.</t>
            
            <t>When it is known that each host maintains long-lived
            connections to a number of different other hosts, a
            byzantine scheme can be used to estimate the skew with
            respect to the true time. Namely, calculate the skew
            difference for each of the peer hosts as described in the
            preceding section, then take the median of the skew
            differences.</t>
              
            <t>This inherent clock drift can then be corrected with a
            linear transformation before the clock data is used in the
            algorithm from the preceding section, the currently
            deployed algorithm, or nearly any other skew correction
            algorithm.</t>
            
            <t>While this scheme is not universally applicable, it
            combines well with other schemes, since it is essentially
            a clock training mechanism. The scheme also acts the
            fastest, since the state is preserved between
            connections.</t>
          </section>
          </section>
		  </section>
		</section>
    </back>

</rfc>

PAFTECH AB 2003-20262026-04-23 10:57:46