http://stupid.domain.name/ietf/

One document matched: draft-yourtchenko-two-party-initial-window-00.xml
<?xml version='1.0' encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM 'rfc2629.dtd' [
<!ENTITY rfc5681 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5681.xml">
<!ENTITY rfc2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY rfc4782 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4782.xml">
<!ENTITY rfc2140 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2140.xml">
<!ENTITY I-D.ietf-tcpm-initcwnd SYSTEM "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.ietf-tcpm-initcwnd">
<!ENTITY I-D.allman-tcpm-bump-initcwnd SYSTEM "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.allman-tcpm-bump-initcwnd">
]>

<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc toc='yes'?>
<?rfc rfcprocack="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc iprnotified="no" ?>
<?rfc strict="yes" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>
<?rfc sortrefs="yes" ?>
<?rfc colonspace='yes' ?>
<?rfc tocindent='yes' ?>

<rfc category="std" docName="draft-yourtchenko-two-party-initial-window-00" ipr="trust200902">
<front>
<title abbrev='Two-Party TCP Initial Window Estimate'>TIE: Two-Party Selection of TCP Initial Congestion Window Estimate</title>

<author initials='A.' surname='Yourtchenko' fullname='Andrew Yourtchenko'>
<organization>Cisco</organization>

<address>
<postal>
<street>7 de Kleetlaan</street>
<city>Diegem</city><code>1831</code>
<country>Belgium</country>
</postal>

<phone>+32 2 704 5494</phone>
<email>ayourtch@cisco.com</email>
</address>
</author>

<author fullname="Ali Begen" initials="A." surname="Begen">      
<organization>Cisco</organization>
<address>
<postal>
<street>181 Bay Street</street>
<city>Toronto</city>
<region>ON</region>
<code>M5J 2T3</code>
<country>Canada</country>
</postal>
<email>abegen@cisco.com</email>
</address>
</author>

<author fullname="Anantha Ramaiah" initials="A." surname="Ramaiah">
<organization>Cisco</organization>
<address>
<postal>
<street>170 Tasman Drive</street>
<city>San Jose</city>
<region>CA</region>
<code>95134</code>
<country>USA</country>
</postal>
<phone>+1 (408) 525-6486</phone>
<email>ananth@cisco.com</email>
</address>
</author>


<date year='2010' />

<abstract>
<t>
There are several proposals on increasing the TCP initial congestion window to a new hardcoded value and subsequent updates of this hardcoded value. This document explores the mechanism of dynamically determining the potentially unbounded Initial Window. This would allow for the future growth after only one software upgrade of the existing TCP stacks. Also this would allow automatic adjustment of the Initial Window size depending on the connectivity properties of the particular Internet host, while minimizing the retransmissions that result from overly optimistic initial congestion window selection.
</t>
</abstract>
</front>
<middle>

<section title='Introduction'>
<t>
There is a growing need to have the value of the initial congestion window increased from the 2-4 segments <xref target="RFC5681">RFC5681</xref> to bigger values - in order to decrease the latency on the initial exchange in client-server protocols like HTTP. This would have a big positive impact on the short-lived connections. <xref target="URL.IW10-references-page">IW10 references page</xref> has the references to some of the related studies. 
</t>
<t>
There are two alternative proposals that both talk about increasing the initial congestion window. <xref target="I-D.ietf-tcpm-initcwnd">I-D.ietf-tcpm-initcwnd</xref> proposes a fixed new value of 10, and <xref target="I-D.allman-tcpm-bump-initcwnd">I-D.allman-tcpm-bump-initcwnd</xref> proposes a fixed value of 6 as well as the schedule for the future increases in the initial congestion window value.
</t>
<t>
The first of the two notes that there is a very modest increase in retransmissions for most applications (0.3% to 0.7%, and for a specific application that created multiple connections, it was between 0.9% and 1.85%). Also that test data shows that slower connections show noticeably bigger increase in retransmissions in case of a bigger initial congestion window.
</t>
<t>
The second document specifies the initial congestion window of 6 segments, with the schedule for future increases in that value. Although this value should result in a smaller impact than the value of 10, the potential problem with the "predefined schedule" approach is that it would force the upgrade cycles of the existing installations.
</t>
<t>
There was also a proposal for Quick Start <xref target="RFC4782">RFC4782</xref> in which the nodes explicitly supply the desired rate of data. That proposal, however, assumes the cooperation of all the routers across the path - and in case the routers do not support this extension, the hosts will use the full proposed bandwidth - therefore increasing the potential for extra packet loss.
</t>
<t>
Another performance enhancement keeps the RTT and cwnd estimates in a per-host cache <xref target="RFC2140">RFC2140</xref> - implemented in a FreeBSD (and reportedly in Linux and Solaris) TCP stack. The practice shows it is a very useful enhancement, but it would not cater for the cases where the hosts did not communicate before - as is the case for a lot of client-server interactions on the Internet.
</t>
<t>
In this document we explore an alternative approach, in which the initial congestion window would be probed over time by the Internet host, and present the initial simulations that in the authors' opinion illustrate the bigger flexibility of it.
</t>
<t>
We claim that choosing the initial congestion window value in an adaptive manner is able to cope with networks and applications that vary widely in their nature. There are networks that can easily handle quite large (even larger than 10) initial congestion window values, while there are others that still struggle with values of 2-4.

 

There are applications that will greatly benefit from using large initial congestion window values. For example, HTTP transactions that tend to last for a short amount of time (often less than 10 RTTs) will naturally favor larger initial congestion window values since the download times could be reduced substantially. On the other hand, HTTP transactions that will last longer (e.g., big file downloads) are unlikely to see any improvement at all. For some other types of applications, a larger initial congestion window bundled with an aggressive sender or receiver window may do more harm than benefit. While performance improvements can be observed individually, the overall performance across the network may degrade. One could suggest to set the initial congestion window values per application, however, even within a single application (or application-layer protocol), there could be enough variation that warrants setting the initial congestion window value differently. Furthermore, this could be seen as a layer violation. To follow the KISS approach, we propose to determine the initial congestion window value in an adaptive manner that takes the past observations into account. For multi-interface hosts, the value could be specified per interface.
</t>
<t>
There exist also some proposals like <xref target="URL.Adaptive-Start">Adaptive Start</xref>, which explore the change to the slow start algorithm itself in order to optimize the case of short-lived connections. Changes of that magnitude are out of scope for this document - however, the mechanism described here could be used as input for these alternative mechanisms as well.
</t>
</section>
<section title='Notational Conventions'>
<t>
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in <xref target="RFC2119">RFC2119</xref>.
</t>
</section>
<section title='The Case for a Two-Party Adaptive Approach'>
<t>
The use of hardcoded initial congestion window is attractive in its simplicity. However, it has a drawback: to cater well for all the deployments it either needs to be a "lowest common denominator" or make a preference towards a better service for some scenarios at the expense of others. The current initial congestion window has been in place for a relatively long time - so it could have been incorporated as one of the constraints into the design of networking components as well as the design criterion for the networks themselves. Unilaterally increasing this value could impose a choice on some of the networks - observe the increased amount of retransmissions or upgrade. A typical upgrade cycle in a network usually involves a large amount of testing and verification by the network architecture team - likewise, the change of the product design assumptions is also a difficult and laborious process. We believe that the adaptive approach to the initial congestion window would allow to adapt the behavior of TCP to the de-facto capabilities of the equipment, not otherwise. Additionally, the nature of adaptive algorithm would allow for future growth - opening the room for more innovative networking solutions.
</t>
<t>
One might argue that a single-party adaptive approach is enough. According to the simulations, the difference in efficiency of a single-party approach is where it is intuitively expected to be: in the middle between the static value choice and the two party approach that is proposed in this document. Arguably adding the second party can be called an optimization - however, we think this optimization is significant enough to discuss it separately.
</t>
</section>
<section title='How to Choose the Initial Congestion Window'>
<t>
We propose the choice of the initial congestion window to become part of the standard three-way handshake - much like the choice of the Maximum Segment Size (MSS). In the SYN segment, each host would notify the other party of the value for the Initial Congestion Window Estimate (ICWE) that they consider would be the best from their perspective, and after completing the three-way handshake both would select the minimum of the two values. This would allow to incorporate the knowledge about the connectivity from both sides of the connection and allow for a better performance without storing a lot of state information - the knowledge about the initial congestion window value will be distributed across all the communicating hosts, and will be supplied and used just when it is needed.
</t>
<t>
Graphically the process can be depicted as follows:

<figure><artwork>
TCP A                                 TCP B
-----                                 -----
  |                                     |
  |--- TCP SYN, ICWE_A ----->           |
  |                                     |
  |        <--- TCP SYNACK, ICWE_B -----|
  |                                     |
ICW_A = min(ICWE_A,ICWE_B)         ICW_B = min(ICWE_A,ICWE_B)
  |                                     |
 ...                                   ...
</artwork></figure>
</t>
<t>
The method of communication of the ICWE values is described in a separate section below.
</t>
<t>
There are several assumptions that this method makes:
</t>
<t>
<list style="symbols">
<t>"At least one of the sides has a component in the path that has the significant impact on burst tolerance of the path across all of the connections this node makes". This assumption is not correct in all cases, however, it is a weaker assumption than the one imposed by the fixed initial congestion window value - all paths may tolerate the burst of X segments, where X is the initial congestion window value. It is also weaker than the assumption that the quality of one connection is always correlated to the quality of another one - which the single-party mechanism makes.</t><t>"The congestion tolerance properties of the path are symmetric". This assumption is the result of the fact that both sides select the common ICWE. This is obviously untrue for a significant portion of cases (e.g., ADSL). However, the typical use case for such environments is that the hosts will be most interested in the best possible use of the capacities of the downstream path - i.e., the ICWE will matter only in one direction. The adaptive nature of the algorithm allows to find the good value for the initial congestion window in case the use patterns are different.</t></list>

</t>
</section>
<section title='How to Update the Initial Congestion Window Estimate'>
<t>
The operation of the host in the initial state starts with the congestion window that is equal to the currently defined one (e.g between 2 and 4 segments). From there on, the host will need to look at the retransmission rate and make a decision whether to increment the congestion window estimate or to decrement it. This decision process needs to satisfy the following criteria:
</t>
<t>
<list style="symbols">
<t>It MUST NOT be overly dynamic. The loss of the segments may be caused either by the inherent properties of the path, temporary congestion or by packet corruption as in the case of wireless transmission.</t><t>It MUST NOT be overly static. If the characteristics of the path change, the host must be able to adapt to the new conditions on a timescale that is closer to order of minutes than to order of weeks.</t><t>It MUST be conservative to avoid the negative effects on the Internet as a whole.</t></list>

</t>
<t>
For the purposes of the initial simulation experiments, the decision process is naive: <list style="symbols">
<t>The party with the smaller ICWE is allowed to increment it if it experienced no losses for the initial burst of traffic within the N past connections. </t><t>The party with the smaller ICWE MUST decrement it in case it experienced loss on one of its connections.</t><t>If there is no way to conclusively determine the presence or absence of loss (e.g., a smaller amount of data than chosen ICWE was sent over a connection), ICWE values on both sides MUST remain unchanged.</t></list>

</t>
<t>
The rationale in the second bullet point to decrement the smaller of the two ICWE estimates, not the larger one, is as follows. Suppose we have two parties - party A, which enjoys better connectivity and can use larger initial congestion window values, and a party B that uses smaller congestion window values. If they make a connection, the ICWE proposed by B will be smaller and will be chosen for the connection. If this estimate is inadequate - it is the responsibility of B, since it was its ICWE that was chosen.
</t>
<t>
This also allows to not influence the connections between the well-connected hosts by the connections from the hosts that have poorer connectivity.
</t>
<t>
This algorithm may not necessarily satisfy all of the requirements above in all cases. More experimentation is needed to find a better mechanism.
</t>
<t>
{DISCUSS: if the connection experiences slow start several times, can we count it as several slow starts ? The reliable detection/action in this case is only possible by one side - then only if the sender had a smaller ICWE can anything be done. } 
</t>
</section>
<section title='How to Communicate the ICWE'>
<t>
The first obvious choice of communicating the ICWE is introducing the new TCP option. However, this approach is difficult in its practical implementation because of the limited TCP option space. Instead, another approach is proposed below.
</t>
<t>
Based on the observations, the TCP Window Size in the initial segment's TCP header is an an even value. This means that the TCP stack can signal to the peer its awareness about the adaptive initial congestion window by setting the least significant bit to 1. Then, the TCP Window field will contain the ICWE value sent by this peer.
</t>
<t>
Upon sending the ACKs for the data received, each side can then communicate the TCP Window according to the TCP specification.
</t>
<t>
{discuss: setting the LSB to 1 for the window allows to have an 'update' for ICWE in some form, throughout the life of the connection. Whether it is needed is a separate topic. Probably not}.
</t>
<t>
An argument against such "LSB signaling" would be that we are changing the semantics of the TCP Window field. However, this equally goes about using TCP Window to clamp the initial congestion window imposed by the other methods - because in that case it is no longer an amount of data that the host is ready to receive, but the amount of data that the host thinks the other party should transmit.
</t>
<t>
During the transmission, the ICWE is supposed to be encoded in the same metric as TCP Window - e.g., in octets. This mechanism can help with interoperability with the stacks that do not perform the adaptive initial congestion window selection, but that are overly optimistic with their congestion window.
</t>
</section>
<section title='How to Determine the Loss of the Initial Burst of Data'>
<t>
{FIXME: this section needs more brainstorming/work}
</t>
<t>
Determining the loss of the initial congestion window of data on the side that has filled it by sending the data is simple - it can be done by verifying if it did any retransmits of this data or not. However, since the algorithm does not imply the communication between the parties beyond the exchange of ICWEs in the very beginning of the connection, both sides need to detect the congestion - and on the receiver side of the burst it may not be easy.
</t>
<t>
{this is where the discussion is needed}
</t>
<t>
There may be a condition where the amount of data sent is smaller than the initial congestion window - i.e., it is not possibly to determine whether the ICWE estimate was right or wrong. In such cases the values of the ICWE on both sides MUST remain unchanged.
</t>
</section>
<section title='Simulation Results'>
<t>
This section shows the results obtained using the simulation enclosed in the Appendix A.
</t>
<t>
{FIXME: to be expanded, for now just pasting from the tcpm mail}

<figure><artwork>

1) Static IW=10, access capacity 5..8, backbone capacity 8..15

./a.out 10 0 1000 5 8 8 15 1 | grep round

Results from the round: loss: 423, headroom: 0, avg IW size: 10.000000
Results from the round: loss: 436, headroom: 0, avg IW size: 10.000000
Results from the round: loss: 423, headroom: 0, avg IW size: 10.000000
Results from the round: loss: 426, headroom: 0, avg IW size: 10.000000

2) One-side-adjustment of the IW:

./a.out 10 1 1000 5 8 8 15 1 | grep round

Results from the round: loss: 66, headroom: 10, avg IW size: 6.380000
Results from the round: loss: 52, headroom: 14, avg IW size: 6.180000
Results from the round: loss: 62, headroom: 10, avg IW size: 6.350000
Results from the round: loss: 63, headroom: 9, avg IW size: 6.210000

3) Two-party adjustment of the IW:

./a.out 10 1 1000 5 8 8 15 0 | grep round

Results from the round: loss: 1, headroom: 32, avg IW size: 5.420000
Results from the round: loss: 0, headroom: 37, avg IW size: 5.350000
Results from the round: loss: 1, headroom: 29, avg IW size: 5.350000
Results from the round: loss: 1, headroom: 41, avg IW size: 5.350000
Results from the round: loss: 0, headroom: 50, avg IW size: 5.290000
</artwork></figure>
</t>
<t>
It needs noting that the value of N needs to be sufficiently large in order to show the best results, below is the result with N=10:

<figure><artwork>

./a.out 10 1 10 5 8 8 15 0 | grep round

Results from the round: loss: 13, headroom: 27, avg IW size: 5.630000
Results from the round: loss: 12, headroom: 6, avg IW size: 5.690000
Results from the round: loss: 10, headroom: 16, avg IW size: 5.650000
Results from the round: loss: 12, headroom: 23, avg IW size: 5.680000
Results from the round: loss: 12, headroom: 13, avg IW size: 5.650000

</artwork></figure>
</t>
<t>
We can see that the results are still better than the one-sided judgement, however, they are of the same order of magnitude.
</t>
</section>
<section title='Conclusions'>
<t>
The choice of the best possible criteria for the increase/decrease of the ICWE values is to be explored further, however, the initial results show that the proposed algorithm provides the results better than unilateral decision on the initial congestion window.
</t>
</section>
<section title='Interoperability With the Hosts That Do Not Support TIE'>
<t>
If one of the sides of the connection does support the method mentioned in this document, and the other does not, then the side supporting it MAY use its best judgement with respect to the initial congestion window, namely the fixed increased initial congestion window of 10 as described in <xref target="I-D.ietf-tcpm-initcwnd">I-D.ietf-tcpm-initcwnd</xref>. However, in this case the hosts MUST implement the appropriate mechanisms for fallback in case of the losses due to this increased initial window similar to the ones proposed by Joe Touch.
</t>
<t>
{FIXME: this section needs more discussion/further work}
</t>
</section>
<section title='Acknowledgements'>
<t>
Thanks to Philip Gladstone, Dan Wing, Preethi Natarajan, Stig Venaas for the review and useful comments and suggestions.
</t>
</section>
<section title='IANA Considerations'>
<t>
None.
</t>
</section>
<section title='Security Considerations'>
<t>
The adaptive nature of this mechanism is a challenge in case of malicious hosts who want to affect the peer's estimate for the future connections. The malicious host would use the larger ICWE than the party and then force the packet loss on the connection. This would result in the decrease of the other party's ICWE for the subsequent connections.
</t>
<t>
However, this would work only in case of low volume of the legitimate connections - in case there is a high volume, the ICWE will quickly return to the good value based on the other connections' information. 
</t>
<t>
{FIXME: There is potentially more of interesting things here. Needs more work}
</t>
</section>
<section title='Simulation Code'>
<t>
The below code is enclosed so the readers can verify the results and run their own experiments:

<figure><artwork>
#include <stdio.h>

#define COUNT 10

int access_capacity[COUNT];
int backbone_capacity[COUNT][COUNT];
int node_iw[COUNT];
int node_noloss[COUNT];

void fill_initial(int iw, int lac, int hac, int lbc, int hbc) {
  int i, j;
  for(i=0; i<COUNT; i++) {
    node_iw[i] = iw;
    node_noloss[i] = 0;
    access_capacity[i] = rand() % (hac-lac) + lac;
    for(j=0;j<COUNT;j++) {
      backbone_capacity[i][j] = rand() % (hbc-lbc) + lbc;
    }
  }
}

void iterate(int no_hint, int adjust, int loss_thresh, 
             int *rloss, int *rheadroom, int *riw) {
  // return the loss on the connection
  int i,j;
  int iw;
  int mincap;
  int loss;
  int headroom;

  i = rand() % COUNT;
  j = rand() % COUNT;

  if(no_hint) {
     iw = node_iw[i];
  } else {
     iw = node_iw[j] < node_iw[i] ? node_iw[j] : node_iw[i];
  }

  *riw = iw;
  printf("Connection between %d[iw %d, cap %d] "
         "and %d[iw %d, cap %d], resulting "
         "iw = %d, backbone cap: %d\n",
         i, node_iw[i], access_capacity[i],
         j, node_iw[j], access_capacity[j],
         iw, backbone_capacity[i][j] );

  // calculate the min capacity = min of two access and backbone

  mincap = backbone_capacity[i][j];
  if (mincap > access_capacity[i]) { mincap = access_capacity[i]; }
  if (mincap > access_capacity[j]) { mincap = access_capacity[j]; }

  // loss occurs if iw is bigger than a minimum capacity.
  loss = iw > mincap ? iw - mincap : 0;

  // headroom is a measure of inefficiency - how much 
  // did we underuse the capacity of the path
  headroom = iw < mincap ? mincap - iw : 0;

  if (adjust) {

    if (no_hint) {
      if (loss) {
        if(node_iw[i] > 2) { node_iw[i]--; }
      } else {
        node_noloss[i]++;
        if(node_noloss[i] > loss_thresh) { node_iw[i] += 1; }
      }
    } else {
        
      // A very simplistic mechanism of adapting the ICWE
      // Loss = decrement the ICWE on the node 
      // that had a smaller ICWE
      // No loss for more than loss_thresh 
      //      consequtive connections =
      //      = increment the ICWE on the side 
      //      that had the smaller ICWE.
      if (loss) {
        node_noloss[i] = node_noloss[j] = 0; 
     
        if (node_iw[i] < node_iw[j]) {
          // node_iw[i] /= 2;
          // if(node_iw[j] > 2) { node_iw[j] -= 1; }
          if(node_iw[i] > 2) { node_iw[i] -= 1; }
        } else {
          // node_iw[j] /= 2;
          // if(node_iw[i] > 2) { node_iw[i] -= 1; }
          if(node_iw[j] > 2) { node_iw[j] -= 1; }
        }
      } else {
        node_noloss[i]++;
        node_noloss[j]++;

        // no loss, can try incrementing the smaller side
        if (node_iw[i] < node_iw[j]) {
          if(node_noloss[i] > loss_thresh) { node_iw[i] += 1; }
        } else {
          if(node_noloss[j] > loss_thresh) { node_iw[j] += 1; }
        }
      }
    } 
  }

  printf("   result headroom: %d, loss: %d, "
         "new iw %d = %d, new iw %d = %d\n", 
         headroom, loss, 
         i, node_iw[i],
         j, node_iw[j]);

  *rloss = loss;
  *rheadroom = headroom;
}


int main(int argc, char **argv) {
  int i = 0;
  int loss, headroom;
  int iws;
  int closs, cheadroom;
  int ciw;
  int count = 100;
  if(argc < 9) {
    printf("usage: %d <initial window> <adjust: 1/0>"
           " <IW increment success threshold> "
           "<lower access cap> <higher access cap> "
           "<lower bb cap> <higher bb cap> "
           "<one-side-only-decision>\n");
    exit(0);
  }
  fill_initial(atoi(argv[1]), atoi(argv[4]), 
               atoi(argv[5]), atoi(argv[6]), atoi(argv[7]));
  while(1) {
    iws = loss = headroom = 0;
    for(i=0;i<count;i++) {
      iterate(atoi(argv[8]), atoi(argv[2]), 
              atoi(argv[3]), &closs, &cheadroom, &ciw);
      loss += closs;
      headroom += cheadroom;
      iws += ciw;
    }
    printf("Results from the round: loss: %d, " 
           "headroom: %d, avg IW size: %f\n", 
           loss, headroom, ((float)iws)/((float)count));
    
  }
}

</artwork></figure>
</t>
<t>
</t></section>
</middle>
<back>
<references title="Normative References">
&rfc5681;
&rfc2119;
</references>
<references title="Informative References">
&rfc4782;
&rfc2140;
&I-D.ietf-tcpm-initcwnd;
&I-D.allman-tcpm-bump-initcwnd;
<reference anchor="URL.IW10-references-page" target="http://code.google.com/speed/protocols/tcpm-IW10.html">
  <front>
    <title>IW10 references page</title>
    <author/>
    <date/>
  </front>
</reference>
<reference anchor="URL.Adaptive-Start" target="http://paleale.eecs.berkeley.edu/~varaiya/papers_ps.dir/jeng.pdf">
  <front>
    <title>Adaptive Start</title>
    <author/>
    <date/>
  </front>
</reference>
</references>

</back>
</rfc>
PAFTECH AB 2003-2026
2026-04-23 14:19:40