One document matched: draft-briscoe-conex-data-centre-00.xml


<?xml version="1.0" encoding="US-ASCII"?>
<?xml-stylesheet type='text/xsl' href='http://xml.resource.org/authoring/rfc2629.xslt' ?>
<?rfc toc="yes" ?>
<!-- Default toc="no" No Table of Contents -->
<?rfc symrefs="yes" ?>
<!-- Default symrefs="no" Don't use anchors, but use numbers for refs -->
<?rfc sortrefs="yes" ?>
<!-- Default sortrefs="no" Don't sort references into order -->
<?rfc compact="yes" ?>
<!-- Default compact="no" Start sections on new pages -->
<?rfc strict="no" ?>
<!-- Default strict="no" Don't check I-D nits -->
<?rfc rfcedstyle="yes" ?>
<!-- Default rfcedstyle="yes" attempt to closely follow finer details from the latest observable RFC-Editor style -->
<?rfc linkmailto="yes" ?>
<!-- Default linkmailto="yes" generate mailto: URL, as appropriate -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<rfc category="info" docName="draft-briscoe-conex-data-centre-00"
     ipr="trust200902">
  <front>
    <title abbrev="Initial ConEx Deployment Examples">Network Performance
    Isolation in Data Centres using Congestion Exposure (ConEx)</title>

    <author fullname="Bob Briscoe" initials="B." surname="Briscoe">
      <organization>BT</organization>

      <address>
        <postal>
          <street>B54/77, Adastral Park</street>

          <street>Martlesham Heath</street>

          <city>Ipswich</city>

          <code>IP5 3RE</code>

          <country>UK</country>
        </postal>

        <phone>+44 1473 645196</phone>

        <email>bob.briscoe@bt.com</email>

        <uri>http://bobbriscoe.net/</uri>
      </address>
    </author>

    <author fullname="Murari Sridharan" initials="M." surname="Sridharan">
      <organization>Microsoft</organization>

      <address>
        <postal>
          <street>1 Microsoft Way</street>

          <city>Redmond</city>

          <region>WA</region>

          <code>98052</code>

          <country></country>
        </postal>

        <phone></phone>

        <facsimile></facsimile>

        <email>muraris@microsoft.com</email>

        <uri></uri>
      </address>
    </author>

    <date day="09" month="July" year="2012" />

    <area>Transport Area</area>

    <workgroup>ConEx</workgroup>

    <keyword>Internet-Draft</keyword>

    <abstract>
      <t>This document describes how a multi-tenant data centre operator can
      isolate tenants from network performance degradation due to each other's
      usage, but without losing the multiplexing benefits of a LAN-style
      network where anyone can use any amount of any resource. Zero per-tenant
      configuration and no implementation change is required on network
      equipment. Instead the solution is implemented with a simple change to
      the hypervisor (or container) on each physical server, beneath the
      tenant's virtual machines. These collectively enforce a very simple
      distributed contract - a single network allowance that each tenant can
      allocate among their virtual machines. The solution is simplest and most
      efficient using layer-3 switches that support explicit congestion
      notification (ECN) and if the sending operating system supports
      congestion exposure (ConEx). Nonetheless, an arrangement is described so
      that the operator can unilaterally deploy a complete solution while
      operating systems are being incrementally upgraded to support ConEx.</t>
    </abstract>
  </front>

  <middle>
    <!-- ====================================================================== -->

    <section anchor="pidc_intro" title="Introduction">
      <t>A number of companies offer hosting of virtual machines on their data
      centre infrastructure—so-called infrastructure as a service
      (IaaS). A set amount of processing power, memory, storage and network
      are offered. Although processing power, memory and storage are
      relatively simple to allocate on the 'pay as you go' basis that has
      become common, the network is less easy to allocate given it is a
      naturally distributed system.</t>

      <t>This document describes how a data centre infrastructure provider can
      deploy congestion policing at every ingress to the data centre network,
      e.g. in all the hypervisors (or containers) in a data centre that
      provides virtualised 'cloud' computing facilities. These bulk congestion
      policers pick up congestion information in the data packets traversing
      the network, using one of two approaches: feedback tunnels or ConEx.
      Then, these policers at the ingress edge have sufficient information to
      limit the amount of congestion any tenant can cause anywhere in the data
      centre. This isolates the network performance experienced by each tenant
      from the behaviour of all the others, without any tenant-related
      configuration of any of the switches.</t>

      <t>The key to the solution is the use of congestion-bit-rate rather than
      bit-rate as the policing metric. <spanx style="emph">How </spanx>this
      works is very simple and quick to describe (<xref
      target="pidc_Outline_Design"></xref> outlines the design at the start
      and<xref target="pidc_design"> </xref> gives details). </t>

      <t>However, it is much more difficult to understand <spanx style="emph">why</spanx>
      this approach provides performance isolation. In particular, why it
      provides performance isolation across a network of links, even though
      there is apparently no isolation mechanism in each link. <xref
      target="pidc_Intuition"></xref> builds up an intuition for why the
      approach works, and why other approaches fall down in different ways.
      The explanation builds as follows:<list style="symbols">
          <t>Starting with the simple case of long-running flows focused any
          one bottleneck link in the network, tenants get weighted shares of
          the link, much like weighted round robin, but with no mechanism in
          any of the links;</t>

          <t>In the more realistic case where flows are not all long-running
          but a mix of short to very long, it is explained that bit-rate is
          not a sufficient metric for isolating performance; how often a
          tenant is <spanx style="emph">not</spanx> sending is the significant
          factor for performance isolation, not whether bit-rate is shared
          equally whenever it is sending;</t>

          <t>Although it might seem that data volume would be a good measure
          of how often a tenant does not send, we then show that a tenant can
          send a large volume of data but hardly affect the performance of
          others — by being very responsive to congestion. Using
          congestion-volume (congestion-bit-rate over time) in a policer
          encourages large data senders to give other tenants much higher
          performance, whereas using straight volume as an allocation metric
          provides no isolation at all from tenants who send the same volume
          but are oblivious to its effect on others (the widespread behaviour
          today);</t>

          <t>We then show that a policer based on the congestion-bit-rate
          metric works across a network of links treating it as a pool of
          capacity, whereas other approaches treat each link independently,
          which is why the proposed approach requires none of the
          configuration complexity on switches that is involved in other
          approaches.</t>
        </list></t>

      <t>The solution would also be just as applicable to isolate the network
      performance of different departments within the data centre of an
      enterprise, which could be implemented without virtualisation. However,
      it will be described as a multi-tenant scenario, which is the more
      difficult case from a security point of view.</t>

      <t>{ToDo: Meshed, pref multipath resource pool, not unnecessarily
      constrained paths.}</t>
    </section>

    <section anchor="pidc_Design_Features" title="Design Features">
      <t>The following goals are met by the design, each of which is explained
      subsequently: <list style="symbols">
          <t>Performance isolation</t>

          <t>No loss of LAN-like openness and multiplexing benefits</t>

          <t>Zero tenant-related switch configuration</t>

          <t>No change to existing switch implementations</t>

          <t>Weighted performance differentiation</t>

          <t>Ultra-Simple contract—per-tenant network-wide allowance</t>

          <t>Sender constraint, but with transferrable allowance</t>

          <t>Transport-agnostic</t>

          <t>Extensible to wide-area and inter-data-centre interconnection</t>
        </list></t>

      <t><list style="hanging">
          <t hangText="Performance Isolation with Openness of a LAN:">The
          primary goal is to ensure that each tenant of a data centre receives
          a minimum assured performance from the whole network resource pool,
          but without losing the efficiency savings from multiplexed use of
          shared infrastructure (work-conserving). There is no need for
          partitioning or reservation of network resources.</t>

          <t hangText="Zero Tenant-Related Switch Configuration:">Performance
          isolation is achieved with no per-tenant configuration of switches.
          All switch resources are potentially available to all tenants.
          <vspace blankLines="1" />Separately, <spanx style="emph">forwarding</spanx>
          isolation may (or may not) be configured to ensure one tenant cannot
          receive traffic from another's virtual network. However, <spanx
          style="emph">performance</spanx> isolation is kept completely
          orthogonal, and adds nothing to the configuration complexity of the
          network.</t>

          <t hangText="No New Switch Implementation:">Straightforward
          commodity switches (or routers) are sufficient. Bulk explicit
          congestion notification (ECN) is recommended, which is available in
          a large and growing range of layer-3 switches (a layer-3 switch does
          switching at layer-2, but it can use the Diffserv and ECN fields for
          traffic control if an IP header can be found). Once the network
          supports ECN, the performance isolation function is confined to the
          hypervisor (or container) and the operating systems on the
          hosts.</t>

          <t hangText="Weighted Performance Differentiation:">A tenant gets
          network performance in proportion to their allowance when
          constrained by others, with no constraint otherwise. Importantly,
          the assurance is not just instantaneous, but over time. And the
          assurance is not just localised to each link but network-wide. This
          will be explained with numerical examples later.</t>

          <t hangText="Ultra-Simple Contract:">The tenant needs to decide only
          two things: The peak bit-rate connecting each virtual machine to the
          network (as today) and an overall 'usage' allowance. This document
          focuses on the latter. A tenant just decides one number for her
          contracted allowance that can be shared over all her virtual
          machines (VMs). The 'usage' allowance is a measure of
          congestion-bit-rate, which will be explained later, but most tenants
          will just think of it as a number, where more is better. A tenant
          has no need to decide in advance which VMs will need more allowance
          and which less—an automated process allocates the allowance
          across the VMs, shifting more to those that need it most, as they
          use it. Therefore, performance cannot be constrained by poor choice
          of allocations between VMs, removing a whole dimension from the
          problem that tenants face when choosing their traffic contract. The
          allocation process can be operated by the tenant, or provided by the
          data centre operator as part of an additional platform as a service
          (PaaS) offer.</t>

          <t hangText="Sender Constraint with transferrable allowance:">By
          default, constraints are always placed on data senders, determined
          by the sending party's traffic contract. Nonetheless, if the
          receiving party (or any other party) wishes to enhance performance
          it can arrange this with the sender at the expense of its own
          allowance. <vspace blankLines="1" />For instance, when a tenant's VM
          sends data to a storage facility the tenant that owns the VM
          consumes her allowance for enhanced sending performance. But by
          default when she later retrieves data from storage, the storage
          facility is the sender, so the storage facility consumes its
          allowance to determine performance in the reverse direction.
          Nonetheless, during the retrieval request, the storage facility can
          require that its sending 'costs' are covered by the receiving VM's
          allowance.</t>

          <t hangText="Transport-Agnostic:">In a well-provisioned network,
          enforcement of performance isolation rarely introduces constraints
          on network behaviour. However, it continually counts how much each
          tenant is limiting the performance of others, and it will intervene
          to enforce performance isolation, but against only those customers
          who most persistently constrain others. This performance isolation
          is oblivious to flows and to the protocols and algorithms being used
          above the IP layer.</t>

          <t hangText="Interconnection:">The solution is designed so that
          interconnected networks can ensure each is accountable for the
          performance degradation it contributes to in other networks. If
          necessary, one network has the information to intervene at its
          ingress to limit traffic from another network that is degrading
          performance. Alternatively, with the proposed protocols, networks
          can see sufficient information in traffic arriving at their borders
          to give their neighbours financial incentives to limit the traffic
          themselves.<vspace blankLines="1" />The present document focuses on
          a single provider-scenario, but evolution to interconnection with
          other data centres over wide-area networks, and interconnection with
          access networks is briefly discussed in <xref
          target="pidc_evolution"></xref>.</t>
        </list></t>
    </section>

    <!-- ====================================================================== -->

    <section anchor="pidc_Outline_Design" title="Outline Design">
      <t>This section outlines the essential features of the design. Design
      details will be given in <xref target="pidc_design"></xref>.<list
          style="hanging">
          <t hangText="Edge policing:">Traffic policing is located at the
          policy enforcement point where each sending host connects to the
          network, typically beneath the tenant's operating system in the
          hypervisor controlled by the infrastructure operator. In this
          respect, the approach has a similar arrangement to the Diffserv
          architecture with traffic policers forming a ring around the network
          <xref target="RFC2475"></xref>.</t>

          <t hangText="Congestion policing:">However, unlike Diffserv, traffic
          policing limits congestion-bit-rate, not bit-rate. Congestion
          bit-rate is the product of congestion probability and bit-rate. For
          instance, if the instantaneous congestion probability (cf. loss
          probability) across a network path were 0.02% and a tenant's maximum
          contracted congestion-bit-rate was 600kb/s, then the policer would
          allow the tenant to send at a bit-rate of up to 3Gb/s (because 3Gb/s
          x 0.02% = 600kb/s). The detail design section describes how
          congestion policers at the network ingress know the congestion that
          each packet will encounter in the network. {ToDo: rewrite this
          section to describe how a congestion policer works, not to focus
          just on units.}</t>

          <t hangText="Hose model:">The congestion policer controls all
          traffic from a particular sender without regard to destination,
          similar to the Diffserv 'hose' model. {ToDo: dual policer, and
          multiple hoses for long-term average.}</t>

          <t hangText="Flow policing unnecessary:">A congestion policer could
          be designed to focus policing on the particular data flow(s)
          contributing most to the excess congestion-bit-rate. However we will
          explain why bulk policing should be sufficient.</t>

          <t hangText="FIFO forwarding:">Each network queue only needs a
          first-in first-out discipline, with no need for any priority
          scheduling. If scheduling by traffic class is used (for whatever
          reason), congestion policing can be used to isolate tenants from
          each other within each class. {ToDo: Say this the other way
          round.}</t>

          <t hangText="ECN marking recommended:">All queues that might become
          congested should support bulk ECN marking, but packets that do not
          support ECN marking can be accommodated.</t>
        </list>In the proposed approach, the network operator deploys capacity
      as usual—using previous experience to determine a reasonable
      contention ratio at every tier of the network. Then, the tenant
      contracts with the operator for an allowance that determines the rate at
      which the congestion policer allows each tenant to contribute to
      congestion {ToDo: Dual policer}. <xref target="pidc_parameter"></xref>
      discusses how the operator would determine this allowance. Each VM's
      congestion policer limits its peak congestion-bit-rate as well as
      limiting the overall average per tenant.</t>
    </section>

    <!-- ====================================================================== -->

    <section anchor="pidc_Intuition" title="Performance Isolation: Intuition">
      <t>Network performance isolation traditionally meant that each user
      could be sure of a minimum guaranteed bit-rate. Such assurances are
      useful if traffic from each tenant follows relatively predictable paths
      and is fairly constant. If traffic demand is more dynamic and
      unpredictable (both over time and across paths), minimum bit-rate
      assurances can still be given, but they have to be very small relative
      to the available capacity.</t>

      <t>This either means the shared capacity has to be greatly overprovided
      so that the assured level is large enough, or the assured level has to
      be small. The former is unnecessarily expensive; the latter doesn't
      really give a sufficiently useful assurance.</t>

      <t>Another form of isolation is to guarantee that each user will get 1/N
      of the capacity of each link, where N is the number of active users at
      each link. This is fine if the number of active users (N) sharing a link
      is fairly predictable. However, if large numbers of tenants do not
      typically share any one link but at any time they all could (as in a
      data centre), a 1/N assurance is fairly worthless. Again, given N is
      typically small but could be very large, either the shared capacity has
      to be expensively overprovided, or the assured bit-rate has to be
      worthlessly small.</t>

      <t>Both these traditional forms of isolation try to give the tenant an
      assurance about instantaneous bit-rate by constraining the instantaneous
      bit-rate of everyone else. However, there are two mistakes in this
      approach. The amount of capacity left for a tenant to transfer data as
      quickly as possible depends on:<list style="numbers">
          <t>the load <spanx style="emph">over time</spanx> of everyone
          else</t>

          <t>how much everyone else yields to the increase in <spanx
          style="emph">congestion</spanx> when someone else tries to transfer
          data</t>
        </list></t>

      <t>This is why limiting congestion-bit-rate over time is the key to
      network performance isolation. It focuses policing only on those tenants
      who go fast over congested path(s) excessively and persistently over
      time. This keeps congestion below a design threshold everywhere so that
      everyone else can go fast.</t>

      <t>Congestion policing can and will enforce a congestion response if a
      particular tenant sends traffic that is completely unresponsive to
      congestion. However, the purpose of congestion policing is not to
      intervene in everyone's rate control all the time. Rather it is
      encourage each tenant to avoid being policed — to keep the
      aggregate of all their flows' responses to congestion within an overall
      envelope. Nonetheless, the upper bound set by the congestion policer
      still ensures that each tenant's minimum performance is isolated from
      the combined effect of everyone else.</t>

      <t>It has not been easy to find a way to give the intuition on why
      congesiton policing isolates performance, particularly across a networks
      of links not just on a single link. The approach used in this section,
      is to describe the system as if everyone is using the congestion
      response they would be forced to use if congestion policing had to
      intervene. We therefore call this the boundary model of congestion
      control. It is a very simple congestion response, so it is much easier
      to understand than if we introduced all the square root terms and other
      complexity of New Reno TCP's response. And it means we don't have to try
      to decribe a mix of responses.</t>

      <t>We cannot emphasise enough that the intention is not to make
      individual flows conform to this boundary response to congestion. Indeed
      the intention is to allow a diverse evolving mix of congestion
      responses, but constrained in total within a simple overall
      envelope.</t>

      <t>After describing and further justifying using the a simple boundary
      model of congestion control, we start by considering long-running flows
      sharing one link. Then we will consider on-off traffic, before widening
      the scope from one link to a network of links and to links of different
      sizes. Then we will depart from the initial simplified model of
      congestion control and consider diverse congestion control algorithms,
      including no end-system response at all.</t>

      <t>Formal analysis to back-up the intuition provided by this section
      will be made available in a more extensive companion technical report
      <xref target="conex-dc_tr"></xref>.</t>

      <section anchor="pidc_Initial_CC_Model"
               title="Simple Boundary Model of Congestion Control">
        <t>The boundary model of congestion control ensures a flow's bit-rate
        is inversely proportional to the congestion level that it detects. For
        instance, if congestion probability doubles, the flow's bit-rate
        halves. This is called a scalable congestion control because it
        maintains the same rate of congestion signals (marked or dropped
        packets) no matter how fast it goes. Examples are Relentless TCP and
        Scalable TCP [ToDo: add refs].</t>

        <t>New Reno-like TCP algorithms <xref target="RFC5681"></xref> have
        been widely replaced by alternatives closer to this scalable ideal
        (e.g. Cubic TCP, Compound TCP [ToDo: add refs]), because at high rates
        New Reno generated congestion signals too infrequently to track
        available capacity fast enough <xref target="RFC3649"></xref>. More
        recent TCP updates (e.g. data centre TCP) are becoming closer still to
        the scalable ideal.</t>

        <t>It is necessary to carefully distinguish congestion bit-rate, which
        is an absolute measure of the rate of congested bits vs. congestion
        probability, which is a relative measure of the proportion of
        congested bits to all bits. For instance, consider a scenario where a
        flow with scalable congestion control is alone in a 1Gb/s link, then
        another similar flow from another tenant joins it. Both will push up
        the congestion probability, which will push down their rates until
        they together fit into the link. Because the flow's rate has to halve
        to accomodate the new flow, congestion probability will double (lets
        say from 0.002% to 0.004%), by our initial assumption of a scalable
        congestion control. When it is alone on the link, the
        congestion-bit-rate of the flow is 20kb/s (=1Gb/s * 0.002%), and when
        it shares the link it is still 20kb/s (= 500Mb/s * 0.04%).</t>

        <t>In summary, a congestion control can be considered scalable if the
        bit-rate of packets carrying congestion signals (the
        congestion-bit-rate) always stays the same no matter how much capacity
        it finds available. This ensures there will always be enough signals
        in a round trip time to keep the dynamics under control.</t>

        <t>Reminder: Making individual flows conform to this boundary or
        scalable response to congestion is a non-goal. Although we start this
        explanation with this specific simple end-system congestion response,
        this is just to aid intuition.</t>
      </section>

      <section anchor="pidc_long-running" title="Long-Running Flows">
        <t><xref target="pidc_Tab-long_flows"></xref> shows various scenarios
        where each of five tenants has contracted for 400kb/s of
        congestion-bit-rate in order to share a 1Gb/s link. In order to help
        intuition, we start with the (unlikely) scenario where all their flows
        are long-running. Long-running flows will try to use all the link
        capacity, so for simplicity we take utilisation as a round 100%.</t>

        <t>In the case we have just described (scenario A) neither tenant's
        policer is intervening at all, because both their congestion
        allowances are 40kb/s and each sends only one flow that contributes
        20kb/s of congestion — half the allowance.</t>

        <texttable anchor="pidc_Tab-long_flows"
                   title="Bit-rates that a congestion policer allocates to five tenants sharing a 1Gb/s link with various numbers (#) of long-running flows all using 'scalable congestion control'">
          <ttcol align="right">       
                 
                  Tenant</ttcol>

          <ttcol>contracted congestion- bit-rate kb/s</ttcol>

          <ttcol>scenario A         
          # : Mb/s</ttcol>

          <ttcol>scenario B         
          # : Mb/s</ttcol>

          <ttcol>scenario C         
          # : Mb/s</ttcol>

          <ttcol>scenario D         
          # : Mb/s</ttcol>

          <c></c>

          <c></c>

          <c></c>

          <c></c>

          <c></c>

          <c></c>

          <c>(a)</c>

          <c>40</c>

          <c>1 : 500</c>

          <c>5 : 250</c>

          <c>5 : 200</c>

          <c>5 : 250</c>

          <c>(b)</c>

          <c>40</c>

          <c>1 : 500</c>

          <c>3 : 250</c>

          <c>3 : 200</c>

          <c>2 : 250</c>

          <c>(c)</c>

          <c>40</c>

          <c>- : ---</c>

          <c>3 : 250</c>

          <c>3 : 200</c>

          <c>2 : 250</c>

          <c>(d)</c>

          <c>40</c>

          <c>- : ---</c>

          <c>2 : 250</c>

          <c>2 : 200</c>

          <c>1 : 125</c>

          <c>(e)</c>

          <c>40</c>

          <c>- : ---</c>

          <c>- : ---</c>

          <c>2 : 200</c>

          <c>1 : 125</c>

          <c></c>

          <c>Congestion probability</c>

          <c>         0.004%</c>

          <c>         0.016%</c>

          <c>         0.02%</c>

          <c>         0.016%</c>
        </texttable>

        <t>Scenario B shows a case where four of the tenants all send 2 or
        more long-running flows. Recall that each flow always contributes
        20kb/s no matter how fast it goes. Therefore the policers of tenants
        (a-c) limit them to two flows-worth of congestion (2 x 20kb/s =
        40kb/s). Tenant (d) is only asking for 2 flows, so it gets them
        without being policed, and all four get the same quarter share of the
        link.</t>

        <t>Scenario C is similar, except the fifth tenant (e) joins in, so
        they all get equal 1/5 shares of the link.</t>

        <t>In Scenario D, only tenant (a) asks for more than two flows, so
        (a)'s policer limits it to two flows-worth of congestion, and everyone
        else gets the number of flows-worth that they ask for. This means that
        tenants (d&e) get less than everyone else, because they asked for
        less than they would have been allowed. (Similarly, in Scenarios A
        & B, some of the tenants are inactive, so they get zero, which is
        also less than they could have had if they had wanted.)</t>

        <t>With lots of long-running flows, as in scenarios B & C,
        congestion policing seems to emulate round robin scheduling,
        equalising the bit-rate of each tenant, no matter how many flows they
        run. By configuring different contracted allowances for each tenant,
        it can easily be seen that congestion policing could emulate weighted
        round robin (WRR), with the relative sizes of the allowances acting as
        the weights.</t>

        <t>Scenario D departs from round-robin. This is deliberate, the idea
        being that tenants are free to take less than their share in the short
        term, which allows them to take more at other times, as we will see in
        <xref target="pidc_wcc"></xref>. In Scenario D, policing focuses only
        on the tenant (a) that is continually exceeding its contract. This
        policer focuses discard solely on tenant a's traffic so that it cannot
        cause any more congestion at the shared link (shown as 0.016% in the
        last row).</t>

        <t>To summarise so far, ingress congestion policers control
        congestion-bit-rate in order to indirectly assure a minimum bit-rate
        per tenant. With lots of long-running flows, the outcome is somewhat
        similar to WRR, but without the need for any mechanism in each
        queue.</t>
      </section>

      <section anchor="pidc_On-Off_Flows" title="On-Off Flows">
        <t>Aiming to behave like round-robin (or weighted round-robin) is only
        useful when all flows are infinitely long. For transfers of finite
        size, congestion policing isolates one tenant's performance from the
        behaviour of others — unlike WRR would, as will now be
        explained.</t>

        <t><xref target="pidc_Fig_on-off"></xref> compares two example
        scenarios where tenant 'b' regularly sends small files in the top
        chart and the same size files but more often in the bottom chart (a
        higher 'on-off ratio'). This is the typical behaviour of a Web server
        when more clients request more files at peak time. Meanwhile, in this
        example, tenant c's behaviour doesn't change between the two scenarios
        -- it sends a couple of large files, each starting at the same time in
        both cases.</t>

        <t>The capacity of the link that 'b' and 'c' share is shown as the
        full height of the plot. The files sent by 'b' are shown as little
        rectangles. 'b' can go at the full bit-rate of the link when 'c' is
        not sending, which is represented by the tall thin rectangles labelled
        'b' near the middle. We assume for simplicity that 'b' and 'c' divide
        up the bit-rate equally. So, when both 'b' and 'c' are sending, the
        'b' rectangles are half the height (bit-rate) and twice the duration
        relative to when 'b' sends alone. The area of a file to be transferred
        stays the same, whether tall and thin or short and fat, because the
        area represents the size of the file (bit-rate x duration = file
        size). The files from 'c' look like inverted castellations, because
        'c' uses half the link rate while each file from 'b' completes, then
        'c' can fill the link until 'b' starts the next file. The
        cross-hatched areas represent idle times when no-one is sending.</t>

        <t>For this simple scenario we ignore start-up dynamics and just focus
        on the rate and duration of flows that are long enough to stabilise,
        which is why they can be represented as simple rectangles. We will
        introduce the effect of flow startups later.</t>

        <t>In the bottom case, where 'b' sends more often, the gaps between
        b's transfers are smaller, so 'c' has less opportunity to use the
        whole line rate. This squeezes out the time it takes for 'c' to
        complete its file transfers (recall a file will always have the same
        area which represents its size). Although 'c' finishes later, it still
        starts the next flow at the same time. In turn, this means 'c' is
        sending during a greater proprotion of b's transfers, which extends
        b's average completion time too.</t>

        <?rfc needLines="23"?>

        <figure anchor="pidc_Fig_on-off"
                title="In the lower case, the on-off ratio of 'b' has increased, which extends all the completion times of 'c' and 'b'">
          <artwork><![CDATA[
 ^ bit-rate
 |
 |---------------------------------------,--.---------,--.----,-------
 |                                       |  |\/\/\/\/\|  |/\/\|       
 |                     c                 | b|/\/\/\/\/| b|\/\/|  c    
 |------.      ,-----.      ,-----.      |  |\/\/\/\/\|  |/\/\|    ,--
 |  b   |      |  b  |      |  b  |      |  |/\/\/\/\/|  |\/\/|    | b
 |      |      |     |      |     |      |  |\/\/\/\/\|  |/\/\|    |  
 +------'------'-----'------'-----'------'--'---------'--'----'----'-->
                                                                   time
 ^ bit-rate
 |
 |---------------------------------------------------.--,--.--,-------
 |                                                   |/\|  |\/|       
 |                         c                         |\/| b|/\| c     
 |------.  ,-----.  ,-----.  ,-----.  ,-----.  ,-----./\|  |\/|  ,----
 | b    |  | b   |  | b   |  | b   |  | b   |  | b   |\/|  |/\|  | b  
 |      |  |     |  |     |  |     |  |     |  |     |/\|  |\/|  |    
 +------'--'-----'--'-----'--'-----'--'-----'--'-----'--'--'--'--'---->
                                                                   time
]]></artwork>
        </figure>

        <t>Round-robin would do little if anything to isolate 'c' from the
        effect of 'b' sending files more often. Round-robin is designed to
        force 'b' and 'c' to share the capacity equally when they are both
        active. But in both scenarios they already share capacity equally when
        they are both active. The difference is in how often they are active.
        Round-robin and other traditional fair queuing techniques don't have
        any memory to sense that 'b' has been active more of the time.</t>

        <t>In contrast, a congestion policer can tell when one tenant is
        sending files more frequently, by measuring the rate at which the
        tenant is contributing to congestion. Our aim is to show that policers
        will be able to isolate performance properly by using the right metric
        (congestion bit-rate), rather than using the wrong metric (bit-rate),
        which doesn't sense whether the load over time is large or small.</t>

        <section anchor="pidc_on-off-numerical-no-policer"
                 title="Numerical Examples Without Policing">
          <t>The usefulness of the congestion bit-rate metric will now be
          illustrated with the numerical examples in <xref
          target="pidc_Tab_on-off"></xref>. The scenarios illustrate what the
          congestion bit-rate would be without any policing or scheduling
          action in the network. Then this metric can be monitored and limited
          by a policer, to prevent one tenant from harming the performance of
          others.</t>

          <t>The 2nd & 3rd columns (file-size and inter-arrival time)
          fully represent the behaviour of each tenant in each scenario. All
          the other columns merely characterise the outcome in various ways.
          The inter-arrival time (T) is the average time between starting one
          file and the next. For instance, tenant 'b' sends a 16Mb file every
          200ms on average. The formula in the heading of some columns shows
          how the column was derived from other columns.</t>

          <t>Scenario E is contrived so that the three tenants all offer the
          same load to the network, even though they send files of very
          different size (S). The files sent by tenant 'a' are 100 times
          smaller than those of tenant 'b', but 'a' sends them 100 times more
          often. In turn, b's files are 100 times smaller than c's, but 'b' in
          turn sends them 100 times more often. Graphicallyy, the scenario
          would look similar to <xref target="pidc_Fig_on-off"></xref>, except
          with three sizes of file, not just two. Scenarios E-G are designed
          to roughly represent various distributions of file sizes found in
          data centres, but still to be simple enough to facilitate intuition,
          even though each tenant would not normally send just one size
          file.</t>

          <t>The average completion time (t) and the maximum were calculated
          from a fairly simple analytical model (documented in a campanion
          technical report <xref target="conex-dc_tr"></xref>). Using one data
          point as an example, it can be seen that a 1600Mb (200MB) file from
          tenant 'c' completes in 1905ms (about 1.9s). The files that are 100
          times smaller complete 100 times more quickly on average. In fact,
          in this scenario with equal loads, each tenant perceives that their
          files are being transferred at the same rate of 840Mb/s on average
          (file-size divided by completion time, as shown in the apparent
          bit-rate column). Thus on average all three tenants perceive they
          are getting 84% of the 1Gb/s link on average (due to the benefit of
          multiplexing and utilisation being low at 240Mb/s / 1Gb/s = 24% in
          this case).</t>

          <t>The completion times of the smaller files vary significantly,
          depending on whether a larger file transfer is proceeding at the
          same time. We have already seen this effect in <xref
          target="pidc_Fig_on-off"></xref>, where, when tenant b's files share
          with 'c', they take twice as long to complete as when they don't.
          This is why the maximum completion time is greater than the average
          for the small files, whereas there is imperceptible variance for the
          largest files.</t>

          <t>The final column shows how congestion bit-rate will be a useful
          metric to enforce performance isolation (the figures illustrate the
          situation before any enforcement mechanism is added). In the case of
          equal loads (scenario E), average congestion bit-rates are all
          equal. In scenarios F and G average congestion bit-rates are higher,
          because all tenants are placing much more load on the network over
          time, even though each still sends at equal rates to others when
          they are active together. <xref target="pidc_Fig_on-off"></xref>
          illustrated a similar effect in the difference between the top and
          bottom scenarios.</t>

          <t>The maximum instantaneous congestion bit-rate is nearly always
          20kb/s. That is because, by definition, all the tenants are using
          scalable congestion controls with a constant congestion rate of
          20kb/s. As we saw in <xref target="pidc_Initial_CC_Model"></xref>,
          the congestion rate of a particular scalable congestion control is
          always the same, no matter how many other flows it competes
          with.</t>

          <t>Once it is understood that the congestion bit-rate of one
          scalable flow is always 'w' and doesn't change whenever a flow is
          active, it becomes clear what the congestion bit-rate will be when
          averaged over time; it will simply be 'w' multiplied by the
          proportion of time that the tenant's file transfers are active. That
          is, w*t/T. For instance, in scenario E, on average tenant b's flows
          start 200ms apart, but they complete in 19ms. So they are active for
          19/200 = 10% of the time (rounded). A tenant that causes a
          congestion bit-rate of 20kb/s for 10% of the time will have an
          average congestion-bit-rate of 2kb/s, as shown.</t>

          <t>To summarise so far, no matter how many more files transfer at
          the same time, each scalable flow still contributes to congestion at
          the same rate, but it contributes for more of the time, because it
          squeezes out into the gap before its next flow starts.</t>

          <?rfc needLines="30"?>

          <texttable anchor="pidc_Tab_on-off"
                     title="How the effect on others of various file-transfer behaviours can be measured by the resulting congestion-bit-rate">
            <ttcol align="right">        Ten
            ant</ttcol>

            <ttcol align="right">File size     
                     
               S      Mb</ttcol>

            <ttcol align="right">Ave. inter- arr- ival
                  
                 T
                   ms</ttcol>

            <ttcol align="right">Ave. load     
                      S/T
                 Mb/s</ttcol>

            <ttcol align="right">Completion time
                      
                      
            ave :  max
            t        
                      
            ms    </ttcol>

            <ttcol align="center">Apparent bit-rate
                    
                    
            ave : min S/t      
                     Mb/s</ttcol>

            <ttcol align="center">Congest- ion bit-rate
                    ave : max
            w*t/T     
                    kb/s</ttcol>

            <!-- . . -->

            <c></c>

            <c></c>

            <c></c>

            <c></c>

            <c>Scenario E</c>

            <c></c>

            <c></c>

            <!-- . . -->

            <c>a</c>

            <c>0.16</c>

            <c>2</c>

            <c>80</c>

            <c>0.19 : 0.48</c>

            <c>840 : 333</c>

            <c>  2 : 20</c>

            <!-- . . -->

            <c>b</c>

            <c>16</c>

            <c>200</c>

            <c>80</c>

            <c>19 :   35</c>

            <c>840 : 460</c>

            <c>  2 : 20</c>

            <!-- . . -->

            <c>c</c>

            <c>1600</c>

            <c>20000</c>

            <c>80</c>

            <c>1905 : 1905</c>

            <c>840 : 840</c>

            <c>  2 : 20</c>

            <!-- . . -->

            <c></c>

            <c></c>

            <c></c>

            <c>____</c>

            <c></c>

            <c></c>

            <c></c>

            <!-- . . -->

            <c></c>

            <c></c>

            <c></c>

            <c>240</c>

            <c></c>

            <c></c>

            <c></c>

            <!-- . . -->

            <c></c>

            <c></c>

            <c></c>

            <c></c>

            <c>Scenario F</c>

            <c></c>

            <c></c>

            <!-- . . -->

            <c>a</c>

            <c>0.16</c>

            <c>0.67</c>

            <c>240</c>

            <c>0.31 : 0.48</c>

            <c>516 : 333</c>

            <c>  9 : 20</c>

            <!-- . . -->

            <c>b</c>

            <c>16</c>

            <c>50</c>

            <c>320</c>

            <c>29 :   42</c>

            <c>557 : 380</c>

            <c> 11 : 20</c>

            <!-- . . -->

            <c>c</c>

            <c>1600</c>

            <c>10000</c>

            <c>160</c>

            <c>3636 : 3636</c>

            <c>440 : 440</c>

            <c>  7 : 20</c>

            <!-- . . -->

            <c></c>

            <c></c>

            <c></c>

            <c>____</c>

            <c></c>

            <c></c>

            <c></c>

            <!-- . . -->

            <c></c>

            <c></c>

            <c></c>

            <c>720</c>

            <c></c>

            <c></c>

            <c></c>

            <!-- . . -->

            <c></c>

            <c></c>

            <c></c>

            <c></c>

            <c>Scenario G</c>

            <c></c>

            <c></c>

            <!-- . . -->

            <c>a</c>

            <c>0.16</c>

            <c>0.67</c>

            <c>240</c>

            <c>0.33 : 0.64</c>

            <c>481 : 250</c>

            <c> 10 : 20</c>

            <!-- . . -->

            <c>b</c>

            <c>16</c>

            <c>40</c>

            <c>400</c>

            <c>32 :   46</c>

            <c>505 : 345</c>

            <c> 16 : 40</c>

            <!-- . . -->

            <c>c</c>

            <c>1600</c>

            <c>10000</c>

            <c>160</c>

            <c>4543 : 4543</c>

            <c>352 : 352</c>

            <c>  9 : 20</c>

            <!-- . . -->

            <c></c>

            <c></c>

            <c></c>

            <c>____</c>

            <c></c>

            <c></c>

            <c></c>

            <!-- . . -->

            <c></c>

            <c></c>

            <c></c>

            <c>800</c>

            <c></c>

            <c></c>

            <c></c>

            <postamble>Single link of capacity 1Gb/s. Each tenant uses a
            scalable congestion control which contributes a
            congestion-bit-rate for each flow of w = 20kb/s.</postamble>
          </texttable>

          <t>In scenario F, clients have increased the rate they request files
          from tenants a, b and c respectively by 3x, 4x and 2x relative to
          scenario E. The tenants send the same size files but 3x, 4x and 2x
          more often. For instance tenant 'b' is sending 16Mb files four times
          as often as before, and they now take longer as well -- nearly 29ms
          rather than 19ms -- because the other tenants are active more often
          too, so completion gets squeezed to later. Consequently, tenant 'b'
          is now sending 57% of the time, so its congestion-bit-rate is 20kb/s
          * 57% = 11kb/s. This is nearly 6x higher than in scenario E,
          reflecting both b's own increase by 4x and that this increase
          coincides with everyone else increasing their load.</t>

          <t>In scenario G, tenant 'b' increases even more, to 5x the load it
          offered in scenario E. This results in average utilisation of
          800Mb/s / 1Gb/s = 80%, compared to 72% in scenario F and only 24% in
          scenario E. 'b' sends the same files but 5x more often, so its load
          rises 5x.</t>

          <t>Completion times rise for everyone due to the overall rise in
          load, but the congestion rates of 'a' and 'c' don't rise anything
          like as much as that of 'b', because they still leave large gaps
          between files. For instance, tenant 'c' completes each large file
          transfer in 4.5s (compared to 1.9s in scenario E), but it still only
          sends files every 10s. So 'c' only sends 45% of the time, which is
          reflected in its congestion bit-rate of 20kb/s * 45% = 9kb/s.</t>

          <t>In contrast, on average tenant 'b' can only complete each
          medium-sized file transfer in 32ms (compared to 19ms in scenario E),
          but on average it starts sending another file after 40ms. So 'b'
          sends 79% of the time, which is reflected in its congestion bit-rate
          of 20kb/s * 79% = 16kb/s (rounded).</t>

          <t>However, during the 45% of the time that 'c' sends a large file,
          b's completion time is higher than average (as shown in <xref
          target="pidc_Fig_on-off"></xref>). In fact, as shown in the maximum
          completion time column, 'b' completes in 46ms, but it starts sending
          a new file after 40ms, which is before the previous one has
          completed. Therefore, during each of c's large files, 'b' sends
          46/40 = 116% of the time on average.</t>

          <t>This actually means 'b' is overlapping two files for 16% of the
          time on average and sending one file for the remaining 84%. Whenever
          two file transfers overlap, 'b' will be causing 2 x 20kb/s = 40kb/s
          of congestion, which explains why tenant b in scenario G is the only
          case with a maximum congestion rate of 40kb/s rather than 20kb/s as
          in every other case. Over the duration of c's large files, 'b' would
          therefore cause congestion at an average rate of 20kb/s * 84% +
          40kb/s * 16% = 23kb/s (or more simply 10kb/s * 116% = 23kb/s). Of
          course, when 'c' is not sending a large file, 'b' will contribute
          less to congestion, which is why its average congestion rate is
          16kb/s overall, as discussed earlier.</t>

          <!--The reason the maximum averaged congestion-bit-rate is higher with small files is simply because of the way it is defined. The congestion rate is averaged over the time between one file and the next, then the maximum is taken for each tenant. For short inter-arrival times, this maximum will occur only in the worst case when all tenants are transferring files, while for longer inter-arrival times, there will be more of a mix of rates of congestion during the time over which it is averaged.-->
        </section>

        <section title="Congestion Policing of On-Off Flows">
          <t>Still referring to the numerical examples in <xref
          target="pidc_Tab_on-off"></xref>, we will now discuss the effect of
          limiting each tenant with a congestion policer.</t>

          <t>The network operator might have deployed congestion policers to
          cap each tenant's average congestion rate to 16kb/s. None of the
          tenants are exceeding this limit in any of the scenarios, but tenant
          'b' is just shy of it in scenario G. Therefore all the tenants would
          be free to behave in all sorts of ways like those of scenarios E-G,
          but they would be prevented from degrading the performance of the
          other tenants beyond the point reached by tenant 'b' in scenario G.
          If tenant 'b' added more load, the policer would prevent the extra
          load entering the network by focusing drop solely on tenant 'b',
          preventing the other tenants from experiencing any more congestion
          due to tenant 'b'. Then tenants 'a' and 'c' would be assured the
          (average) apparent bit-rates shown, whatever the behaviour of
          'b'.</t>

          <t>If 'a' added more load, 'c' would not suffer. Instead 'b' would
          go over limit and its rate would be trimmed during congestion peaks,
          sacrificing some of its lead to 'a'. Similarly, if 'c' added more
          load, 'b' would be made to sacrifice some of its performance, so
          that 'a' would not suffer. Further, if more tenants arrived to share
          the same link, the policer would force 'b' to sacrifice performance
          in favour of the additional tenants.</t>

          <t>There is nothing special about a policer limit of 16kb/s. The
          example when discussing infinite flows used a limit of 40kb/s per
          tenant. And some tenants can be given higher limits than others
          (e.g. at an additional charge). If the operator gives out congestion
          limits that together add up to a higher amount but it doesn't
          increase the link capacity, it merely allows the tenants to apply
          more load (e.g. more files of the same size in the same time), but
          each with lower bit-rate.</t>

          <t>{ToDo: Discuss min bit-rates}</t>

          <t>{ToDo: discuss instantaneous limits and how they protect the
          minimum bit-rate of other tenants}</t>
        </section>
      </section>

      <section anchor="pidc_wcc" title="Weighted Congestion Controls">
        <t>At high speed, congestion controls such as Cubic TCP, Data Centre
        TCP, Compound TCP etc all contribute to congestion at widely differing
        rates, which is called their 'aggressiveness' or 'weight'. So far, we
        have made the simplifying assumption of a scalable congestion control
        algorithm that contributes to congestion at a constant rate of w =
        20kb/s. We now assume tenant 'c' uses a similar congestion control to
        before, but with different parameters in the algorithm so that its
        weight is still constant, but at w = 2.2kb/s.</t>

        <t>Tenant 'b' still uses w = 20kb/s for its smaller files, so when the
        two compete for the 1Gb/s link, they will share it in proportion to
        their weights, 20:2.2 (or 90%:10%). That is, 'b' and 'c' will
        respectively get (20/22.2)*1Gb/s = 900Mb/s and (2.2/22.2)*1Gb/s =
        100Mb/s of the 1Gb/s link. <xref target="pidc_Fig_weighted"></xref>
        shows the situation before (upper) and after (lower) this change.</t>

        <t>When the two compete, 'b' transfers each file 9/5 faster than
        before (900Mb/s rather than 500Mb/s), so it completes them in 5/9 of
        the time. 'b' still contributes congestion at the same rate of 20kb/s,
        but for 5/9 less time than before. Therefore, relative to before, 'b'
        uses up its allowance 5/9 as quickly.</t>

        <t>Tenant 'c' contributes congestion at 2.2/22.2 of its previous rate,
        that is 2kb/s rather than 20kb/s. Although tenant 'b' goes faster, as
        each file finishes, it gets out of the way sooner, so 'c' can catch up
        to where it got to before after each 'b' file and should complete
        hardly any later than before. Tenant 'c' will probably lose some
        completion time because it has to accelerate and decelerate more. But,
        whenever it is sending a file, 'c' gains (20kb/s -2kb/s) = 18kb of
        allowance every second, which it can use for other transfers.</t>

        <figure anchor="pidc_Fig_weighted"
                title="Weighted congestion controls with equal weights (upper) and unequal (lower)">
          <artwork><![CDATA[
 ^ bit-rate
 |
 |---------------------------------------------------.--,--.--,-------
 |                                                   |/\|  |\/|       
 |                         c                         |\/| b|/\| c     
 |------.  ,-----.  ,-----.  ,-----.  ,-----.  ,-----./\|  |\/|  ,----
 | b    |  | b   |  | b   |  | b   |  | b   |  | b   |\/|  |/\|  | b  
 |      |  |     |  |     |  |     |  |     |  |     |/\|  |\/|  |    
 +------'--'-----'--'-----'--'-----'--'-----'--'-----'--'--'--'--'---->
                                                                   time

 ^ bit-rate
 |
 |---------------------------------------------------.--,--.--,-------
 |---.     ,---.    ,---.    ,---.    ,---.    ,---. |/\|  |\/|  ,---.
 |   |     |   |    |   |  c |   |    |   |    |   | |\/| b|/\| c|   |
 |   |     |   |    |   |    |   |    |   |    |   | |/\|  |\/|  |   |
 | b |     | b |    | b |    | b |    | b |    | b | |\/|  |/\|  | b |
 |   |     |   |    |   |    |   |    |   |    |   | |/\|  |\/|  |   |
 +---'-----'---'----'---'----'---'----'---'----'---'-'--'--'--'--'---'>
                                                                   time
]]></artwork>
        </figure>

        <t>It seems too good to be true that both tenants gain so much and
        lose so little by 'c' reducing its aggressiveness. The gains are
        unlikely to be as perfect as this simple model predicts, but we
        believe they will be nearly as substantial.</t>

        <t>It might seem that everyone can keep gaining by everyone agreeing
        to reduce their weights, ad infinitum. However, the lower the weight,
        the less signals the congestion control gets, so it starts to lose its
        control during dynamics. Nonetheless, congestion policing should
        encourage congestion control designs to keep reducing their weights,
        but they will have to stop when they reach the minimum necessary
        congestion in order to maintain sufficient control signals.</t>
      </section>

      <section anchor="pidc_Network_of_Links" title="A Network of Links">
        <t>So far we have only considered a single link. Congestion policing
        at the network edge is designed to work across a network of links,
        treating them all as a pool of resources, as we shall now explain. We
        will use the dual-homed topology shown in <xref
        target="pidc_Fig_dual-homed"></xref> (stretching the bounds of ASCII
        art) as a very simple example of a pool of resources.</t>

        <t>In this case where there are 48 servers (H1, H2, ... Hn where n=48)
        on the left, with on average 8 virtual machines (VMs) running on each
        (e.g. server n is running Vn1, Vn2, ... to Vnm where m = 8). Each
        server is connected by two 1Gb/s links, one to each top-of-rack switch
        S1 & S2. To the right of the switches, there are 6 links of 10Gb/s
        each, connecting onwards to customer networks or to the rest of the
        data centre. There is a total of 48 *2 *1Gb/s = 96Gb/s capacity
        between the 48 servers and the 2 switches, but there is only 6 *10Gb/s
        = 60Gb/s to the right of the switches. Nonetheless, data centres are
        often designed with some level of contention like this, because at the
        ToR switches a proportion of the traffic from certain hosts turns
        round locally towards other hosts in the same rack.</t>

        <figure align="center" anchor="pidc_Fig_dual-homed"
                title="Dual-Homed Topology -- a Simple Resource Pool">
          <artwork><![CDATA[
    virtual      hosts        switches
    machines
  
V11 V12     V1m                        __/
*   * ...   *   H1 ,-.__________+--+__/
 \___\__   __\____/`-'       __-|S1|____,--
                      `. _ ,' ,'|  |_______
        .       H2 ,-._,`.  ,'  +--+
        .        . `-'._  `.
        .        .      `,' `.
                 .     ,' `-. `.+--+_______
Vn1 Vn2     Vnm       /      `-_|S2|____
*   * ...   *   Hn ,-.__________|  |__  `--
 \___\__   __\____/`-'          +--+  \__
                                         \
]]></artwork>
        </figure>

        <t>The congestion policer proposed in this document is based on the
        'hose' model, where a tenant's congestion allowance can be used for
        sending data over any path, including many paths at once. Therefore,
        any one of the virtual machines on the left can use its allowance to
        contribute to congestion on any or all of the 6 links on the right (or
        any other link in the diagram actually, including those from the
        server to the switches and those turning back to other hosts).</t>

        <t>Nonetheless, if congestion policers are to enforce performance
        isolation, they should stop one tenant squeezing the capacity
        available to another tenant who needs to use a particular bottleneck
        link or links. They should work whether the offending tenant is acting
        deliberately or merely carelessly.</t>

        <t>The only way a tenant can become squeezed is if another tenant uses
        more of the bottleneck capacity, which can only happen if the other
        tenant sends more flows (or more aggressive flows) over that link. In
        the following we will call the tenant that is shifting flows 'active',
        and the ones already on a link 'passive'. These terms have been chosen
        so as not to imply one is bad and the other good — just
        different.</t>

        <t>The active tenant will increase flow completion times for all
        tenants (passive and active) using that bottleneck. Such an active
        tenant might shift flows from other paths to focus them onto one,
        which would not of itself use up any more congestion allowance (recall
        that a scalable congestion control uses up its congestion allowance at
        the same rate per flow whatever bit-rate it is going at <xref
        target="pidc_On-Off_Flows"></xref> and therefore whatever path it is
        using). However, although the instantaneous rate at which the active
        tenant uses up its allowance won't alter, the increased completion
        times due to increased congestion will use up more of the active
        tenant's allowance over time (same rate but for more of the time). If
        the passive tenants are using up part of their allowances on other
        links, the increase in congestion will use up a relatively smaller
        proportion of their allowances. Once such an increase exceeds the
        active tenant's congestion allowance, the congestion policer will
        protect the passive tenants from further performance degradation.</t>

        <t>A policer may not even have to directly intervene for tenants to be
        protected; load balancing may remove the problem first. Load balancing
        might either be provided by the network (usually just random), or some
        of the 'passive' tenants might themselves actively shift traffic off
        the increasingly congested bottleneck and onto other paths. Some of
        them might be using the multipath TCP protocol (MPTCP — see
        experimental <xref target="RFC6356"></xref>) that would achieve this
        automatically, or ultimately they might shift their virtual machine to
        a different endpoint to circumvent the congestion hot-spot completely.
        Even if one passive tenant were not using MPTCP or could not shift
        easily, others shifting away would achieve the same outcome.
        Essentially, the deterrent effect of congestion policers encourages
        everyone to even out congestion, shifting load away from hot spots.
        Then performance isolation becomes an emergent property of everyone's
        behaviour, due to the deterrent effect of policers, rather than always
        through explicit policer intervention.</t>

        <t>{ToDo: Add numerical example}</t>

        <t>In contrast, enforcement mechanisms based on scheduling algorithms
        like WRR or WFQ have to be deployed at each link, and each one works
        in isolation from the others. Therefore, each one doesn't know how
        much of other links the tenant is using. This is fine for networks
        with a single known bottleneck per customer (e.g. many access
        networks). However, in data centres there are many potential
        bottlenecks and each tenant generally only uses a share of a small
        number of them. A mechanism like WRR would not isolate anyone's
        performance if it gave every tenant the right to use the same share of
        all the links in the network, without regard to how many they were
        using.</t>

        <t>The correct approach, as proposed here, is to give a tenant a share
        of the whole pool, not the same share of each link.</t>
      </section>

      <section anchor="pidc_Links_Diff_Sizes" title="Links of Different Sizes">
        <t>Congestion policing treats a Mb/s of capacity in one link as
        identical to a Mb/s of capacity in another link, even if the size of
        each link is different. For instance, consider the case where one of
        the three links to the right of each switch in <xref
        target="pidc_Fig_dual-homed"></xref> were upgraded to 40Gb/s while the
        other two remained at 10Gb/s (perhaps to accommodate the extra traffic
        from a couple of the dual homed 1Gb/s servers being upgraded to
        dual-homed 10Gb/s).</t>

        <t>Two congestion control algorithms running at the same rate will
        cause the same level of congestion probability, whatever size link
        they are sharing. <list style="symbols">
            <t>If 50 equal flows share a 10Gb/s link (10Gb/s / 50 = 200Mb/s
            each) they will cause 0.01% congestion probability;</t>

            <t>If 200 equal flows share a 40Gb/s link (40Gb/s / 200 = 200Mb/s
            each) they will still cause 0.01% congestion probability;</t>
          </list>This is because the congestion probability is determined by
        the congestion control algorithms, not by the link.</t>

        <t>Therefore, if an average of 300 flows were spread across the above
        links (1x 40Gb/s and 2 x 10Gb/s), the numbers on each link would tend
        towards respectively 200:50:50, so that each flow would get 200Mb/s
        and each link would have 0.01% congestion on it. Sometimes, there
        might be more flows on the bigger link, resulting in less than 200Mb/s
        per flow and congestion higher than 0.01%. However, whenever the
        congestion level was less on one link than another, congestion
        policing would encourage flows to balance out the congestion level
        across the links (as long as some flows could use congestion balancing
        mechanisms like MPTCP).</t>

        <t>In summary, all the outcomes of congestion policing described so
        far (emulating WRR etc) apply across a pool of diverse link sizes just
        as much as they apply to single links.</t>
      </section>

      <section anchor="Diverse_Algorithms"
               title="Diverse Congestion Control Algorithms">
        <t>Throughout this explanation we have assumed a scalable congestion
        control algorithm, which we justified <xref
        target="pidc_Initial_CC_Model"></xref> as the 'boundary' case if
        congestion policing had to intervene, which is all that is relevant
        when considering whether the policer can enforce performance
        isolation.</t>

        <t>This performance isolation approach still works, whether or not the
        congestion controls in daily use by tenants fit this scalable model. A
        bulk congestion policer constrains the sum of all the congestion
        controls being used by a tenant so that they collectively remain below
        a large-scale envelope that is itself shaped like the sum of many
        scalable algorithms. Bulk congestion policers will constrain the
        overall congestion effect (the sum) of any mix of algorithms within
        it, including flows that are completely unresponsive to congestion.
        This is explained around Fig 3 of <xref target="CongPol"></xref>.</t>

        <t>{ToDo, summarise the relevant part of that paper here and perhaps
        even add ASCII art for the plot...}</t>

        <t>{ToDo, bring in discussion of slow-start as effectively another
        variant of congestion control, with considerable overshoots, etc.}</t>

        <t>The defining difference between the scalable congestion we have
        assumed and the congestion controls in widespread production operating
        systems (New Reno, Compound, Cubic, Data Centre TCP etc) is the way
        congestion probability decreases as flow-rate increases (for a
        long-running flow). With a scalable congestion control, if flow-rate
        doubles, congestion probability halves. Whereas, with most production
        congestion controls, if flow-rate doubles, congestion probability
        reduces to less than half. For instance, New Reno TCP reduces
        congestion to a quarter. The responses of Cubic and Compound are
        closer to the ideal scalable control than to New Reno, but they do not
        depart too far from TCP to ensure they can co-exist happily with New
        Reno.</t>

        <!--This means that all the production controls sit on the safe side of the scalable model we have assumed. 
That is, when a congestion policer constrains congestion, production transports will constrain their 
bit-rate more than if they used a scalable algorithm.-->
      </section>
    </section>

    <!-- ====================================================================== -->

    <section anchor="pidc_design" title="Design">
      <t>The design involves the following elements, all involving changes
      solely in the hypervisor or operating systems, not network switches:
      <list style="hanging">
          <t hangText="Congestion Information at Ingress:">This information
          needs to be trusted by the operator of the data centre
          infrastructure, therefore it cannot just use the feedback in the
          end-to-end transport (e.g. TCP SACK or ECN echo congestion
          experienced flags) that might anyway be encrypted. Trusted
          congestion feedback may be implemented in either of the following
          two ways: <list style="letters">
              <t>either as a shim in both sending and receiving hypervisors
              using an edge-to-edge (host-host) tunnel, with feedback messages
              reporting congestion back to the sending host's hypervisor (in
              addition to the e2e feedback at the transport layer).</t>

              <t>or in the sending operating system using the congestion
              exposure protocol (ConEx <xref
              target="ConEx-Abstract-Mech"></xref>);</t>
            </list>Approach a) could be applied solely to traffic from
          operating systems that do not yet support the simpler approach
          b)<vspace blankLines="1" />The host-host feedback tunnel (approach
          a) is easier to implement if a tunnelling overlay is already in use
          in the data centre. For instance, we believe it would be possible to
          build the necessary feedback facilities using the proposed network
          virtualisation approach based on generic routing encapsulation (GRE)
          <xref target="nvgre"></xref>. The tunnel egress would also need to
          be able to detect congestion. This would be simple for e2e flows
          with ECN enabled, because this will lead to ECN also being enabled
          in the outer IP header <xref target="RFC6040"></xref>. However, for
          non-ECN enabled flows, it is more problematic. It might be possible
          to add sequence numbers to the outer headers, as is done in many
          pseudowire technologies. However, a simpler alternative is possible
          in a data centre where the switches can be ECN-enabled. It would
          then be possible to enable ECN in the outer headers, even if the e2e
          transport is not ECN-capable (Not-ECT in the inner header). At the
          egress, if the outer is marked as 'congestion experienced', but the
          inner is not-ECT, the packet would have to be dropped, being the
          only congestion signal the e2e transport would understand. But
          before dropping it, the ECN marking in the outer would have served
          the purpose of a congestion signal to the tunnel egress. Beyond
          this, implementation details of approach a) are still work in
          progress. <vspace blankLines="1" />If the ConEx option is used
          (approach b), a congestion audit function will also be required as a
          shim in the hypervisor (or container) layer where data leaves the
          network and enters the receiving host. The ConEx option is only
          applicable if the guest OS at the sender has been modified to send
          ConEx markings. For IPv6 this protocol is defined in <xref
          target="conex-destopt"></xref>. The ConEx markings could be encoded
          in the IPv4 header by hiding them within the packet ID field as
          proposed in <xref target="intarea-ipv4-id-reuse"></xref>.</t>

          <t hangText="Congestion Policing:">A bulk congestion policing
          function would be associated with each tenant's virtual machine to
          police all the traffic it sends into the network. If would most
          likely be implemented as a shim in the hypervisor. It would be
          expected that various policer designs might be developed, but here
          we propose a simple but effective one in order to be concrete. A
          token bucket is filled with tokens at a constant rate that
          represents the tenant's congestion allowance. The bucket is drained
          by the size of every packet with a congestion marking, as described
          in <xref target="CongPol"></xref>. If approach a) were used to get
          "Congestion Information at the Ingress" , the bucket would be
          drained by congestion feedback from the tunel egress. If approach b)
          were used, the bucket would be drained by ConEx markings on the
          actual data packets being forwarded (ConEx re-inserts the e2e
          feedback from the transport receiver back onto packets on the
          forward data path).<vspace blankLines="0" />{ToDo: Add details of
          congestion burst limiting}<vspace blankLines="1" />While the data
          centre network operator only needs to police congestion in bulk,
          tenants may wish to enforce their own limits on individual users or
          applications, as sub-limits of their overall allowance. Given all
          the information used for policing is readily available to tenants in
          the transport layer below their sender, any such per-flow, per-user
          or per-application limitations can be readily applied. The tenant
          may operate their own fine-grained policing software, or such
          detailed control capabilities may be offered as part of the platform
          (platform as a service or PaaS) above the more general
          infrastructure as a service (IaaS).</t>

          <t hangText="Distributed Token Buckets:">A customer may run virtual
          machines on multiple physical nodes, in which case the data centre
          operator would ensure that it deployed a policer in the hypervisor
          on each node where the customer was running a VM, at the time each
          VM was instantiated.The DC operator would arrange for them to
          collectively enforce the per-customer congestion allowance, as a
          distributed policer.<vspace blankLines="1" />A function to
          distribute a customer's tokens to the policer associated with each
          of the customer's VMs would be needed. This could be similar to the
          distributed rate limiting of <xref target="DRL"></xref>.
          Alternatively, a logically centralised bucket of congestion tokens
          could be used with simple 1-1 communication between it and each
          local token bucket in the hypervisor under each VM. <vspace
          blankLines="1" />Importantly, traditional bit-rate tokens cannot
          simply be reassigned from one VM to another without implications on
          the balance of network loading (requiring operator intervention each
          time), whereas congestion tokens can be freely reassigned between
          different VMs, because a congestion token is equivalent at any place
          or time in a network;<vspace blankLines="1" />As well as
          distribution of tokens between the VMs of a tenant, it would
          similarly be feasible to allow transfer of tokens between tenants,
          also without breaking the performance isolation properties of the
          system. Secure token transfer mechanisms could be built above the
          underlying policing design described here. Therefore the details of
          token transfer need not concern us here, and can be deferred to
          future work.</t>

          <t hangText="Switch/Router Support:">Network switches/routers would
          not need any modification. However, both congestion detection by the
          tunnel (approach a) and ConEx audit (approach b) would be easier if
          switches supported ECN. <vspace blankLines="1" />Data centre TCP
          might be used as well, although not essential. DCTCP requires ECN
          and is designed for data centres. DCTCP requires modified sender and
          receiver TCP algorithms as well as a more aggressive active queue
          management algorithm in the L3 switches. The AQM involves a step
          threshold at a very shallow queue length for ECN marking.</t>
        </list></t>
    </section>

    <section anchor="pidc_parameter" title="Parameter Setting">
      <t>{ToDo: }</t>

      <!-- <t>ToDo: stitch together the following snippets:</t>

      <t>Then the operator determines the loss probability that results when a
      flow with a modern congestion control (e.g. Cubic or DCTCP) runs
      continuously at a certain bit-rate (that is, a certain window with a
      typical round trip time and packet size). For instance, a single 200Mb/s
      flow drives loss probability up to 0.1% (with 1500B packets and 6ms
      RTT).</t>

      <t></t>

      <t>sets a target operating level of congestion for all paths through the
      network (e.g. 0.01% loss probability). The operator limits every
      tenant's contribution to congestion anywhere in the pool of capacity
      (e.g. one tenant may be limited to 50kb of congestion-volume per second,
      which means 50kb/s of discarded packets). Then, as long as congestion
      remains below the target level of 0.01%, this tenant will always be able
      to send at 500Mb/s, because 500Mb/s of traffic @ 0.01% loss probability
      = 50kb/s of lost bits.</t>

      <t>As long as path loss probability remains at 0.01%, if the tenant sent
      any more than 500Mb/s of data it would contribute more than the limit of
      50kb/s of loss. The operator monitors whether the bit-rate of loss
      exceeds this limit, and if it does, which the operator prevents the
      tenant from exceeding, by limiting the bit-rate of data that the tenant
      can send if the bit-rate of losses exceeds the agreed limit of
      50kb/s.</t>

      <t>The operator also deploys sufficient capacity so that if everyone is
      running at their congestion limit, all their bit-rates will still be .
      As long as congestion is below this level everywhere, every data source
      will be able to introduce more traffic</t>
      -->
    </section>

    <section anchor="pidc_deployment" title="Incremental Deployment">
      <section anchor="pidc_migration" title="Migration">
        <t>A pre-requisite for ingress congestion policing is the function
        entitled "Congestion Information at Ingress " in<xref
        target="pidc_design"> </xref>. Tunnel feedback (approach a) is a more
        processing intensive change to the hypervisors, but it can be deployed
        unilaterally by the data centre operator in all hypervisors (or
        containers), without requiring support in guest operating systems.</t>

        <t>Using ConEx markings (approach b) is only applicable if a
        particular guest OS supports the marking of outgoing packets with
        ConEx markings. But if available this is simpler and more
        efficient.</t>

        <t>Both functions could be implemented in each hypervisor, and a
        simple filter could be installed to allow ConEx packets through into
        the data centre network (approach a) without going through the
        feedback tunnel shim, while non-ConEx packets would need to be
        tunnelled and to elicit tunnel feedback (approach b). This would
        provide an incremental deployment scenario with the best of both
        worlds: it would work for unmodified guest OSs, but for guest OSs with
        ConEx support, it would require less processing (therefore being
        faster) and not require the considerable overhead of a duplicate
        feedback channel between hypervisors (sending and forwarding a large
        proportion of tiny packets).</t>

        <t>{ToDo: Note that the main reason for preferring ConEx information
        will be because it is designed to represent a conservative expectation
        of congestion, whereas tunnel feedback represents congestion only
        after it has happened.}</t>
      </section>

      <section anchor="pidc_evolution" title="Evolution">
        <t>Initially, the approach would be confined to intra-data centre
        traffic. With the addition of ECN support on network equipment in the
        WAN between data centres, it could straightforwardly be extended to
        inter-data centre scenarios, including across interconnected backbone
        networks.</t>

        <t>Having proved the approach within and between data centres and
        across interconnect, more mass-market devices might be expected to
        turned on support for ECN feedback, and ECN might be turned on in
        equipment in wider networks most likely to be bottlenecks (access and
        backhaul). </t>
      </section>
    </section>

    <section anchor="pidc_alternates" title="Related Approaches">
      <t>The Related Work section of <xref target="CongPol"></xref> provides a
      useful comparison of the approach proposed here against other attempts
      to solve similar problems.</t>

      <t>When the hose model is used with Diffserv, capacity has to be
      considerably over-provisioned for all the unfortunate cases when
      multiple sources of traffic happen to coincide even though they are all
      in-contract at their respective ingress policers. Even so, every node
      within a Diffserv network also has to be configured to limit higher
      traffic classes to a maximum rate in case of really unusual traffic
      distributions that would starve lower priority classes. Therefore, for
      really important performance assurances, Diffserv is used in the 'pipe'
      model where the policer constrains traffic separately for each
      destination, and sufficient capacity is provided at each network node
      for the sum of all the peak contracted rates for paths crossing that
      node.</t>

      <t>In contrast, the congestion policing approach is designed to give
      full performance assurances across a meshed network (the hose model),
      without having to divide a network up into pipes. If an unexpected
      distribution of traffic from all sources focuses on a congestion
      hotspot, it will increase the congestion-bit-rate seen by the policers
      of all sources contributing to the hot-spot. The congestion policers
      then focus on these sources, which in turn limits the severity of the
      hot-spot. </t>

      <t>The critical improvement over Diffserv is that the ingress edges
      receive information about any congestion occuring in the middle, so they
      can limit how much congestion occurs, wherever it happens to occur.
      Previously Diffserv edge policers had to limit traffic generally in case
      it caused congestion, because they never knew whether it would (open
      loop control).</t>

      <t>Congestion policing mechanisms could be used to assure the
      performance of one data flow (the 'pipe' model), but this would involve
      unnecessary complexity, given the approach works well for the 'hose'
      model.</t>

      <t>Therefore, congestion policing allows capacity to be provisioned for
      the average case, not for the near-worst case when many unlikely cases
      coincide. It assures performance for all traffic using just one traffic
      class, whereas Diffserv only assures performance for a small proportion
      of traffic by partitioning it off into higher priority classes and
      over-provisioning relative to the traffic contracts sold for for this
      class.</t>

      <t>{ToDo: Refer to <xref target="pidc_Intuition"></xref> for comparison
      with WRR & WFQ}</t>

      <t>Seawall {ToDo} <xref target="Seawall"></xref></t>
    </section>

    <!-- ====================================================================== -->

    <section anchor="pidc_security" title="Security Considerations"></section>

    <!-- ====================================================================== -->

    <section anchor="pidc_IANA" title="IANA Considerations">
      <t>This document does not require actions by IANA.</t>
    </section>

    <!-- ====================================================================== -->

    <section anchor="pidc_conclusions" title="Conclusions">
      <t>{ToDo}</t>

      <t></t>
    </section>

    <!-- ====================================================================== -->

    <section title="Acknowledgments">
      <t></t>
    </section>

    <!-- ====================================================================== -->
  </middle>

  <back>
    <references title="Informative References">
      <?rfc include='reference.RFC.2475'?>

      <?rfc include='reference.RFC.3649'?>

      <?rfc include='reference.RFC.5681'?>

      <?rfc include='reference.RFC.6040'?>

      <?rfc include='reference.RFC.6356'?>

      <reference anchor="ConEx-Abstract-Mech">
        <front>
          <title>Congestion Exposure (ConEx) Concepts and Abstract
          Mechanism</title>

          <author fullname="Matt Mathis" initials="M" surname="Mathis">
            <organization>Google</organization>
          </author>

          <author fullname="Bob Briscoe" initials="B" surname="Briscoe">
            <organization>BT</organization>
          </author>

          <date day="31" month="October" year="2011" />
        </front>

        <seriesInfo name="Internet-Draft"
                    value="draft-ietf-conex-abstract-mech-03" />

        <format target="http://www.ietf.org/internet-drafts/draft-ietf-conex-abstract-mech-03.txt"
                type="TXT" />
      </reference>

      <reference anchor="conex-destopt">
        <front>
          <title>IPv6 Destination Option for Conex</title>

          <author fullname="Suresh Krishnan" initials="S" surname="Krishnan">
            <organization></organization>
          </author>

          <author fullname="Mirja Kuehlewind" initials="M"
                  surname="Kuehlewind">
            <organization></organization>
          </author>

          <author fullname="Carlos Ucendo" initials="C" surname="Ucendo">
            <organization></organization>
          </author>

          <date day="30" month="October" year="2011" />

          <abstract>
            <t>Conex is a mechanism by which senders inform the network about
            the congestion encountered by packets earlier in the same flow.
            This document specifies an IPv6 destination option that is capable
            of carrying conex markings in IPv6 datagrams.</t>
          </abstract>
        </front>

        <seriesInfo name="Internet-Draft" value="draft-ietf-conex-destopt-01" />

        <format target="http://www.ietf.org/internet-drafts/draft-ietf-conex-destopt-01.txt"
                type="TXT" />
      </reference>

      <reference anchor="intarea-ipv4-id-reuse">
        <front>
          <title>Reusing the IPv4 Identification Field in Atomic
          Packets</title>

          <author fullname="Bob Briscoe" initials="B" surname="Briscoe">
            <organization></organization>
          </author>

          <date day="12" month="March" year="2012" />

          <abstract>
            <t>This specification takes a new approach to extensibility that
            is both principled and a hack. It builds on recent moves to
            formalise the increasingly common practice where fragmentation in
            IPv4 more closely matches that of IPv6. The large majority of IPv4
            packets are now 'atomic', meaning indivisible. In such packets,
            the 16 bits of the IPv4 Identification (IPv4 ID) field are
            redundant and could be freed up for the Internet community to put
            to other uses, at least within the constraints imposed by their
            original use for reassembly. This specification defines the
            process for redefining the semantics of these bits. It uses the
            previously reserved control flag in the IPv4 header to indicate
            that these 16 bits have new semantics. Great care is taken
            throughout to ease incremental deployment, even in the presence of
            middleboxes that incorrectly discard or normalise packets that
            have the reserved control flag set.</t>
          </abstract>
        </front>

        <seriesInfo name="Internet-Draft"
                    value="draft-briscoe-intarea-ipv4-id-reuse-01" />

        <format target="http://www.ietf.org/internet-drafts/draft-briscoe-intarea-ipv4-id-reuse-01.txt"
                type="TXT" />
      </reference>

      <reference anchor="CongPol"
                 target="http://bobbriscoe.net/projects/refb/#polfree">
        <front>
          <title>Policing Freedom to Use the Internet Resource Pool</title>

          <author fullname="Arnaud Jacquet" initials="A" surname="Jacquet">
            <organization>BT</organization>
          </author>

          <author fullname="Bob Briscoe" initials="B" surname="Briscoe">
            <organization>BT & UCL</organization>
          </author>

          <author fullname="Toby Moncaster" initials="T" surname="Moncaster">
            <organization>BT</organization>
          </author>

          <date month="December" year="2008" />
        </front>

        <seriesInfo name="Proc ACM Workshop on Re-Architecting the Internet (ReArch'08)"
                    value="" />

        <format target="http://www.bobbriscoe.net/projects/2020comms/refb/policer_rearch08.pdf"
                type="PDF" />
      </reference>

      <reference anchor="Seawall"
                 target="http://research.microsoft.com/en-us/projects/seawall/">
        <front>
          <title>Seawall: Performance Isolation in Cloud Datacenter
          Networks</title>

          <author fullname="Alan Shieh" initials="A" surname="Shieh">
            <organization>Microsoft and Cornell Uni</organization>
          </author>

          <author fullname="Srikanth Kandula" initials="S" surname="Kandula">
            <organization>Microsoft</organization>
          </author>

          <author fullname="Albert Greenberg" initials="A" surname="Greenberg">
            <organization>Microsoft</organization>
          </author>

          <author fullname="Changhoon Kim" initials="C" surname="Kim">
            <organization>Microsoft</organization>
          </author>

          <date month="June" year="2010" />
        </front>

        <seriesInfo name="Proc 2nd USENIX Workshop on Hot Topics in Cloud Computing"
                    value="" />

        <format target="http://www.usenix.org/event/nsdi11/tech/full_papers/Shieh.pdf"
                type="PDF" />
      </reference>

      <reference anchor="DRL"
                 target="http://doi.acm.org/10.1145/1282427.1282419">
        <front>
          <title>Cloud control with distributed rate limiting</title>

          <author fullname="Barath Raghavan" initials="B" surname="Raghavan">
            <organization></organization>
          </author>

          <author fullname="Kashi Vishwanath" initials="K"
                  surname="Vishwanath">
            <organization></organization>
          </author>

          <author fullname="Sriram Ramabhadran" initials="S"
                  surname="Ramabhadran">
            <organization></organization>
          </author>

          <author fullname="Kenneth Yocum" initials="K" surname="Yocum">
            <organization></organization>
          </author>

          <author fullname="Alex Snoeren" initials="A" surname="Snoeren">
            <organization></organization>
          </author>

          <date month="" year="2007" />
        </front>

        <seriesInfo name="ACM SIGCOMM CCR" value="37(4)337--348" />

        <format target="http://doi.acm.org/10.1145/1282427.1282419" type="PDF" />
      </reference>

      <reference anchor="conex-dc_tr">
        <front>
          <title>Network Performance Isolation in Data Centres by Congestion
          Exposure to Edge Policers</title>

          <author fullname="Bob" surname="Briscoe">
            <organization>BT</organization>
          </author>

          <date day="03" month="November" year="2011" />
        </front>

        <seriesInfo name="BT Technical Report" value="TR-DES8-2011-004" />

        <annotation>Work in progress</annotation>
      </reference>

      <reference anchor="nvgre">
        <front>
          <title>NVGRE: Network Virtualization using Generic Routing
          Encapsulation</title>

          <author fullname="Murari Sridhavan" initials="M" surname="Sridhavan">
            <organization></organization>
          </author>

          <author fullname="Albert Greenberg" initials="A" surname="Greenberg">
            <organization></organization>
          </author>

          <author fullname="Narasimhan Venkataramaiah" initials="N"
                  surname="Venkataramaiah">
            <organization></organization>
          </author>

          <author fullname="Yu-Shun Wang" initials="Y" surname="Wang">
            <organization></organization>
          </author>

          <author fullname="Kenneth Duda" initials="K" surname="Duda">
            <organization></organization>
          </author>

          <author fullname="Ilango Ganga" initials="I" surname="Ganga">
            <organization></organization>
          </author>

          <author fullname="Geng Lin" initials="G" surname="Lin">
            <organization></organization>
          </author>

          <author fullname="Mark Pearson" initials="M" surname="Pearson">
            <organization></organization>
          </author>

          <author fullname="Patricia Thaler" initials="P" surname="Thaler">
            <organization></organization>
          </author>

          <author fullname="Chait Tumuluri" initials="C" surname="Tumuluri">
            <organization></organization>
          </author>

          <date day="8" month="July" year="2012" />

          <abstract>
            <t>This document describes the usage of Generic Routing
            Encapsulation (GRE) header for Network Virtualization, called
            NVGRE, in multi- tenant datacenters. Network Virtualization
            decouples virtual networks and addresses from physical network
            infrastructure, providing isolation and concurrency between
            multiple virtual networks on the same physical network
            infrastructure. This document also introduces a Network
            Virtualization framework to illustrate the use cases, but the
            focus is on specifying the data plane aspect of NVGRE.</t>
          </abstract>
        </front>

        <seriesInfo name="Internet-Draft"
                    value="draft-sridharan-virtualization-nvgre-01" />

        <format target="http://www.ietf.org/internet-drafts/draft-sridharan-virtualization-nvgre-01.txt"
                type="TXT" />
      </reference>
    </references>

    <section title="Summary of Changes between Drafts">
      <t>Detailed changes are available from
      http://tools.ietf.org/html/draft-briscoe-conex-data-centre</t>

      <t><list style="hanging">
          <t
          hangText="From draft-briscoe-conex-initial-deploy-02 to draft-briscoe-conex-data-centre-00:"><list
              style="symbols">
              <t>Split off data-centre scenario as a separate document, by
              popular request.</t>
            </list></t>
        </list></t>
    </section>
  </back>
</rfc>

PAFTECH AB 2003-20262026-04-22 21:41:29