One document matched: draft-briscoe-conex-data-centre-00.xml
<?xml version="1.0" encoding="US-ASCII"?>
<?xml-stylesheet type='text/xsl' href='http://xml.resource.org/authoring/rfc2629.xslt' ?>
<?rfc toc="yes" ?>
<!-- Default toc="no" No Table of Contents -->
<?rfc symrefs="yes" ?>
<!-- Default symrefs="no" Don't use anchors, but use numbers for refs -->
<?rfc sortrefs="yes" ?>
<!-- Default sortrefs="no" Don't sort references into order -->
<?rfc compact="yes" ?>
<!-- Default compact="no" Start sections on new pages -->
<?rfc strict="no" ?>
<!-- Default strict="no" Don't check I-D nits -->
<?rfc rfcedstyle="yes" ?>
<!-- Default rfcedstyle="yes" attempt to closely follow finer details from the latest observable RFC-Editor style -->
<?rfc linkmailto="yes" ?>
<!-- Default linkmailto="yes" generate mailto: URL, as appropriate -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<rfc category="info" docName="draft-briscoe-conex-data-centre-00"
ipr="trust200902">
<front>
<title abbrev="Initial ConEx Deployment Examples">Network Performance
Isolation in Data Centres using Congestion Exposure (ConEx)</title>
<author fullname="Bob Briscoe" initials="B." surname="Briscoe">
<organization>BT</organization>
<address>
<postal>
<street>B54/77, Adastral Park</street>
<street>Martlesham Heath</street>
<city>Ipswich</city>
<code>IP5 3RE</code>
<country>UK</country>
</postal>
<phone>+44 1473 645196</phone>
<email>bob.briscoe@bt.com</email>
<uri>http://bobbriscoe.net/</uri>
</address>
</author>
<author fullname="Murari Sridharan" initials="M." surname="Sridharan">
<organization>Microsoft</organization>
<address>
<postal>
<street>1 Microsoft Way</street>
<city>Redmond</city>
<region>WA</region>
<code>98052</code>
<country></country>
</postal>
<phone></phone>
<facsimile></facsimile>
<email>muraris@microsoft.com</email>
<uri></uri>
</address>
</author>
<date day="09" month="July" year="2012" />
<area>Transport Area</area>
<workgroup>ConEx</workgroup>
<keyword>Internet-Draft</keyword>
<abstract>
<t>This document describes how a multi-tenant data centre operator can
isolate tenants from network performance degradation due to each other's
usage, but without losing the multiplexing benefits of a LAN-style
network where anyone can use any amount of any resource. Zero per-tenant
configuration and no implementation change is required on network
equipment. Instead the solution is implemented with a simple change to
the hypervisor (or container) on each physical server, beneath the
tenant's virtual machines. These collectively enforce a very simple
distributed contract - a single network allowance that each tenant can
allocate among their virtual machines. The solution is simplest and most
efficient using layer-3 switches that support explicit congestion
notification (ECN) and if the sending operating system supports
congestion exposure (ConEx). Nonetheless, an arrangement is described so
that the operator can unilaterally deploy a complete solution while
operating systems are being incrementally upgraded to support ConEx.</t>
</abstract>
</front>
<middle>
<!-- ====================================================================== -->
<section anchor="pidc_intro" title="Introduction">
<t>A number of companies offer hosting of virtual machines on their data
centre infrastructure—so-called infrastructure as a service
(IaaS). A set amount of processing power, memory, storage and network
are offered. Although processing power, memory and storage are
relatively simple to allocate on the 'pay as you go' basis that has
become common, the network is less easy to allocate given it is a
naturally distributed system.</t>
<t>This document describes how a data centre infrastructure provider can
deploy congestion policing at every ingress to the data centre network,
e.g. in all the hypervisors (or containers) in a data centre that
provides virtualised 'cloud' computing facilities. These bulk congestion
policers pick up congestion information in the data packets traversing
the network, using one of two approaches: feedback tunnels or ConEx.
Then, these policers at the ingress edge have sufficient information to
limit the amount of congestion any tenant can cause anywhere in the data
centre. This isolates the network performance experienced by each tenant
from the behaviour of all the others, without any tenant-related
configuration of any of the switches.</t>
<t>The key to the solution is the use of congestion-bit-rate rather than
bit-rate as the policing metric. <spanx style="emph">How </spanx>this
works is very simple and quick to describe (<xref
target="pidc_Outline_Design"></xref> outlines the design at the start
and<xref target="pidc_design"> </xref> gives details). </t>
<t>However, it is much more difficult to understand <spanx style="emph">why</spanx>
this approach provides performance isolation. In particular, why it
provides performance isolation across a network of links, even though
there is apparently no isolation mechanism in each link. <xref
target="pidc_Intuition"></xref> builds up an intuition for why the
approach works, and why other approaches fall down in different ways.
The explanation builds as follows:<list style="symbols">
<t>Starting with the simple case of long-running flows focused any
one bottleneck link in the network, tenants get weighted shares of
the link, much like weighted round robin, but with no mechanism in
any of the links;</t>
<t>In the more realistic case where flows are not all long-running
but a mix of short to very long, it is explained that bit-rate is
not a sufficient metric for isolating performance; how often a
tenant is <spanx style="emph">not</spanx> sending is the significant
factor for performance isolation, not whether bit-rate is shared
equally whenever it is sending;</t>
<t>Although it might seem that data volume would be a good measure
of how often a tenant does not send, we then show that a tenant can
send a large volume of data but hardly affect the performance of
others — by being very responsive to congestion. Using
congestion-volume (congestion-bit-rate over time) in a policer
encourages large data senders to give other tenants much higher
performance, whereas using straight volume as an allocation metric
provides no isolation at all from tenants who send the same volume
but are oblivious to its effect on others (the widespread behaviour
today);</t>
<t>We then show that a policer based on the congestion-bit-rate
metric works across a network of links treating it as a pool of
capacity, whereas other approaches treat each link independently,
which is why the proposed approach requires none of the
configuration complexity on switches that is involved in other
approaches.</t>
</list></t>
<t>The solution would also be just as applicable to isolate the network
performance of different departments within the data centre of an
enterprise, which could be implemented without virtualisation. However,
it will be described as a multi-tenant scenario, which is the more
difficult case from a security point of view.</t>
<t>{ToDo: Meshed, pref multipath resource pool, not unnecessarily
constrained paths.}</t>
</section>
<section anchor="pidc_Design_Features" title="Design Features">
<t>The following goals are met by the design, each of which is explained
subsequently: <list style="symbols">
<t>Performance isolation</t>
<t>No loss of LAN-like openness and multiplexing benefits</t>
<t>Zero tenant-related switch configuration</t>
<t>No change to existing switch implementations</t>
<t>Weighted performance differentiation</t>
<t>Ultra-Simple contract—per-tenant network-wide allowance</t>
<t>Sender constraint, but with transferrable allowance</t>
<t>Transport-agnostic</t>
<t>Extensible to wide-area and inter-data-centre interconnection</t>
</list></t>
<t><list style="hanging">
<t hangText="Performance Isolation with Openness of a LAN:">The
primary goal is to ensure that each tenant of a data centre receives
a minimum assured performance from the whole network resource pool,
but without losing the efficiency savings from multiplexed use of
shared infrastructure (work-conserving). There is no need for
partitioning or reservation of network resources.</t>
<t hangText="Zero Tenant-Related Switch Configuration:">Performance
isolation is achieved with no per-tenant configuration of switches.
All switch resources are potentially available to all tenants.
<vspace blankLines="1" />Separately, <spanx style="emph">forwarding</spanx>
isolation may (or may not) be configured to ensure one tenant cannot
receive traffic from another's virtual network. However, <spanx
style="emph">performance</spanx> isolation is kept completely
orthogonal, and adds nothing to the configuration complexity of the
network.</t>
<t hangText="No New Switch Implementation:">Straightforward
commodity switches (or routers) are sufficient. Bulk explicit
congestion notification (ECN) is recommended, which is available in
a large and growing range of layer-3 switches (a layer-3 switch does
switching at layer-2, but it can use the Diffserv and ECN fields for
traffic control if an IP header can be found). Once the network
supports ECN, the performance isolation function is confined to the
hypervisor (or container) and the operating systems on the
hosts.</t>
<t hangText="Weighted Performance Differentiation:">A tenant gets
network performance in proportion to their allowance when
constrained by others, with no constraint otherwise. Importantly,
the assurance is not just instantaneous, but over time. And the
assurance is not just localised to each link but network-wide. This
will be explained with numerical examples later.</t>
<t hangText="Ultra-Simple Contract:">The tenant needs to decide only
two things: The peak bit-rate connecting each virtual machine to the
network (as today) and an overall 'usage' allowance. This document
focuses on the latter. A tenant just decides one number for her
contracted allowance that can be shared over all her virtual
machines (VMs). The 'usage' allowance is a measure of
congestion-bit-rate, which will be explained later, but most tenants
will just think of it as a number, where more is better. A tenant
has no need to decide in advance which VMs will need more allowance
and which less—an automated process allocates the allowance
across the VMs, shifting more to those that need it most, as they
use it. Therefore, performance cannot be constrained by poor choice
of allocations between VMs, removing a whole dimension from the
problem that tenants face when choosing their traffic contract. The
allocation process can be operated by the tenant, or provided by the
data centre operator as part of an additional platform as a service
(PaaS) offer.</t>
<t hangText="Sender Constraint with transferrable allowance:">By
default, constraints are always placed on data senders, determined
by the sending party's traffic contract. Nonetheless, if the
receiving party (or any other party) wishes to enhance performance
it can arrange this with the sender at the expense of its own
allowance. <vspace blankLines="1" />For instance, when a tenant's VM
sends data to a storage facility the tenant that owns the VM
consumes her allowance for enhanced sending performance. But by
default when she later retrieves data from storage, the storage
facility is the sender, so the storage facility consumes its
allowance to determine performance in the reverse direction.
Nonetheless, during the retrieval request, the storage facility can
require that its sending 'costs' are covered by the receiving VM's
allowance.</t>
<t hangText="Transport-Agnostic:">In a well-provisioned network,
enforcement of performance isolation rarely introduces constraints
on network behaviour. However, it continually counts how much each
tenant is limiting the performance of others, and it will intervene
to enforce performance isolation, but against only those customers
who most persistently constrain others. This performance isolation
is oblivious to flows and to the protocols and algorithms being used
above the IP layer.</t>
<t hangText="Interconnection:">The solution is designed so that
interconnected networks can ensure each is accountable for the
performance degradation it contributes to in other networks. If
necessary, one network has the information to intervene at its
ingress to limit traffic from another network that is degrading
performance. Alternatively, with the proposed protocols, networks
can see sufficient information in traffic arriving at their borders
to give their neighbours financial incentives to limit the traffic
themselves.<vspace blankLines="1" />The present document focuses on
a single provider-scenario, but evolution to interconnection with
other data centres over wide-area networks, and interconnection with
access networks is briefly discussed in <xref
target="pidc_evolution"></xref>.</t>
</list></t>
</section>
<!-- ====================================================================== -->
<section anchor="pidc_Outline_Design" title="Outline Design">
<t>This section outlines the essential features of the design. Design
details will be given in <xref target="pidc_design"></xref>.<list
style="hanging">
<t hangText="Edge policing:">Traffic policing is located at the
policy enforcement point where each sending host connects to the
network, typically beneath the tenant's operating system in the
hypervisor controlled by the infrastructure operator. In this
respect, the approach has a similar arrangement to the Diffserv
architecture with traffic policers forming a ring around the network
<xref target="RFC2475"></xref>.</t>
<t hangText="Congestion policing:">However, unlike Diffserv, traffic
policing limits congestion-bit-rate, not bit-rate. Congestion
bit-rate is the product of congestion probability and bit-rate. For
instance, if the instantaneous congestion probability (cf. loss
probability) across a network path were 0.02% and a tenant's maximum
contracted congestion-bit-rate was 600kb/s, then the policer would
allow the tenant to send at a bit-rate of up to 3Gb/s (because 3Gb/s
x 0.02% = 600kb/s). The detail design section describes how
congestion policers at the network ingress know the congestion that
each packet will encounter in the network. {ToDo: rewrite this
section to describe how a congestion policer works, not to focus
just on units.}</t>
<t hangText="Hose model:">The congestion policer controls all
traffic from a particular sender without regard to destination,
similar to the Diffserv 'hose' model. {ToDo: dual policer, and
multiple hoses for long-term average.}</t>
<t hangText="Flow policing unnecessary:">A congestion policer could
be designed to focus policing on the particular data flow(s)
contributing most to the excess congestion-bit-rate. However we will
explain why bulk policing should be sufficient.</t>
<t hangText="FIFO forwarding:">Each network queue only needs a
first-in first-out discipline, with no need for any priority
scheduling. If scheduling by traffic class is used (for whatever
reason), congestion policing can be used to isolate tenants from
each other within each class. {ToDo: Say this the other way
round.}</t>
<t hangText="ECN marking recommended:">All queues that might become
congested should support bulk ECN marking, but packets that do not
support ECN marking can be accommodated.</t>
</list>In the proposed approach, the network operator deploys capacity
as usual—using previous experience to determine a reasonable
contention ratio at every tier of the network. Then, the tenant
contracts with the operator for an allowance that determines the rate at
which the congestion policer allows each tenant to contribute to
congestion {ToDo: Dual policer}. <xref target="pidc_parameter"></xref>
discusses how the operator would determine this allowance. Each VM's
congestion policer limits its peak congestion-bit-rate as well as
limiting the overall average per tenant.</t>
</section>
<!-- ====================================================================== -->
<section anchor="pidc_Intuition" title="Performance Isolation: Intuition">
<t>Network performance isolation traditionally meant that each user
could be sure of a minimum guaranteed bit-rate. Such assurances are
useful if traffic from each tenant follows relatively predictable paths
and is fairly constant. If traffic demand is more dynamic and
unpredictable (both over time and across paths), minimum bit-rate
assurances can still be given, but they have to be very small relative
to the available capacity.</t>
<t>This either means the shared capacity has to be greatly overprovided
so that the assured level is large enough, or the assured level has to
be small. The former is unnecessarily expensive; the latter doesn't
really give a sufficiently useful assurance.</t>
<t>Another form of isolation is to guarantee that each user will get 1/N
of the capacity of each link, where N is the number of active users at
each link. This is fine if the number of active users (N) sharing a link
is fairly predictable. However, if large numbers of tenants do not
typically share any one link but at any time they all could (as in a
data centre), a 1/N assurance is fairly worthless. Again, given N is
typically small but could be very large, either the shared capacity has
to be expensively overprovided, or the assured bit-rate has to be
worthlessly small.</t>
<t>Both these traditional forms of isolation try to give the tenant an
assurance about instantaneous bit-rate by constraining the instantaneous
bit-rate of everyone else. However, there are two mistakes in this
approach. The amount of capacity left for a tenant to transfer data as
quickly as possible depends on:<list style="numbers">
<t>the load <spanx style="emph">over time</spanx> of everyone
else</t>
<t>how much everyone else yields to the increase in <spanx
style="emph">congestion</spanx> when someone else tries to transfer
data</t>
</list></t>
<t>This is why limiting congestion-bit-rate over time is the key to
network performance isolation. It focuses policing only on those tenants
who go fast over congested path(s) excessively and persistently over
time. This keeps congestion below a design threshold everywhere so that
everyone else can go fast.</t>
<t>Congestion policing can and will enforce a congestion response if a
particular tenant sends traffic that is completely unresponsive to
congestion. However, the purpose of congestion policing is not to
intervene in everyone's rate control all the time. Rather it is
encourage each tenant to avoid being policed — to keep the
aggregate of all their flows' responses to congestion within an overall
envelope. Nonetheless, the upper bound set by the congestion policer
still ensures that each tenant's minimum performance is isolated from
the combined effect of everyone else.</t>
<t>It has not been easy to find a way to give the intuition on why
congesiton policing isolates performance, particularly across a networks
of links not just on a single link. The approach used in this section,
is to describe the system as if everyone is using the congestion
response they would be forced to use if congestion policing had to
intervene. We therefore call this the boundary model of congestion
control. It is a very simple congestion response, so it is much easier
to understand than if we introduced all the square root terms and other
complexity of New Reno TCP's response. And it means we don't have to try
to decribe a mix of responses.</t>
<t>We cannot emphasise enough that the intention is not to make
individual flows conform to this boundary response to congestion. Indeed
the intention is to allow a diverse evolving mix of congestion
responses, but constrained in total within a simple overall
envelope.</t>
<t>After describing and further justifying using the a simple boundary
model of congestion control, we start by considering long-running flows
sharing one link. Then we will consider on-off traffic, before widening
the scope from one link to a network of links and to links of different
sizes. Then we will depart from the initial simplified model of
congestion control and consider diverse congestion control algorithms,
including no end-system response at all.</t>
<t>Formal analysis to back-up the intuition provided by this section
will be made available in a more extensive companion technical report
<xref target="conex-dc_tr"></xref>.</t>
<section anchor="pidc_Initial_CC_Model"
title="Simple Boundary Model of Congestion Control">
<t>The boundary model of congestion control ensures a flow's bit-rate
is inversely proportional to the congestion level that it detects. For
instance, if congestion probability doubles, the flow's bit-rate
halves. This is called a scalable congestion control because it
maintains the same rate of congestion signals (marked or dropped
packets) no matter how fast it goes. Examples are Relentless TCP and
Scalable TCP [ToDo: add refs].</t>
<t>New Reno-like TCP algorithms <xref target="RFC5681"></xref> have
been widely replaced by alternatives closer to this scalable ideal
(e.g. Cubic TCP, Compound TCP [ToDo: add refs]), because at high rates
New Reno generated congestion signals too infrequently to track
available capacity fast enough <xref target="RFC3649"></xref>. More
recent TCP updates (e.g. data centre TCP) are becoming closer still to
the scalable ideal.</t>
<t>It is necessary to carefully distinguish congestion bit-rate, which
is an absolute measure of the rate of congested bits vs. congestion
probability, which is a relative measure of the proportion of
congested bits to all bits. For instance, consider a scenario where a
flow with scalable congestion control is alone in a 1Gb/s link, then
another similar flow from another tenant joins it. Both will push up
the congestion probability, which will push down their rates until
they together fit into the link. Because the flow's rate has to halve
to accomodate the new flow, congestion probability will double (lets
say from 0.002% to 0.004%), by our initial assumption of a scalable
congestion control. When it is alone on the link, the
congestion-bit-rate of the flow is 20kb/s (=1Gb/s * 0.002%), and when
it shares the link it is still 20kb/s (= 500Mb/s * 0.04%).</t>
<t>In summary, a congestion control can be considered scalable if the
bit-rate of packets carrying congestion signals (the
congestion-bit-rate) always stays the same no matter how much capacity
it finds available. This ensures there will always be enough signals
in a round trip time to keep the dynamics under control.</t>
<t>Reminder: Making individual flows conform to this boundary or
scalable response to congestion is a non-goal. Although we start this
explanation with this specific simple end-system congestion response,
this is just to aid intuition.</t>
</section>
<section anchor="pidc_long-running" title="Long-Running Flows">
<t><xref target="pidc_Tab-long_flows"></xref> shows various scenarios
where each of five tenants has contracted for 400kb/s of
congestion-bit-rate in order to share a 1Gb/s link. In order to help
intuition, we start with the (unlikely) scenario where all their flows
are long-running. Long-running flows will try to use all the link
capacity, so for simplicity we take utilisation as a round 100%.</t>
<t>In the case we have just described (scenario A) neither tenant's
policer is intervening at all, because both their congestion
allowances are 40kb/s and each sends only one flow that contributes
20kb/s of congestion — half the allowance.</t>
<texttable anchor="pidc_Tab-long_flows"
title="Bit-rates that a congestion policer allocates to five tenants sharing a 1Gb/s link with various numbers (#) of long-running flows all using 'scalable congestion control'">
<ttcol align="right">
Tenant</ttcol>
<ttcol>contracted congestion- bit-rate kb/s</ttcol>
<ttcol>scenario A
# : Mb/s</ttcol>
<ttcol>scenario B
# : Mb/s</ttcol>
<ttcol>scenario C
# : Mb/s</ttcol>
<ttcol>scenario D
# : Mb/s</ttcol>
<c></c>
<c></c>
<c></c>
<c></c>
<c></c>
<c></c>
<c>(a)</c>
<c>40</c>
<c>1 : 500</c>
<c>5 : 250</c>
<c>5 : 200</c>
<c>5 : 250</c>
<c>(b)</c>
<c>40</c>
<c>1 : 500</c>
<c>3 : 250</c>
<c>3 : 200</c>
<c>2 : 250</c>
<c>(c)</c>
<c>40</c>
<c>- : ---</c>
<c>3 : 250</c>
<c>3 : 200</c>
<c>2 : 250</c>
<c>(d)</c>
<c>40</c>
<c>- : ---</c>
<c>2 : 250</c>
<c>2 : 200</c>
<c>1 : 125</c>
<c>(e)</c>
<c>40</c>
<c>- : ---</c>
<c>- : ---</c>
<c>2 : 200</c>
<c>1 : 125</c>
<c></c>
<c>Congestion probability</c>
<c> 0.004%</c>
<c> 0.016%</c>
<c> 0.02%</c>
<c> 0.016%</c>
</texttable>
<t>Scenario B shows a case where four of the tenants all send 2 or
more long-running flows. Recall that each flow always contributes
20kb/s no matter how fast it goes. Therefore the policers of tenants
(a-c) limit them to two flows-worth of congestion (2 x 20kb/s =
40kb/s). Tenant (d) is only asking for 2 flows, so it gets them
without being policed, and all four get the same quarter share of the
link.</t>
<t>Scenario C is similar, except the fifth tenant (e) joins in, so
they all get equal 1/5 shares of the link.</t>
<t>In Scenario D, only tenant (a) asks for more than two flows, so
(a)'s policer limits it to two flows-worth of congestion, and everyone
else gets the number of flows-worth that they ask for. This means that
tenants (d&e) get less than everyone else, because they asked for
less than they would have been allowed. (Similarly, in Scenarios A
& B, some of the tenants are inactive, so they get zero, which is
also less than they could have had if they had wanted.)</t>
<t>With lots of long-running flows, as in scenarios B & C,
congestion policing seems to emulate round robin scheduling,
equalising the bit-rate of each tenant, no matter how many flows they
run. By configuring different contracted allowances for each tenant,
it can easily be seen that congestion policing could emulate weighted
round robin (WRR), with the relative sizes of the allowances acting as
the weights.</t>
<t>Scenario D departs from round-robin. This is deliberate, the idea
being that tenants are free to take less than their share in the short
term, which allows them to take more at other times, as we will see in
<xref target="pidc_wcc"></xref>. In Scenario D, policing focuses only
on the tenant (a) that is continually exceeding its contract. This
policer focuses discard solely on tenant a's traffic so that it cannot
cause any more congestion at the shared link (shown as 0.016% in the
last row).</t>
<t>To summarise so far, ingress congestion policers control
congestion-bit-rate in order to indirectly assure a minimum bit-rate
per tenant. With lots of long-running flows, the outcome is somewhat
similar to WRR, but without the need for any mechanism in each
queue.</t>
</section>
<section anchor="pidc_On-Off_Flows" title="On-Off Flows">
<t>Aiming to behave like round-robin (or weighted round-robin) is only
useful when all flows are infinitely long. For transfers of finite
size, congestion policing isolates one tenant's performance from the
behaviour of others — unlike WRR would, as will now be
explained.</t>
<t><xref target="pidc_Fig_on-off"></xref> compares two example
scenarios where tenant 'b' regularly sends small files in the top
chart and the same size files but more often in the bottom chart (a
higher 'on-off ratio'). This is the typical behaviour of a Web server
when more clients request more files at peak time. Meanwhile, in this
example, tenant c's behaviour doesn't change between the two scenarios
-- it sends a couple of large files, each starting at the same time in
both cases.</t>
<t>The capacity of the link that 'b' and 'c' share is shown as the
full height of the plot. The files sent by 'b' are shown as little
rectangles. 'b' can go at the full bit-rate of the link when 'c' is
not sending, which is represented by the tall thin rectangles labelled
'b' near the middle. We assume for simplicity that 'b' and 'c' divide
up the bit-rate equally. So, when both 'b' and 'c' are sending, the
'b' rectangles are half the height (bit-rate) and twice the duration
relative to when 'b' sends alone. The area of a file to be transferred
stays the same, whether tall and thin or short and fat, because the
area represents the size of the file (bit-rate x duration = file
size). The files from 'c' look like inverted castellations, because
'c' uses half the link rate while each file from 'b' completes, then
'c' can fill the link until 'b' starts the next file. The
cross-hatched areas represent idle times when no-one is sending.</t>
<t>For this simple scenario we ignore start-up dynamics and just focus
on the rate and duration of flows that are long enough to stabilise,
which is why they can be represented as simple rectangles. We will
introduce the effect of flow startups later.</t>
<t>In the bottom case, where 'b' sends more often, the gaps between
b's transfers are smaller, so 'c' has less opportunity to use the
whole line rate. This squeezes out the time it takes for 'c' to
complete its file transfers (recall a file will always have the same
area which represents its size). Although 'c' finishes later, it still
starts the next flow at the same time. In turn, this means 'c' is
sending during a greater proprotion of b's transfers, which extends
b's average completion time too.</t>
<?rfc needLines="23"?>
<figure anchor="pidc_Fig_on-off"
title="In the lower case, the on-off ratio of 'b' has increased, which extends all the completion times of 'c' and 'b'">
<artwork><![CDATA[
^ bit-rate
|
|---------------------------------------,--.---------,--.----,-------
| | |\/\/\/\/\| |/\/\|
| c | b|/\/\/\/\/| b|\/\/| c
|------. ,-----. ,-----. | |\/\/\/\/\| |/\/\| ,--
| b | | b | | b | | |/\/\/\/\/| |\/\/| | b
| | | | | | | |\/\/\/\/\| |/\/\| |
+------'------'-----'------'-----'------'--'---------'--'----'----'-->
time
^ bit-rate
|
|---------------------------------------------------.--,--.--,-------
| |/\| |\/|
| c |\/| b|/\| c
|------. ,-----. ,-----. ,-----. ,-----. ,-----./\| |\/| ,----
| b | | b | | b | | b | | b | | b |\/| |/\| | b
| | | | | | | | | | | |/\| |\/| |
+------'--'-----'--'-----'--'-----'--'-----'--'-----'--'--'--'--'---->
time
]]></artwork>
</figure>
<t>Round-robin would do little if anything to isolate 'c' from the
effect of 'b' sending files more often. Round-robin is designed to
force 'b' and 'c' to share the capacity equally when they are both
active. But in both scenarios they already share capacity equally when
they are both active. The difference is in how often they are active.
Round-robin and other traditional fair queuing techniques don't have
any memory to sense that 'b' has been active more of the time.</t>
<t>In contrast, a congestion policer can tell when one tenant is
sending files more frequently, by measuring the rate at which the
tenant is contributing to congestion. Our aim is to show that policers
will be able to isolate performance properly by using the right metric
(congestion bit-rate), rather than using the wrong metric (bit-rate),
which doesn't sense whether the load over time is large or small.</t>
<section anchor="pidc_on-off-numerical-no-policer"
title="Numerical Examples Without Policing">
<t>The usefulness of the congestion bit-rate metric will now be
illustrated with the numerical examples in <xref
target="pidc_Tab_on-off"></xref>. The scenarios illustrate what the
congestion bit-rate would be without any policing or scheduling
action in the network. Then this metric can be monitored and limited
by a policer, to prevent one tenant from harming the performance of
others.</t>
<t>The 2nd & 3rd columns (file-size and inter-arrival time)
fully represent the behaviour of each tenant in each scenario. All
the other columns merely characterise the outcome in various ways.
The inter-arrival time (T) is the average time between starting one
file and the next. For instance, tenant 'b' sends a 16Mb file every
200ms on average. The formula in the heading of some columns shows
how the column was derived from other columns.</t>
<t>Scenario E is contrived so that the three tenants all offer the
same load to the network, even though they send files of very
different size (S). The files sent by tenant 'a' are 100 times
smaller than those of tenant 'b', but 'a' sends them 100 times more
often. In turn, b's files are 100 times smaller than c's, but 'b' in
turn sends them 100 times more often. Graphicallyy, the scenario
would look similar to <xref target="pidc_Fig_on-off"></xref>, except
with three sizes of file, not just two. Scenarios E-G are designed
to roughly represent various distributions of file sizes found in
data centres, but still to be simple enough to facilitate intuition,
even though each tenant would not normally send just one size
file.</t>
<t>The average completion time (t) and the maximum were calculated
from a fairly simple analytical model (documented in a campanion
technical report <xref target="conex-dc_tr"></xref>). Using one data
point as an example, it can be seen that a 1600Mb (200MB) file from
tenant 'c' completes in 1905ms (about 1.9s). The files that are 100
times smaller complete 100 times more quickly on average. In fact,
in this scenario with equal loads, each tenant perceives that their
files are being transferred at the same rate of 840Mb/s on average
(file-size divided by completion time, as shown in the apparent
bit-rate column). Thus on average all three tenants perceive they
are getting 84% of the 1Gb/s link on average (due to the benefit of
multiplexing and utilisation being low at 240Mb/s / 1Gb/s = 24% in
this case).</t>
<t>The completion times of the smaller files vary significantly,
depending on whether a larger file transfer is proceeding at the
same time. We have already seen this effect in <xref
target="pidc_Fig_on-off"></xref>, where, when tenant b's files share
with 'c', they take twice as long to complete as when they don't.
This is why the maximum completion time is greater than the average
for the small files, whereas there is imperceptible variance for the
largest files.</t>
<t>The final column shows how congestion bit-rate will be a useful
metric to enforce performance isolation (the figures illustrate the
situation before any enforcement mechanism is added). In the case of
equal loads (scenario E), average congestion bit-rates are all
equal. In scenarios F and G average congestion bit-rates are higher,
because all tenants are placing much more load on the network over
time, even though each still sends at equal rates to others when
they are active together. <xref target="pidc_Fig_on-off"></xref>
illustrated a similar effect in the difference between the top and
bottom scenarios.</t>
<t>The maximum instantaneous congestion bit-rate is nearly always
20kb/s. That is because, by definition, all the tenants are using
scalable congestion controls with a constant congestion rate of
20kb/s. As we saw in <xref target="pidc_Initial_CC_Model"></xref>,
the congestion rate of a particular scalable congestion control is
always the same, no matter how many other flows it competes
with.</t>
<t>Once it is understood that the congestion bit-rate of one
scalable flow is always 'w' and doesn't change whenever a flow is
active, it becomes clear what the congestion bit-rate will be when
averaged over time; it will simply be 'w' multiplied by the
proportion of time that the tenant's file transfers are active. That
is, w*t/T. For instance, in scenario E, on average tenant b's flows
start 200ms apart, but they complete in 19ms. So they are active for
19/200 = 10% of the time (rounded). A tenant that causes a
congestion bit-rate of 20kb/s for 10% of the time will have an
average congestion-bit-rate of 2kb/s, as shown.</t>
<t>To summarise so far, no matter how many more files transfer at
the same time, each scalable flow still contributes to congestion at
the same rate, but it contributes for more of the time, because it
squeezes out into the gap before its next flow starts.</t>
<?rfc needLines="30"?>
<texttable anchor="pidc_Tab_on-off"
title="How the effect on others of various file-transfer behaviours can be measured by the resulting congestion-bit-rate">
<ttcol align="right"> Ten
ant</ttcol>
<ttcol align="right">File size
S Mb</ttcol>
<ttcol align="right">Ave. inter- arr- ival
T
ms</ttcol>
<ttcol align="right">Ave. load
S/T
Mb/s</ttcol>
<ttcol align="right">Completion time
ave : max
t
ms </ttcol>
<ttcol align="center">Apparent bit-rate
ave : min S/t
Mb/s</ttcol>
<ttcol align="center">Congest- ion bit-rate
ave : max
w*t/T
kb/s</ttcol>
<!-- . . -->
<c></c>
<c></c>
<c></c>
<c></c>
<c>Scenario E</c>
<c></c>
<c></c>
<!-- . . -->
<c>a</c>
<c>0.16</c>
<c>2</c>
<c>80</c>
<c>0.19 : 0.48</c>
<c>840 : 333</c>
<c> 2 : 20</c>
<!-- . . -->
<c>b</c>
<c>16</c>
<c>200</c>
<c>80</c>
<c>19 : 35</c>
<c>840 : 460</c>
<c> 2 : 20</c>
<!-- . . -->
<c>c</c>
<c>1600</c>
<c>20000</c>
<c>80</c>
<c>1905 : 1905</c>
<c>840 : 840</c>
<c> 2 : 20</c>
<!-- . . -->
<c></c>
<c></c>
<c></c>
<c>____</c>
<c></c>
<c></c>
<c></c>
<!-- . . -->
<c></c>
<c></c>
<c></c>
<c>240</c>
<c></c>
<c></c>
<c></c>
<!-- . . -->
<c></c>
<c></c>
<c></c>
<c></c>
<c>Scenario F</c>
<c></c>
<c></c>
<!-- . . -->
<c>a</c>
<c>0.16</c>
<c>0.67</c>
<c>240</c>
<c>0.31 : 0.48</c>
<c>516 : 333</c>
<c> 9 : 20</c>
<!-- . . -->
<c>b</c>
<c>16</c>
<c>50</c>
<c>320</c>
<c>29 : 42</c>
<c>557 : 380</c>
<c> 11 : 20</c>
<!-- . . -->
<c>c</c>
<c>1600</c>
<c>10000</c>
<c>160</c>
<c>3636 : 3636</c>
<c>440 : 440</c>
<c> 7 : 20</c>
<!-- . . -->
<c></c>
<c></c>
<c></c>
<c>____</c>
<c></c>
<c></c>
<c></c>
<!-- . . -->
<c></c>
<c></c>
<c></c>
<c>720</c>
<c></c>
<c></c>
<c></c>
<!-- . . -->
<c></c>
<c></c>
<c></c>
<c></c>
<c>Scenario G</c>
<c></c>
<c></c>
<!-- . . -->
<c>a</c>
<c>0.16</c>
<c>0.67</c>
<c>240</c>
<c>0.33 : 0.64</c>
<c>481 : 250</c>
<c> 10 : 20</c>
<!-- . . -->
<c>b</c>
<c>16</c>
<c>40</c>
<c>400</c>
<c>32 : 46</c>
<c>505 : 345</c>
<c> 16 : 40</c>
<!-- . . -->
<c>c</c>
<c>1600</c>
<c>10000</c>
<c>160</c>
<c>4543 : 4543</c>
<c>352 : 352</c>
<c> 9 : 20</c>
<!-- . . -->
<c></c>
<c></c>
<c></c>
<c>____</c>
<c></c>
<c></c>
<c></c>
<!-- . . -->
<c></c>
<c></c>
<c></c>
<c>800</c>
<c></c>
<c></c>
<c></c>
<postamble>Single link of capacity 1Gb/s. Each tenant uses a
scalable congestion control which contributes a
congestion-bit-rate for each flow of w = 20kb/s.</postamble>
</texttable>
<t>In scenario F, clients have increased the rate they request files
from tenants a, b and c respectively by 3x, 4x and 2x relative to
scenario E. The tenants send the same size files but 3x, 4x and 2x
more often. For instance tenant 'b' is sending 16Mb files four times
as often as before, and they now take longer as well -- nearly 29ms
rather than 19ms -- because the other tenants are active more often
too, so completion gets squeezed to later. Consequently, tenant 'b'
is now sending 57% of the time, so its congestion-bit-rate is 20kb/s
* 57% = 11kb/s. This is nearly 6x higher than in scenario E,
reflecting both b's own increase by 4x and that this increase
coincides with everyone else increasing their load.</t>
<t>In scenario G, tenant 'b' increases even more, to 5x the load it
offered in scenario E. This results in average utilisation of
800Mb/s / 1Gb/s = 80%, compared to 72% in scenario F and only 24% in
scenario E. 'b' sends the same files but 5x more often, so its load
rises 5x.</t>
<t>Completion times rise for everyone due to the overall rise in
load, but the congestion rates of 'a' and 'c' don't rise anything
like as much as that of 'b', because they still leave large gaps
between files. For instance, tenant 'c' completes each large file
transfer in 4.5s (compared to 1.9s in scenario E), but it still only
sends files every 10s. So 'c' only sends 45% of the time, which is
reflected in its congestion bit-rate of 20kb/s * 45% = 9kb/s.</t>
<t>In contrast, on average tenant 'b' can only complete each
medium-sized file transfer in 32ms (compared to 19ms in scenario E),
but on average it starts sending another file after 40ms. So 'b'
sends 79% of the time, which is reflected in its congestion bit-rate
of 20kb/s * 79% = 16kb/s (rounded).</t>
<t>However, during the 45% of the time that 'c' sends a large file,
b's completion time is higher than average (as shown in <xref
target="pidc_Fig_on-off"></xref>). In fact, as shown in the maximum
completion time column, 'b' completes in 46ms, but it starts sending
a new file after 40ms, which is before the previous one has
completed. Therefore, during each of c's large files, 'b' sends
46/40 = 116% of the time on average.</t>
<t>This actually means 'b' is overlapping two files for 16% of the
time on average and sending one file for the remaining 84%. Whenever
two file transfers overlap, 'b' will be causing 2 x 20kb/s = 40kb/s
of congestion, which explains why tenant b in scenario G is the only
case with a maximum congestion rate of 40kb/s rather than 20kb/s as
in every other case. Over the duration of c's large files, 'b' would
therefore cause congestion at an average rate of 20kb/s * 84% +
40kb/s * 16% = 23kb/s (or more simply 10kb/s * 116% = 23kb/s). Of
course, when 'c' is not sending a large file, 'b' will contribute
less to congestion, which is why its average congestion rate is
16kb/s overall, as discussed earlier.</t>
<!--The reason the maximum averaged congestion-bit-rate is higher with small files is simply because of the way it is defined. The congestion rate is averaged over the time between one file and the next, then the maximum is taken for each tenant. For short inter-arrival times, this maximum will occur only in the worst case when all tenants are transferring files, while for longer inter-arrival times, there will be more of a mix of rates of congestion during the time over which it is averaged.-->
</section>
<section title="Congestion Policing of On-Off Flows">
<t>Still referring to the numerical examples in <xref
target="pidc_Tab_on-off"></xref>, we will now discuss the effect of
limiting each tenant with a congestion policer.</t>
<t>The network operator might have deployed congestion policers to
cap each tenant's average congestion rate to 16kb/s. None of the
tenants are exceeding this limit in any of the scenarios, but tenant
'b' is just shy of it in scenario G. Therefore all the tenants would
be free to behave in all sorts of ways like those of scenarios E-G,
but they would be prevented from degrading the performance of the
other tenants beyond the point reached by tenant 'b' in scenario G.
If tenant 'b' added more load, the policer would prevent the extra
load entering the network by focusing drop solely on tenant 'b',
preventing the other tenants from experiencing any more congestion
due to tenant 'b'. Then tenants 'a' and 'c' would be assured the
(average) apparent bit-rates shown, whatever the behaviour of
'b'.</t>
<t>If 'a' added more load, 'c' would not suffer. Instead 'b' would
go over limit and its rate would be trimmed during congestion peaks,
sacrificing some of its lead to 'a'. Similarly, if 'c' added more
load, 'b' would be made to sacrifice some of its performance, so
that 'a' would not suffer. Further, if more tenants arrived to share
the same link, the policer would force 'b' to sacrifice performance
in favour of the additional tenants.</t>
<t>There is nothing special about a policer limit of 16kb/s. The
example when discussing infinite flows used a limit of 40kb/s per
tenant. And some tenants can be given higher limits than others
(e.g. at an additional charge). If the operator gives out congestion
limits that together add up to a higher amount but it doesn't
increase the link capacity, it merely allows the tenants to apply
more load (e.g. more files of the same size in the same time), but
each with lower bit-rate.</t>
<t>{ToDo: Discuss min bit-rates}</t>
<t>{ToDo: discuss instantaneous limits and how they protect the
minimum bit-rate of other tenants}</t>
</section>
</section>
<section anchor="pidc_wcc" title="Weighted Congestion Controls">
<t>At high speed, congestion controls such as Cubic TCP, Data Centre
TCP, Compound TCP etc all contribute to congestion at widely differing
rates, which is called their 'aggressiveness' or 'weight'. So far, we
have made the simplifying assumption of a scalable congestion control
algorithm that contributes to congestion at a constant rate of w =
20kb/s. We now assume tenant 'c' uses a similar congestion control to
before, but with different parameters in the algorithm so that its
weight is still constant, but at w = 2.2kb/s.</t>
<t>Tenant 'b' still uses w = 20kb/s for its smaller files, so when the
two compete for the 1Gb/s link, they will share it in proportion to
their weights, 20:2.2 (or 90%:10%). That is, 'b' and 'c' will
respectively get (20/22.2)*1Gb/s = 900Mb/s and (2.2/22.2)*1Gb/s =
100Mb/s of the 1Gb/s link. <xref target="pidc_Fig_weighted"></xref>
shows the situation before (upper) and after (lower) this change.</t>
<t>When the two compete, 'b' transfers each file 9/5 faster than
before (900Mb/s rather than 500Mb/s), so it completes them in 5/9 of
the time. 'b' still contributes congestion at the same rate of 20kb/s,
but for 5/9 less time than before. Therefore, relative to before, 'b'
uses up its allowance 5/9 as quickly.</t>
<t>Tenant 'c' contributes congestion at 2.2/22.2 of its previous rate,
that is 2kb/s rather than 20kb/s. Although tenant 'b' goes faster, as
each file finishes, it gets out of the way sooner, so 'c' can catch up
to where it got to before after each 'b' file and should complete
hardly any later than before. Tenant 'c' will probably lose some
completion time because it has to accelerate and decelerate more. But,
whenever it is sending a file, 'c' gains (20kb/s -2kb/s) = 18kb of
allowance every second, which it can use for other transfers.</t>
<figure anchor="pidc_Fig_weighted"
title="Weighted congestion controls with equal weights (upper) and unequal (lower)">
<artwork><![CDATA[
^ bit-rate
|
|---------------------------------------------------.--,--.--,-------
| |/\| |\/|
| c |\/| b|/\| c
|------. ,-----. ,-----. ,-----. ,-----. ,-----./\| |\/| ,----
| b | | b | | b | | b | | b | | b |\/| |/\| | b
| | | | | | | | | | | |/\| |\/| |
+------'--'-----'--'-----'--'-----'--'-----'--'-----'--'--'--'--'---->
time
^ bit-rate
|
|---------------------------------------------------.--,--.--,-------
|---. ,---. ,---. ,---. ,---. ,---. |/\| |\/| ,---.
| | | | | | c | | | | | | |\/| b|/\| c| |
| | | | | | | | | | | | |/\| |\/| | |
| b | | b | | b | | b | | b | | b | |\/| |/\| | b |
| | | | | | | | | | | | |/\| |\/| | |
+---'-----'---'----'---'----'---'----'---'----'---'-'--'--'--'--'---'>
time
]]></artwork>
</figure>
<t>It seems too good to be true that both tenants gain so much and
lose so little by 'c' reducing its aggressiveness. The gains are
unlikely to be as perfect as this simple model predicts, but we
believe they will be nearly as substantial.</t>
<t>It might seem that everyone can keep gaining by everyone agreeing
to reduce their weights, ad infinitum. However, the lower the weight,
the less signals the congestion control gets, so it starts to lose its
control during dynamics. Nonetheless, congestion policing should
encourage congestion control designs to keep reducing their weights,
but they will have to stop when they reach the minimum necessary
congestion in order to maintain sufficient control signals.</t>
</section>
<section anchor="pidc_Network_of_Links" title="A Network of Links">
<t>So far we have only considered a single link. Congestion policing
at the network edge is designed to work across a network of links,
treating them all as a pool of resources, as we shall now explain. We
will use the dual-homed topology shown in <xref
target="pidc_Fig_dual-homed"></xref> (stretching the bounds of ASCII
art) as a very simple example of a pool of resources.</t>
<t>In this case where there are 48 servers (H1, H2, ... Hn where n=48)
on the left, with on average 8 virtual machines (VMs) running on each
(e.g. server n is running Vn1, Vn2, ... to Vnm where m = 8). Each
server is connected by two 1Gb/s links, one to each top-of-rack switch
S1 & S2. To the right of the switches, there are 6 links of 10Gb/s
each, connecting onwards to customer networks or to the rest of the
data centre. There is a total of 48 *2 *1Gb/s = 96Gb/s capacity
between the 48 servers and the 2 switches, but there is only 6 *10Gb/s
= 60Gb/s to the right of the switches. Nonetheless, data centres are
often designed with some level of contention like this, because at the
ToR switches a proportion of the traffic from certain hosts turns
round locally towards other hosts in the same rack.</t>
<figure align="center" anchor="pidc_Fig_dual-homed"
title="Dual-Homed Topology -- a Simple Resource Pool">
<artwork><![CDATA[
virtual hosts switches
machines
V11 V12 V1m __/
* * ... * H1 ,-.__________+--+__/
\___\__ __\____/`-' __-|S1|____,--
`. _ ,' ,'| |_______
. H2 ,-._,`. ,' +--+
. . `-'._ `.
. . `,' `.
. ,' `-. `.+--+_______
Vn1 Vn2 Vnm / `-_|S2|____
* * ... * Hn ,-.__________| |__ `--
\___\__ __\____/`-' +--+ \__
\
]]></artwork>
</figure>
<t>The congestion policer proposed in this document is based on the
'hose' model, where a tenant's congestion allowance can be used for
sending data over any path, including many paths at once. Therefore,
any one of the virtual machines on the left can use its allowance to
contribute to congestion on any or all of the 6 links on the right (or
any other link in the diagram actually, including those from the
server to the switches and those turning back to other hosts).</t>
<t>Nonetheless, if congestion policers are to enforce performance
isolation, they should stop one tenant squeezing the capacity
available to another tenant who needs to use a particular bottleneck
link or links. They should work whether the offending tenant is acting
deliberately or merely carelessly.</t>
<t>The only way a tenant can become squeezed is if another tenant uses
more of the bottleneck capacity, which can only happen if the other
tenant sends more flows (or more aggressive flows) over that link. In
the following we will call the tenant that is shifting flows 'active',
and the ones already on a link 'passive'. These terms have been chosen
so as not to imply one is bad and the other good — just
different.</t>
<t>The active tenant will increase flow completion times for all
tenants (passive and active) using that bottleneck. Such an active
tenant might shift flows from other paths to focus them onto one,
which would not of itself use up any more congestion allowance (recall
that a scalable congestion control uses up its congestion allowance at
the same rate per flow whatever bit-rate it is going at <xref
target="pidc_On-Off_Flows"></xref> and therefore whatever path it is
using). However, although the instantaneous rate at which the active
tenant uses up its allowance won't alter, the increased completion
times due to increased congestion will use up more of the active
tenant's allowance over time (same rate but for more of the time). If
the passive tenants are using up part of their allowances on other
links, the increase in congestion will use up a relatively smaller
proportion of their allowances. Once such an increase exceeds the
active tenant's congestion allowance, the congestion policer will
protect the passive tenants from further performance degradation.</t>
<t>A policer may not even have to directly intervene for tenants to be
protected; load balancing may remove the problem first. Load balancing
might either be provided by the network (usually just random), or some
of the 'passive' tenants might themselves actively shift traffic off
the increasingly congested bottleneck and onto other paths. Some of
them might be using the multipath TCP protocol (MPTCP — see
experimental <xref target="RFC6356"></xref>) that would achieve this
automatically, or ultimately they might shift their virtual machine to
a different endpoint to circumvent the congestion hot-spot completely.
Even if one passive tenant were not using MPTCP or could not shift
easily, others shifting away would achieve the same outcome.
Essentially, the deterrent effect of congestion policers encourages
everyone to even out congestion, shifting load away from hot spots.
Then performance isolation becomes an emergent property of everyone's
behaviour, due to the deterrent effect of policers, rather than always
through explicit policer intervention.</t>
<t>{ToDo: Add numerical example}</t>
<t>In contrast, enforcement mechanisms based on scheduling algorithms
like WRR or WFQ have to be deployed at each link, and each one works
in isolation from the others. Therefore, each one doesn't know how
much of other links the tenant is using. This is fine for networks
with a single known bottleneck per customer (e.g. many access
networks). However, in data centres there are many potential
bottlenecks and each tenant generally only uses a share of a small
number of them. A mechanism like WRR would not isolate anyone's
performance if it gave every tenant the right to use the same share of
all the links in the network, without regard to how many they were
using.</t>
<t>The correct approach, as proposed here, is to give a tenant a share
of the whole pool, not the same share of each link.</t>
</section>
<section anchor="pidc_Links_Diff_Sizes" title="Links of Different Sizes">
<t>Congestion policing treats a Mb/s of capacity in one link as
identical to a Mb/s of capacity in another link, even if the size of
each link is different. For instance, consider the case where one of
the three links to the right of each switch in <xref
target="pidc_Fig_dual-homed"></xref> were upgraded to 40Gb/s while the
other two remained at 10Gb/s (perhaps to accommodate the extra traffic
from a couple of the dual homed 1Gb/s servers being upgraded to
dual-homed 10Gb/s).</t>
<t>Two congestion control algorithms running at the same rate will
cause the same level of congestion probability, whatever size link
they are sharing. <list style="symbols">
<t>If 50 equal flows share a 10Gb/s link (10Gb/s / 50 = 200Mb/s
each) they will cause 0.01% congestion probability;</t>
<t>If 200 equal flows share a 40Gb/s link (40Gb/s / 200 = 200Mb/s
each) they will still cause 0.01% congestion probability;</t>
</list>This is because the congestion probability is determined by
the congestion control algorithms, not by the link.</t>
<t>Therefore, if an average of 300 flows were spread across the above
links (1x 40Gb/s and 2 x 10Gb/s), the numbers on each link would tend
towards respectively 200:50:50, so that each flow would get 200Mb/s
and each link would have 0.01% congestion on it. Sometimes, there
might be more flows on the bigger link, resulting in less than 200Mb/s
per flow and congestion higher than 0.01%. However, whenever the
congestion level was less on one link than another, congestion
policing would encourage flows to balance out the congestion level
across the links (as long as some flows could use congestion balancing
mechanisms like MPTCP).</t>
<t>In summary, all the outcomes of congestion policing described so
far (emulating WRR etc) apply across a pool of diverse link sizes just
as much as they apply to single links.</t>
</section>
<section anchor="Diverse_Algorithms"
title="Diverse Congestion Control Algorithms">
<t>Throughout this explanation we have assumed a scalable congestion
control algorithm, which we justified <xref
target="pidc_Initial_CC_Model"></xref> as the 'boundary' case if
congestion policing had to intervene, which is all that is relevant
when considering whether the policer can enforce performance
isolation.</t>
<t>This performance isolation approach still works, whether or not the
congestion controls in daily use by tenants fit this scalable model. A
bulk congestion policer constrains the sum of all the congestion
controls being used by a tenant so that they collectively remain below
a large-scale envelope that is itself shaped like the sum of many
scalable algorithms. Bulk congestion policers will constrain the
overall congestion effect (the sum) of any mix of algorithms within
it, including flows that are completely unresponsive to congestion.
This is explained around Fig 3 of <xref target="CongPol"></xref>.</t>
<t>{ToDo, summarise the relevant part of that paper here and perhaps
even add ASCII art for the plot...}</t>
<t>{ToDo, bring in discussion of slow-start as effectively another
variant of congestion control, with considerable overshoots, etc.}</t>
<t>The defining difference between the scalable congestion we have
assumed and the congestion controls in widespread production operating
systems (New Reno, Compound, Cubic, Data Centre TCP etc) is the way
congestion probability decreases as flow-rate increases (for a
long-running flow). With a scalable congestion control, if flow-rate
doubles, congestion probability halves. Whereas, with most production
congestion controls, if flow-rate doubles, congestion probability
reduces to less than half. For instance, New Reno TCP reduces
congestion to a quarter. The responses of Cubic and Compound are
closer to the ideal scalable control than to New Reno, but they do not
depart too far from TCP to ensure they can co-exist happily with New
Reno.</t>
<!--This means that all the production controls sit on the safe side of the scalable model we have assumed.
That is, when a congestion policer constrains congestion, production transports will constrain their
bit-rate more than if they used a scalable algorithm.-->
</section>
</section>
<!-- ====================================================================== -->
<section anchor="pidc_design" title="Design">
<t>The design involves the following elements, all involving changes
solely in the hypervisor or operating systems, not network switches:
<list style="hanging">
<t hangText="Congestion Information at Ingress:">This information
needs to be trusted by the operator of the data centre
infrastructure, therefore it cannot just use the feedback in the
end-to-end transport (e.g. TCP SACK or ECN echo congestion
experienced flags) that might anyway be encrypted. Trusted
congestion feedback may be implemented in either of the following
two ways: <list style="letters">
<t>either as a shim in both sending and receiving hypervisors
using an edge-to-edge (host-host) tunnel, with feedback messages
reporting congestion back to the sending host's hypervisor (in
addition to the e2e feedback at the transport layer).</t>
<t>or in the sending operating system using the congestion
exposure protocol (ConEx <xref
target="ConEx-Abstract-Mech"></xref>);</t>
</list>Approach a) could be applied solely to traffic from
operating systems that do not yet support the simpler approach
b)<vspace blankLines="1" />The host-host feedback tunnel (approach
a) is easier to implement if a tunnelling overlay is already in use
in the data centre. For instance, we believe it would be possible to
build the necessary feedback facilities using the proposed network
virtualisation approach based on generic routing encapsulation (GRE)
<xref target="nvgre"></xref>. The tunnel egress would also need to
be able to detect congestion. This would be simple for e2e flows
with ECN enabled, because this will lead to ECN also being enabled
in the outer IP header <xref target="RFC6040"></xref>. However, for
non-ECN enabled flows, it is more problematic. It might be possible
to add sequence numbers to the outer headers, as is done in many
pseudowire technologies. However, a simpler alternative is possible
in a data centre where the switches can be ECN-enabled. It would
then be possible to enable ECN in the outer headers, even if the e2e
transport is not ECN-capable (Not-ECT in the inner header). At the
egress, if the outer is marked as 'congestion experienced', but the
inner is not-ECT, the packet would have to be dropped, being the
only congestion signal the e2e transport would understand. But
before dropping it, the ECN marking in the outer would have served
the purpose of a congestion signal to the tunnel egress. Beyond
this, implementation details of approach a) are still work in
progress. <vspace blankLines="1" />If the ConEx option is used
(approach b), a congestion audit function will also be required as a
shim in the hypervisor (or container) layer where data leaves the
network and enters the receiving host. The ConEx option is only
applicable if the guest OS at the sender has been modified to send
ConEx markings. For IPv6 this protocol is defined in <xref
target="conex-destopt"></xref>. The ConEx markings could be encoded
in the IPv4 header by hiding them within the packet ID field as
proposed in <xref target="intarea-ipv4-id-reuse"></xref>.</t>
<t hangText="Congestion Policing:">A bulk congestion policing
function would be associated with each tenant's virtual machine to
police all the traffic it sends into the network. If would most
likely be implemented as a shim in the hypervisor. It would be
expected that various policer designs might be developed, but here
we propose a simple but effective one in order to be concrete. A
token bucket is filled with tokens at a constant rate that
represents the tenant's congestion allowance. The bucket is drained
by the size of every packet with a congestion marking, as described
in <xref target="CongPol"></xref>. If approach a) were used to get
"Congestion Information at the Ingress" , the bucket would be
drained by congestion feedback from the tunel egress. If approach b)
were used, the bucket would be drained by ConEx markings on the
actual data packets being forwarded (ConEx re-inserts the e2e
feedback from the transport receiver back onto packets on the
forward data path).<vspace blankLines="0" />{ToDo: Add details of
congestion burst limiting}<vspace blankLines="1" />While the data
centre network operator only needs to police congestion in bulk,
tenants may wish to enforce their own limits on individual users or
applications, as sub-limits of their overall allowance. Given all
the information used for policing is readily available to tenants in
the transport layer below their sender, any such per-flow, per-user
or per-application limitations can be readily applied. The tenant
may operate their own fine-grained policing software, or such
detailed control capabilities may be offered as part of the platform
(platform as a service or PaaS) above the more general
infrastructure as a service (IaaS).</t>
<t hangText="Distributed Token Buckets:">A customer may run virtual
machines on multiple physical nodes, in which case the data centre
operator would ensure that it deployed a policer in the hypervisor
on each node where the customer was running a VM, at the time each
VM was instantiated.The DC operator would arrange for them to
collectively enforce the per-customer congestion allowance, as a
distributed policer.<vspace blankLines="1" />A function to
distribute a customer's tokens to the policer associated with each
of the customer's VMs would be needed. This could be similar to the
distributed rate limiting of <xref target="DRL"></xref>.
Alternatively, a logically centralised bucket of congestion tokens
could be used with simple 1-1 communication between it and each
local token bucket in the hypervisor under each VM. <vspace
blankLines="1" />Importantly, traditional bit-rate tokens cannot
simply be reassigned from one VM to another without implications on
the balance of network loading (requiring operator intervention each
time), whereas congestion tokens can be freely reassigned between
different VMs, because a congestion token is equivalent at any place
or time in a network;<vspace blankLines="1" />As well as
distribution of tokens between the VMs of a tenant, it would
similarly be feasible to allow transfer of tokens between tenants,
also without breaking the performance isolation properties of the
system. Secure token transfer mechanisms could be built above the
underlying policing design described here. Therefore the details of
token transfer need not concern us here, and can be deferred to
future work.</t>
<t hangText="Switch/Router Support:">Network switches/routers would
not need any modification. However, both congestion detection by the
tunnel (approach a) and ConEx audit (approach b) would be easier if
switches supported ECN. <vspace blankLines="1" />Data centre TCP
might be used as well, although not essential. DCTCP requires ECN
and is designed for data centres. DCTCP requires modified sender and
receiver TCP algorithms as well as a more aggressive active queue
management algorithm in the L3 switches. The AQM involves a step
threshold at a very shallow queue length for ECN marking.</t>
</list></t>
</section>
<section anchor="pidc_parameter" title="Parameter Setting">
<t>{ToDo: }</t>
<!-- <t>ToDo: stitch together the following snippets:</t>
<t>Then the operator determines the loss probability that results when a
flow with a modern congestion control (e.g. Cubic or DCTCP) runs
continuously at a certain bit-rate (that is, a certain window with a
typical round trip time and packet size). For instance, a single 200Mb/s
flow drives loss probability up to 0.1% (with 1500B packets and 6ms
RTT).</t>
<t></t>
<t>sets a target operating level of congestion for all paths through the
network (e.g. 0.01% loss probability). The operator limits every
tenant's contribution to congestion anywhere in the pool of capacity
(e.g. one tenant may be limited to 50kb of congestion-volume per second,
which means 50kb/s of discarded packets). Then, as long as congestion
remains below the target level of 0.01%, this tenant will always be able
to send at 500Mb/s, because 500Mb/s of traffic @ 0.01% loss probability
= 50kb/s of lost bits.</t>
<t>As long as path loss probability remains at 0.01%, if the tenant sent
any more than 500Mb/s of data it would contribute more than the limit of
50kb/s of loss. The operator monitors whether the bit-rate of loss
exceeds this limit, and if it does, which the operator prevents the
tenant from exceeding, by limiting the bit-rate of data that the tenant
can send if the bit-rate of losses exceeds the agreed limit of
50kb/s.</t>
<t>The operator also deploys sufficient capacity so that if everyone is
running at their congestion limit, all their bit-rates will still be .
As long as congestion is below this level everywhere, every data source
will be able to introduce more traffic</t>
-->
</section>
<section anchor="pidc_deployment" title="Incremental Deployment">
<section anchor="pidc_migration" title="Migration">
<t>A pre-requisite for ingress congestion policing is the function
entitled "Congestion Information at Ingress " in<xref
target="pidc_design"> </xref>. Tunnel feedback (approach a) is a more
processing intensive change to the hypervisors, but it can be deployed
unilaterally by the data centre operator in all hypervisors (or
containers), without requiring support in guest operating systems.</t>
<t>Using ConEx markings (approach b) is only applicable if a
particular guest OS supports the marking of outgoing packets with
ConEx markings. But if available this is simpler and more
efficient.</t>
<t>Both functions could be implemented in each hypervisor, and a
simple filter could be installed to allow ConEx packets through into
the data centre network (approach a) without going through the
feedback tunnel shim, while non-ConEx packets would need to be
tunnelled and to elicit tunnel feedback (approach b). This would
provide an incremental deployment scenario with the best of both
worlds: it would work for unmodified guest OSs, but for guest OSs with
ConEx support, it would require less processing (therefore being
faster) and not require the considerable overhead of a duplicate
feedback channel between hypervisors (sending and forwarding a large
proportion of tiny packets).</t>
<t>{ToDo: Note that the main reason for preferring ConEx information
will be because it is designed to represent a conservative expectation
of congestion, whereas tunnel feedback represents congestion only
after it has happened.}</t>
</section>
<section anchor="pidc_evolution" title="Evolution">
<t>Initially, the approach would be confined to intra-data centre
traffic. With the addition of ECN support on network equipment in the
WAN between data centres, it could straightforwardly be extended to
inter-data centre scenarios, including across interconnected backbone
networks.</t>
<t>Having proved the approach within and between data centres and
across interconnect, more mass-market devices might be expected to
turned on support for ECN feedback, and ECN might be turned on in
equipment in wider networks most likely to be bottlenecks (access and
backhaul). </t>
</section>
</section>
<section anchor="pidc_alternates" title="Related Approaches">
<t>The Related Work section of <xref target="CongPol"></xref> provides a
useful comparison of the approach proposed here against other attempts
to solve similar problems.</t>
<t>When the hose model is used with Diffserv, capacity has to be
considerably over-provisioned for all the unfortunate cases when
multiple sources of traffic happen to coincide even though they are all
in-contract at their respective ingress policers. Even so, every node
within a Diffserv network also has to be configured to limit higher
traffic classes to a maximum rate in case of really unusual traffic
distributions that would starve lower priority classes. Therefore, for
really important performance assurances, Diffserv is used in the 'pipe'
model where the policer constrains traffic separately for each
destination, and sufficient capacity is provided at each network node
for the sum of all the peak contracted rates for paths crossing that
node.</t>
<t>In contrast, the congestion policing approach is designed to give
full performance assurances across a meshed network (the hose model),
without having to divide a network up into pipes. If an unexpected
distribution of traffic from all sources focuses on a congestion
hotspot, it will increase the congestion-bit-rate seen by the policers
of all sources contributing to the hot-spot. The congestion policers
then focus on these sources, which in turn limits the severity of the
hot-spot. </t>
<t>The critical improvement over Diffserv is that the ingress edges
receive information about any congestion occuring in the middle, so they
can limit how much congestion occurs, wherever it happens to occur.
Previously Diffserv edge policers had to limit traffic generally in case
it caused congestion, because they never knew whether it would (open
loop control).</t>
<t>Congestion policing mechanisms could be used to assure the
performance of one data flow (the 'pipe' model), but this would involve
unnecessary complexity, given the approach works well for the 'hose'
model.</t>
<t>Therefore, congestion policing allows capacity to be provisioned for
the average case, not for the near-worst case when many unlikely cases
coincide. It assures performance for all traffic using just one traffic
class, whereas Diffserv only assures performance for a small proportion
of traffic by partitioning it off into higher priority classes and
over-provisioning relative to the traffic contracts sold for for this
class.</t>
<t>{ToDo: Refer to <xref target="pidc_Intuition"></xref> for comparison
with WRR & WFQ}</t>
<t>Seawall {ToDo} <xref target="Seawall"></xref></t>
</section>
<!-- ====================================================================== -->
<section anchor="pidc_security" title="Security Considerations"></section>
<!-- ====================================================================== -->
<section anchor="pidc_IANA" title="IANA Considerations">
<t>This document does not require actions by IANA.</t>
</section>
<!-- ====================================================================== -->
<section anchor="pidc_conclusions" title="Conclusions">
<t>{ToDo}</t>
<t></t>
</section>
<!-- ====================================================================== -->
<section title="Acknowledgments">
<t></t>
</section>
<!-- ====================================================================== -->
</middle>
<back>
<references title="Informative References">
<?rfc include='reference.RFC.2475'?>
<?rfc include='reference.RFC.3649'?>
<?rfc include='reference.RFC.5681'?>
<?rfc include='reference.RFC.6040'?>
<?rfc include='reference.RFC.6356'?>
<reference anchor="ConEx-Abstract-Mech">
<front>
<title>Congestion Exposure (ConEx) Concepts and Abstract
Mechanism</title>
<author fullname="Matt Mathis" initials="M" surname="Mathis">
<organization>Google</organization>
</author>
<author fullname="Bob Briscoe" initials="B" surname="Briscoe">
<organization>BT</organization>
</author>
<date day="31" month="October" year="2011" />
</front>
<seriesInfo name="Internet-Draft"
value="draft-ietf-conex-abstract-mech-03" />
<format target="http://www.ietf.org/internet-drafts/draft-ietf-conex-abstract-mech-03.txt"
type="TXT" />
</reference>
<reference anchor="conex-destopt">
<front>
<title>IPv6 Destination Option for Conex</title>
<author fullname="Suresh Krishnan" initials="S" surname="Krishnan">
<organization></organization>
</author>
<author fullname="Mirja Kuehlewind" initials="M"
surname="Kuehlewind">
<organization></organization>
</author>
<author fullname="Carlos Ucendo" initials="C" surname="Ucendo">
<organization></organization>
</author>
<date day="30" month="October" year="2011" />
<abstract>
<t>Conex is a mechanism by which senders inform the network about
the congestion encountered by packets earlier in the same flow.
This document specifies an IPv6 destination option that is capable
of carrying conex markings in IPv6 datagrams.</t>
</abstract>
</front>
<seriesInfo name="Internet-Draft" value="draft-ietf-conex-destopt-01" />
<format target="http://www.ietf.org/internet-drafts/draft-ietf-conex-destopt-01.txt"
type="TXT" />
</reference>
<reference anchor="intarea-ipv4-id-reuse">
<front>
<title>Reusing the IPv4 Identification Field in Atomic
Packets</title>
<author fullname="Bob Briscoe" initials="B" surname="Briscoe">
<organization></organization>
</author>
<date day="12" month="March" year="2012" />
<abstract>
<t>This specification takes a new approach to extensibility that
is both principled and a hack. It builds on recent moves to
formalise the increasingly common practice where fragmentation in
IPv4 more closely matches that of IPv6. The large majority of IPv4
packets are now 'atomic', meaning indivisible. In such packets,
the 16 bits of the IPv4 Identification (IPv4 ID) field are
redundant and could be freed up for the Internet community to put
to other uses, at least within the constraints imposed by their
original use for reassembly. This specification defines the
process for redefining the semantics of these bits. It uses the
previously reserved control flag in the IPv4 header to indicate
that these 16 bits have new semantics. Great care is taken
throughout to ease incremental deployment, even in the presence of
middleboxes that incorrectly discard or normalise packets that
have the reserved control flag set.</t>
</abstract>
</front>
<seriesInfo name="Internet-Draft"
value="draft-briscoe-intarea-ipv4-id-reuse-01" />
<format target="http://www.ietf.org/internet-drafts/draft-briscoe-intarea-ipv4-id-reuse-01.txt"
type="TXT" />
</reference>
<reference anchor="CongPol"
target="http://bobbriscoe.net/projects/refb/#polfree">
<front>
<title>Policing Freedom to Use the Internet Resource Pool</title>
<author fullname="Arnaud Jacquet" initials="A" surname="Jacquet">
<organization>BT</organization>
</author>
<author fullname="Bob Briscoe" initials="B" surname="Briscoe">
<organization>BT & UCL</organization>
</author>
<author fullname="Toby Moncaster" initials="T" surname="Moncaster">
<organization>BT</organization>
</author>
<date month="December" year="2008" />
</front>
<seriesInfo name="Proc ACM Workshop on Re-Architecting the Internet (ReArch'08)"
value="" />
<format target="http://www.bobbriscoe.net/projects/2020comms/refb/policer_rearch08.pdf"
type="PDF" />
</reference>
<reference anchor="Seawall"
target="http://research.microsoft.com/en-us/projects/seawall/">
<front>
<title>Seawall: Performance Isolation in Cloud Datacenter
Networks</title>
<author fullname="Alan Shieh" initials="A" surname="Shieh">
<organization>Microsoft and Cornell Uni</organization>
</author>
<author fullname="Srikanth Kandula" initials="S" surname="Kandula">
<organization>Microsoft</organization>
</author>
<author fullname="Albert Greenberg" initials="A" surname="Greenberg">
<organization>Microsoft</organization>
</author>
<author fullname="Changhoon Kim" initials="C" surname="Kim">
<organization>Microsoft</organization>
</author>
<date month="June" year="2010" />
</front>
<seriesInfo name="Proc 2nd USENIX Workshop on Hot Topics in Cloud Computing"
value="" />
<format target="http://www.usenix.org/event/nsdi11/tech/full_papers/Shieh.pdf"
type="PDF" />
</reference>
<reference anchor="DRL"
target="http://doi.acm.org/10.1145/1282427.1282419">
<front>
<title>Cloud control with distributed rate limiting</title>
<author fullname="Barath Raghavan" initials="B" surname="Raghavan">
<organization></organization>
</author>
<author fullname="Kashi Vishwanath" initials="K"
surname="Vishwanath">
<organization></organization>
</author>
<author fullname="Sriram Ramabhadran" initials="S"
surname="Ramabhadran">
<organization></organization>
</author>
<author fullname="Kenneth Yocum" initials="K" surname="Yocum">
<organization></organization>
</author>
<author fullname="Alex Snoeren" initials="A" surname="Snoeren">
<organization></organization>
</author>
<date month="" year="2007" />
</front>
<seriesInfo name="ACM SIGCOMM CCR" value="37(4)337--348" />
<format target="http://doi.acm.org/10.1145/1282427.1282419" type="PDF" />
</reference>
<reference anchor="conex-dc_tr">
<front>
<title>Network Performance Isolation in Data Centres by Congestion
Exposure to Edge Policers</title>
<author fullname="Bob" surname="Briscoe">
<organization>BT</organization>
</author>
<date day="03" month="November" year="2011" />
</front>
<seriesInfo name="BT Technical Report" value="TR-DES8-2011-004" />
<annotation>Work in progress</annotation>
</reference>
<reference anchor="nvgre">
<front>
<title>NVGRE: Network Virtualization using Generic Routing
Encapsulation</title>
<author fullname="Murari Sridhavan" initials="M" surname="Sridhavan">
<organization></organization>
</author>
<author fullname="Albert Greenberg" initials="A" surname="Greenberg">
<organization></organization>
</author>
<author fullname="Narasimhan Venkataramaiah" initials="N"
surname="Venkataramaiah">
<organization></organization>
</author>
<author fullname="Yu-Shun Wang" initials="Y" surname="Wang">
<organization></organization>
</author>
<author fullname="Kenneth Duda" initials="K" surname="Duda">
<organization></organization>
</author>
<author fullname="Ilango Ganga" initials="I" surname="Ganga">
<organization></organization>
</author>
<author fullname="Geng Lin" initials="G" surname="Lin">
<organization></organization>
</author>
<author fullname="Mark Pearson" initials="M" surname="Pearson">
<organization></organization>
</author>
<author fullname="Patricia Thaler" initials="P" surname="Thaler">
<organization></organization>
</author>
<author fullname="Chait Tumuluri" initials="C" surname="Tumuluri">
<organization></organization>
</author>
<date day="8" month="July" year="2012" />
<abstract>
<t>This document describes the usage of Generic Routing
Encapsulation (GRE) header for Network Virtualization, called
NVGRE, in multi- tenant datacenters. Network Virtualization
decouples virtual networks and addresses from physical network
infrastructure, providing isolation and concurrency between
multiple virtual networks on the same physical network
infrastructure. This document also introduces a Network
Virtualization framework to illustrate the use cases, but the
focus is on specifying the data plane aspect of NVGRE.</t>
</abstract>
</front>
<seriesInfo name="Internet-Draft"
value="draft-sridharan-virtualization-nvgre-01" />
<format target="http://www.ietf.org/internet-drafts/draft-sridharan-virtualization-nvgre-01.txt"
type="TXT" />
</reference>
</references>
<section title="Summary of Changes between Drafts">
<t>Detailed changes are available from
http://tools.ietf.org/html/draft-briscoe-conex-data-centre</t>
<t><list style="hanging">
<t
hangText="From draft-briscoe-conex-initial-deploy-02 to draft-briscoe-conex-data-centre-00:"><list
style="symbols">
<t>Split off data-centre scenario as a separate document, by
popular request.</t>
</list></t>
</list></t>
</section>
</back>
</rfc>
| PAFTECH AB 2003-2026 | 2026-04-22 21:41:29 |