One document matched: draft-briscoe-conex-policing-00.xml
<?xml version="1.0" encoding="US-ASCII"?>
<?xml-stylesheet type='text/xsl' href='http://xml.resource.org/authoring/rfc2629.xslt' ?>
<?rfc toc="yes" ?>
<!-- Default toc="no" No Table of Contents -->
<?rfc symrefs="yes" ?>
<!-- Default symrefs="no" Don't use anchors, but use numbers for refs -->
<?rfc sortrefs="yes" ?>
<!-- Default sortrefs="no" Don't sort references into order -->
<?rfc compact="yes" ?>
<!-- Default compact="no" Start sections on new pages -->
<?rfc strict="no" ?>
<!-- Default strict="no" Don't check I-D nits -->
<?rfc rfcedstyle="yes" ?>
<!-- Default rfcedstyle="yes" attempt to closely follow finer details from the latest observable RFC-Editor style -->
<?rfc linkmailto="yes" ?>
<!-- Default linkmailto="yes" generate mailto: URL, as appropriate -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY I-D.ietf-conex-abstract-mech PUBLIC "" "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.ietf-conex-abstract-mech.xml">
<!ENTITY I-D.briscoe-conex-initial-deploy PUBLIC "" "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.briscoe-conex-initial-deploy.xml">
<!ENTITY I-D.ietf-conex-mobile PUBLIC "" "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.ietf-conex-mobile.xml">
<!-- ENTITY I-D.briscoe-conex-data-centre PUBLIC "" "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.briscoe-conex-data-centre.xml" -->
]>
<rfc category="info" docName="draft-briscoe-conex-policing-00"
ipr="trust200902">
<front>
<title abbrev="Congestion Policing">Network Performance Isolation using
Congestion Policing</title>
<author fullname="Bob Briscoe" initials="B." surname="Briscoe">
<organization>BT</organization>
<address>
<postal>
<street>B54/77, Adastral Park</street>
<street>Martlesham Heath</street>
<city>Ipswich</city>
<code>IP5 3RE</code>
<country>UK</country>
</postal>
<phone>+44 1473 645196</phone>
<email>bob.briscoe@bt.com</email>
<uri>http://bobbriscoe.net/</uri>
</address>
</author>
<date day="18" month="February" year="2013"/>
<area>Transport Area</area>
<workgroup>ConEx</workgroup>
<keyword>Internet-Draft</keyword>
<abstract>
<t>This document describes why policing using congestion information can
isolate users from network performance degradation due to each other's
usage, but without losing the multiplexing benefits of a LAN-style
network where anyone can use any amount of any resource. Extensive
numerical examples and diagrams are given. The document is agnostic to
how the congestion information reaches the policer. The congestion
exposure (ConEX) protocol is recommended, but other tunnel feedback
mechanisms have been proposed.</t>
</abstract>
</front>
<middle>
<!-- ====================================================================== -->
<section anchor="copo_intro" title="Introduction">
<t>This document is informative, not normative. It describes why
congestion policing isolates the network performance of different
parties who are using a network, and why it is more useful than other
forms of performance isolation such as peak and committed information
rate policing, weighted round-robin or weighted fair-queueing.</t>
<t>The main point is that congestion policing isolates performance even
when the load applied by everyone is highly variable, where variation
may be over time, or by spreading traffic across different paths (the
hose model). Unlike other isolation techniques, congestion policing is
an edge mechanism that works over a pool of links across whole network,
not just at one link. If the load from everyone happens to be constant,
and there is a single bottleneck, congestion policing gives a nearly
identical outcome to weighted round-robin or weighted fair-queuing at
that bottleneck. But otherwise, congestion policing is a far more
general solution. </t>
<t>The document complements others that describe various network
arrangements that could use congestion policing, such as in a broadband
access network <xref target="I-D.briscoe-conex-initial-deploy"/>, in a
mobile access network <xref target="I-D.ietf-conex-mobile"/> or in a
data centre network <xref target="I-D.briscoe-conex-data-centre"/>.
Nonetheless, all the numerical examples use the data centre
scenario.</t>
<t>The key to the solution is the use of congestion-bit-rate rather than
bit-rate as the policing metric. <spanx style="emph">How </spanx>this
works is very simple and quick to describe <xref
target="copo_Congestion_Policer"/>.</t>
<t>However, it is much more difficult to understand <spanx style="emph">why</spanx>
this approach provides performance isolation. In particular, why it
provides performance isolation across a network of links, even though
there is apparently no isolation mechanism in each link. <xref
target="copo_Intuition"/> builds up an intuition for why the approach
works, and why other approaches fall down in different ways. The
explanation builds as follows:<list style="symbols">
<t>Starting with the simple case of long-running flows focused any
one bottleneck link in the network, tenants get weighted shares of
the link, much like weighted round robin, but with no mechanism in
any of the links;</t>
<t>In the more realistic case where flows are not all long-running
but a mix of short to very long, it is explained that bit-rate is
not a sufficient metric for isolating performance; how often a
tenant is <spanx style="emph">not</spanx> sending is the significant
factor for performance isolation, not whether bit-rate is shared
equally whenever a source happens to be sending;</t>
<t>Although it might seem that data volume would be a good measure
of how often a tenant does not send, we then show that a tenant can
send a large volume of data but hardly affect the performance of
others — by being more responsive to congestion. Using
congestion-volume (congestion-bit-rate over time) in a policer
encourages large data senders to do this, giving other tenants much
higher performance while hardly affecting their own performance.
Whereas, using straight volume as an allocation metric provides no
distinction between high volume sources that yield and high volume
sources that do not yield (the widespread behaviour today);</t>
<t>We then show that a policer based on the congestion-bit-rate
metric works across a network of links treating it as a pool of
capacity, whereas other approaches treat each link independently,
which is why the proposed approach requires none of the
configuration complexity on switches that is involved in other
approaches.</t>
<t>We also show that a congestion policer can be arranged to limit
bursts of congestion from sources that focus traffic onto a single
link, even where one source may consist of a large aggregate of
sources.</t>
<t>We show that a congestion policer rewards traffic that shifts to
less congested paths (e.g. multipath TCP or virtual machine
motion).</t>
<t>We show that congestion policing works on the pool of links,
irrespective of whether individual links have significantly
different capacities.</t>
<t>We show that a congestion policer allows a wide variety of
responses to congestion (e.g. New Reno TCP <xref target="RFC5681"/>,
Cubic TCP, Compound TCP, Data Centre TCP <xref target="DCTCP"/> and
even unresponsive UDP traffic), while still encouraging and
enforcing a sufficient response to congestion from all sources taken
together, so that the performance of each application is
sufficiently isolated from others.</t>
</list></t>
</section>
<section anchor="copo_Congestion_Policer"
title="Example Bulk Congestion Policer">
<t>In order to explain why a congestion policer works, we first describe
a particular design of congestion policer, called a bulk congestion
policer. The aim is to give a concrete example to motivate the protocol
changes necessary to implement such a policer. This is merely an
informative document, so there is no intention to standardise this or
any specific congestion policing algorithm. The aim is for different
implementers to be free to develop their own improvements to this
design.</t>
<t>A number of other documents describe deployment arrangements that
include a congestion policer, for instance in a broadband access network
<xref target="I-D.briscoe-conex-initial-deploy"/>, in a mobile access
network <xref target="I-D.ietf-conex-mobile"/> and in a data centre
network <xref target="I-D.briscoe-conex-data-centre"/>. The congestion
policer described in the present document could be deployed in any of
these arrangements. However, to be concrete, the example link
capacities, topology and terminology used in the present document assume
a multi-tenant data centre scenario <xref
target="I-D.briscoe-conex-data-centre"/>. It is believed that similar
examples could be drawn up for the other scenarios, just using different
numbers for link capacities, etc and replacing some terminology, e.g.
substituting 'end-customer' for 'tenant' throughout.</t>
<t><xref target="copo_Fig_CongPol"/> illustrates a bulk congestion
policer. It is very similar to a regular token bucket for policing
bit-rate except the tokens do not represent bits, rather they represent
'congestion-bits'. Using Congestion Exposure (ConEx) <xref
target="I-D.ietf-conex-abstract-mech"/> is probably the easiest way to
understand what congestion-bits are. However, there are other valid ways
to signal congestion information to a congestion policer without ConEx
(e.g. tunnelled feedback described in <xref
target="I-D.briscoe-conex-data-centre"/>.</t>
<t>Using ConEx as an example, a ConEx-based congestion token bucket
ignores any packets unless they are ConEx-marked. The ConEx protocol
includes audit capabilities that force the sender to reveal information
it receives about losses anywhere on the downstream path (e.g. at the
points marked X in <xref target="copo_Fig_CongPol"/>). The sender
signals this information with a ConEx-marking in the IP header of the
packets it sends into the network, using an arrangement called
re-inserted feedback or re-feedback. So, every time the sender detects a
loss of a 1500B packet, it has to apply a ConEx-marking to a subsequent
1500B packet (or it has to ConEx-mark at least as many bytes made up of
smaller packets). In <xref target="copo_Fig_CongPol"/> ConEx-marked
packets are shown tagged with a '*'.</t>
<t>A bulk congestion policer is deployed at every point where traffic
enters an operator's network (similar to the Diffserv architecture). At
any one point of entry, the operator assigns a congestion token bucket
to each tenant that is sending traffic into the network. The operator
assigns a contracted token-fill-rate w1 to tenant 1, w2 to tenant 2, and
in general wi to tenant with index i, where i = 1,2,... This contracted
token-fill-rate, wi, can be thought of like a weight, where a higher
weight allows a tenant to contribute more traffic through a downstream
congested link or to contribute traffic into more congested links, as we
shall see.</t>
<t>Tenants wanting to send more traffic can be assigned a higher weight
(probably paying more for it). Typically, tenants might sign-up to a
traffic contract in two parts: <list style="numbers">
<t>a peak rate, which represents the capacity of their access to the
network</t>
<t>a congestion token-fill rate, which limits how much traffic can
be sent that limits what others are sending.</t>
</list></t>
<t>In <xref target="copo_Fig_CongPol"/>, as each packet arrives, the
classifier assigns it to the congestion token bucket of the relevant
tenant. The token bucket ignores all non-ConEx-marked packets - they do
not drain any tokens from the bucket. Otherwise, if a packet is
ConEx-marked and ,say, 1500B in size it will drain 1500B of tokens from
the congestion token bucket. If the tenant is contributing to more
congestion than its contracted congestion-fill-rate, the bucket level
will decrease, and if it persists, the bucket level will become low
enough to cause the policer to start discarding a proportion of all that
tenant's traffic (whether ConEx-marked or not).</t>
<figure anchor="copo_Fig_CongPol"
title="Bulk Congestion Policer Schematic">
<artwork><![CDATA[ ______________________
| | | | Legend |
|w1 |w2 |wi | |
| | | | [_] [_]packet stream |
V V V | |
congestion . . . | [*] marked packet |
token bucket| . | | . | __|___| | ___ |
__|___| | . | | |:::| | \ / policer |
| |:::| __|___| | |:::| | /_\ |
| +---+ | +---+ | +---+ | |
bucket depth | : | : | : | /\ marking meter |
controls the | . | : | . | \/ |
policer _V_ . | : | . |______________________|
____\ /__/\___________________________ downstream
/[*] /_\ \/ [_] | : [_] | : [_] \ /->network
class-/ | . | . \ / /--->
ifier/ _V_ . | . \ / /
__,--.__________________\ /__/\___________________\______/____/ loss
`--' [_] [*] [_] /_\ \/ [_] | . [*] / \ \-X--->
\ | . / \-->
\ _V_ : / \ loss
\__________________________\ /__/\_____/ \-X--->
[_] [*] [_] [*] [_] /_\ \/ [_] \
\--->
]]></artwork>
</figure>
<t>Note that this is called a bulk congestion policer because it polices
all the flows of a tenant even if they spread out in a hose pattern
across numerous downstream networks and even if only a few of these
flows are primarily to blame for contributing heavily to downstream
congestion, while others find plenty of capacity downstream. This
approach is motivated by simplicity rather than precision (if precision
were required other policer designs could be used). Simplicity might be
more important than precision if capacity were generally
well-provisioned and the policer was intended to intervene only rarely
to limit unusually extreme behaviour. Also, a network might take such a
simple but somewhat heavy-handed approach in order to motivate tenants
to implement more complex flow-specific controls themselves <xref
target="CongPol"/>.</t>
<t>Various more sophisticated congestion policer designs have been
evaluated <xref target="CPolTrilogyExp"/>. In these experiments, it was
found that it is better if the policer gradually increases discards as
the bucket becomes empty. Also isolation between tenants is better if
each tenant is policed based on the combination of two buckets, not one
(<xref target="copo_Dual_CTB"/>):<list style="numbers">
<t>A deep bucket (that would take minutes or even hours to fill at
the contracted fill-rate) that constrains the tenant's long-term
average rate of congestion (wi)</t>
<t>a very shallow bucket (e.g. only two or three MTU) that is filled
considerably faster than the deep bucket (c * wi), where c = ~10,
which prevents a tenant storing up a large backlog of tokens then
causing congestion in one large burst.</t>
</list>In this arrangement each marked packet drains tokens from both
buckets, and the probability of policer discard is taken as the worse of
the two buckets.</t>
<figure anchor="copo_Dual_CTB"
title="Dual Congestion Token Bucket (in place of each single bucket in the previous figure)">
<artwork><![CDATA[ | |
Legend: |c*wi |wi
See previous figure V V
. .
. | . | deep bucket
________________________|___|
| . |:::|
|______________|___| |:::|
| shallow +---+ +---+
worse of the| bucket
two buckets| \____ ____/
triggers| \ / both buckets
policing V : drained by
___ . marked packets
___________\ /___________________/ \__________________
[_] [_] /_\ [_] [*] [_] \ / [_] [_] [_]
]]></artwork>
</figure>
<t>This design of congestion policer will be sufficient to understand
the rest of the document. However, it should be remembered that this is
only an example, not an ideal design. Also it should be remembered that
other mechanisms such as tunnel feedback can be used instead of
ConEx.</t>
</section>
<!-- ====================================================================== -->
<section anchor="copo_Intuition"
title="Network Performance Isolation: Intuition">
<section anchor="copo_Problem" title="The Problem">
<t>Network performance isolation traditionally meant that each user
could be sure of a minimum guaranteed bit-rate. Such assurances are
useful if traffic from each tenant follows relatively predictable
paths and is fairly constant. If traffic demand is more dynamic and
unpredictable (both over time and across paths), minimum bit-rate
assurances can still be given, but they have to be very small relative
to the available capacity, because a large number of users might all
want to simulataneously share any one link, even though they rarely
all use it at the same time.</t>
<t>This either means the shared capacity has to be greatly
overprovided so that the assured level is large enough, or the assured
level has to be small. The former is unnecessarily expensive; the
latter doesn't really give a sufficiently useful assurance.</t>
<t>Round robin or fair queuing are other forms of isolation that
guarantee that each user will get 1/N of the capacity of each link,
where N is the number of active users at each link. This is fine if
the number of active users (N) sharing a link is fairly predictable.
However, if large numbers of tenants do not typically share any one
link but at any time they all could (as in a data centre), a 1/N
assurance is fairly worthless. Again, given N is typically small but
could be very large, either the shared capacity has to be expensively
overprovided, or the assured bit-rate has to be worthlessly small. The
argument is no different for the weighted forms of these algorithms:
WRR & WFQ).</t>
<t>Both these traditional forms of isolation try to give one tenant
assured instantaneous bit-rate by constraining the instantaneous
bit-rate of everyone else. This approach is flawed except in the
special case when the load from every tenant on every link is
continuous and fairly constant. The reality is usually very different:
sources are on-off and the route taken varies, so that on any one link
a source is more often off than on.<list style="empty">
<t>In these more realistic (non-constant) scenarios, the capacity
available for any one tenant depends much more on <spanx
style="emph">how often</spanx> everyone else uses a link, not just
<spanx style="emph">how much</spanx> bit-rate everyone else would
be entitled to if they did use it.</t>
</list></t>
<t>For instance, if 100 tenants are using a 1Gb/s link for 1% of the
time, there is a good chance each will get the full 1Gb/s link
capacity. But if just six of those tenants suddenly start using the
link 50% of the time, whenever the other 94 tenants need the link,
they will typically find 3 of these heavier tenants using it already.
If a 1/N approach like round-robin were used, then the light tenants
would suddently get 1/4 * 1Gb/s = 250Mb/s on average. Round-robin
cannot claim to isolate tenants from each other if they usually get
1Gb/s but sometimes they get 250Mb/s (and only 10Mb/s guaranteed in
the worst case when all 100 tenants are active).</t>
<t>In contrast, congestion policing is the key to network performance
isolation because it focuses policing only on those tenants that go
fast over congested path(s) excessively and persistently over time.
This keeps congestion below a design threshold everywhere so that
everyone else can go fast. In this way, congestion policing takes
account of highly variable loads (varying in time and varying across
routes). And, if everyone's load happens to be constant, congestion
policing converges on the same outcome as the traditional forms of
isolation.</t>
<t>The other flaw in the traditional approaches to isolation like WRR
& WFQ is that they actually prevent long-running flows from
yielding to brief bursts from lighter tenants. A long-running flow can
yield to brief flows and still complete nearly as soon as it would
have otherwise (the brief flows complete sooner, freeing up the
capacity for the longer flow sooner—see <xref
target="copo_wcc"/>). However, WRR & WFQ prevent flows from even
seeing the congestion signals that would allow them to co-ordinate
between themselves, because they isolate each tenant completely into
separate queues.</t>
<t>In summary, superficially, traditional approaches with separate
queues sound good for isolation, but:<list style="numbers">
<t>not when everyone's load is variable, so each tenant has no
assurance about how many other queues there will be;</t>
<t>and not when each tenant can no longer even see the congestion
signals from other tenants, so no-one's control algorithms can
determine whether they would benefit most by pushing harder or
yielding.</t>
</list></t>
</section>
<section anchor="copo_Approach" title="Approach">
<t>It has not been easy to find a way to give the intuition on why
congesiton policing isolates performance, particularly across a
networks of links not just on a single link. The approach used in this
section, is to describe the system as if everyone is using the
congestion response they would be forced to use if congestion policing
had to intervene. We therefore call this the boundary model of
congestion control. It is a very simple congestion response, so it is
much easier to understand than if we introduced all the square root
terms and other complexity of New Reno TCP's response. And it means we
don't have to try to decribe a mix of responses.</t>
<t>The purpose of congestion policing is not to intervene in
everyone's rate control all the time. Rather it is encourage each
tenant to <spanx style="emph">avoid</spanx> being policed — to
keep the aggregate of all their flows' rates within an overall
envelope that is sufficiently responsive to congestion. Nonetheless,
congestion policing can and will enforce a congestion response if a
particular tenant sends traffic that is completely unresponsive to
congestion. This upper bound enforced by the congestion policer is
what ensures that each tenant's minimum performance is isolated from
the combined effect of everyone else.</t>
<t>We cannot emphasise enough that the intention is not to make
individual flows conform to this boundary response to congestion.
Indeed the intention is to allow a diverse evolving mix of congestion
responses, but constrained in total within a simple overall
envelope.</t>
<t>After describing and further justifying using the simple boundary
model of congestion control, we start by considering long-running
flows sharing one link. Then we will consider on-off traffic, before
widening the scope from one link to a network of links and to links of
different sizes. Then we will depart from the initial simplified model
of congestion control and consider diverse congestion control
algorithms, including completely unresponsive end-systems.</t>
<t>Formal analysis to back-up the intuition provided by this section
will be made available in a more extensive companion technical report
<xref target="conex-dc_tr"/>.</t>
</section>
<section anchor="copo_Terminology" title="Terminology">
<t>The term congestion probability will be used as a generisation of
either loss probability or ECN marking probability.</t>
<t>It is necessary to carefully distinguish:<list style="symbols">
<t>congestion bit-rate (or just congestion rate), which is an
absolute measure of the rate of congested bits</t>
<t>congestion probability, which is a relative measure of the
proportion of congested bits to all bits</t>
</list>Sometimes people loosely talk of loss rate when they mean
loss probability (measured in %). In this document, we will keep to
strict terminology, so loss rate would be measured in lost packets per
second, or lost bits per second.</t>
</section>
<section anchor="copo_Initial_CC_Model"
title="Simple Boundary Model of Congestion Control">
<t>The boundary model of congestion control ensures a flow's bit-rate
is inversely proportional to the congestion level that it detects. For
instance, if congestion probability doubles, the flow's bit-rate
halves. This is called a scalable congestion control because it
maintains the same rate of congestion signals (marked or dropped
packets) no matter how fast it goes. Examples from the research
community are Relentless TCP <xref target="Relentless"/> and Scalable
TCP <xref target="STCP"/>.</t>
<t>In production systems, TCP algorithms based on New Reno <xref
target="RFC5681"/> have been widely replaced by alternatives closer to
this scalable ideal (e.g. Cubic TCP in Linux and Compound TCP in
Windows), because at high rates New Reno generated congestion signals
too infrequently to track available capacity fast enough <xref
target="RFC3649"/>. More recent TCP updates (e.g. Data Centre TCP
<xref target="DCTCP"/> in Windows 8 and available in Linux) are
becoming closer still to the scalable ideal.</t>
<t>For instance, consider a scenario where a flow with scalable
congestion control is alone in a 1Gb/s link, then another similar flow
from another tenant joins it. Both will push up the congestion
probability, which will push down their rates until they together fit
into the link. Because the flow's rate has to halve to accomodate the
new flow, congestion probability will double (lets say from 0.002% to
0.004%), by our initial assumption of a two similar scalable
congestion controls. When it is alone on the link, the
congestion-bit-rate of the flow is 20kb/s (=1Gb/s * 0.002%), and when
it shares the link it is still 20kb/s (= 500Mb/s * 0.004%).</t>
<t>In summary, a congestion control can be considered scalable if the
bit-rate of packets carrying congestion signals (the
congestion-bit-rate) always stays the same no matter how much capacity
it finds available. This ensures there will always be enough signals
in a round trip time to keep the dynamics under control.</t>
<t>Reminder: Making individual flows conform to this boundary or
scalable response to congestion is a non-goal. Although we start this
explanation with this specific simple end-system congestion response,
this is just to aid intuition.</t>
</section>
<section anchor="copo_long-running" title="Long-Running Flows">
<t><xref target="copo_Tab-long_flows"/> shows various scenarios where
each of five tenants has contracted for 40kb/s of congestion-bit-rate
in order to share a 1Gb/s link. In order to help intuition, we start
with the (unlikely) scenario where all their flows are long-running. A
number of long-running flows will try to use all the link capacity, so
for simplicity we take utilisation as a round figure of 100%.</t>
<t>In the case we have just described (scenario A) neither tenant's
policer is intervening at all, because both their congestion
allowances are 40kb/s and each sends only one flow that contributes
20kb/s of congestion — half the allowance.</t>
<texttable anchor="copo_Tab-long_flows"
title="Bit-rates that a congestion policer allocates to five tenants sharing a 1Gb/s link with various numbers (#) of long-running flows all using 'scalable congestion control'">
<ttcol align="right">
Tenant</ttcol>
<ttcol>contracted congestion- bit-rate kb/s</ttcol>
<ttcol>scenario A
# : Mb/s</ttcol>
<ttcol>scenario B
# : Mb/s</ttcol>
<ttcol>scenario C
# : Mb/s</ttcol>
<ttcol>scenario D
# : Mb/s</ttcol>
<c/>
<c/>
<c/>
<c/>
<c/>
<c/>
<c>(a)</c>
<c>40</c>
<c>1 : 500</c>
<c>5 : 250</c>
<c>5 : 200</c>
<c>5 : 250</c>
<c>(b)</c>
<c>40</c>
<c>1 : 500</c>
<c>3 : 250</c>
<c>3 : 200</c>
<c>2 : 250</c>
<c>(c)</c>
<c>40</c>
<c>- : ---</c>
<c>3 : 250</c>
<c>3 : 200</c>
<c>2 : 250</c>
<c>(d)</c>
<c>40</c>
<c>- : ---</c>
<c>2 : 250</c>
<c>2 : 200</c>
<c>1 : 125</c>
<c>(e)</c>
<c>40</c>
<c>- : ---</c>
<c>- : ---</c>
<c>2 : 200</c>
<c>1 : 125</c>
<c/>
<c>Congestion probability</c>
<c> 0.004%</c>
<c> 0.016%</c>
<c> 0.02%</c>
<c> 0.016%</c>
</texttable>
<t>Scenario B shows a case where four of the tenants all send 2 or
more long-running flows. Recall, from the assumption of scalable
controls, that each flow always contributes 20kb/s no matter how fast
it goes. Therefore the policers of tenants (a-c) limit them to two
flows-worth of congestion (2 x 20kb/s = 40kb/s). Tenant (d) is only
asking for 2 flows, so it gets them without being policed, and all
four get the same quarter share of the link.</t>
<t>Scenario C is similar, except the fifth tenant (e) joins in, so
they all get equal 1/5 shares of the link.</t>
<t>In Scenario D, only tenant (a) sends more than two flows, so (a)'s
policer limits it to two flows-worth of congestion, and everyone else
gets the number of flows-worth that they ask for. This means that
tenants (d&e) get less than everyone else, which is fine because
they asked for less than they would have been allowed. (Similarly, in
Scenarios A & B, some of the tenants are inactive, so they get
zero, which is also less than they could have had if they had
wanted.)</t>
<t>With lots of long-running flows, as in scenarios B & C,
congestion policing seems to emulate per-tenant round robin
scheduling, equalising the bit-rate of each tenant, no matter how many
flows they run. By configuring different contracted allowances for
each tenant, it can easily be seen that congestion policing could
emulate weighted round robin (WRR), with the relative sizes of the
allowances acting as the weights.</t>
<t>Scenario D departs from round-robin. This is deliberate, the idea
being that tenants are free to take less than their share in the short
term, which allows them to take more at other times, as we will see in
<xref target="copo_wcc"/>. In Scenario D, policing focuses only on the
tenant (a) that is continually exceeding its contract. This policer
focuses discard solely on tenant a's traffic so that it cannot cause
any more congestion at the shared link (shown as 0.016% in the last
row).</t>
<t>To summarise so far, ingress congestion policers control
congestion-bit-rate in order to indirectly assure a minimum bit-rate
per tenant. With lots of long-running flows, the outcome is somewhat
similar to WRR, but without the need for any mechanism in each
queue.</t>
<t>However, aiming to behave like WRR is only useful when all flows
are infinitely long. When load is variable because transfers are of
finite size, congestion policing properly isolates one tenant's
performance from the behaviour of others — unlike WRR would, as
will now be explained.</t>
</section>
<section anchor="copo_On-Off_Flows" title="On-Off Flows">
<t><xref target="copo_Fig_on-off"/> compares two example scenarios
where tenant 'b' regularly sends small files in the top chart and the
same size files but more often in the bottom chart (a higher 'on-off
ratio'). This is the typical behaviour of a Web server when more
clients request more files at peak time. Meanwhile, in this example,
tenant c's behaviour doesn't change between the two scenarios -- it
sends a couple of large files, each starting at the same time in both
cases.</t>
<t>The capacity of the link that 'b' and 'c' share is shown as the
full height of the plot. The files sent by 'b' are shown as little
rectangles. 'b' can go at the full bit-rate of the link when 'c' is
not sending, which is represented by the tall thin rectangles part-way
through the trace labelled 'b'. We assume for simplicity that 'b' and
'c' divide up the bit-rate equally. So, when both 'b' and 'c' are
sending, the 'b' rectangles are half the height (bit-rate) and twice
the duration relative to when 'b' sends alone. The area of a file to
be transferred stays the same, whether tall and thin or short and fat,
because the area represents the size of the file (bit-rate x duration
= file size). The files from 'c' look like inverted castellations,
because 'c' uses half the link rate while each file from 'b'
completes, then 'c' can fill the link until 'b' starts the next file.
The cross-hatched areas represent idle times when no-one is
sending.</t>
<t>For this simple scenario we ignore start-up dynamics and just focus
on the rate and duration of flows that are long enough to stabilise,
which is why they can be represented as simple rectangles. We will
introduce the effect of flow startups later.</t>
<t>In the bottom case, where 'b' sends more often, the gaps between
b's transfers are smaller, so 'c' has less opportunity to use the
whole line rate. This squeezes out the time it takes for 'c' to
complete its file transfers (recall a file will always have the same
area which represents its size). Although 'c' finishes later, it still
starts the next flow at the same time. In turn, this means 'c' is
sending during a greater proprotion of b's transfers, which extends
b's average completion time too.</t>
<?rfc needLines="23"?>
<figure anchor="copo_Fig_on-off"
title="In the lower case, the on-off ratio of 'b' has increased, which extends all the completion times of 'c' and 'b'">
<artwork><![CDATA[
^ bit-rate
|
|---------------------------------------,--.---------,--.----,-------
| | |\/\/\/\/\| |/\/\|
| c | b|/\/\/\/\/| b|\/\/| c
|------. ,-----. ,-----. | |\/\/\/\/\| |/\/\| ,--
| b | | b | | b | | |/\/\/\/\/| |\/\/| | b
| | | | | | | |\/\/\/\/\| |/\/\| |
+------'------'-----'------'-----'------'--'---------'--'----'----'-->
time
^ bit-rate
|
|---------------------------------------------------.--,--.--,-------
| |/\| |\/|
| c |\/| b|/\| c
|------. ,-----. ,-----. ,-----. ,-----. ,-----./\| |\/| ,----
| b | | b | | b | | b | | b | | b |\/| |/\| | b
| | | | | | | | | | | |/\| |\/| |
+------'--'-----'--'-----'--'-----'--'-----'--'-----'--'--'--'--'---->
time
]]></artwork>
</figure>
<t>Round-robin would do little if anything to isolate 'c' from the
effect of 'b' sending files more often. Round-robin is designed to
force 'b' and 'c' to share the capacity equally when they are both
active. But in both scenarios they already share capacity equally when
they are both active. The difference is in how often they are active.
Round-robin and other traditional fair queuing techniques don't have
any memory to sense that 'b' has been active more of the time.</t>
<t>In contrast, a congestion policer can tell when one tenant is
sending files more frequently, by measuring the rate at which the
tenant is contributing to congestion. Our aim is to show that policers
will be able to isolate performance properly by using the right metric
(congestion bit-rate), rather than using the wrong metric (bit-rate),
which doesn't sense whether the load over time is large or small.</t>
<section anchor="copo_on-off-numerical-no-policer"
title="Numerical Examples Without Policing">
<t>The usefulness of the congestion bit-rate metric will now be
illustrated with the numerical examples in <xref
target="copo_Tab_on-off"/>. The scenarios illustrate what the
congestion bit-rate would be without any policing or scheduling
action in the network. Then this metric can be monitored and limited
by a policer, to prevent one tenant from harming the performance of
others.</t>
<t>The 2nd & 3rd columns (file-size and inter-arrival time)
fully represent the behaviour of each tenant in each scenario. All
the other columns merely characterise the outcome in various ways.
The inter-arrival time (T) is the average time between starting one
file and the next. For instance, tenant 'b' sends a 16Mb file every
200ms on average. The formula in the heading of some columns shows
how the column was derived from other columns.</t>
<t>Scenario E is contrived so that the three tenants all offer the
same load to the network, even though they send files of very
different size (S). The files sent by tenant 'a' are 100 times
smaller than those of tenant 'b', but 'a' sends them 100 times more
often. In turn, b's files are 100 times smaller than c's, but 'b' in
turn sends them 100 times more often. Graphically, the scenario
would look similar to <xref target="copo_Fig_on-off"/>, except with
three sizes of file, not just two. Scenarios E-G are designed to
roughly represent various distributions of file sizes found in data
centres, but still to be simple enough to facilitate intuition, even
though each tenant would not normally send just one size file.</t>
<t>The average completion time (t) and the maximum were calculated
from a fairly simple analytical model (documented in a campanion
technical report <xref target="conex-dc_tr"/>). Using one data point
as an example, it can be seen that a 1600Mb (200MB) file from tenant
'c' completes in 1905ms (about 1.9s). The files that are 100 times
smaller complete 100 times more quickly on average. In fact, in this
scenario with equal loads, each tenant perceives that their files
are being transferred at the same rate of 840Mb/s on average
(file-size divided by completion time, as shown in the apparent
bit-rate column). Thus on average all three tenants perceive they
are getting 84% of the 1Gb/s link on average (due to the benefit of
multiplexing and utilisation being low at 240Mb/s / 1Gb/s = 24% in
this case).</t>
<t>The completion times of the smaller files vary significantly,
depending on whether a larger file transfer is proceeding at the
same time. We have already seen this effect in <xref
target="copo_Fig_on-off"/>, where, when tenant b's files share with
'c', they take twice as long to complete as when they don't. This is
why the maximum completion time is greater than the average for the
small files, whereas there is imperceptible variance for the largest
files.</t>
<t>The final column shows how congestion bit-rate will be a useful
metric to enforce performance isolation (as we said, the figures
illustrate the situation before any enforcement mechanism is added).
In the case of equal loads (scenario E), average congestion
bit-rates are all equal. In scenarios F and G average congestion
bit-rates are higher, because all tenants are placing much more load
on the network over time, even though each still sends at equal
rates to others when they are active together. <xref
target="copo_Fig_on-off"/> illustrated a similar effect in the
difference between the top and bottom scenarios.</t>
<t>The maximum instantaneous congestion bit-rate is nearly always
20kb/s. That is because, by definition, all the tenants are using
scalable congestion controls with a constant congestion rate of
20kb/s, and each tenant's flows rarely overlap because the
per-tenant load in these scenarios is fairly low. As we saw in <xref
target="copo_Initial_CC_Model"/>, the congestion rate of a
particular scalable congestion control is always the same, no matter
how many other flows it competes with.</t>
<t>Once it is understood that the congestion bit-rate of one
scalable flow is always 'w' and doesn't change whenever a flow is
active, it becomes clear what the congestion bit-rate will be when
averaged over time; it will simply be 'w' multiplied by the
proportion of time that the tenant's file transfers are active. That
is, w*t/T. For instance, in scenario E, on average tenant b's flows
start 200ms apart, but they complete in 19ms. So they are active for
19/200 = 10% of the time (rounded). A tenant that causes a
congestion bit-rate of 20kb/s for 10% of the time will have an
average congestion-bit-rate of 2kb/s, as shown.</t>
<t>To summarise so far, no matter how many more files other tenants
transfer at the same time, each scalable flow still contributes to
congestion at the same rate, but it contributes for more of the
time, because it squeezes out into the gap before the next flow from
that tenant starts.</t>
<?rfc needLines="30"?>
<texttable anchor="copo_Tab_on-off"
title="How the effect on others of various file-transfer behaviours can be measured by the resulting congestion-bit-rate">
<ttcol align="right"> Ten
ant</ttcol>
<ttcol align="right">File size
S Mb</ttcol>
<ttcol align="right">Ave. inter- arr- ival
T
ms</ttcol>
<ttcol align="right">Ave. load
S/T
Mb/s</ttcol>
<ttcol align="right">Completion time
ave : max
t
ms </ttcol>
<ttcol align="center">Apparent bit-rate
ave : min S/t
Mb/s</ttcol>
<ttcol align="center">Congest- ion bit-rate
ave : max
w*t/T
kb/s</ttcol>
<!-- . . -->
<c/>
<c/>
<c/>
<c/>
<c>Scenario E</c>
<c/>
<c/>
<!-- . . -->
<c>a</c>
<c>0.16</c>
<c>2</c>
<c>80</c>
<c>0.19 : 0.48</c>
<c>840 : 333</c>
<c> 2 : 20</c>
<!-- . . -->
<c>b</c>
<c>16</c>
<c>200</c>
<c>80</c>
<c>19 : 35</c>
<c>840 : 460</c>
<c> 2 : 20</c>
<!-- . . -->
<c>c</c>
<c>1600</c>
<c>20000</c>
<c>80</c>
<c>1905 : 1905</c>
<c>840 : 840</c>
<c> 2 : 20</c>
<!-- . . -->
<c/>
<c/>
<c/>
<c>____</c>
<c/>
<c/>
<c/>
<!-- . . -->
<c/>
<c/>
<c/>
<c>240</c>
<c/>
<c/>
<c/>
<!-- . . -->
<c/>
<c/>
<c/>
<c/>
<c>Scenario F</c>
<c/>
<c/>
<!-- . . -->
<c>a</c>
<c>0.16</c>
<c>0.67</c>
<c>240</c>
<c>0.31 : 0.48</c>
<c>516 : 333</c>
<c> 9 : 20</c>
<!-- . . -->
<c>b</c>
<c>16</c>
<c>50</c>
<c>320</c>
<c>29 : 42</c>
<c>557 : 380</c>
<c> 11 : 20</c>
<!-- . . -->
<c>c</c>
<c>1600</c>
<c>10000</c>
<c>160</c>
<c>3636 : 3636</c>
<c>440 : 440</c>
<c> 7 : 20</c>
<!-- . . -->
<c/>
<c/>
<c/>
<c>____</c>
<c/>
<c/>
<c/>
<!-- . . -->
<c/>
<c/>
<c/>
<c>720</c>
<c/>
<c/>
<c/>
<!-- . . -->
<c/>
<c/>
<c/>
<c/>
<c>Scenario G</c>
<c/>
<c/>
<!-- . . -->
<c>a</c>
<c>0.16</c>
<c>0.67</c>
<c>240</c>
<c>0.33 : 0.64</c>
<c>481 : 250</c>
<c> 10 : 20</c>
<!-- . . -->
<c>b</c>
<c>16</c>
<c>40</c>
<c>400</c>
<c>32 : 46</c>
<c>505 : 345</c>
<c> 16 : 40</c>
<!-- . . -->
<c>c</c>
<c>1600</c>
<c>10000</c>
<c>160</c>
<c>4543 : 4543</c>
<c>352 : 352</c>
<c> 9 : 20</c>
<!-- . . -->
<c/>
<c/>
<c/>
<c>____</c>
<c/>
<c/>
<c/>
<!-- . . -->
<c/>
<c/>
<c/>
<c>800</c>
<c/>
<c/>
<c/>
<postamble>Single link of capacity 1Gb/s. Each tenant uses a
scalable congestion control which contributes a
congestion-bit-rate for each flow of w = 20kb/s.</postamble>
</texttable>
<t>In scenario F, clients have increased the rate they request files
from tenants a, b and c respectively by 3x, 4x and 2x relative to
scenario E. The tenants send the same size files but 3x, 4x and 2x
more often. For instance tenant 'b' is sending 16Mb files four times
as often as before, and they now take longer as well -- nearly 29ms
rather than 19ms -- because the other tenants are active more often
too, so completion gets squeezed to later. Consequently, tenant 'b'
is now sending 57% of the time, so its congestion-bit-rate is 20kb/s
* 57% = 11kb/s. This is nearly 6x higher than in scenario E,
reflecting both b's own increase by 4x and that this increase
coincides with everyone else increasing their load.</t>
<t>In scenario G, tenant 'b' increases even more, to 5x the load it
offered in scenario E. This results in average utilisation of
800Mb/s / 1Gb/s = 80%, compared to 72% in scenario F and only 24% in
scenario E. 'b' sends the same files but 5x more often, so its load
rises 5x.</t>
<t>Completion times rise for everyone due to the overall rise in
load, but the congestion rates of 'a' and 'c' don't rise anything
like as much as that of 'b', because they still leave large gaps
between files. For instance, tenant 'c' completes each large file
transfer in 4.5s (compared to 1.9s in scenario E), but it still only
sends files every 10s. So 'c' only sends 45% of the time, which is
reflected in its congestion bit-rate of 20kb/s * 45% = 9kb/s.</t>
<t>In contrast, on average tenant 'b' can only complete each
medium-sized file transfer in 32ms (compared to 19ms in scenario E),
but on average it starts sending another file after 40ms. So 'b'
sends 79% of the time, which is reflected in its congestion bit-rate
of 20kb/s * 79% = 16kb/s (rounded).</t>
<t>However, during the 45% of the time that 'c' sends a large file,
b's completion time is higher than average (as shown in <xref
target="copo_Fig_on-off"/>). In fact, as shown in the maximum
completion time column, 'b' completes in 46ms, but it starts sending
a new file after 40ms, which is before the previous one has
completed. Therefore, during each of c's large files, 'b' sends
46/40 = 116% of the time on average.</t>
<t>This actually means 'b' is overlapping two files for 16% of the
time on average and sending one file for the remaining 84%. Whenever
two file transfers overlap, 'b' will be causing 2 x 20kb/s = 40kb/s
of congestion, which explains why tenant b in scenario G is the only
case with a maximum congestion rate of 40kb/s rather than 20kb/s as
in every other case. Over the duration of c's large files, 'b' would
therefore cause congestion at an average rate of 20kb/s * 84% +
40kb/s * 16% = 23kb/s (or more simply 20kb/s * 116% = 23kb/s). Of
course, when 'c' is not sending a large file, 'b' will contribute
less to congestion, which is why its average congestion rate is
16kb/s overall, as discussed earlier.</t>
<!--The reason the maximum averaged congestion-bit-rate is higher with small files is simply because of the way it is defined. The congestion rate is averaged over the time between one file and the next, then the maximum is taken for each tenant. For short inter-arrival times, this maximum will occur only in the worst case when all tenants are transferring files, while for longer inter-arrival times, there will be more of a mix of rates of congestion during the time over which it is averaged.-->
</section>
<section title="Congestion Policing of On-Off Flows">
<t>Still referring to the numerical examples in <xref
target="copo_Tab_on-off"/>, we will now discuss the effect of
limiting each tenant with a congestion policer.</t>
<t>The network operator might have deployed congestion policers to
cap each tenant's average congestion rate to 16kb/s. None of the
tenants are exceeding this limit in any of the scenarios, but tenant
'b' is just shy of it in scenario G. Therefore all the tenants would
be free to behave in all sorts of ways like those of scenarios E-G,
but they would be prevented from degrading the performance of the
other tenants beyond the point reached by tenant 'b' in scenario G.
If tenant 'b' added more load, the policer would prevent the extra
load entering the network by focusing drop solely on tenant 'b',
preventing the other tenants from experiencing any more congestion
due to tenant 'b'. Then tenants 'a' and 'c' would be assured the
(average) apparent bit-rates shown, whatever the behaviour of
'b'.</t>
<t>If 'a' added more load, 'c' would not suffer. Instead 'b' would
go over limit and its rate would be trimmed during congestion peaks,
sacrificing some of its lead to 'a'. Similarly, if 'c' added more
load, 'b' would be made to sacrifice some of its performance, so
that 'a' would not suffer. Further, if more tenants arrived to share
the same link, the policer would force 'b' to sacrifice performance
in favour of the additional tenants.</t>
<t>There is nothing special about a policer limit of 16kb/s. The
example when discussing infinite flows used a limit of 40kb/s per
tenant. And some tenants can be given higher limits than others
(e.g. at an additional charge). If the operator gives out congestion
limits that together add up to a higher amount but it doesn't
increase the link capacity, it merely allows the tenants to apply
more load (e.g. more files of the same size in the same time), but
each with lower bit-rate.</t>
<!--{ToDo: Consider adding a scenario with much lower load from one tenant with small flows than another with large flows
and comparing with WRR}
{ToDo: Then discuss min bit-rates}-->
</section>
</section>
<section anchor="copo_wcc" title="Weighted Congestion Controls">
<t>At high speed, congestion controls such as Cubic TCP, Data Centre
TCP, Compound TCP etc all contribute to congestion at widely differing
rates, which is called their 'aggressiveness' or 'weight'. So far, we
have made the simplifying assumption of a scalable congestion control
algorithm that contributes to congestion at a constant rate of w =
20kb/s. We now assume tenant 'c' uses a similar congestion control to
before, but with different parameters in the algorithm so that its
weight is still constant, but much lower, at w = 2.2kb/s.</t>
<t>Tenant 'b' still uses w = 20kb/s for its smaller files, so when the
two compete for the 1Gb/s link, they will share it in proportion to
their weights, 20:2.2 (or 90%:10%). That is, when competing, 'b' and
'c' will respectively get (20/22.2)*1Gb/s = 900Mb/s and
(2.2/22.2)*1Gb/s = 100Mb/s of the 1Gb/s link. <xref
target="copo_Fig_weighted"/> shows the situation before (upper) and
after (lower) this change.</t>
<t>When the two compete, 'b' transfers each file 9/5 faster than
before (900Mb/s rather than 500Mb/s), so it completes them in 5/9 of
the time. 'b' still contributes congestion at the same rate of 20kb/s,
but for 5/9 less time than before. Therefore, relative to before, 'b'
uses up its allowance 5/9 as quickly.</t>
<t>Even though 'b' completes each small file much faster, 'c' should
complete its file transfer hardly any later than before. Although
tenant 'b' goes faster, as each 'b' file finishes, it gets out of the
way sooner. Irrespective of its lower weight, 'c' can still use the
full link when 'b' is not present, because weights only determine
shares between active flows. Because 'c' has the whole link sooner it
can catch up to the same point it would have reached in its download
if 'b' had been slower but longer. Tenant 'c' will probably lose some
completion time because it has to accelerate and decelerate more. But,
whenever it is sending a file, 'c' gains (20kb/s -2kb/s) = 18kb of
allowance every second, which it can use for other transfers.</t>
<figure anchor="copo_Fig_weighted"
title="Weighted congestion controls with equal weights (upper) and unequal (lower)">
<artwork><![CDATA[
^ bit-rate
|
|---------------------------------------------------.--,--.--,-------
| |/\| |\/|
| c |\/| b|/\| c
|------. ,-----. ,-----. ,-----. ,-----. ,-----./\| |\/| ,----
| b | | b | | b | | b | | b | | b |\/| |/\| | b
| | | | | | | | | | | |/\| |\/| |
+------'--'-----'--'-----'--'-----'--'-----'--'-----'--'--'--'--'---->
time
^ bit-rate
|
|---------------------------------------------------.--,--.--,-------
|---. ,---. ,---. ,---. ,---. ,---. |/\| |\/| ,---.
| | | | | | c | | | | | | |\/| b|/\| c| |
| | | | | | | | | | | | |/\| |\/| | |
| b | | b | | b | | b | | b | | b | |\/| |/\| | b |
| | | | | | | | | | | | |/\| |\/| | |
+---'-----'---'----'---'----'---'----'---'----'---'-'--'--'--'--'---'>
time
]]></artwork>
</figure>
<t>It seems too good to be true that both tenants gain so much and
lose so little by 'c' reducing its aggressiveness. The gains are
unlikely to be as perfect as this simple model predicts, but we
believe they will be nearly as substantial.</t>
<t>It might seem that everyone can keep gaining by everyone agreeing
to reduce their weights, ad infinitum. However, the lower the weight,
the less signals the congestion control gets, so it starts to lose its
control during dynamics. Nonetheless, congestion policing should
encourage congestion control designs to keep reducing their weights,
but they will have to stop when they reach the minimum necessary
congestion in order to maintain sufficient control signals.</t>
<!--{ToDo: Consider adding a concrete scenario with specified flow arrival rates and say three simultaneous larger flows on average.}-->
</section>
<section anchor="copo_Network_of_Links" title="A Network of Links">
<t>So far we have only considered a single link. Congestion policing
at the network edge is designed to work across a network of links,
treating them all as a pool of resources, as we shall now explain. We
will use the dual-homed data centre topology shown in <xref
target="copo_Fig_dual-homed"/> (stretching the bounds of ASCII art) as
a simple example of a pool of resources.</t>
<t>In this case where there are 48 servers (H1, H2, ... Hn where n=48)
on the left, with on average 8 virtual machines (VMs) running on each
(e.g. server n is running Vn1, Vn2, ... to Vnm where m = 8). Each
server is connected by two 1Gb/s links, one to each top-of-rack switch
S1 & S2. To the right of the switches, there are 6 links of 10Gb/s
each, connecting onwards to customer networks or to the rest of the
data centre. There is a total of 48 *2 *1Gb/s = 96Gb/s capacity
between the 48 servers and the 2 switches, but there is only 6 *10Gb/s
= 60Gb/s to the right of the switches. Nonetheless, data centres are
often designed with some level of contention like this, because at the
ToR switches a proportion of the traffic from certain hosts turns
round locally towards other hosts in the same rack.</t>
<figure align="center" anchor="copo_Fig_dual-homed"
title="Dual-Homed Topology -- a Simple Resource Pool">
<artwork><![CDATA[
virtual hosts switches
machines
V11 V12 V1m __/
* * ... * H1 ,-.__________+--+__/
\___\__ __\____/`-' __-|S1|____,--
`. _ ,' ,'| |_______
. H2 ,-._,`. ,' +--+
. . `-'._ `.
. . `,' `.
. ,' `-. `.+--+_______
Vn1 Vn2 Vnm / `-_|S2|____
* * ... * Hn ,-.__________| |__ `--
\___\__ __\____/`-' +--+ \__
\
]]></artwork>
</figure>
<t>The congestion policer proposed in this document is based on the
'hose' model, where a tenant's congestion allowance can be used for
sending data over any path, including many paths at once. Therefore,
any one of the virtual machines on the left can use its allowance to
contribute to congestion on any or all of the 6 links on the right (or
any other link in the diagram actually, including those from the
server to the switches and those turning back to other hosts).</t>
<t>Nonetheless, if congestion policers are to enforce performance
isolation, they should stop one tenant squeezing the capacity
available to another tenant who needs to use a particular bottleneck
link or links (e.g. to reach a particular receiver). Policing should
work whether the offending tenant is acting deliberately or merely
carelessly.</t>
<section anchor="copo_Focused_Load"
title="Numerical Example: Isolation from Focused Load">
<t>Consider the following scenario: two tenants have equal
congestion allowances of 40kb/s, and let's imagine they each run
servers that send about 1 flow per second. All their flows use the
same scalable congestion control and all use the same aggressiveness
or weight of 20kb/s.</t>
<t>Tenant A has clients all over the network, so it spreads its
traffic over numerous links, and its flows usually complete within 1
second. Therefore, on average, A has one flow running at any one
instant, so it consumes 20kb/s from its 40kb/s congestion
allowance.</t>
<t>Then tenant B decides (carelessly or deliberately) to concentrate
all its flows into one of the links that A sometimes uses for one of
its flows. All the flows through this new bottleneck (A's, B's and
those from other tenants) will still each consume 20kb/s of
allowance while they are active (<xref
target="copo_On-Off_Flows"/>), but they will all take longer to
complete. Let's say they take 15 times longer to complete on average
(i.e. 15s not 1s). And let's assume 5% of tenant A's flows pass
through this link. Then each tenant will consume congestion
allowance at:<list style="symbols">
<t>Tenant A: (95% * 1 * 20kb/s) + (5% * 15 * 20kb/s) =
34kb/s</t>
<t>Tenant B: (100% * 15 * 20kb/s) = 300kb/s</t>
</list></t>
<t>Tenant B's congestion token bucket will therefore drain 7.5 times
faster than it is filling (300 / 40 = 7.5) so it will rapidly empty
and start dropping some of tenant B's traffic before it enters the
network. In fact, once tenant B's bucket is empty, it will be
limited to no more than 40kb/s of congestion at the bottleneck link,
rather than 300kb/s otherwise. In effect, the policer shifts packet
drops in excess of the 40kb/s allowance out of the shared link an
into itself, to focus excess discards only on tenant B.</t>
<t>Once tenant B's bucket empties, tenant A's completion time for
the flows through this bottleneck would not return to 1s, but it
would be limited to about 2s (not 15s), and it would consume its
congestion allowance at about 21kb/s, well within its limt of
40kb/s.</t>
</section>
<section anchor="copo_Short_Term" title="Isolation in the Short-Term">
<t>It seems problematic that tenant B's congestion policer takes
time to empty before it kicks in, during which time other tenants
suffer reduced performance. This is deliberate—to allow some
give and take over time. Nonetheless, it is possible to limit this
problem as well, using the dual token bucket already described in
<xref target="copo_Fig_dual-homed"/>. The second shallow bucket sets
an immediate limit on how much tenant B can harm the performance of
others. The fill-rate of the shallow bucket is c times that of the
deeper bucket (let's say c = 8), so tenant B is immediately limited
to 8 * 40kb/s = 320kb/s of congestion. In the scenario described
above, tenant B only reaches 300kb/s but at least this backstop
prevents congestion from tenant B ever exceeding 320kb/s.</t>
</section>
<section anchor="copo_Load_Balancing"
title="Encouraging Load Balancing">
<t>A policer may not even have to directly intervene for tenants to
be protected; it might encourage load balancing to remove the
problem first. Load balancing might either be provided by the
network (usually just random), or some of the tenants might
themselves actively shift traffic off the increasingly congested
bottleneck and onto other paths. Some of them might be using the
multipath TCP protocol (MPTCP — see experimental <xref
target="RFC6356"/>) that would achieve this automatically, or
ultimately if the congestion signals persisted, automatic VM
placement algorithms might shift their virtual machine to a
different endpoint to circumvent the congestion hot-spot completely.
Even if some tenants were not using MPTCP or could not shift easily,
others shifting away would achieve the same outcome. Essentially,
the deterrent effect of congestion policers encourages everyone to
even out congestion across the network, shifting load away from hot
spots. Then performance isolation becomes an emergent property of
everyone's behaviour, due to the deterrent effect of policers,
rather than always through explicit policer intervention.</t>
<t>In such an environment, a policer is needed that shares out the
whole pool of network resources, not one that just controls shares
of each link individually.</t>
<t>In contrast, enforcement mechanisms based on scheduling
algorithms like WRR or WFQ have to be deployed at each link, and
each one works in ignorance of how much of other links the tenant is
using. These algorithms were designed for networks with a single
known bottleneck per customer (e.g. many access networks). However,
in data centres there are many potential bottlenecks and each tenant
generally only uses a share of a small number of them. A mechanism
like WRR would not isolate anyone's performance if it gave every
tenant the right to use the same share of all the links in the
network, without regard to how many they were using.</t>
</section>
</section>
<section anchor="copo_Tenant_Diff_Sizes"
title="Tenants of Different Sizes">
<t>The scenario in <xref target="copo_Network_of_Links"/> assumed
tenants A & B had equal congestion allowances. If one tenant
bought a huge allowance (e.g. 500kb/s) while another tenant bought a
small allowance (e.g. 10kb/s), it would seem that the former could
focus all its traffic on one link to swamp the latter.</t>
<t>At this point, it needs to be pointed out that congestion
allowances do not preclude the need for decent capacity planning. If
we further assume that tenant A has bought the same access capacity as
tenant B, 50 times more congestion allowance implies tenant A expects
to be provided with 50 times the contention ratio of tenant B, i.e.
any single links they share will be big enough for B to still have
enough if it gets 1 share for every 50 used by A.</t>
<t>However, there is still a problem in that A can store up congestion
allowance and cause much more than 500kb/s of congestion in one burst.
The dual token bucket arrangement (<xref
target="copo_Fig_dual-homed"/>) limits tenant A to c * 500kb/s of
congestion, but if c=8 as before, 8 * 500kb/s = 4Mb/s of congestion in
a burst would seriously swamp tenant B if all focused on one link
where tenant B was quietly consuming its 10kb/s allowance.</t>
<t>The answer to this problem is that the burst limit c can be smaller
for larger tenants, tending to 1 for the largest tenants (i.e. a
single shallow bucket would suffice for huge tenants). The reasoning
is that the c factor allows a tenant some give-and-take over time, but
the aggregate of traffic from a larger tenant consists of enough flows
that it has sufficient give-and-take within its own aggregate without
needing give-and-take from others.</t>
</section>
<section anchor="copo_Links_Diff_Sizes" title="Links of Different Sizes">
<t>Congestion policing treats a Mb/s of capacity in one link as
identical to a Mb/s of capacity in another link, even if the size of
each link is different. For instance, consider the case where one of
the three links to the right of each switch in <xref
target="copo_Fig_dual-homed"/> were upgraded to 40Gb/s while the other
two remained at 10Gb/s (perhaps to accommodate the extra traffic from
a couple of the dual homed 1Gb/s servers being upgraded to dual-homed
10Gb/s).</t>
<t>A congestion control algorithm running at the same rate will cause
the same level of congestion probability, whatever size link it is
sharing. For example:<list style="symbols">
<t>If 50 equal flows share a 10Gb/s link (10Gb/s / 50 = 200Mb/s
each) they will cause 0.01% congestion probability;</t>
<t>If 200 of the same flows share a 40Gb/s link (40Gb/s / 200 =
200Mb/s each) they will still cause 0.01% congestion
probability;</t>
</list>This is because the congestion probability is determined by
the congestion control algorithms, not by the link.</t>
<t>Therefore, if an average of 300 flows were spread across the above
links (1x 40Gb/s and 2 x 10Gb/s), the numbers on each link would tend
towards respectively 200:50:50, so that each flow would get 200Mb/s
and each link would have 0.01% congestion on it. Sometimes, there
might be more flows on one link, resulting in less than 200Mb/s per
flow and congestion higher than 0.01%. However, whenever the
congestion level was greater on one link than another, congestion
policing would encourage flows to balance out the congestion level
across the links (as long as some flows could use congestion balancing
mechanisms like MPTCP).</t>
<t>In summary, all the outcomes of congestion policing described so
far (emulating WRR, etc) apply across a pool of diverse link sizes
just as much as they apply to single links.</t>
</section>
<section anchor="Diverse_Algorithms"
title="Diverse Congestion Control Algorithms">
<t>Throughout this explanation we have assumed a scalable congestion
control algorithm, which we justified (<xref
target="copo_Initial_CC_Model"/>) as the 'boundary' case if congestion
policing had to intervene, which is all that is relevant when
considering whether the policer can enforce performance isolation.</t>
<t>The defining difference between the scalable congestion we have
assumed and the congestion controls in widespread production operating
systems (New Reno, Compound, Cubic, Data Centre TCP etc) is the way
congestion probability decreases as flow-rate increases (for a
long-running flow). With a scalable congestion control, if flow-rate
doubles, congestion probability halves. Whereas, with most production
congestion controls, if flow-rate doubles, congestion probability
reduces to less than half. For instance, New Reno TCP reduces
congestion to a quarter. The responses of Cubic and Compound are
closer to the ideal scalable control than to New Reno, but they do not
depart too far from New Reno to ensure they can co-exist happily with
it.</t>
<t>Congestion policing still works, whether or not the congestion
controls in daily use by tenants fit the scalable model. A bulk
congestion policer constrains the sum of all the congestion controls
being used by a tenant so that they collectively remain below an
aggregate envelope that is itself shaped like the sum of many scalable
algorithms. Bulk congestion policers will constrain the overall
congestion effect (the sum) of any mix of algorithms within it,
including flows that are completely unresponsive to congestion. </t>
<t>This will now be explained using <xref
target="copo_Fig_Diff_Responses"/> (if the ASCII art is
incomprehensible, see a similar plot at Figure 3 of <xref
target="CongPol"/>).</t>
<figure anchor="copo_Fig_Diff_Responses"
title="Schematic Plot of Bulk Policing of Traffic with Different Congestion Responses">
<artwork><![CDATA[ bit-rate[b/s]
/ \
| * / / / / / / / / / / / / / / / / /
| +New Reno * Policed / / / / / / / / / / / / /
| + TCP * / / / / / / / / / / / / / / / /
| + * / / / / / / / / / / / / / / /
| - - - - - -+- - - - - - - - -*- - - - - - - - - - - - - - -
| + * / / / / / unresponsive/ /
| + * / / / / / / / / / / /
| + * / / / / / / / / / /
| +* / / / / / / /
| * / / /+/ /
| * / /
|____________________________________________________________\
0 ' 0.2%' ' 0.4%' 'congestion/
]]></artwork>
</figure>
<t>The congestion policer prevents the aggregate of a tenant's traffic
from operating in the hatched region shown in <xref
target="copo_Fig_Diff_Responses"/> (there is deliberately no vertical
scale, because the plot is schematic only). The tenant can have either
high congestion or high rate, but not both at the same time. It can be
seen that the single New Reno TCP flow responds to congestion in a
similar way, but with a flatter slope. It can also be seen that, at
0.25% prevailing congestion, this policer would allow roughly two
simultaneous New Reno TCP flows without intervening. At about 0.45%
congestion it would allow only one, and above 0.45% it would start to
police even one flow. But at much lower congestion levels typical in a
high-capacity data centre (e.g. 0.01%) the policer would allow
numerous New Reno TCP flows (extrapolating the curve upwards).</t>
<t>The horizontal plot represents an unresponsive (UDP) flow. The
congestion policer would allow one of these flows to run unimpeded up
to roughly 0.3% congestion, when the policer would effectively take
control and impose its own congestion response on this flow, dropping
more UDP packets for higher levels of prevailing congestion. If a
tenant ran two of these UDP flows (or one at twice the rate), the
policer would leave them alone unless prevailing congestion was above
about 0.22% when it would intervene (by extrapolation of the visible
plots).</t>
<t>The New Reno TCP curve follows the characteristic k/sqrt(p) rule,
while the 'Policed' curve follows a simple w/p rule, where k is the
New Reno constant and w is the constant contracted fill-rate of a
particular policer. A scalable TCP (not shown) would follow a v/p rule
where v is the weight of the individual flow. If we imagine a scenario
where w = 8v, the policer would allow 8 of these scalable TCP flows
(averaged over time), and this would be true at any congestion level,
because the curvature of 8 scalable TCP's would exacatly match the
'Policed' curve.</t>
<t>So far, the discussion has focused on the congestion avoidance
phase of each congestion response. As each TCP flow starts, it
typically increases exponentially (TCP slow-start) until it overshoots
available capacity causing a spike of loss due to the feedback delay
over the following round-trip. The ConEx audit mechanism requires the
source to send credit markings in advance to cover the risk of these
spikes of loss. Credit marked ConEx packets drain tokens from the
congestion policer similarly to regular ConEx marked packets, which
reduces the available congestion allowance for other flows. If the
tenant starts new flows fast enough, these credit markings alone will
empty the token bucket so that the policer starts to discard some
packets from the start-up phase before they can enter the network.
(The tunnel feedback method for signalling congestion back to the
policer lacks such a credit mechanism, which is an important advantage
of the ConEx protocol, given Internet traffic generally consists of a
high proportion of short flows.)</t>
<t>A tenant could run a mix of multiple different types of TCP and UDP
flows, as long as, when all stacked up, the sum of them all still
fitted under the 'Policed' curve at the prevailing level of
congestion.</t>
</section>
</section>
<!-- ====================================================================== -->
<section anchor="copo_parameter" title="Parameter Setting">
<t>The primary parameter to set for each tenant is the contracted
fill-rate of their congestion token bucket. For a server sending
predominantly long-running flows (e.g.a video server or an indexing
robot filling search databases), this fill-rate can be determined from
the required number of simultaneous TCP flows, and the typical
congestion-bit-rate of one TCP flow.</t>
<t>For a scalable TCP flow, the congestion bit-rate is constant. But
currently (2013) TCP Cubic is predominant, and the congestion bit-rate
of one Cubic flow is not constant, but depends somewhat on bit-rate and
RTT. For example, a Cubic flow with 100Mb/s throughput and 50ms RTT, has
a congestion-bit-rate of ~2kbs, whereas at 200Mb/s and 5ms RTT, its
congestion-bit-rate is ~6kb/s. Therefore, a reasonable rule of thumb for
Cubic's congestion-bit-rate at rates and RTTs of about this order is
5kb/s.</t>
<t>Therefore, if a tenant wanted to serve no more on average than 12
simultaneous TCP Cubic flows, their contracted congestion-rate should be
roughly 12 * 5kb/s = 60kb/s.</t>
<t>The contracted fill rate of a congestion policer should not need to
change as flow-rates and link capacities scale, assuming the tenant
still utilises the higher capacity to the same extent.</t>
<t>A congestion policer based on the dual token bucket in <xref
target="copo_Dual_CTB"/> also needs to specify the c factor that
determined the maximum congestion-rate allowed, rather than the average.
A theoretical approach to setting this factor is being worked on but, in
the mean time, a figure of about 10 seems reasonable for a tenant
sending only a few simultaneous flows (see <xref
target="copo_Short_Term"/>), while <xref
target="copo_Tenant_Diff_Sizes"/> explains that the c factor should tend
to reduce down to 1 for very large tenants.</t>
<t>The dual token bucket makes isolation fairly insensitive to the depth
of the deeper bucket, because the shallower bucket constrains the
congestion burst rate, which makes the congestion burst size less
important. It is believed that the deeper bucket depth could be minutes
or even hours of token fill. The depth of the shallower bucket should be
just a few MTU—more experimentation is needed to determine how
few.</t>
<t>Indeed, all the above recommendations on parameter setting are best
considered as initial values to be used in experiments to determine
whether other values would be better.</t>
<t>No recommendations can yet be given for parameter settings for a
tenant running a large proportion of short flows. This requires further
experimentation.</t>
</section>
<!-- ====================================================================== -->
<section anchor="copo_security" title="Security Considerations">
<t>This document is about performance isolation, therefore the whole
document concerns security. Where ConEx is used, the specification of
the ConEx Abstract Mechanism <xref
target="I-D.ietf-conex-abstract-mech"/> should be consulted for security
matters, particularly the design of audit to assure the integrity of
ConEx signals.</t>
</section>
<!-- ====================================================================== -->
<section anchor="copo_IANA" title="IANA Considerations">
<t>This document does not require actions by IANA.</t>
</section>
<!-- ====================================================================== -->
<section anchor="copo_conclusions" title="Conclusions">
<t>{ToDo}</t>
<t/>
</section>
<!-- ====================================================================== -->
<section title="Acknowledgments">
<t/>
</section>
<!-- ====================================================================== -->
</middle>
<back>
<references title="Informative References">
<?rfc include='reference.RFC.3649'?>
<?rfc include='reference.RFC.5681'?>
<?rfc include='reference.RFC.6356'?>
&I-D.ietf-conex-abstract-mech;
&I-D.briscoe-conex-initial-deploy;
&I-D.ietf-conex-mobile;
<reference anchor="I-D.briscoe-conex-data-centre">
<front>
<title>Network Performance Isolation in Data Centres using
Congestion Policing</title>
<author fullname="Bob Briscoe" initials="B" surname="Briscoe">
<organization/>
</author>
<author fullname="Murari Sridharan" initials="M" surname="Sridharan">
<organization/>
</author>
<date day="" month="February" year="2013"/>
<abstract>
<t>This document describes how a multi-tenant data centre operator
can isolate tenants from network performance degradation due to
each other's usage, but without losing the multiplexing benefits
of a LAN- style network where anyone can use any amount of any
resource. Zero per-tenant configuration and no implementation
change is required on network equipment. Instead the solution is
implemented with a simple change to the hypervisor (or container)
on each physical server, beneath the tenant's virtual machines.
These collectively enforce a very simple distributed contract - a
single network allowance that each tenant can allocate among their
virtual machines. The solution is simplest and most efficient
using layer-3 switches that support explicit congestion
notification (ECN) and if the sending operating system supports
congestion exposure (ConEx). Nonetheless, an arrangement is
described so that the operator can unilaterally deploy a complete
solution while operating systems are being incrementally upgraded
to support ConEx.</t>
</abstract>
</front>
<seriesInfo name="Internet-Draft"
value="draft-briscoe-conex-data-centre-01"/>
<format target="http://www.ietf.org/internet-drafts/draft-briscoe-conex-data-centre-01.txt"
type="TXT"/>
</reference>
<reference anchor="CongPol"
target="http://bobbriscoe.net/projects/refb/#polfree">
<front>
<title>Policing Freedom to Use the Internet Resource Pool</title>
<author fullname="Arnaud Jacquet" initials="A" surname="Jacquet">
<organization>BT</organization>
</author>
<author fullname="Bob Briscoe" initials="B" surname="Briscoe">
<organization>BT & UCL</organization>
</author>
<author fullname="Toby Moncaster" initials="T" surname="Moncaster">
<organization>BT</organization>
</author>
<date month="December" year="2008"/>
</front>
<seriesInfo name="Proc ACM Workshop on Re-Architecting the Internet (ReArch'08)"
value=""/>
<format target="http://www.bobbriscoe.net/projects/2020comms/refb/policer_rearch08.pdf"
type="PDF"/>
</reference>
<reference anchor="Relentless"
target="http://www.hpcc.jp/pfldnet2009/Program.html">
<front>
<title>Relentless Congestion Control</title>
<author fullname="Matt Mathis" initials="M" surname="Mathis">
<organization>PSC</organization>
</author>
<date month="May" year="2009"/>
</front>
<seriesInfo name="Proc. Int'l Wkshp on Protocols for Future, Large-scale & Diverse Network Transports (PFLDNeT'09)"
value=""/>
<format target="http://www.hpcc.jp/pfldnet2009/Program_files/1569198525.pdf"
type="PDF"/>
</reference>
<reference anchor="STCP"
target="http://www-lce.eng.cam.ac.uk/~ctk21/scalable/">
<front>
<title>Scalable TCP: Improving Performance in Highspeed Wide Area
Networks</title>
<author fullname="Tom Kelly" initials="T" surname="Kelly">
<organization>Uni Cambridge</organization>
</author>
<date month="April" year="2003"/>
</front>
<seriesInfo name="ACM SIGCOMM Computer Communication Review"
value="32(2)"/>
</reference>
<reference anchor="DCTCP"
target="http://portal.acm.org/citation.cfm?id=1851192">
<front>
<title>Data Center TCP (DCTCP)</title>
<author fullname="Mohammad Alizadeh" initials="M" surname="Alizadeh">
<organization/>
</author>
<author fullname="Albert Greenberg" initials="A" surname="Greenberg">
<organization/>
</author>
<author fullname="David A. Maltz" initials="D.A." surname="Maltz">
<organization/>
</author>
<author fullname="Jitendra Padhye" initials="J" surname="Padhye">
<organization/>
</author>
<author fullname="Parveen Patel" initials="P" surname="Patel">
<organization/>
</author>
<author fullname="Balaji Prabhakar" initials="B" surname="Prabhakar">
<organization/>
</author>
<author fullname="Sudipta Sengupta" initials="S" surname="Sengupta">
<organization/>
</author>
<author fullname="Murari Sridharan" initials="M" surname="Sridharan">
<organization/>
</author>
<date month="October" year="2010"/>
</front>
<seriesInfo name="ACM SIGCOMM CCR" value="40(4)63--74"/>
<format target="http://ccr.sigcomm.org/drupal/files/p63_0.pdf"
type="PDF"/>
</reference>
<reference anchor="CPolTrilogyExp"
target="http://trilogy-project.org/publications/deliverables.html">
<front>
<title>Progress on resource control</title>
<author fullname="Costin Raiciu" initials="C" role="editor"
surname="Raiciu">
<organization>UCL</organization>
</author>
<date month="December" year="2009"/>
</front>
<seriesInfo name="Trilogy EU 7th Framework Project ICT-216372 Deliverable"
value="9"/>
<format target="http://typo3.trilogy-project.eu/fileadmin/publications/Deliverables/D9_-_Progress_on_resource_control.pdf"
type="PDF"/>
</reference>
<reference anchor="conex-dc_tr">
<front>
<title>Network Performance Isolation in Data Centres by Congestion
Exposure to Edge Policers</title>
<author fullname="Bob" surname="Briscoe">
<organization>BT</organization>
</author>
<date day="03" month="November" year="2011"/>
</front>
<seriesInfo name="BT Technical Report" value="TR-DES8-2011-004"/>
<annotation>Work in progress</annotation>
</reference>
</references>
<section title="Summary of Changes between Drafts">
<t>This draft was separated out from the -00 version of <xref
target="I-D.briscoe-conex-data-centre"/>, taken mostly from what was
section 4. Changes from that draft are listed below:</t>
<!--Detailed changes are available from http://tools.ietf.org/html/draft-briscoe-conex-policing-->
<t><list style="hanging">
<t
hangText="From section 4 of draft-briscoe-conex-data-centre-00:"><list
style="symbols">
<t>New Introduction</t>
<t>Added section 2. "Example Bulk Congestion Policer"</t>
<t>Rearranged and clarified the previous introductory text that
preceded what is now Section 3.4 into three subsections:<list
style="empty">
<t>3.1 "The Problem"</t>
<t>3.2 "Approach"</t>
<t>3.3 "Terminology"</t>
</list></t>
<t>Corrected numerical examples:<list style="symbols">
<t>Section 3.4 "Long-RunningFlows": 0.04% to 0.004% and
400kb/s to 40kb/s</t>
<t>Section 3.6.1 "Numerical Examples Without Policing":
10kb/s to 20kb/s</t>
<t>section 3.7 "Weighted Congestion Controls": 22.2 to
20</t>
</list></t>
<t>Clarified where necessary, especially the descriptions of the
scenarios in 3.7 "Weighted Congestion Controls" and 3.8 "A
Network of Links".</t>
<t>Section 3.8 "A Network of Links": added a numerical example
including congestion burst limiting that was not covered at all
previously</t>
<t>Added Section 3.9 "Tenants of Different Sizes"</t>
<t>Filled in absent text in:<list style="symbols">
<t>Section 3.11 "Diverse Congestion Controls"</t>
<t>Section 4. "Parameters Settings"</t>
<t>Section 5. "Security Considerations"</t>
</list></t>
<t>Minor corrections throughout and updated references.</t>
</list></t>
</list></t>
</section>
</back>
</rfc>
| PAFTECH AB 2003-2026 | 2026-04-22 16:13:05 |