One document matched: draft-zheng-tcpincast-00.txt


Network Working Group                                         H.Zheng
Internet Draft                                              CEIE,BJTU
Intended status: Experimental                                  C.Qiao
Expires: December 28, 2016                                       SUNY
                                                               K.Chen
                                                               Y.Zhao
                                                            CEIE,BJTU
                                                        June 25, 2016



     An Effective Approach to Preventing TCP Incast Throughput Collapse
                         for Data Center Networks
                      draft-zheng-tcpincast-00.txt


Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on December 28, 2016.

Copyright Notice

   Copyright (c) 2016 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents



Zheng,et al.          Expires December 28, 2016                [Page 1]

Internet-Draft    Preventing TCP Incast Thr. Col. for DCN      June 2016


   carefully, as they describe your rights and restrictions with
   respect to this document.

Abstract

   This document presents an effective solution to the known TCP incast
   problem in data center networks. The incast problem refers to a
   drastic TCP throughput drop when the number of servers synchroni-
   cally sending data to the same receiver is too large. Our idea is
   avoiding packet losses before TCP incast happens. The scheme is
   limiting the number of concurrent senders such that the link can be
   filled as fully as possible but no packet losses. In this document
   we examine the condition that the link can be saturated but no
   packet losses. Then based on the condition we propose an approach to
   estimates the reasonable number of concurrent senders. Our approach
   does not modify TCP protocol itself and can thus be applied to any
   TCP variant, and works regardless of the type of data center network
   topology and throughput limitation. Analysis and simulation results
   show that our approach eliminates the incast problem and noticeably
   improves TCP throughput.

Table of Contents

   1. Introduction ................................................ 3
   2. Model ....................................................... 4
      2.1. TCP Incast Model........................................ 4                                
      2.2. TCP rabbits in data centers............................. 5                                           
   3. The Condition SBNPL ......................................... 6
      3.1. Why limit the number of concurrent senders ..............6
      3.2. Assumptions and notations............................... 7                                         
      3.3. The condition SBNPL..................................... 7                                   
   4. Performance evaluation....................................... 8                                 
      4.1. Simulation Configuration................................ 8                                        
      4.2. Throughput performance.................................. 9                                      
      4.3. The effect of the limitation to the buffer size on
      throughput .................................................. 9
      4.4. The effect of BDP on throughput .........................9
   5. Related Work ................................................ 9
   6. Conclusions ................................................ 11
   7. Security Considerations..................................... 12                                  
   8. IANA Considerations ........................................ 12
   9. References ................................................. 12
   10. Acknowledgments ........................................... 14






Zheng,et al.          Expires December 28, 2016                [Page 2]

Internet-Draft    Preventing TCP Incast Thr. Col. for DCN      June 2016


1. Introduction

   Data centers are becoming one of hottest topics in both research
   communities and IT industry. It is attractive to build data centers
   atop  standard  TCP/IP  and  Ethernet  due  to  their  technological
   advantages and economy of scale. Although TCP has been used widely
   in the Internet and works well, TCP does not work well in data
   center networks. A reported open problem is TCP incast [1-4].

   TCP  incast,  which  results  in  gross  under-utilization  of  link
   capacity, occurs in synchronized many-to-one communication patterns.
   In such a communication pattern, a receiver issues data requests to
   multiple senders. The senders respond to the request and return an
   amount of data to the receiver. The data from all senders pass
   through a bottleneck link in a many-to-one pattern to the receiver.
   When  the  number  of  synchronized  senders  increases,  throughput
   observed at the receiver drops to one or two orders of magnitude
   below the link capacity.

   Several factors, including high bandwidth and low latency of a data
   center network, lead to TCP incast [3]. The barrier synchronized
   many-to-one communication pattern triggers multiple servers to
   transmit packets to a receiver concurrently, resulting in a
   potentially large amount of traffic simultaneously poured into the
   network. Due to limited buffer space at the switches, such traffic
   can overload the buffer, resulting in packet losses. TCP recovers
   from most of packet losses through timeout retransmission. The
   timeout duration is at least hundreds of milliseconds, which is
   orders of magnitude greater than a typical round trip time in a data
   center network. A server that suffers timeout is stalled even though
   other servers can use the available bandwidth to complete
   transmitting. Due to the synchronized communication, however, the
   receiver has to wait for the slowest server that suffers timeout.
   During such a waiting period, the bottleneck link may be fully idle.
   This results in under-utilization of the link and performance
   collapse.

   This document addresses the above TCP throughput collapse problem.
   We propose an approach to preventing TCP throughput collapse. We
   focus on avoiding packet losses before TCP incast happens. The basic
   idea is to restrict the number of servers allowed to transmit data
   to the receiver at the same time to a reasonable value so as to
   avoid TCP incast.

   For given limited amount of data required by the receiver, too few
   servers send data at a time may not fully utilize the bandwidth on
   the link. Too many servers, however due to the limited switch buffer,


Zheng,et al.          Expires December 28, 2016                [Page 3]

Internet-Draft    Preventing TCP Incast Thr. Col. for DCN      June 2016


   simultaneously sending to the receiver will lead to overfilling of
   the  buffer,  resulting  in  TCP  timeouts  and  then  TCP  incast.
   Accordingly the number of servers that are permitted to send data at
   the same time should be carefully determined. Our objective is to
   find a reasonable value of number of co-existing servers to ensure
   that the aggregate bandwidth is utilized efficiently but without
   packet losses.

   Our approach is to calculate the reasonable number of servers
   allowed to send data at the same time. In this document, we give the
   condition that the bottleneck link can be saturated but no packet
   losses (SBNPL). We show that if there is sufficient amount of
   traffic on the bottleneck link while the backlog created at the
   switch buffer never exceeds the buffer size, the bottleneck link can
   be utilized effectively without losses. Based on this condition, we
   then propose an approach to calculating the reasonable number of
   concurrent TCP senders. The simulation results and analysis show our
   approach noticeably improves throughput performance.The approach is
   an application-layer mechanism. It can be a global scheduling
   application or an application staggering requests to a limited
   number of senders. It also can be implemented on senders, which can
   coordinately skew their responses. The advantage of our approaches
   is that it does not need to modify TCP protocol or the switch.
   The reminder of this document is organized as follows. In section 2,
   we describe model discussed in this document, including TCP incast
   model and TCP flow characteristics in data center. We examine the
   condition that the bottleneck link can be saturated but no packet
   losses in section 3. Simulation results for evaluating performance
   are presented in section 4. In section 6 we conclude this document
   and discuss future works.
2. Model

2.1. TCP Incast Model

   TCP incast occurs in synchronized many-to-one communication patterns.
   Research works have shown that many applications in data centers use
   many-to-one communication pattern [3,5,6].  A simplified model is
   originally presented in [1]. It is a distributed storage system, a
   file is split into data blocks. Each data block is stripped across N
   servers, such that each server stores a chunk of that particular
   data block, denoted by Sender Request Unit (SRU). We call the number
   of servers across which the file is stripped as stripping width. The
   receiver requests servers for that particular block; each server
   responds and returns the chunk, i.e., SRU, stored on it. Only after
   the receiver has gotten all chunks of the current block, it requests



Zheng,et al.          Expires December 28, 2016                [Page 4]

Internet-Draft    Preventing TCP Incast Thr. Col. for DCN      June 2016


   for the next block. Essentially, this is a Bulk Synchronous Model
   (BSP) [7].
   Built atop TCP/IP and Gigabit Ethernet, the underlying network is
   constructed with high bandwidth (1~10 Gbps, even higher) and low
   latency (RTT of 10~100 microseconds).  The link connecting the
   receiver and the switch is the only bottleneck.
   In our model, each sender only opens one TCP connection for
   transmission. TCP connections are kept open until the transfer of
   the whole file completes, or the receiver does not request any more.
   It is because that it is not necessary to open an individual
   connection for transferring each data block of the file.
   The transfer of a whole file consists of 'rounds'. A round begins
   when the receiver requests for a data block, and ends when the
   receiver has gotten all chunks of that block. In any round, each TCP
   connection transfers limited amount of data, which equals to the
   size of SRU, according to TCP congestion control algorithm.
   Consequently the transfer of a SRU in each round can be viewed as a
   succession of 'sub-rounds'.  At the beginning of any sub-round, each
   TCP sender transmits SEND_W packets back-to-back, where SEND_W is
   the current size of TCP send window. These packets are so-called
   'outstanding packets', which have been sent but whose ACK's have yet
   to be received by the sender.  If no packet losses, each outstanding
   packet must either be in the buffer queue, or be in the delay pipe
   (it is in the pipe from the sender to the receiver or its associated
   ACK packet is in the pipe in the other direction). Once all packets
   falling within the current send window have been transmitted, no
   other packets are sent until the first ACK is received for one of
   these packets. The reception of this ACK signals the end of the
   current sub-round and the beginning of the next sub-round. So the
   duration of a sub-round is a RTT. At the beginning of the next sub-
   round, a group of NEW_SEND_W new packets will be sent, where
   NEW_SEND_W is the new size of TCP send window. When all packets
   associated to that particular SRU have been acknowledged, the
   transfer of that particular block chunk completes. The next round
   starts when the slowest TCP connection completes the transfer of its
   corresponding SRU. Note that when the new round begins, the value of
   TCP congestion window is that of congestion window when the last
   round ends.

2.2. TCP rabbits in data centers

   TCP has unique characteristics in the context of data centers. More
   specifically, a TCP flow is generally referred to as either a TCP
   mouse if it is short (so short that it never exits the slow-start



Zheng,et al.          Expires December 28, 2016                [Page 5]

Internet-Draft    Preventing TCP Incast Thr. Col. for DCN      June 2016


   phase), or a TCP elephant if it has a bulk of data to transfer and
   lasts very long.

   From the discussion above, we observe that the size of a TCP flow is
   between a mouse and an elephant. In particular, a TCP flow may enter
   both the slow-start phase and the congestion avoidance phase during
   data transmission. We call such a TCP flow a "rabbit".

   Note that a rabbit TCP is elusive: it may exit slow start and enter
   congestion avoidance in the first round during which the first data
   block is transmitted (in the case of narrow striping) or in later
   rounds during which subsequent data blocks are sent (in the case of
   wider striping). While one may use different equations to model a
   TCP rabbit's behaviors in different rounds, or for different widths
   of striping, it greatly complicates the implementation. Moreover
   when there are multiple co-active rabbits, the interaction between
   them will make it much harder to model and predict their behaviors.
   Our objective is to find an approach that: is easy to implement,
   works regardless of TCP phase, TCP parameters configuration and
   scalability of data center and does not need to modify TCP protocol
   itself.

3. The Condition SBNPL

   Denote the reasonable number of servers as senders to simultaneously
   transmit with m. In this section at first we will explain why
   limiting the number of concurrent senders can guarantee throughput
   of the receiver. Then we will deduce the condition that the link can
   be saturated but no packet losses.
3.1. Why limit the number of concurrent senders

   Given stripping width, observe throughput of the receiver varying
   the number of concurrent TCP senders, i.e., the value of m. Our
   experiment (for a block of 1 MB stripped across 256 servers with
   buffer size of 64 KB) shows that throughput firstly increases then
   drops with the increasing of m. Because the volume of data sent by a
   TCP sender is just limited to 4 KB, when there are few concurrent
   TCP senders, the total amount of data transmitted by these senders
   is not sufficient to saturate the link, such that throughput of the
   receiver is low. With the increasing of the value of m, there are
   more concurrent TCP senders pouring data onto the link, and
   throughput increases till it reaches a top. When the value of m
   becomes larger, the potentially large amounts of traffic contributed
   by these m concurrent senders exhausts the buffering capacity of the
   link, which leads to packet losses; the throughput falls from the
   top instead of rising. We call the top throughput as the optimal


Zheng,et al.          Expires December 28, 2016                [Page 6]

Internet-Draft    Preventing TCP Incast Thr. Col. for DCN      June 2016


   throughput, and the value of m corresponding to the optimal
   throughput as the optimal number of concurrent senders.
   If the number of concurrent TCP senders exactly equals to the
   optimal value, the receiver can obtain optimal throughput. The
   simulation result also shows that even if the number of concurrent
   senders is not optimal, there exists a value range of m where the
   receiver obtains good throughput performance. By "good" throughput
   performance, we mean the receiver achieves 98 percent or above of
   the optimal throughput. Hence we view those values of m  within this
   range as reasonable number of concurrent senders. It may be hard to
   find the exact optimal value of m such that the receiver achieves
   optimal throughput, while find the reasonable value of m is more
   feasible. In order to find the reasonable value of m, we deduce the
   condition SBNPL in later subsection.
3.2. Assumptions and notations

   Ignoring the time that it takes to read data from disks or caches,
   once each sender receives a request from the receiver, it responds
   and returns its corresponding chunk. Thus at the beginning of any
   sub-round, all senders simultaneously inject data to the network.
   Because TCP connections are synchronized, an assumption also adopted
   in [8], they share both the buffer and the bottleneck link bandwidth
   quite equitably. That is, each TCP connection shares the buffer and
   the link bandwidth with B/m and C/m, respectively, where B is the
   buffer size and C is the bandwidth of the bottleneck link.
   Assume that both TCP send and receive socket buffer sizes are large
   enough that TCP send window SEND_W is limited by its congestion
   window CON_W, i.e., SEND_W = CON_W. All links have same bandwidth C,
   and propagating delay d. The switch has buffer capacity B, and
   manages queue with Drop Tail algorithm.

3.3. The condition SBNPL

   It is enough to analyze the transfer in a round, since for any round
   the data block's size is same and the block is transferred according
   to TCP congestion control algorithm. It is worth noting that at the
   beginning of a round the TCP send window size is the one when the
   last round ends.
   Due to synchronization, the windows evolutions of all TCP
   connections are synchronized, at any sub-round the total amount of
   data poured onto the bottleneck,M, is merely the sum of amount of
   data contributed by each connection at the current time, that is
   SUM_M.




Zheng,et al.          Expires December 28, 2016                [Page 7]

Internet-Draft    Preventing TCP Incast Thr. Col. for DCN      June 2016


   As mentioned above, if measured in units of packets, these M/P
   outstanding packets, where P is the size of a packet, must either be
   in the buffer queue, or be in the delay pipe, assuming no packet
   losses. If M is not less than the bandwidth delay product of the
   bottleneck, the bottleneck can be saturated. When these m TCP
   connections share the bottleneck, each connection's bandwidth delay
   product equals to its share of the link bandwidth times RTT time
   that it traverses the bottleneck. Because they are synchronized,
   each connection's share of the bottleneck is equitable, (i.e., which
   is C/m), and the RTTs are similar, so the total bandwidth delay
   product is the sum of m connections' bandwidth delay product, which
   equals to the bottleneck bandwidth times the average RTT,
   i.e.,C*RTT_AVR [9]. Ignoring the transmission time on the link
   between senders (or the receiver) and the switch, and processing
   time on the switch and/or the senders and the receiver, RTT_AVR is
   four times of the propagating delay of the bottleneck.
   In literature[10] it showed that when there is only one TCP
   connection on the bottleneck link, the link can be saturated but no
   packet losses, at each time as long as the amount of data poured
   onto the link satisfies the condition C*RTT_AVR=M=C*RTT_AVR+B. A
   simple extending of this condition to a version where m concurrent
   synchronized TCP connections share the link is
   C*RTT_AVR=SUM_M=C*RTT_AVR+B. However, for TCP incast problem, this
   condition does not hold because of highly bursty traffic. At the
   beginning of a round, for each TCP connection it is possible that
   its window has increased enough large that SEND_W packets are
   transmitted in back-to-back manner, resulting in flash crowd at the
   buffer. Such a traffic burstiness can lead to a backlog created at
   the buffer even when M<C*RTT_AVR, and packet losses can happen even
   when M<C*RTT_AVR+B.
4. Performance evaluation

4.1. Simulation Configuration

   We use network simulator NS-2 [11]to run simulation experiments, and
   leverage the NS-2 source code provided in [1,3] for our performance
   evaluation. Inspired by distributed storage and bulk block transfers
   in parallel processing applications such as MapReduce, we use the
   workload that is described in Section 2. Especially the data block
   size is fixed; it was thought to be representative of communication
   patterns in popular distributed storage systems [3,4].

   The performance metric is throughput over the bottleneck link, given
   by the total bytes received by the receiver divided by the finishing
   time of the last sender. We explore throughput by varying parameters,
   such as the number of servers, switch buffer size, data blocks size


Zheng,et al.          Expires December 28, 2016                [Page 8]

Internet-Draft    Preventing TCP Incast Thr. Col. for DCN      June 2016


   and bandwidth delay product. As the number of servers increases,
   each server would transmit a decreasing chunk block, for the size of
   SRU is 1/n MB, where n is the number of servers. We vary the value
   of n from 4 till the value where the size of SRU is 2 KB, which is
   of representative of minor chunk of data. Data block sizes are 1MB
   and 4MB, respectively, and BDP varies from 12.5 to 25. We run 100
   times simulation experiments, each simulation run 20 seconds. The
   throughput is averaged over 100 experiments.

4.2. Throughput performance

   We fix the size of data blocks 1MB. The simulation result shows that
   our proposed approach improves incast throughput notice.

4.3. The effect of the limitation to the buffer size on throughput

   We redo simulation experiments varying the buffer size B , given
   network bandwidth delay product of 12 KB. We consider three cases:
   B>=C*RTT_AVR and B<C*RTT_AVR-S .
   The simulation result shows that our proposed approach can make sure
   the link utilized as fully as possible without packet losses. It is
   because sufficient enough data are poured onto the link. As long as
   B is not less than the bandwidth delay product, the utilization of
   the link can be 50% above.
4.4. The effect of BDP on throughput

   The relation of bandwidth delay product to TCP throughput is well-
   known in the literature, such as [12,13]. We do simulation
   experiments varying bandwidth delay product, given the buffer size
   of 64 KB. The calculation of m with the improved approach is
   independent on network bandwidth delay product. The simulation
   result shows when bandwidth delay product increases, RTT increases
   and throughput decreases due to TCP itself congestion control
   mechanism. Nonethe-less, our approach improves incast throughput
   noticeably; the utilization of the bottleneck link is 70% above even
   when bandwidth delay product is higher.
5. Related Work

   The idea of restricting the number of co-active servers (i.e.,
   servers that send data simultaneously to the receiver) is originally
   suggested in [14]. However the authors did not give any clue on how
   to determine the value of the number of co-active servers. In [15],
   the authors proposed an approach to calculating this value. However,
   the assumption was that TCP always works in the slow-start phase and
   never enters congestion avoidance. Our work use a more realistic


Zheng,et al.          Expires December 28, 2016                [Page 9]

Internet-Draft    Preventing TCP Incast Thr. Col. for DCN      June 2016


   assumption: TCP does enter both phases. It will be shown that it
   brings more challenges for the calculating in next section.

   The idea of avoiding packet losses so as to prevent throughput
   collapse due to incast is also used in ICTCP proposed in [16]. The
   difference between our work and ICTCP is that our approach works
   without requiring any changes made to TCP protocols while ICTCP
   modifies TCP receiver to avoid incast congestion.The authors of
   [17]provided an admission control to TCP senders in order to limit
   the number of concurrent senders. However, their scheme requests the
   switch can report the network situation.

   DCTCP in [5] tries to improve TCP throughput based on the idea of
   reducing the switch buffer occupation. It achieves the goal by using
   marking scheme at switches and modifying TCP congestion control
   algorithm to react the marking scheme. Work [3] lowered TCP's
   retransmission timeout value to improve TCP throughput, however,
   most systems lack the fine-grained TCP timer required for such a low
   RTO. Work [1] studied whether the TCP variants, such as TCP SACK and
   TCP New Reno, and further improvements to TCP loss recovery, such as
   Limited Transmit, can prevent the incast problem.

   Some algorithms, such as Quantized Congestion Notification (QCN)
   [18]and Fair Quantized Congestion Notification (FQCN) [19]are
   developed to provide congestion control at the switch or Ethernet
   layer in data center networks. It has been shown that QCN can
   effectively control link rates very rapidly in datacenter networks.
   However, it performs poorly when TCP incast occurs. FQCN, as an
   enhanced QCN, improve fairness of multiple flows sharing the link
   capacity.

   Different  from  those  existing  works  which  either  improve  TCP
   protocol itself or are done at the Ethernet level, work [20]proposed
   a control protocol that is customized for the datacenter environment.
   The proposed uses explicit rate control to apportion bandwidth
   according to flow deadlines, motivated by the soft real-time nature
   of large scale web applications in today's datacenters.

   Manish Jain et al. reported the condition that a bulk TCP transfer
   can saturate a bottleneck link but no packet losses [21]. Although
   their work can be extended to a version for many TCP transfer on a
   bottleneck link, in our work barrier synchronized many-to-one
   communication pattern in incast brings about differences and
   challenges for deriving the condition SBNPL in three aspects. First,
   in our work each TCP connection just has limited amount of data to
   transmit. Yet, a "bulk" TCP transfer in existing works has infinite
   data to send. Second, in our incast model TCP enters both the slow-


Zheng,et al.          Expires December 28, 2016               [Page 10]

Internet-Draft    Preventing TCP Incast Thr. Col. for DCN      June 2016


   start and congestion avoidance phases, while existing works assumed
   TCP works either in congestion avoidance  [21,22] or slow-start phase
   [21,15]. Because of the barrier synchronized transfer, the
   successive data blocks transmissions are related: when the transfer
   of a particular data block begins, TCP stays at which phase, slow-
   start or congestion avoidance, not only depends on the last data
   block transmission but also the number of senders. Third, our
   objective is to determine how many TCP transfers can simultaneously
   exist on the bottleneck link so as to utilize the link as fully as
   possible; the objective of  [21] is to determine TCP socket buffer
   size so that the TCP transfer receives its maximum feasible
   throughput.
   The effect of the buffer on throughput, and the relations of
   bandwidth delay product and TCP congestion control algorithm to
   sizing the buffer are well-known in the networking literature.
   References  [23,24] described the rule-of-thumb and square-root rule
   for determining the size of the buffer, respectively. Both of them
   come from a desire to keep a congested link, that is, there are
   packet losses on the link, as busy as possible. Different from these
   existing works, in this document we discuss sizing the buffer
   assuming no packet losses on the link.
6. Conclusions

   By restricting the number of concurrent TCP senders that
   simultaneously transmit on the bottleneck is an effective method to
   avoid packet losses, thus TCP timeouts. This method eliminates the
   root cause of incast throughput collapse. However, it needs
   carefulness to find the condition that the bottleneck can be
   saturated but no packet losses when there are many TCP senders co-
   active on the bottleneck. Considering traffic burstiness and the
   delay pipe buffering ability, we examine this condition, and
   calculate the reasonable number of concurrent senders. Our approach
   is simple and easy to implement, and improve throughput performance.
   It is hard to find the exact optimal number of concurrent TCP
   senders such that the bottleneck link is fully utilized. In the
   future works, we will devote ourselves to analyzing the possibility
   of finding the optimal number of concurrent TCP senders, and to
   finding it if it exists. We also will extend our work for more
   complicated data center networks, where the path from receiver to
   servers traverses multiple Ethernet switches, or there are multiple
   receivers on a single switch or multiple switches sending requests
   to a shared subset of servers.





Zheng,et al.          Expires December 28, 2016               [Page 11]

Internet-Draft    Preventing TCP Incast Thr. Col. for DCN      June 2016


7. Security Considerations

   This document makes no changes to the underlying security of TCP. No
   new security issues are raised within this document.

8. IANA Considerations

   This document includes no request to INNA. Existing IANA registries
   for TCP parameters are sufficient.

9. References



   [1]   A. Phanishayee, E. Krevat, V. Vasudevan, D. G. Andersen, G. R.
         Ganger, G. A. Gibson and S. Seshan, "Measurement and analysis
         of TCP throughput collapse in cluster-based storage systems,"
         FAST '08, Feb. 2008, San Jose, CA.
   [2]   D. Nagle, D. Serenyi and A. Matthews, "The panasas activescale
         storage cluster: delivering scalable high bandwidth storage",
         Proc. the 2004 ACM/IEEE conference on Supercomputing,
         Washington DC, USA, 2004.
   [3]   V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G.
         Andersen, G.R. Ganger, G. A. Gibson and B. Mueller, "Safe and
         effective fine-grained TCP retransmissions for data center
         communication", SIGCOMM'09, August 17-21, 2009, Barcelona,
         Spain.
   [4]   Y.Chen, R.Grith, J.Liu, A.D.Joseph, and R.H. Katz.
         Understanding TCP incast throughput collapse in data center
         rnetworks. In Proc. Workshop: Research  on Enterprise
         Networking, Barcelona,Spain,Aug.2009.
   [5]   M.Alizadeh, A.Greenberg, D.A.Maltz, J.Padhye, P.Patel,
         B.Prabhakar, S.Sengupta, andM.Sridharan. Data Center TCP
         (DCTCP). Proceedings of ACM SIGCOMM, 2010.
   [6]   S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R.
         Chaiken. The Nature of Datacenter Traffic: Measurements &
         Analysis. In ACM IMC 2009, Nov 4-6, 2009, Chicago, USA
   [7]   BSP Website. (2012). Available: http://www.bsp-worldwide.org
   [8]   Jiao Zhang, Fengyuan Ren and Chuang Lin, "Modeling and
         Understanding  TCP  Incast  in Data Center Networks," Proc.
         IEEE INFOCOM 2011, July, pp.1377-1385, doi:
         10.1109/INFCOM.2011.5934923.



Zheng,et al.          Expires December 28, 2016               [Page 12]

Internet-Draft    Preventing TCP Incast Thr. Col. for DCN      June 2016




   [9]   S. Shenker, L. Zhang, D. Clark, "some observation on the
         dynamics of a congestion control algorithm," ACM Computer
         Communications Review, vol. 20, 1990,  pp. 30-39.
   [10]  Manish Jain , Ravi S. Prasad and Constantinos Dovrolis, "The
         TCP Bandwidth-Delay Product revisited: network buffering,
         cross traffic, and socket buffer auto-sizing," CERCS Technical
         Reports, 2003.
   [11]  The network simulator 2, http://www.isi.edu/nsnam/ns.
   [12]  S. Shalunov and B. Teitelbaum, " Bulk TCP Use and Performance
         on Internet2," 2002. Also see:
         http://netflow.internet2.edu/weekly/.
   [13]  D. Katabi, M. Handley and C. Rohrs, "Congestion Control for
         High Bandwidth Delay Product Networks," Proc. ACM SIGCOMM 2002.
   [14]  E. Keevat, V. Vasudevan, A. Phanishayee, D. G. Andersen, G.R.
         Ganger, G. A. Gibson and S. Seshan, "On application-level
         approaches to avoiding TCP throughput collapse in cluster-
         based storage systems", Supercomputing '07, Nov. 10-16, 2007,
         Reno, NV.
   [15]  Yongxiang Zhao, Changjia Chen, "A scheme to solve TCP
         transmission problem in data center", 2010 Cross-Strait
         Conference on Information Science and Technology, CSCIST 2010,
         July 9-11, 2010, Qinghuang Dao, China.
   [16]  H. Wu, Z. Feng, C. Guo and Y. Zhang, ICTCP: Incast congestion
         control for tcp in data center networks. In ACM CoNext 2010,
         Nov 30 - Dec 3, 2010 Philadelphia, USA
   [17]  Adrian Tam, Kang Xi, Yang Xu, H. Jonathan Chao, "Preventing
         TCP Incast Throughput Collapse at the Initiation, Continuation,
         and Termination", Proc. IEEE/ACM International Workshop on
         Quality of Service (IWQOS 2012), Coimbra, Portugal, June, 2012.
   [18]  M. Alizadeh, B. Atikoglu, A. Kabbani, A. Lakshmikantha, R. Pan,
         B. Prabhakar, and M. Seaman, "Data Center Transport Mechanisms:
         Congestion Control Theory and IEEE Standardization," Proc. the
         46th Annual Allerton Conference, Illinois, USA, Sep. 2008, pp.
         1270-1277.
   [19]  Yan Zhang, and Nirwan Ansari, "On mitigating TCP Incast in
         Data Center Networks," Proc. IEEE INFOCOM 2011, Shanghai,
         China, July. 2011, pp. 51-55.
   [20]  Christo Wilson, Hitesh Ballani, Thomas Karagiannis and Ant
         Rowtron, "Better Never than Late: Meeting Deadlines in


Zheng,et al.          Expires December 28, 2016               [Page 13]

Internet-Draft    Preventing TCP Incast Thr. Col. for DCN      June 2016




         Datacenter Networks," Proc. ACM SIGCOMM 2011, New York, USA,
         2011, pp. 50-61.
   [21]  Manish Jain , Ravi S. Prasad and Constantinos Dovrolis, "The
         TCP Bandwidth-Delay Product revisited: network buffering,
         cross traffic, and socket buffer auto-sizing," CERCS Technical
         Reports, 2003.
   [22]  Jiao Zhang, Fengyuan Ren and Chuang Lin, "Modeling and
         Understanding  TCP  Incast  in Data Center Networks," Proc.
         IEEE INFOCOM 2011, July, pp.1377-1385, doi:
         10.1109/INFCOM.2011.5934923.
   [23]  C. Villamizar and C. Song, "High performance tcp in ansnet,"
         ACM Computer Communications Review, vol. 24, pp. 45-60, 1994.
   [24]  G. Appenzeller, I. Keslassy and N. McKeown, "Sizing router
         buffers," ACM SIGCOMM Computer Communication Review, vol. 34,
         Oct. 2004, doi:10.1145/1030194.1015499.

10. Acknowledgments

   We thank the research group lead by Professor David Andersen
   (Carnegie Mellon University) for sharing the NS-2 simulation codes
   and in particular, Amar Phanishayee (Carnegie Mellon University) for
   help with understanding these codes. We also thank Professor
   Yongxiang Zhao and Professor Changjia Chen of Beijing Jiaotong
   University for feedback and support. Discussing with Professor
   Hongfang Yu (University of Electronic Science and Technology of
   China) and Bingli Guo (Beijing University of Posts and
   Telecommunications) also help our work.

   This document was prepared using 2-Word-v2.0.template.dot.














Zheng,et al.          Expires December 28, 2016               [Page 14]

Internet-Draft    Preventing TCP Incast Thr. Col. for DCN      June 2016


Authors' Addresses

   Hongyun Zheng
   CEIE, BJTU
   Beijing,China
   Email: hyzheng@bjtu.edu.cn

   Chunming Qiao
   SUNY
   Buffalo,NY,U.S.A
   Email: qiao@computer.org

   Kai Chen
   CEIE, BJTU
   Beijing,China
   Email: 10120063@bjtu.edu.cn

   Yongxiang Zhao
   CEIE, BJTU
   Beijing,China
   Email: yxzhao@bjtu.edu.cn



























Zheng,et al.          Expires December 28, 2016               [Page 15]


PAFTECH AB 2003-20262026-04-24 15:47:52