One document matched: draft-deng-taps-datacenter-00.txt





Network Working Group                                            L. Deng
Internet-Draft                                              China Mobile
Intended status: Informational                         February 13, 2014
Expires: August 17, 2014


                End Point Properties for Peer Selection
                   draft-deng-taps-datacenter-00.txt

Abstract

   It is noticed that within a data center, unique traffic pattern and
   performance goals for the transport layer exist, as compared to
   things on the Internet.  This draft discusses the usecase for
   applying transport API from the perspective of an application running
   in a data center environment, and proposes potential requirements for
   such API design.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on August 17, 2014.

Copyright Notice

   Copyright (c) 2014 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of




Deng                     Expires August 17, 2014                [Page 1]

Internet-Draft   End Point Properties for Peer Selection   February 2014


   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Usecases  . . . . . . . . . . . . . . . . . . . . . . . . . .   3
     3.1.  VM related traffic  . . . . . . . . . . . . . . . . . . .   3
     3.2.  Application Priorities  . . . . . . . . . . . . . . . . .   3
     3.3.  Access Type differentiation . . . . . . . . . . . . . . .   4
     3.4.  Delay Tolerant Traffic  . . . . . . . . . . . . . . . . .   4
   4.  Transport Optimization in DC  . . . . . . . . . . . . . . . .   4
     4.1.  Performance degradation in DC . . . . . . . . . . . . . .   4
       4.1.1.  Incast Collapse . . . . . . . . . . . . . . . . . . .   4
       4.1.2.  Long tail of RTT  . . . . . . . . . . . . . . . . . .   5
       4.1.3.  Buffer Pressure . . . . . . . . . . . . . . . . . . .   5
     4.2.  Transport Optimization Goals/Mechanisms . . . . . . . . .   5
   5.  DC Transport API Considerations . . . . . . . . . . . . . . .   6
     5.1.  information flow from app to transport  . . . . . . . . .   6
     5.2.  information flow from transport to app  . . . . . . . . .   6
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .   7
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   7
   8.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   7
     8.1.  Normative References  . . . . . . . . . . . . . . . . . .   7
     8.2.  Informative References  . . . . . . . . . . . . . . . . .   7
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .   7

1.  Introduction

   It is noticed that the traffic pattern in a data center is quite
   different from Internet.  First of all, almost all the traffic in
   data center are carried by TCP (over 90%).  Secondly, there are
   extreme deviation among TCP flows in terms of data volume and duration.
   while most of the flows are very short, that complete in less than
   2-3 round trips, most of the traffic volume belongs to few long
   lasting flows.  ToR switches are highly multiplexed for tens of
   concurrent TCP flows for most of the time.

   The reason behind such a traffic pattern is a combination of three
   types of data traffic:

   (1) highly delay-sensitive short flows resulting from the distributed
   computing model employed pervasively for delay-sensitive application
   (web search/social networking);

   (2) highly delay-sensitive short flows for cluster control/mangement;
   and



Deng                     Expires August 17, 2014                [Page 2]

Internet-Draft   End Point Properties for Peer Selection   February 2014


   (3) delay tolerant bakup/synchronization data traffic with large data
   volume.

2.  Terminology

   DC: Data Center, is a facility used to house computer systems and
   associated components, such as telecommunications and storage
   systems.

   ToR: a Top of Rack switch, usually sits on top of a rack of servers
   and serves as the entrance to other parts of the data center as well
   as inter-connects the local servers within the rack.

   VM: Virtual Machine, is a software implementation of a machine (i.e. a
   computer) that executes programs like a physical machine.

   VM migration the process of moving a running virtual machine or
   application between different physical machines.

   NIC: Network Interface Controller, is a computer hardware component
   that connects a computer to a computer network.

   DCB: Data center bridging, refers to a set of enhancements to
   Ethernet local area networks for use in data center environments,
   such as lossless ethernet.

3.  Usecases

   Except the web search/query example in the introduction section,
   other usecases for optimized data delivery within a DC are presented
   in the following.

3.1.  VM related traffic

   In virtualized data centers, to cope with the reliability concerns
   arising from the relatively unreliable general commodity hardware
   platforms, keeping several identical VM instances running on
   different physical servers for each other's backup is common
   practice.  In such case, TCP flows for VM backup or migration,
   although considerably larger in data volume and longer in duration
   than typical user traffic, are also delay sensitive.

3.2.  Application Priorities

   For data center accommodating multiple applications, dependent on the
   operator/service provider's marketing/provisioning strategies or the
   application's own user expectation, differentiation in resource
   provision in case of congestion is a common practice.  For instance,



Deng                     Expires August 17, 2014                [Page 3]

Internet-Draft   End Point Properties for Peer Selection   February 2014


   physical resources in a data center could be shared between delay-
   sensitive web search engine and document/music sharing applications.
   Within the data center, traffic from loader-balancer to servers and
   from servers to database are multiplexed on the internal DC network.

3.3.  Access Type differentiation

   Given various access types for a specific application, the DC
   operator may want to enforce different QoS policies to some specific
   group of users, according to their access type.  For instance, if the
   service provider is currently marketing on the mobile market, it
   could prioritize mobile traffic over fixed traffic.

   For potential competing service providers, one may wants to
   prioritize traffic from its own subscribers over other third party
   users.

3.4.  Delay Tolerant Traffic

   Delay tolerant traffic, including software upgrade and active
   measurement data traffic for bandwidth detection should not impact
   the real productive traffic.

4.  Transport Optimization in DC

   To fully understand why we need special transport service for DC
   environment as compared to Internet, it is better to look first at
   what problems an optimized transport service would be from the
   perspective of a DC application.

4.1.  Performance degradation in DC

   In particular, the following three issues are identified in DC
   environment in terms of transport performance.

4.1.1.  Incast Collapse

   For the sake of reduced CAPEX, cheap shallow-buffered ToR switches
   are dominant in today's data center, it is usually the case that the
   buffer space of the ToR switch before an aggregator (the server who
   is responsible for dividing a task into a group of subtasks and
   collects responses from its relevant working servers for result
   aggregation) be consumed up the instance that workers submit their
   subtask through highly synchronized TCP flows, resulting in
   consistent packet loss over the affected flows.  The resultant
   timeout would cause a dramatic performance degradation, since the
   regular RTT (less than 10ms) in data center is of magnitudes smaller
   than the traditional TCP RTO configuration (200ms).



Deng                     Expires August 17, 2014                [Page 4]

Internet-Draft   End Point Properties for Peer Selection   February 2014


4.1.2.  Long tail of RTT

   Due to the greedy nature of traditional TCP algorithms, the existence
   of large volume long flows would increasingly builds up the buffer
   queue in switches along the way, adding considerable queuing delay at
   switches for the highly delay sensitive short flows.

4.1.3.  Buffer Pressure

   Due to the greedy nature of traditional TCP algorithms, the existence
   of large volume long flows would increasingly builds up the buffer
   queue in switches along the way, further reducing the actual
   available buffering space to accommodate delay sensitive short flows,
   even they are not submitted in the same time.

4.2.  Transport Optimization Goals/Mechanisms

   Since both hardware and software devices are typically deployed and
   highly customized by a single service operator, there have been
   various private solutions for these issues, including cross-layer,
   cross boundary (network+end host) hybrid ones.

   In solving the above issues, various proposals are made in order to
   meet some of the following optimization goals:

   (1) Reduce unnecessary loss/timeout: since TCP performance lost are
   mainly caused by packet losses/retransimision timeouts, it is
   proposed that by finer-tuned RTO configuration and timing framework,
   the performance degradation in result could be largely
   mitigated.[Pannas] In the meantime, there have been work from IEEE
   DCB family, providing lossless ethernet service from the link layer,
   which could be rendered to avoid packet loss from the IP layer and be
   demonstrated to be effective in a coupled solution for DC transport
   optimization[detail].

   (2) Mitigate the Performance impact from loss/timeout: delay-based CC
   algorithms are expected to be more robust to packet losses/timeout in
   mitigating incast collapse issue for DC.[vegas]

   Control/avoid lengthy buffer queues: as queuing delay substantially
   impact the RTT in DC environment, it is motivated to cut the delay
   hence improve performance by keeping the buffering queues
   short[dctcp] or even empty.[hull] In order to do that, the sender may
   sense the queue at switches by explicit feedback (ECN [dctcp] or
   implicit delay variation (Vegas[vegas]).

   (3) Delay prioritized buffer queuing: for resource bounded period,
   it is essential to make efficient use of limited resource to deliver



Deng                     Expires August 17, 2014                [Page 5]

Internet-Draft   End Point Properties for Peer Selection   February 2014


   the demanded service rather than fair-sharing among all the
   competitors and fail them all ultimately.  Proposals have been made
   to allow applications to explicitly indicate a flow's delivery
   preferences (either by absolute deadline information[d3] or by
   relative priorities[detail]), in order to improve the overall
   delivery success rate.

   (4) Smooth traffic bursts: one one hand, (distributed) application
   would be refined to introduce random offset in concurrent short flow
   submission; on the other hand, random offset would be introduced to
   RTO back-off calculation to mitigate retransmission synchronization
   [Pannas].  Moreover, physical pacing at NIC level are proposed to
   counter the effect of traffic bursts caused by server performance
   optimization techniques.[d2tcp]

5.  DC Transport API Considerations

5.1.  information flow from app to transport

   (1) delivery related: the information from the application about its
   expectation on the transport service in delivery.  For example, the
   delivery goal could be specified in forms of

   (1.1) absolute delay requirement; or

   (1.2) relative priority indication.

   (2) retransmission related: the information from the application
   about how the transport would deal with packet losses.  For example,
   the information could include:

   (2.1) loss recovery needed or not;

   (2.2) if so, preferred retransmission timeout granularity;

   (3) pacing related: the information from the application about its
   expectation about the traffic pacing.  For example, the information
   could include:

   (3.1) traffic duration, in case of pacing for long flows only policy;

   (3.2) burstyness expectation.

5.2.  information flow from transport to app

   Congestion information, from the network device or local transport
   layer about the congestion status of the current transport link.




Deng                     Expires August 17, 2014                [Page 6]

Internet-Draft   End Point Properties for Peer Selection   February 2014


6.  Security Considerations

   TBA.

7.  IANA Considerations

   TBA.

8.  References

8.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

8.2.  Informative References

   [Pannas]   Vasudevan, V., Phanishayee, A., and H. Shah, "Safe and
              effective fine-grained TCP retransmissions for datacenter
              communication", 2009.

   [d2tcp]    Vamanan, B., Hasan, J., and T. Vijaykumar, "Deadline-aware
              datacenter tcp (d2tcp)", 2012.

   [d3]       Wilson, C., Ballani, H., and T. Karagiannis, "Trading a
              little bandwidth for ultra-low latency in the data
              center", 2011.

   [dctcp]    Alizadeh, M., Greenberg, A., and D. Maltz, "Data center
              tcp", 2011.

   [detail]
              Zats, D., Das, T., and P. Mohan, "DeTail: reducing the
              flow completion time tail in datacenter networks", 2012.

   [hull]     Alizadeh, M., Kabbani, A., and T. Edsall, "Less is more:
              trading a little bandwidth for ultra-low latency in the
              data center", 2012.

   [vegas]    Lee, C., Jang, K., and S. Moon, "Reviving delay-based TCP
              for data centers", 2012.

Author's Address

   Lingli Deng
   China Mobile

   Email: denglingli@chinamobile.com



Deng                     Expires August 17, 2014                [Page 7]

PAFTECH AB 2003-20262026-04-24 01:10:01