One document matched: draft-deng-taps-datacenter-00.txt
Network Working Group L. Deng
Internet-Draft China Mobile
Intended status: Informational February 13, 2014
Expires: August 17, 2014
End Point Properties for Peer Selection
draft-deng-taps-datacenter-00.txt
Abstract
It is noticed that within a data center, unique traffic pattern and
performance goals for the transport layer exist, as compared to
things on the Internet. This draft discusses the usecase for
applying transport API from the perspective of an application running
in a data center environment, and proposes potential requirements for
such API design.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on August 17, 2014.
Copyright Notice
Copyright (c) 2014 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
Deng Expires August 17, 2014 [Page 1]
Internet-Draft End Point Properties for Peer Selection February 2014
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Usecases . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1. VM related traffic . . . . . . . . . . . . . . . . . . . 3
3.2. Application Priorities . . . . . . . . . . . . . . . . . 3
3.3. Access Type differentiation . . . . . . . . . . . . . . . 4
3.4. Delay Tolerant Traffic . . . . . . . . . . . . . . . . . 4
4. Transport Optimization in DC . . . . . . . . . . . . . . . . 4
4.1. Performance degradation in DC . . . . . . . . . . . . . . 4
4.1.1. Incast Collapse . . . . . . . . . . . . . . . . . . . 4
4.1.2. Long tail of RTT . . . . . . . . . . . . . . . . . . 5
4.1.3. Buffer Pressure . . . . . . . . . . . . . . . . . . . 5
4.2. Transport Optimization Goals/Mechanisms . . . . . . . . . 5
5. DC Transport API Considerations . . . . . . . . . . . . . . . 6
5.1. information flow from app to transport . . . . . . . . . 6
5.2. information flow from transport to app . . . . . . . . . 6
6. Security Considerations . . . . . . . . . . . . . . . . . . . 7
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7
8. References . . . . . . . . . . . . . . . . . . . . . . . . . 7
8.1. Normative References . . . . . . . . . . . . . . . . . . 7
8.2. Informative References . . . . . . . . . . . . . . . . . 7
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 7
1. Introduction
It is noticed that the traffic pattern in a data center is quite
different from Internet. First of all, almost all the traffic in
data center are carried by TCP (over 90%). Secondly, there are
extreme deviation among TCP flows in terms of data volume and duration.
while most of the flows are very short, that complete in less than
2-3 round trips, most of the traffic volume belongs to few long
lasting flows. ToR switches are highly multiplexed for tens of
concurrent TCP flows for most of the time.
The reason behind such a traffic pattern is a combination of three
types of data traffic:
(1) highly delay-sensitive short flows resulting from the distributed
computing model employed pervasively for delay-sensitive application
(web search/social networking);
(2) highly delay-sensitive short flows for cluster control/mangement;
and
Deng Expires August 17, 2014 [Page 2]
Internet-Draft End Point Properties for Peer Selection February 2014
(3) delay tolerant bakup/synchronization data traffic with large data
volume.
2. Terminology
DC: Data Center, is a facility used to house computer systems and
associated components, such as telecommunications and storage
systems.
ToR: a Top of Rack switch, usually sits on top of a rack of servers
and serves as the entrance to other parts of the data center as well
as inter-connects the local servers within the rack.
VM: Virtual Machine, is a software implementation of a machine (i.e. a
computer) that executes programs like a physical machine.
VM migration the process of moving a running virtual machine or
application between different physical machines.
NIC: Network Interface Controller, is a computer hardware component
that connects a computer to a computer network.
DCB: Data center bridging, refers to a set of enhancements to
Ethernet local area networks for use in data center environments,
such as lossless ethernet.
3. Usecases
Except the web search/query example in the introduction section,
other usecases for optimized data delivery within a DC are presented
in the following.
3.1. VM related traffic
In virtualized data centers, to cope with the reliability concerns
arising from the relatively unreliable general commodity hardware
platforms, keeping several identical VM instances running on
different physical servers for each other's backup is common
practice. In such case, TCP flows for VM backup or migration,
although considerably larger in data volume and longer in duration
than typical user traffic, are also delay sensitive.
3.2. Application Priorities
For data center accommodating multiple applications, dependent on the
operator/service provider's marketing/provisioning strategies or the
application's own user expectation, differentiation in resource
provision in case of congestion is a common practice. For instance,
Deng Expires August 17, 2014 [Page 3]
Internet-Draft End Point Properties for Peer Selection February 2014
physical resources in a data center could be shared between delay-
sensitive web search engine and document/music sharing applications.
Within the data center, traffic from loader-balancer to servers and
from servers to database are multiplexed on the internal DC network.
3.3. Access Type differentiation
Given various access types for a specific application, the DC
operator may want to enforce different QoS policies to some specific
group of users, according to their access type. For instance, if the
service provider is currently marketing on the mobile market, it
could prioritize mobile traffic over fixed traffic.
For potential competing service providers, one may wants to
prioritize traffic from its own subscribers over other third party
users.
3.4. Delay Tolerant Traffic
Delay tolerant traffic, including software upgrade and active
measurement data traffic for bandwidth detection should not impact
the real productive traffic.
4. Transport Optimization in DC
To fully understand why we need special transport service for DC
environment as compared to Internet, it is better to look first at
what problems an optimized transport service would be from the
perspective of a DC application.
4.1. Performance degradation in DC
In particular, the following three issues are identified in DC
environment in terms of transport performance.
4.1.1. Incast Collapse
For the sake of reduced CAPEX, cheap shallow-buffered ToR switches
are dominant in today's data center, it is usually the case that the
buffer space of the ToR switch before an aggregator (the server who
is responsible for dividing a task into a group of subtasks and
collects responses from its relevant working servers for result
aggregation) be consumed up the instance that workers submit their
subtask through highly synchronized TCP flows, resulting in
consistent packet loss over the affected flows. The resultant
timeout would cause a dramatic performance degradation, since the
regular RTT (less than 10ms) in data center is of magnitudes smaller
than the traditional TCP RTO configuration (200ms).
Deng Expires August 17, 2014 [Page 4]
Internet-Draft End Point Properties for Peer Selection February 2014
4.1.2. Long tail of RTT
Due to the greedy nature of traditional TCP algorithms, the existence
of large volume long flows would increasingly builds up the buffer
queue in switches along the way, adding considerable queuing delay at
switches for the highly delay sensitive short flows.
4.1.3. Buffer Pressure
Due to the greedy nature of traditional TCP algorithms, the existence
of large volume long flows would increasingly builds up the buffer
queue in switches along the way, further reducing the actual
available buffering space to accommodate delay sensitive short flows,
even they are not submitted in the same time.
4.2. Transport Optimization Goals/Mechanisms
Since both hardware and software devices are typically deployed and
highly customized by a single service operator, there have been
various private solutions for these issues, including cross-layer,
cross boundary (network+end host) hybrid ones.
In solving the above issues, various proposals are made in order to
meet some of the following optimization goals:
(1) Reduce unnecessary loss/timeout: since TCP performance lost are
mainly caused by packet losses/retransimision timeouts, it is
proposed that by finer-tuned RTO configuration and timing framework,
the performance degradation in result could be largely
mitigated.[Pannas] In the meantime, there have been work from IEEE
DCB family, providing lossless ethernet service from the link layer,
which could be rendered to avoid packet loss from the IP layer and be
demonstrated to be effective in a coupled solution for DC transport
optimization[detail].
(2) Mitigate the Performance impact from loss/timeout: delay-based CC
algorithms are expected to be more robust to packet losses/timeout in
mitigating incast collapse issue for DC.[vegas]
Control/avoid lengthy buffer queues: as queuing delay substantially
impact the RTT in DC environment, it is motivated to cut the delay
hence improve performance by keeping the buffering queues
short[dctcp] or even empty.[hull] In order to do that, the sender may
sense the queue at switches by explicit feedback (ECN [dctcp] or
implicit delay variation (Vegas[vegas]).
(3) Delay prioritized buffer queuing: for resource bounded period,
it is essential to make efficient use of limited resource to deliver
Deng Expires August 17, 2014 [Page 5]
Internet-Draft End Point Properties for Peer Selection February 2014
the demanded service rather than fair-sharing among all the
competitors and fail them all ultimately. Proposals have been made
to allow applications to explicitly indicate a flow's delivery
preferences (either by absolute deadline information[d3] or by
relative priorities[detail]), in order to improve the overall
delivery success rate.
(4) Smooth traffic bursts: one one hand, (distributed) application
would be refined to introduce random offset in concurrent short flow
submission; on the other hand, random offset would be introduced to
RTO back-off calculation to mitigate retransmission synchronization
[Pannas]. Moreover, physical pacing at NIC level are proposed to
counter the effect of traffic bursts caused by server performance
optimization techniques.[d2tcp]
5. DC Transport API Considerations
5.1. information flow from app to transport
(1) delivery related: the information from the application about its
expectation on the transport service in delivery. For example, the
delivery goal could be specified in forms of
(1.1) absolute delay requirement; or
(1.2) relative priority indication.
(2) retransmission related: the information from the application
about how the transport would deal with packet losses. For example,
the information could include:
(2.1) loss recovery needed or not;
(2.2) if so, preferred retransmission timeout granularity;
(3) pacing related: the information from the application about its
expectation about the traffic pacing. For example, the information
could include:
(3.1) traffic duration, in case of pacing for long flows only policy;
(3.2) burstyness expectation.
5.2. information flow from transport to app
Congestion information, from the network device or local transport
layer about the congestion status of the current transport link.
Deng Expires August 17, 2014 [Page 6]
Internet-Draft End Point Properties for Peer Selection February 2014
6. Security Considerations
TBA.
7. IANA Considerations
TBA.
8. References
8.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
8.2. Informative References
[Pannas] Vasudevan, V., Phanishayee, A., and H. Shah, "Safe and
effective fine-grained TCP retransmissions for datacenter
communication", 2009.
[d2tcp] Vamanan, B., Hasan, J., and T. Vijaykumar, "Deadline-aware
datacenter tcp (d2tcp)", 2012.
[d3] Wilson, C., Ballani, H., and T. Karagiannis, "Trading a
little bandwidth for ultra-low latency in the data
center", 2011.
[dctcp] Alizadeh, M., Greenberg, A., and D. Maltz, "Data center
tcp", 2011.
[detail]
Zats, D., Das, T., and P. Mohan, "DeTail: reducing the
flow completion time tail in datacenter networks", 2012.
[hull] Alizadeh, M., Kabbani, A., and T. Edsall, "Less is more:
trading a little bandwidth for ultra-low latency in the
data center", 2012.
[vegas] Lee, C., Jang, K., and S. Moon, "Reviving delay-based TCP
for data centers", 2012.
Author's Address
Lingli Deng
China Mobile
Email: denglingli@chinamobile.com
Deng Expires August 17, 2014 [Page 7]
| PAFTECH AB 2003-2026 | 2026-04-24 01:10:01 |