One document matched: draft-deng-taps-datacenter-00.xml


<?xml version="1.0" encoding="UTF-8"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
There has to be one entity for each item to be referenced.
An alternate method (rfc include) is described in the references. -->

<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2234 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2234.xml">
<!ENTITY I-D.narten-iana-considerations-rfc2434bis SYSTEM "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.narten-iana-considerations-rfc2434bis.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs),
please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
(Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space
(using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="yes" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="info" docName="draft-deng-taps-datacenter-00.txt" ipr="trust200902">

	<!-- ***** FRONT MATTER ***** -->
	<front>
		<!-- The abbreviated title is used in the page header - it is only necessary if the
		full title is longer than 39 characters -->
		<title>End Point Properties for Peer Selection</title>
		<!-- add 'role="editor"' below for the editors if appropriate -->
		<!-- Another author who claims to be an editor -->
		<author fullname="Lingli Deng" initials='L.'
			surname="Deng">
			<organization>China Mobile</organization>
			<address>
				<email>denglingli@chinamobile.com</email>
			</address>
		</author>

		<date day="13" month="February" year="2014" />
		<!-- Meta-data Declarations -->
		<area></area>
		<workgroup></workgroup>

		<keyword></keyword>
		<abstract>
			<t> It is noticed that within a data center, unique traffic pattern and performance goals for 
            the transport layer exist, as compared to things on the Internet.  This draft discusses the 
            usecases for applying transport APIs from the perspective of an application running in a data 
            center environment, and proposes potential requirements for such API design.
			</t>
		</abstract>
	</front>

	<middle>


		<section anchor="intro" title="Introduction">
			<t> It is noticed that the traffic pattern in a data center is quite different from Internet.
                First of all, almost all the traffic in data center are carried by TCP (over 90%). 
                Secondly, there are extreme deviation among TCP flows in terms of data volume and duration. 
                while most of the flows are very short, that complete in less than 2-3 round trips, most of 
                the traffic volume belongs to few long lasting flows.  ToR switches are highly multiplexed 
                for tens of concurrent TCP flows for most of the time.
            </t>
            <t> The reason behind such a traffic pattern is a combination of following types of data traffic: </t>
            
            <t> (1) highly delay-sensitive short flows resultant from the distributed computing model employed 
            pervasively for interactive Internet application (web search/social networking); </t>
            <t> (2) highly delay-sensitive short flows for cluster control/management; and </t>
            <t> (3) delay tolerant background flows for bakup/synchronization with considerably large data volume.</t>
           
            
  		 </section>

    <section title="Terminology">
      
    <t>DC: Data Center, is a facility used to house computer systems and associated components, such as 
    telecommunications and storage systems.</t>

	<t>ToR: a Top of Rack switch, usually sits on top of a rack of servers and serves as the entrance to 
    other parts of the data center networking as well as inter-connecting the local servers within the rack.</t>

	<t>VM: Virtual Machine, is a software implementation of a machine (i.e. a computer) that executes 
    programs like a physical machine.</t>
	
	<t>VM Migration: Virtual Machine Migration, refers to the process of moving a running virtual machine 
    or application between different physical machines.</t>
    
    <t>NIC: Network Interface Controller, is a computer hardware component that connects a computer to 
    a computer network.</t>
    
    <t>DCB: Data center bridging, refers to a set of enhancements to Ethernet local area networks for 
    use in data center environments, such as lossless ethernet. </t>

</section>

		<section title="Usecases">
			<t> Except the web search/query example described in the introduction section, other usecases for
            optimized data delivery within a DC are presented in the following.
            </t>


		<section title="VM Related Traffic">
			<t> In virtualized data centers, to cope with the reliability concerns arising from the relatively
            unreliable general commodity hardware platforms, keeping several identical VM instances running on different
            physical servers for each other's backup is common practice. In such case, TCP flows for VM backup or migration, 
            although considerably larger in data volume and longer in duration than typical user traffic, are also delay sensitive. 
            </t>
        </section>
				
		<section title="Application Priorities">
			<t> For data center accommodating multiple applications, one would certainly prefer differentiation in resource 
            provision in case of congestion, according to the DC operator's provisioning policy or the application's 
            own feature. </t>
            <t> For instance, if physical resources in a data center be shared between a delay-sensitive web search engine
            and a relatively delay-tolerant document/music sharing application, both application's
            data traffic share the links from loader-balancer to servers and from servers to database are multiplexed on 
            the internal DC network.
            </t>
        </section>

		<section title="Access Type differentiation">
            <t> Given various access types for a specific application, the DC operator may want to enforce different 
            QoS policies to some selected group of users, according to their access type. For instance, if the service 
            provider is currently marketing on the mobile market, it could prioritize mobile traffic over fixed traffic. 
            </t>
 			<t> For potential competing service providers, one may also want to prioritize direct traffic from its own 
            application over other third party users.
            </t>
        </section>
        
		<section title="Delay Tolerant Traffic">
            <t> Delay tolerant traffic, including background software upgrade and other management traffic, such
            as active measurement data traffic for performance monitory/fault detection should not impact any real 
            productive traffic.
            </t>            
        </section>
    </section>	

	<section title="Transport Optimization in DC">
			<t> To fully understand why we need special transport services for DC environment as compared to Internet,
            it is better to look first at what problems an optimized transport service would be from the perspective of
            a DC application, begining with the issues it faces in terms of performance degradation.
            </t>


		<section title="Performance degradation in DC">
			<t> In particular, the following three issues are identified in DC environment in terms of transport performance.
            </t>
				
		      <section title="Incast Collapse">
			<t> For the sake of reduced CAPEX, cheap shallow-buffered ToR switches is currently and will be dominating in data 
            centers. Hence it is quite easily that the buffer space of the ToR switch before an aggregator (the server who is responsible 
            for dividing a task into a group of subtasks and collects responses from its relevant working servers for result 
            aggregation) be consumed up the instance that  workers submit their subtask through highly synchronized TCP flows, 
            resulting in consistent packet loss over the affected flows. The resultant timeout would cause a dramatic 
            performance degradation, since the regular RTT (less than 10ms) in data center is of magnitudes smaller than the 
            traditional TCP RTO configuration (200ms).
            </t>
            </section>

		  <section title="Long tail of RTT">
            <t> Due to the greedy nature of traditional TCP algorithms, the existence of large volume long flows would 
            increasingly builds up queues in buffer space at intermediaries along the way, resulting considerable queuing delay at 
            switches for short delay-senstive flows.
            </t>
        </section>
        
		<section title="Buffer Pressure">
            <t> Another affect of long queues in buffer space at intermediaries along the way is that it further reduces the 
            actually available buffering space to accommodate bursty delay sensitive short flows, even if they are not submitted 
            in the same time.
            </t>
            
        </section>
        </section>
        
        <section title="Transport Optimization Goals">
			<t> Since both hardware and software devices are typically deployed and customized by a single DC 
            operator, various private solutions for these issues are proposed, including cross-layer, cross-boundary 
            (requiring cooperation between the network device and end hosts) ones.
            </t>
				
            <t> In solving the above issues, various proposals are made in order to meet some of the following optimization
            goals:</t>
            <t> (1) Reduce loss/timeout occurance: since TCP performance degradation is caused by packet losses/retransimision
            timeouts, it is proposed that by finer-tuned RTO configuration and finer-definition timing framework, the impact 
            in result could be largely mitigated<xref target="Pannas"></xref>. In the meantime, there are work from IEEE 
            DCB family, providing lossless ethernet service from the link layer, which could be rendered to avoid packet 
            loss seen from the IP layer and has been demonstrated to be effective in a coupled solution for DC tranport optimization
            <xref target="detail"></xref></t>.
            <t> (2) Mitigate impact from loss/timeout: delay-based CC algorithms are expected to be more robust
            to packet losses/timeout in mitigating incast collapse issue for DC<xref target="vegas"></xref></t>.
            <t> (3) Avoid lengthy buffer queues: as queuing delay substantially impacts the RTT in DC environment, it is 
            motivated to improve performance by keeping the buffering queues short<xref target="dctcp"></xref> or even
            empty<xref target="hull"></xref>. In order to do that, the sender may sense the queue at switches by explicit feedback (ECN
            <xref target="dctcp"></xref> or implicit delay variation (Vegas<xref target="vegas"></xref>).</t>
            <t> (4) Delay prioritized buffer queuing: for resource bounded period, it is essential to make efficient use of limited
            resource to deliver the most desirable service rather than fair-sharing among all the competitors and fail them all ultimately.
            Proposals have been made to allow applications to explicitly indicate a flow's delivery preferences (either by 
            absolute deadline information<xref target="d3"></xref> or by relative priorities<xref target="detail"></xref>), in order to improve the overall
            delivery success rate.</t>
            <t> (5) Smooth traffic bursts: one one hand, (distributed) application would be refined to introduce random offset to avoid
            concurrent short flow submission peak; on the other hand, random offset would be introduced to RTO backoff calculation to 
            mitigate retransmission synchronization <xref target="Pannas"></xref>. Moreover, physical pacing at NIC level is proposed to counter
            the effect of traffic bursts caused by general OS server performance optimization techniques<xref target="d2tcp"></xref></t>.

            
        </section>
        </section>



		<section title="DC Transport API Considerations">
            <t> According to the above discussion, it is believed that the following information flows should be supported by optimized
            APIs between application to the core transport service.
            </t>
            
            <section title="Information Flow From The Above">

                    <t>(1) Delivery related: refers to the information from the application about its expectation on data delivery. 
                    For example, a explicit performance expectation could be specified by</t>

                            <t>(1.1) absolute delay requirement; or</t>
                            <t>(1.2) relative priority indication.</t>

                    
                    <t>(2) Retransmission related: refers to the information from the application about how the tranport would deal with packet losses.
                     For example, the information could include: </t>

                            <t>(2.1) loss recovery needed or not; </t>
                            <t>(2.2) if so, prefered retransmission timeout granularity; </t>

                    
                    <t>(3) Pacing related: the information from the application about its expection about the flow for the applicability for pacing.
                    For example, the information could include: </t>

                            <t>(3.1) traffic duration, in case of pacing for long flows only policy;</t>
                            <t>(3.2) burstyness expectation.</t>
           
               
            </section>
            
            <section title="Information Flow From The Bottom">

                    <t>Congestion status: refers to information from the network device or local transport layer about the congestion status of the current 
                    transport connection/path.</t>

            </section>
       </section>
        
        
	
		<section title="Security Considerations">
			<t>TBA.</t>
		</section>

		<section title="IANA Considerations">
			<t>TBA.</t>
		</section>

		<!-- This PI places the pagebreak correctly (before the section title) in the text output. -->
		<?rfc needLines="8" ?>
	</middle>
	<!--  *****BACK MATTER ***** -->
	<back>
		<!-- References split into informative and normative -->

		<!-- There are 2 ways to insert reference entries from the citation libraries:
		1. define an ENTITY at the top, and use "ampersand character"RFC2629; here (as shown)
		2. simply use a PI "less than character"?rfc include="reference.RFC.2119.xml"?> here
		(for I-Ds: include="reference.I-D.narten-iana-considerations-rfc2434bis.xml")

		Both are cited textually in the same manner: by using xref elements.
		If you use the PI option, xml2rfc will, by default, try to find included files in the same
		directory as the including file. You can also define the XML_LIBRARY environment variable
		with a value containing a set of directories to search.  These can be either in the local
		filing system or remote ones accessed by http (http://domain/dir/... ).-->
		
			<references title="Normative References">
				<!--?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"?-->
				&RFC2119;	
            </references>
			<references title="Informative References">

            <reference anchor="Pannas">
                <front>
                <title>Safe and effective fine-grained TCP retransmissions for datacenter communication</title>
                <author initials="V." surname="Vasudevan">
                <organization/>
                </author>
                <author initials="A." surname="Phanishayee">
                <organization/>
                </author>
                <author initials="H." surname="Shah">
                <organization/>
                </author>
                <date year="2009"/>
                </front>
              </reference>

            <reference anchor="detail">
                <front>
                <title>DeTail: reducing the flow completion time tail in datacenter networks</title>
                <author initials="D." surname="Zats">
                <organization/>
                </author>
                <author initials="T." surname="Das">
                <organization/>
                </author>
                <author initials="P." surname="Mohan">
                <organization/>
                </author>
                <date year="2012"/>
                </front>
              </reference>	
              			

            <reference anchor="vegas">
                <front>
                <title>Reviving delay-based TCP for data centers</title>
                <author initials="C." surname="Lee">
                <organization/>
                </author>
                <author initials="K." surname="Jang">
                <organization/>
                </author>
                <author initials="S." surname="Moon">
                <organization/>
                </author>
                <date year="2012"/>
                </front>
              </reference>	
              
 
            <reference anchor="dctcp">
                <front>
                <title>Data center tcp </title>
                <author initials="M." surname="Alizadeh">
                <organization/>
                </author>
                <author initials="A." surname="Greenberg">
                <organization/>
                </author>
                <author initials="D." surname="Maltz">
                <organization/>
                </author>
                <date year="2011"/>
                </front>
              </reference>	
              
              
            <reference anchor="hull">
                <front>
                <title>Less is more: trading a little bandwidth for ultra-low latency in the data center</title>
                <author initials="M." surname="Alizadeh">
                <organization/>
                </author>
                <author initials="A." surname="Kabbani">
                <organization/>
                </author>
                <author initials="T." surname="Edsall">
                <organization/>
                </author>
                <date year="2012"/>
                </front>
              </reference>	
              
               
            <reference anchor="d2tcp">
                <front>
                <title>Deadline-aware datacenter tcp (d2tcp)</title>
                <author initials="B." surname="Vamanan">
                <organization/>
                </author>
                <author initials="J." surname="Hasan">
                <organization/>
                </author>
                <author initials="T." surname="Vijaykumar">
                <organization/>
                </author>
                <date year="2012"/>
                </front>
              </reference>	
                                         
            <reference anchor="d3">
                <front>
                <title>Trading a little bandwidth for ultra-low latency in the data center</title>
                <author initials="C." surname="Wilson">
                <organization/>
                </author>
                <author initials="H." surname="Ballani">
                <organization/>
                </author>
                <author initials="T." surname="Karagiannis">
                <organization/>
                </author>
                <date year="2011"/>
                </front>
              </reference>	
              </references>
              
	</back>
	<!-- Here we use entities that we defined at the beginning. -->


	



	<!-- Change Log

	v00 2006-03-15  EBD   Initial version

	v01 2006-04-03  EBD   Moved PI location back to position 1 -
	v3.1 of XMLmind is better with them at this location.
	v02 2007-03-07  AH    removed extraneous nested_list attribute,
	other minor corrections
	v03 2007-03-09  EBD   Added comments on null IANA sections and fixed heading capitalization.
	Modified comments around figure to reflect non-implementation of
	figure indent control.  Put in reference using anchor="DOMINATION".
	Fixed up the date specification comments to reflect current truth.
	v04 2007-03-09 AH     Major changes: shortened discussion of PIs,
	added discussion of rfc include.
	v05 2007-03-10 EBD    Added preamble to C program example to tell about ABNF and alternative
	images. Removed meta-characters from comments (causes problems).  -->
</rfc>

PAFTECH AB 2003-20262026-04-24 02:38:10