One document matched: draft-shah-armd-arp-reduction-00.txt



   Working Group: ARMD                                  Himanshu Shah 
   Intended Status: Proposed Standard                      Ciena Corp 
   Internet Draft                                           
                                                                      
   Expiration Date: May, 2011 
                                                                      
                                                    October 18, 2010 
                                                                      
                                         
    
                                                                      
                                                                      
 
    
 
                        ARP Reduction in Data Center
                   draft-shah-armd-arp-reduction-00.txt 
                                      
 
 
Status of this Memo 
    
   This Internet-Draft is submitted in full conformance with the 
   provisions of BCP 78 and BCP 79. 
    
   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF), its areas, and its working groups.  Note that 
   other groups may also distribute working documents as Internet-
   Drafts. 
    
   Internet-Drafts are draft documents valid for a maximum of six 
   months and may be updated, replaced, or obsoleted by other documents 
   at any time. It is inappropriate to use Internet-Drafts as reference 
   material or to cite them other than as "work in progress." 
    
   The list of current Internet-Drafts can be accessed at 
   http://www.ietf.org/1id-abstracts.html 
    
   The list of Internet-Draft Shadow Directories can be accessed at 
   http://www.ietf.org/shadow.html 
    
   This Internet-Draft will expire on May 18, 2011 
    
    
    
Copyright Notice 
    
   Copyright (c) 2010 IETF Trust and the persons identified as the 
   document authors.  All rights reserved. 
    
     
   Shah, et al.            Expires May 2011                          1 
   Internet Draft   draft-shah-arp-reduction-00.txt 
                                    
    
   This document is subject to BCP 78 and the IETF Trust's Legal 
   Provisions Relating to IETF Documents 
   (http://trustee.ietf.org/license-info) in effect on the date of 
   publication of this document. Please review these documents 
   carefully, as they describe your rights and restrictions with 
   respect to this document.  Code Components extracted from this 
   document must include Simplified BSD License text as described in 
   Section 4.e of the Trust Legal Provisions and are provided without 
   warranty as described in the Simplified BSD License. 
    
    
Abstract 
    
   With advent of virtual machine (VM) technologies, a host is able to 
   support multiple VMs in a single physical machine. The data center 
   application leverages these capabilities to instantiate upwards of 
   10s to 100s of VMs in a server. Each VM operates as an independent 
   IP host with its own MAC address associated with a virtual Network 
   Interface Card (vNIC) that maps to a single physical Ethernet 
   interface. These physical servers are typically stacked in a rack 
   with its Ethernet interface connected to top-of-the-rack (ToR) 
   switch. The ToR switches are interconnected through End-of-the-Row 
   (EoR) switch which in turn is connected to core switches.  
    
   As discussed in [ARP-Problem] the VM hosts use ARP broadcasts to 
   find other VM hosts and use periodic (broadcast) gratuitous ARPs to 
   refresh their IP to MAC address binding in other VM hosts. Such 
   broadcasts in a large data center with potentially thousands of VM 
   hosts in a layer-2 based topology can cause havoc.  
    
   This document describes a solution whereby a ToR switch assumes the 
   handling of the ARP broadcasts based on the ARP table that it 
   maintains by gleaning information from the passing ARP PDUs. When 
   the information is not new, gratuitous ARP PDUs are dropped and ARP 
   broadcast requests from hosts are responded by the switch from the 
   learned ARP information instead of forwarding them out. 
    
    
Conventions 
    
   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",  
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 
   document are to be interpreted as described in RFC 2119 [RFC 2119].  
        
 
    
Table of Contents 
    
     
 Copyright Notice .................................................... 1 
     
   Shah, et al.            Expires May 2011                          2 
   Internet Draft   draft-shah-arp-reduction-00.txt 
                                    
    
Abstract.............................................................. 2 
1.0 Contributing Authors ............................................. 3 
2.0 Overview.......................................................... 3 
 2.1 Terminology ..................................................... 4 
3.0 Topology.......................................................... 5 
4.0 Configuration..................................................... 5 
5.0 Building the ARP tables........................................... 6 
 5.1 ARP Requests .................................................... 6 
 5.2 ARP Response .................................................... 7 
 5.3 Gratuitous ARP .................................................. 7 
 5.4 Host movement ................................................... 7 
 5.5 IPv6 Hosts ...................................................... 8 
 5.6 External ARP servers ............................................ 8 
 6.0 Conclusion ...................................................... 8 
7.0 Security Considerations........................................... 9 
8.0 References........................................................ 9 
 8.1 Normative References ............................................ 9 
 8.2 Informative References .......................................... 9 
9.0 Author's Address.................................................. 9 
    
1.0  Contributing Authors 
    
   This document is the combined effort of the following individuals 
   and many others who have carefully reviewed this document and 
   provided the technical clarifications 
    
   Linda Durbar                 Huawei 
   Sue Hares                    Huawei 
   T Sridhar          Force10 Networks 
    
    
    
2.0 Overview 
    
   The following factors exasperate the effect of ARP broadcasts in the 
   data centers. 
     . Ever increasing dependence on applications that run in data 
        center 
     . Large number of physical server hosts in the server farms 
     . Use of large number of VMs in the physical server host 
     . Each VM to have its own IP and MAC address and they all reside 
        in the same subnet as it allows them to move around in 
        different physical hosts based on fair resource distribution 
        policies. This requires the data center networks to be layer 2 
        based 
     
   Shah, et al.            Expires May 2011                          3 
   Internet Draft   draft-shah-arp-reduction-00.txt 
                                    
    
     . Each VM resort to frequent ARP broadcasts as request to find 
        the target VMs and/or gratuitous ARPs to refresh its IP to MAC 
        address binding in peers with whom they tend to chat more 
        often. This also stems from the fact that VMs holds a 
        relatively small ARP table and use more aggressive age out in 
        order to accommodate the 'most' active peers in the table. 
    
   The broadcast as such in layer 2 networks has far reaching impacts; 
   i.e. wastage in network bandwidth as well as CPU resources used by 
   all the VMs while processing superfluous ARP broadcasts.  
    
   It appears that it is possible to minimize the ills of ARP 
   broadcasts in the data center network in a relatively simpler 
   fashion. The solution requires the first hop Ethernet Switches, 
   typically ToR, to maintain ARP table learned from the passing ARP 
   PDUs and selectively propagate and/or proxy on behalf of the remote 
   peer. These types of ARP processing principles are well known and 
   used/described in L2VPN Working Group documents such as [ARP-
   Mediation] and [IPLS]. 
    
   The following sections describe the inner-workings of ARP snooping, 
   learning and maintaining ARP tables, using the learned information 
   to limit broadcast propagation and proxy (the response) on behalf of 
   the remote peers. 
    
    
     
    
    
    
2.1 Terminology 
    
    
        ToR            Top-of-Rack. An Ethernet switch present on top 
                       of a rack which provides network connectivity to 
                       the servers present on the rack.   
    
        Downlink       Downlink in this document refers to local host 
                        (servers in the rack) facing Ethernet 
                        connection in the ToR switch. 
          
         
        Uplink         Uplink in this document refers network facing 
                        Ethernet connection in the ToR switch. 
                        Typically, the uplinks from ToRs connect to 
                        end-of-rack switches.  
         
        EoR            End-of-Rack Ethernet switch. This is more of an 
                        aggregation switch. Uplinks from ToR connects 
                        to EoR and uplink from EoR connects to Core 
                        switch.  
         
     
   Shah, et al.            Expires May 2011                          4 
   Internet Draft   draft-shah-arp-reduction-00.txt 
                                    
    
        Host/Server    The host or server term is used in this 
                        document to refer to an IP host or server. An 
                        IP host could be a one physical entity or a 
                        logical entity (as a Virtual Machine) in a 
                        physical host. The term server refers to its 
                        application role in data center. Both terms are 
                        used interchangeably or together and mean IP 
                        end station. 
         
        Local hosts    This term is used in the context of a ToR 
                        switch to denote the (VM) hosts connected to a 
                        ToR on the downlink, i.e. directly connected 
                        hosts  
         
        Remote hosts    This term is used in the context of a ToR to 
                        denote the hosts that are accessible through 
                        uplink of the ToR.  
         
        VM             Virtual Machine. This is a logical instance of 
                        a host that operates independently in a 
                        physical host and has its own IP and MAC 
                        address. The VM architecture allows efficient 
                        use of physical host resources in data center 
                        application. 
         
         
         
3.0 Topology 
    
   An example topology of a data center network that is referred in 
   this document is that of an hierarchical connectivity of low to high 
   density Ethernet switches that provide flat (common broadcast 
   domain) layer 2-based network for the servers in the data center. 
   Each server host, thus connected is said to be on the same subnet 
   and communicate directly using IP without having to go through a 
   router or default gateway. In other words, an IPv4 host (VM or 
   otherwise) on this network can find another IPv4 host's MAC address 
   using the ARP methodologies. 
    
       
4.0 Configuration 
    
   It is assumed that ARP reduction methodologies that are defined in 
   this document will be limited to ToR switches. We believe that 
   maximum benefits of restraining ARP broadcasts in the network can be 
   achieved by the first hop (or directly connected to host) switches 
   without placing additional burden on second or third tier switches.  
    
   The ToR switches will need to be configured with this feature 
   enabled. Each Ethernet interface needs to be identified as a type of 
   downlink or uplink within the context of this feature. The ARP 
   reduction feature treats ARP frames received from downlink or uplink 
   differently as described in the following sections. 
     
   Shah, et al.            Expires May 2011                          5 
   Internet Draft   draft-shah-arp-reduction-00.txt 
                                    
    
     
   It is possible for the operator to configure various ARP reduction 
   related parameters; such as - 
     . ARP aging timer 
     . Size of the ARP table 
     . Static entries of IP to MAC address 
    
   There are situations where low cost ToR switches do not have the 
   needed capacity to process ARP reduction functions.  Under those 
   circumstances, external ARP server (described below) approach can be 
   considered.   
    
    
5.0 Building the ARP tables 
 
   When enabled, ToR switch will start monitoring the data frames for 
   the ARP PDUs. The ARP PDU processing is recommended to be handled in 
   the following manner. 
     . All ARP request PDUs should be redirected to control plane CPU 
     . All gratuitous ARP PDUs should be redirected to control plane 
        CPU 
     . All ARP response PDUs should be bi-casted; one copy sent to 
        control plane CPU and other copy forwarded out normally. 
    
   The ARP table can become large. The scaling factor dictates that the 
   table be maintained in the control plane memory as compared to 
   hardware tables in the forwarding plane. In either case, it is 
   prudent that 'local host' is preferred over 'remote host' when 
   placing the IP to MAC address association entries in the contested 
   ARP table space. 
    
    
5.1 ARP Requests 
    
   The ARP requests are broadcast frames. The ToR gleans the IP and MAC 
   address from the ARP PDU. The source IP and MAC address association 
   is learned or updated/refreshed, if already learned. The destination 
   IP address is searched in the ARP table. If an entry exists, the 
   associated MAC address from the table is used to prepare a unicast 
   ARP reply PDU. The same MAC address is also used as the source MAC 
   address in the MAC header that is prepended to the unicast ARP reply 
   PDU. 
    
   If the destination IP address in the request is not present in the 
   table, then the original ARP request PDU is broadcasted to all the 
   switch ports except the source port the request was received from. 
   However, if the requested (destination) IP address is present in the 
   ARP table, unicast ARP response PDU is prepared as described above 
   and sent to the egress port based on which port the target existed 
   and original ARP request PDU is dropped. 
    
   The intent is to try preventing propagation of ARP request PDU 
   broadcasts as much as possible using the information present in the 
     
   Shah, et al.            Expires May 2011                          6 
   Internet Draft   draft-shah-arp-reduction-00.txt 
                                    
    
   ARP table. The following observations can be made from such 
   behavior. 
      . Most of the ARP requests from the local hosts for the local 
         hosts can be prevented most of the times 
      . Most of the ARP requests from the remote hosts for the local 
         hosts can be prevented from forwarding towards downlinks or 
         other uplinks 
      . Many of the ARP requests from the local hosts for the remote 
         hosts can be prevented from forwarding towards uplinks, if 
         remote host IP to MAC association is known. 
    
5.2 ARP Response 
    
   The unicast ARP response is gleaned to learn/update the ARP table 
   for source and destination IP/MAC address association and forwarded 
   out as a normal frame. 
    
5.3 Gratuitous ARP 
    
   The Gratuitous ARP reply is a broadcast ARP PDU with destination IP 
   address and MAC address of the sender. It is typically used by the 
   (VM) IP hosts to keep its association fresh in peer's ARP cache.  
    
   The ToR switch should process Gratuitous ARP in the following 
   manner. 
      . Learn/update/refresh the ARP table entry 
      . If ARP entry was new or existed with different information then 
         gratuitous ARP PDU is forwarded out otherwise the PDU is 
         dropped. 
    
   The important goal for handling of the gratuitous ARP PDU from the 
   downlinks (i.e. local hosts) is to not propagate into the 'network' 
   (i.e. to uplinks) if the information is not new.  
      
5.4 Host movement 
    
   The VM architecture allows movements of VMs to different physical 
   server entities based on optimum resource utilization policies. The 
   act of movement is called vMotion and the flexibility adds 
   attraction for its use. The vMotion could be manual (operator 
   initiated) or automatic in reaction to demands placed by the 
   application users. The important point is that in either case, 
   vMotion is not transparent and is made known to the network. There 
   is ongoing work in IEEE 802.1Qbe standards organization to 
   coordinate/communicate the presence and capabilities of the VMs to 
   the directly connected network switch. 
    
   It is expected that ToR would leverage the knowledge obtained of 
   newborn VM to update local ARP table as well as to notify the 
   network (other switches) via unsolicited gratuitous ARP on behalf of 
   the VM. The details of such procedures will be described in the 
   subsequent revisions of this document. 
    
     
   Shah, et al.            Expires May 2011                          7 
   Internet Draft   draft-shah-arp-reduction-00.txt 
                                    
    
5.5 IPv6 Hosts 
    
   The IPv6 hosts use Neighbor Discovery procedures that are different 
   from ARP methodologies used by IPv4 hosts. The details of handling 
   of Neighbor discovery procedures will be described in the subsequent 
   updates to this document. 
     
5.6 External ARP servers 
    
   It is possible that in some configuration, the ToR switches may not 
   be capable to handle the ARP reduction procedures. For such 
   configuration, it is possible to outsource the ARP reduction 
   procedures to one or more external ARP server hosts. The ToR 
   switches will then be configured to, 
      . Identify the interface(s) connected to the ARP servers. Such 
         interface(s) must be separate from downlink and uplinks that 
         are connected to 'host reachable' (or native) networks as 
         described above. This concept is similar to how switches treat 
         'management' network separate from user data network. 
      . All broadcast ARP PDUs are forwarded to interface(s) where ARP 
         servers reside 
      . All unicast ARP PDUs received from 'native' interfaces are bi-
         casted; one copy of the PDU is forwarded to ARP server and 
         other forwarded normally 
      . All ARP PDUs received from the ARP server interface(s) are the 
         results of the ARP-Reduction procedure based PDUs generated by 
         the ARP servers. They are handled in the following manner. 
           o The source MAC address is not learned 
           o Instead, the source MAC address is used to determine the 
              'real' native ingress interface. That is, switch will 
              treat the packet as if it was received from the interface 
              where source appears to reside and make the forwarding 
              decision based on destination MAC address and the newly 
              determined ingress port. 
      . If multiple ARP server interfaces are to exist (in order to 
         avoid single point of failure), an ARP PDU received from one 
         ARP server interface is never forwarded out to other ARP 
         server interface(s) (i.e. split horizon rule). 
    
6.0 Conclusion 
    
   Based on the procedures described in this document, it is possible 
   for ToR switches in the data center to dampen ARP broadcasts 
   significantly. The solution is not new, based on well known 
   procedures, non-intrusive and low hanging fruit that strives to 
   curtail broadcasts that are increasingly becoming a problem in the 
   data centers. In essence, ToR switches are facilitating the 
   offloading of the extended ARP table management from the IP hosts 
   unto itself. The ARP table timeout can be tuned higher by the 
   operator based on the available switch resources and network traffic 
   behavior. The larger capacity of the ARP table directly translates 
   to more effective subduing of the ARP broadcasts. An additional 
   approach is described to further offload ARP table and PDU 
     
   Shah, et al.            Expires May 2011                          8 
   Internet Draft   draft-shah-arp-reduction-00.txt 
                                    
    
   management to dedicated server(s) for reduced capacity low end ToR 
   or as a cost effective solution. 
    
    
7.0 Security Considerations 
    
   The details of the security aspects will be addressed in future 
   revision. 
    
8.0 References 
    
8.1 Normative References 
    
    
   [ARP] RFC 826, STD 37, D. Plummer, "An Ethernet Address Resolution 
      Protocol:  Or Converting Network Protocol Addresses to 48.bit 
      Ethernet Addresses for Transmission on Ethernet Hardware". 
    
   [ARP-Problem] L.Dunbar et al., "Scalable Address Resolution for 
      Large Data Center Problem Statements", draft-dunbar-arp-for-
      large-dc-problem-statement-00.txt. 
    
    
8.2 Informative References 
    
   [ARP-Mediation] H. Shah et al., "ARP Mediation for IP interworking 
      in Layer 2 VPN", draft-ietf-l2vpn-arp-mediation-13.txt. 
     
   [IPLS] H.Shah et al., "IP-only LAN service",  
      draft-ietf-l2vpn-ipls-09.txt. 
 
   [PROXY-ARP] RFC 925, J. Postel, "Multi-LAN Address Resolution". 
    
    
9.0 Author's Address 
    
   Himanshu Shah 
   Ciena Corp 
   Email: hshah@ciena.com 
    
 
     
   Shah, et al.            Expires May 2011                          9 

PAFTECH AB 2003-20262026-04-19 08:29:10