One document matched: draft-kompella-nvo3-server2nve-01.xml


<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc strict="yes" ?>
<?rfc toc="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="std" docName="draft-kompella-nvo3-server2nve-01"
     ipr="trust200902">
  <front>
    <title abbrev="Signaling VM Activity to NVE">Signaling Virtual Machine
    Activity to the Network Virtualization Edge</title>

    <author fullname="Kireeti Kompella" initials="K." surname="Kompella">
      <organization>Juniper Networks</organization>

      <address>
        <postal>
          <street>1194 N. Mathilda Ave.</street>

          <city>Sunnyvale</city>

          <region>CA</region>

          <code>94089</code>

          <country>US</country>
        </postal>

        <email>kireeti@juniper.net</email>
      </address>
    </author>

    <author fullname="Yakov Rekhter" initials="Y." surname="Rekhter">
      <organization>Juniper Networks</organization>

      <address>
        <postal>
          <street>1194 N. Mathilda Ave.</street>

          <city>Sunnyvale</city>

          <region>CA</region>

          <code>94089</code>

          <country>US</country>
        </postal>

        <email>yakov@juniper.net</email>
      </address>
    </author>

    <author fullname="Thomas Morin" initials="T." surname="Morin">
      <organization>France Telecom - Orange Labs</organization>

      <address>
        <postal>
          <street>2, avenue Pierre Marzin</street>

          <city>Lannion</city>

          <code>22307</code>

          <country>France</country>
        </postal>

        <email>thomas.morin@orange.com</email>
      </address>
    </author>

    <author fullname="David L. Black" initials="D.L." surname="Black">
      <organization>EMC Corporation</organization>

      <address>
        <postal>
          <street>176 South St.</street>

          <city>Hopkinton</city>

          <region>MA</region>

          <code>01748</code>
        </postal>

        <email>david.black@emc.com</email>
      </address>
    </author>

    <date day="22" month="October" year="2012"/>

    <area>Routing</area>

    <keyword>Internet-Draft</keyword>

    <keyword>VM vmotion orchestration</keyword>

    <abstract>
      <t>This document proposes a simplified approach for provisioning the
      networking parameters related to Virtual Machine creation, migration and
      termination on servers. The idea is to provision the server, then have
      the server signal the requisite parameters to the relevant network
      device(s). Such an approach reduces the workload on the provisioning
      system and simplifies the data model that the provisioning system needs
      to maintain. Furthermore, it is more resilient to topology changes in
      server-network connectivity, for example, reconnecting a server to a
      different network port or switch.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>To create a Virtual Machine (VM) on a server in a data center, one
      must specify parameters for the CPU, storage, network and appliance
      aspects of the VM. At a minimum, this requires provisioning the server
      that will host the VM, and the Network Virtualization Edge (NVE) that
      will implement the virtual network for the VM. Similar considerations
      apply to live migration and terminating VMs. This document proposes
      mechanisms whereby a server can be provisioned with all of the paramters
      for the VM, and the server in turn signals the networking aspects to the
      NVE. The NVE may be located on the server or in an external network
      switch that may be directly connected to the server or accessed via an
      L2 (Ethernet) LAN or VLAN. The following subsections capture the
      abstract sequence of steps for VM creation, live migration and
      deletion.</t>

      <section anchor="VMcreate" title="VM Creation">
        <t>This subsection describes an abstract sequence of steps involved in
        creating a VM and make it operational. The following steps are
        intended as an illustrative example, not as prescriptive text; the
        goal is to capture sufficient detail to set a context for the
        signaling described in <xref target="sig"/>.</t>

        <t>Creating a VM requires: <list style="numbers">
            <t>gathering the CPU, network, storage, and appliance parameters
            required for the VM;</t>

            <t>deciding which server, network, storage and appliance devices
            best match the VM requirements in the current state of the data
            center;</t>

            <t>provisioning the server with the VM parameters;</t>

            <t>provisioning the network element(s) to which the server is
            connected with the network-related parameters of the VM;</t>

            <t>informing the network element(s) to which the server is
            connected about the VM's peer VMs, storage devices and other
            appliances with which the VM needs to communicate;</t>

            <t>informing the network element(s) to which a VM's peer VMs are
            connected about the new VM and its addresses;</t>

            <t>provisioning storage with the storage-related parameters;
            and</t>

            <t>provisioning necessary appliances (firewalls, load balancers
            and "middle boxes").</t>
          </list></t>

        <t>While shown as a numbered sequence above, some of these steps may
        be concurrent (e.g., server, storage and network provisioning for the
        new VM may be done concurrently).</t>

        <t>Steps 1 and 2 are primarily information gathering. For Steps 3 to
        8, the provisioning system talks actively to servers, network
        switches, storage and appliances, and must know the details of the
        physical server, network, storage and appliance connectivity
        topologies. Step 4 is typically done using just provisioning, whereas
        Steps 5 and 6 may be a combination of provisioning and other
        techniques. Steps 4 to 6 accomplish the task of provisioning the
        network for a VM, the result of which is a Data Center Virtual Private
        Network (DCVPN) overlaid on the physical network.</t>

        <t>This document focuses on the case where the network elements in
        Step 4 are not co-resident with the server, and shows how the
        provisioning in Step 4 can be replaced by signaling between server and
        network, using information from Step 3. This document also shows how
        Step 4 can interact seamlessly with some of the realizations of Steps
        5 and 6.</t>
      </section>

      <section anchor="VMmigrate" title="VM Live Migration">
        <t>This subsection describes an abstract sequence of steps involved in
        live migration of a VM. Live migration is sometimes referred to as
        "hot" migration, in that from an external viewpoint, the VM appears to
        continue to run while being migrated to another server (e.g., TCP
        connections generally survive this class of migration). In contrast,
        suspend/resume (or "cold") migration consistes of suspending VM
        execution on one server and resuming it on another. The following live
        migration steps are intended as an illustrative example, not as
        prescriptive text; the goal is to capture sufficient detail to set a
        context for the signaling described in <xref target="sig"/>.</t>

        <t>For simplicity, this set of abstract steps assumes shared storage,
        so that the VM's storage is accessible to the source and destination
        servers. Live migration of a VM requires: <list style="numbers">
            <t>deciding which server should be the destination of the
            migration based on the VM's requirements, data center state and
            reason for the migration;</t>

            <t>provisioning the destination server with the VM parameters and
            creating a VM to receive the live migration;</t>

            <t>provisioning the network element(s) to which the destination
            server is connected with the network-related parameters of the
            VM;</t>

            <t>transferring the VM's memory image between the source and
            destination servers;</t>

            <t>actually moving the VM: pausing the VM's execution on the
            source server, transferring the VM's execution state and any
            remaining memory state to the destination server and continuing
            the VM's execution on the destination server;</t>

            <t>informing the network element(s) to which the destination
            server is connected about the VM's peer VMs, storage devices and
            other appliances with which the VM needs to communicate;</t>

            <t>informing the network element(s) to which a VM's peer VMs are
            connected about the VM's new location;</t>

            <t>activating the VM's network parameters at the destination
            server;</t>

            <t>deactivating the VM's network parameters at the source
            server;</t>

            <t>deprovisioning the VM from the network element(s) to which the
            source server is connected; and</t>

            <t>deleting the VM at the source server.</t>
          </list></t>

        <t>While shown as a numbered sequence above, some of these steps may
        be concurrent (e.g., moving the VM and associated network
        changes).</t>

        <t>Step 1 is primarily information gathering. For Steps 2, 3, 10 and
        11, the provisioning system talks actively to servers, network
        switches and appliances, and must know the details of the physical
        server, network and appliance connectivity topologies. Steps 4 and 5
        are usually handled directly by the servers involved. Steps 6 to 9 may
        be handled by the servers (e.g., a gratuitous ARP or RARP from the
        destination server may accomplish all four steps) or other
        techniques.</t>

        <t>This document focuses on the case where the network elements are
        not co-resident with the server, and shows how the provisioning in
        Step 3 and the deprovisioning in Step 10 can be replaced by signaling
        between server and network, using information from Step 3. This
        document also shows how Step 4 can interact seamlessly with some of
        the realizations of Steps 5 and 6.</t>
      </section>

      <section anchor="VMterminate" title="VM Termination">
        <t>This subsection describes an abstract sequence of steps involved in
        termination of a VM, also referred to as "powering off" a VM. The
        following termination steps are intended as an illustrative example,
        not as prescriptive text; the goal is to capture sufficient detail to
        set a context for the signaling described in <xref target="sig"/>.</t>

        <t>Termination of a VM requires: <list style="numbers">
            <t>ensuring that the VM is no longer executing;</t>

            <t>deactivating the VM's network parameters at the server;</t>

            <t>deprovisioning the VM from the network element(s) to which the
            server is connected; and</t>

            <t>deleting the VM from the server (the VM's image may remain in
            storage for reuse).</t>
          </list></t>

        <t>While shown as a numbered sequence above, some of these steps may
        be concurrent (e.g., network deprovisioning and VM deletion).</t>

        <t>Steps 1, 2 and 4 are handled by the server, based on instructions
        from the provisioning system. For Step 3, the provisioning system
        talks actively to servers, network switches, storage and appliances,
        and must know the details of the physical server, network, storage and
        appliance connectivity topologies.</t>

        <t>This document focuses on the case where the network elements in
        Step 3 are not co-resident with the server, and shows how the
        deprovisioning in Step 3 can be replaced by signaling between server
        and network.</t>
      </section>
    </section>

    <section anchor="conv" title="Conventions and Acronyms Used">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119"/>.</t>

      <t>The following acronyms are used: <list style="empty">
          <t>DCVPN: Data Center Virtual Private Network -- a virtual
          connectivity topology overlaid on physical devices to provide
          virtual devices with the connectivity they need and isolation from
          other DCVPNs</t>

          <t>NVE: Network Virtualization Edge -- the entities that realize
          private communication among VMs in a DCVPN <list style="empty">
              <t>lNVE: local NVE: wrt a VM, NVE elements to which it is
              directly connected</t>

              <t>rNVE: remote NVE: wrt a VM, NVE elements to which the VM's
              peer VMs are connected</t>
            </list></t>

          <t>NVGRE: Network Virtualization using Generic Routing
          Encapsulation</t>

          <t>VDP: VSI Discovery and Configuration Protocol</t>

          <t>VID: 12-bit VLAN tag or identifier used locally between a server
          and its lNVE</t>

          <t>VLAN: Virtual Local Area Network</t>

          <t>VM: Virtual Machine (same as Virtual Station)</t>

          <t>peer VM: wrt a VM, other VMs in the VM's DCVPN</t>

          <t>VNID: DCVPN Identifier (sometimes called a Group Identifier)</t>

          <t>VSI: Virtual Station Interface</t>

          <t>VXLAN: Virtual eXtensible Local Area Network</t>
        </list></t>
    </section>

    <section anchor="VN" title="Virtual Networks">
      <t>The goal of provisioning networks for VMs is to create an "isolation
      domain" wherein a group of VMs can talk freely to each other, but
      communication to and from VMs outside that group is restricted (either
      prohibited, or mediated via a router, a firewall or other network
      gateway). Such an isolation domain, sometimes called a Closed User
      Group, here will be called a Data Center Virtual Private Network
      (DCVPN). The network elements on the outer border or edge of the overlay
      portion of a Virtual Network are called Network Virtualization Edges
      (NVEs).</t>

      <t>A DCVPN is assigned a global "name" that identifies it in the
      management plane; this name is unique in the scope of the data center,
      but may be unique across several cooperating data centers. A DCVPN is
      also assigned an identifier unique in the scope of the data center, the
      Virtual Network Group ID (VNID). The VNID is a control plane entity. A
      data plane tag is also needed to distinguish different DCVPNs' traffic;
      more on this later.</t>

      <t>For a given VM, the NVE can be classified into two parts: the network
      elements to which the VM's server is directly connected (the local NVE
      or l-NVE), and those to which peer VMs are connected (the remote NVE or
      r-NVE). In some cases, the l-NVE is co-resident with the server hosting
      the VM; in other cases, the l-NVE is separate (distributed l-NVE). The
      latter case is the one of interest in this document.</t>

      <t>A created VM is added to a DCVPN through Steps 4 to 6 in section
      <xref target="VMcreate"/> which can be recast as follows. In Step 4, the
      l-NVE(s) are informed about the VM's VNID, network addresses and
      policies, and the lNVE and server agree on how to distinguish traffic
      for different DCVPNs from and to the server. In Step 5 the relevant
      r-NVE elements and the addresses of their VMs are discovered. In Step 6,
      the r-NVE(s) are informed of the presence of the new VM and obtain its
      addresses.</t>

      <t>Once a DCVPN is created, the next steps for network provisioning are
      to create and apply policies such as for QoS or access control. These
      occur in three flavors: policies for all VMs in the group, policies for
      individual VMs, and policies for communication across DCVPN
      boundaries.</t>

      <section title="Current Mode of Operation">
        <t>DCVPNs are often realized as Ethernet VLAN segments. A VLAN segment
        satisfies the communication properties of a DCVPN. A VLAN also has
        data plane mechanisms for discovering network elements (Layer 2
        switches, aka bridges) and VM addresses. When a DCVPN is realized as a
        VLAN, Step 4 requires provisioning both the server and l-NVE with the
        VLAN tag that identifies the DCVPN. Step 6 requires provisioning all
        involved network elements with the same VLAN tag. Address learning is
        done by flooding, and the announcement of a new VM is typically by a
        "gratuitous ARP".</t>

        <t>While VLANs are familiar and well-understood, they fail to scale on
        several dimensions. Underlying VLANs is a Layer 2 infrastructure. The
        number of independent VLANs in a Layer 2 domain is limited by the size
        of the VLAN tag. Data plane techniques (flooding and broadcast) are
        another source of serious concern as the overall size of the network
        grows.</t>
      </section>

      <section title="Future Mode of Operation">
        <t>There are several scalable realizations of DCVPNs that address the
        isolation requirements of DCVPNs as well as the need for a scalable
        substrate for DCVPNs and the need for scalable mechanisms for NVE and
        VM address discovery. While these are not the goal of this document, a
        secondary goal of this document is to show how the signaling that
        replaces Step 4 can seamlessly interact with several of these
        realizations of DCVPNs.</t>

        <t>VLAN tags (VIDs) will be used as the data plane tag to distinguish
        traffic for different DCVPNs' between a server and its l-NVE. Note
        that, as used here, VIDs only have local significance between server
        and NVE, not to be confused with the notion of VLANs, which are a data
        center-wide concept. Data plane tags between l-NVE and r-NVE depends
        on the encapsulation mechanism among the NVE; the l-NVE is expected to
        map between VIDs and intra-NVE tags in both directions.</t>
      </section>
    </section>

    <section title="Provisioning DCVPNs">
      <t>For VM creation as described in section <xref target="VMcreate"/>,
      Step 3 provisions the server; Steps 4 and 5 provision the l-NVE
      elements; Step 6 provisions the r-NVE elements.</t>

      <t>In some cases, the l-NVE elements live within the server; in this
      case, Steps 3 and 4 are "single-touch" in that the provisioning system
      only needs to talk to the server, and both CPU and network parameters
      can be applied by the server. However, in other cases, the l-NVE is
      separate from the server, requiring that the provisioning system talk
      independently to both the server and lNVE. This scenario, which we call
      "distributed local NVE", is the one considered in this document. This
      document resurrects "single-touch" provisioning in the distributed lNVE
      case.</t>

      <t>The approach here is to provision the server, then have the server
      signal the requisite parameters to the l-NVE. Such an approach reduces
      the workload on the provisioning system, allowing it to scale both in
      the number of elements it can manage, as well as the rate at which it
      can process changes. It also simplifies the data model that the
      provisioning system needs to have; in particular, the provisioning
      system does not have to maintain a full, up-to-date map of server to
      network connectivity. Furthermore, it is more resilient to topology
      changes in server-network connectivity that have not yet been
      transmitted to the provisioning system. For example, if a server is
      reconnected to a different port or a different l-NVE to recover from a
      malfunctioning port, the server can contact the new l-NVE over the new
      port without the provisioning system being aware of the change.</t>

      <t>While the current document focuses on provisioning networking
      parameters via signaling, future extensions may address the provisioning
      of storage and middle-box parameters in a similar fashion. Companion
      documents will describe how NVEs to which peer VMs are connected can get
      the required networking information via signaling rather than by
      provisioning and/or other means.</t>
    </section>

    <section anchor="sig" title="Signaling">
      <section title="Preliminaries">
        <t>There are three common operations in a virtualized data center:
        creating a VM; migrating a VM from one physical server to another; and
        terminating a VM. Creating a VM requires "associating" it with its
        DCVPN and "activating" that association; decommissioning a VM requires
        "deactivating" the VM's association with the DCVPN and then
        "dissociating" the VM from its DCVPN. Moving a VM consists of
        associating it with its DCVPN in its new location, then dissociating
        it from its old location. . The deactivation operation is often
        implicit in another operation, but is called out here for symmetry and
        completeness.<!--
	      stick to these terms, or use attach/pre-attach/detach?

	  Thomas: with the text above, it seems to me that pre-associate is not needed anymore : the preliminary work would be triggered by "associate" while
 the final work (the VM has brought up its interface, or has been fully respawned after a migration) would be triggered by "activate".  
Hence, I would suggest removing the sentence  '' To facilitate a smooth migration of a VM, there is one additional operation, "pre-associate" '''.

On a related matter: I don't see yet why we need (or why it will be useful) to distinguish two steps for decommisionning a VM: deactivate and dissassociate.

.--></t>

        <t/>
      </section>

      <section title="VM Operations">
        <section anchor="nwparam" title="Network Parameters">
          <t>For each VM association operation, a subset of the following
          information is needed from server to l-NVE: <list style="hanging">
              <t hangText="operation:">one of pre-associate, associate, or
              dissociate.</t>

              <t hangText="authentication:">proof that this operation was
              authorized by the provisioning system</t>

              <t hangText="VNID:">identifier of DCVPN to which VM belongs</t>

              <t hangText="VID:">tag to use between server and lNVE to
              distinguish DCVPN traffic; the value zero in an associate or
              pre-associate operation is a request to the l-NVE to assign an
              unused VID. These specifications are meant to provide
              extensibility by allowing the VID to be a VLAN-id, but also any
              another means of locally multiplexing traffic betwen the server
              and the nve. In the case where the NVE is implemented on the
              server, the VID can be the a local name of a virtual network
              interface.</t>

              <t hangText="table type:">realization of DCVPN on NVE (see
              below).</t>

              <t hangText="address entries:">addresses for VM on server</t>

              <t hangText="policy:">VM-specific network policies, such as
              access control lists and/or QoS policies</t>

              <t hangText="hold time:">time (in milliseconds) to keep a VM's
              addresses after it migrates away from this l-NVE. This is set to
              zero when a VM is terminated.</t>

              <t hangText="per-address-VID-allocation:">boolean flag which can
              optionally be set to "yes", resulting in the VID allocated to
              the VM being distinct from the VID allocated to other VMs
              connected to the same DCVPN on a same NVE port; this behavior
              will result in traffic between to/from the VM to always transit
              through the NVE, even from/to VMs of a same DCVPN</t>
            </list></t>

          <t>Activate and deactivate are dataplane operations that reference
          the VID, and additionally provide authentication, table type and
          address entries information. When an activate is realized via a
          "gratuitous ARP" in the data plane, the VID is in the Ethernet
          header, and all of the other parameters are obtained by mapping the
          VID and the port on which the frame containing it was received to
          information established by a prior associate operation.</t>

          <t>Realizations of DCVPNs include, among others, E-VPNs (<xref
          target="I-D.ietf-l2vpn-evpn"/>), IP VPNs (<xref target="RFC4364"/>),
          NVGRE (<xref target="I-D.sridharan-virtualization-nvgre"/>, TRILL
          (<xref target="RFC6325"/>), VPLS (<xref target="RFC4761"/>, <xref
          target="RFC4762"/>), and VXLAN (<xref
          target="I-D.mahalingam-dutt-dcops-vxlan"/>). The table type
          implicitly defines whether forwarding at the NVE for the DCVPN is at
          Layer 2 or Layer 3 or both.</t>

          <t>Typically, for the pre-associate and associate messages, all the
          information except hold time would be needed. For the dissociate
          message, all the above information except VID and table type would
          be needed.</t>

          <t>Operations are stateful, that is, they remain in place until
          superceded by another operation. For example, on receiving an
          associate message, an NVE is expected to create and maintain the
          DCVPN table for a VM until the NVE receives a dissociate message to
          remove the table. A separate liveness protocol may be run between
          server and NVE to let each side know that the other is still
          operational; if the liveness protocol fails, each side may remove
          all state installed in response to messages from the other.</t>

          <t>In the descriptions below, we assume that the NVE layer provides
          a mechanism for control plane distribution of VM addresses, as
          opposed to doing this in the data plane. If this is not the case,
          NVE elements can skip the parts of the procedures below that involve
          address distribution.</t>

          <t>As VIDs are local to server-NVE communication, in fact to a
          specific port connecting these two elements, a mapping table
          containg 4-tuples of the following form will prove useful to the
          NVE:</t>

          <figure align="center">
            <artwork><![CDATA[<VID, port, VNID, VM address entries>
	      ]]></artwork>
          </figure>

          <t>The procedures below assume that the NVE systematically reorders
          the provided VM address entries before inserting or looking up
          entries in this mamping table.</t>

          <t>Note that valid values of VID are from 1 to 4094, inclusive. A
          value of 0 is used to mean "unassigned". When a VID can be shared by
          more than one VM, it is necessary to reference-count entries in this
          table. Entries in this table have multiple uses:<list
              style="symbols">
              <t>Find the VNID for a VID and port for association, activation
              and traffic forwarding;</t>

              <t>Determine whether a VID exists (has already been assigned)
              for a VNID and port.</t>

              <t>Determine which <VID, port> pairs to use for forwarding
              VNID traffic that requires flooding.</t>
            </list></t>

          <!--
	  Thomas's notes:
	  mapping different VMs of a same VNID to different VLANs
	   o text from a previous email:

	    - I'm thinking that we should offer the option of letting a server ask for different local VLAN (ie VID)
for multiple VMs of a same VNID. This would allow the NVE to do policing on the traffic between two VMs hosted
on a same server (a la VEPA). Since it would be at the expense of a detour, this would of course remain optional.

	    - If you agree, this means that the mapping maintained by the NVE would be <VNID,port, address> -> VID ,
and that we need the server to pass an additional parameter, a boolean with the semantic of "please provide a
distinct VID for two distinct addresses of a same VNID"; a possible name would be "per-address-VID-allocation".

	   o we agree to keep the idea, float it around, and see where it goes, maybe as something optional
	  -->

          <!--	    David notes: Need to think about assumption that VM's addresses are always passed in associate operation.
	    Reference counting is going to be problematic if this isn't done, although VID per VM would address that.

     Also, reference counting needs to be double-checked in the operation descriptions below.
	

Thomas: please find in this revision a proposed rewrite to support for this optional behavior, it seems to me that we don't need any refcount if the table has a 4-tuple including the VM address
-->

          <!--
-->
        </section>

        <section title="Creating a VM">
          <t>When a VM is instantiated on a server, it is assigned a VNID, VM
          addresses and a table type for the DCVPN. The VM addresses may be
          any of IPv4, IPv6 and MAC addresses. There may also be network
          policies specific to the VM. To connect the VM to its DCVPN, the
          server signals these parameters to the l-NVE via an "associate"
          operation followed by an "activate" operation to put the parameters
          into use. (Note that the l-NVE may consist of more than one
          device.)</t>

          <t>On receiving an associate message on port P from server S, an NVE
          device does the following: <list style="format A.%d:">
              <t>Validate the authentication (if present). If not, inform the
              provisioning system, log the error, and stop processing the
              associate message. This validation may include authorization
              checks.</t>

              <t>Check the per-address-VID-allocation flag is the associate
              message:<list style="symbols">
                  <t>if this flag is not set:<list style="symbols">
                      <t>Check if the VID in the associate message is zero
                      (i.e., an allocation request); if so, look up the VID
                      for <VNID, P, VM address entries> ; if there is
                      none, allocate a new VID</t>

                      <t>If the VID in the associate message is non-zero, look
                      up <VNID, P, VM address entries> for an already
                      allocated VID. If the looup is successful, associate the
                      resulting VID with <VNID, P, VM address entries>.
                      If the result is zero, associate the VID with <VNID,
                      P, VM address entries>. Otherwise, the provided VID
                      does not match the one in use for <VNID, P>, so
                      respond to S with an error, and stop processing the
                      associate message.</t>
                    </list></t>

                  <t>if this flag is set, check if the VID in the associate
                  message is zero :<list style="symbols">
                      <t>if so (this is an allocation request), allocate a new
                      VID, distinct from other VIDs allocated on this
                      port;</t>

                      <t>if the VID is non-zero, check that the provided VID
                      is distinct from other VIDs allocated on this port; if
                      so, associate the VID with <VNID, P, VM address
                      entries>. If not, the provided VID does not match the
                      per-address-VID-constraint, so respond to S with an
                      error, and stop processing the associate message.</t>
                    </list></t>
                </list></t>

              <t>Add the <VID, P, VM address entries> -> VNID mapping
              to the mapping table</t>

              <t>If a table of appropriate type (as signaled) for VNID does
              not already exist, create it, and add the VM's addresses to
              it.</t>

              <t>Commmunicate with the control plane to advertise the VM's
              addresses, and also to get the addresses of other VMs in the
              DCVPN. Populate the table with the VM's addresses and any
              addresses learned from the control plane (some control planes
              may not provide all or even any of the other addresses in the
              DCVPN at this point).</t>

              <t>Finally, respond to S with the VID for <VNID, P, VM
              address entries>, and also saying that the operation was
              successful.</t>
            </list>After a successful associate, the network has been
          provisioned (at least in the local NVE) for the VM's traffic, but
          forwarding has not been enabled. On receiving an activate message on
          port P from server S, an NVE device does the following (activate is
          a one-way message that does not have a response):</t>

          <t><list counter="1" style="format B.%d:">
              <t>Validate the authentication (if present). If not, inform the
              provisioning system, log the error, and stop processing the
              associate message. This validation may include authorization
              checks.</t>

              <t>Check if the VID in the activate message is zero. If so, log
              the error, and stop processing the activate message.</t>

              <t>Use the VID and port P to look up the VNID from a previous
              associate message. If there is no VNID, log the error and stop
              processing the activate message.</t>

              <t>If forwarding is not enabled for <VID, P, VM address
              entries> activate it, mapping VID -> VNID.</t>

              <t>If the activate message is a dataplane frame that requires
              forwarding beyond the NVE, (e.g., a "gratuitous ARP"), use the
              activated forwarding to send the dataplane frame via the virtual
              network identified by the VNID.</t>
            </list></t>

          <!--
	    PRIVATE NOTE: need to authenticate that a server's request for
	    network connectivity is genuine: NVE cannot blindly trust
	    server.  Need to prevent replay attacks.  Yet would be nice to
	    have a "one-time" token whereby a server can resignal at a
	    future point in time to a different lNVE if a failure occurs,
	    for example if server is rehomed to new lNVE.

	    Also, need to take care more generally of the case of server
	    dual-homed to ToRs.
	  -->

          <!--
	  authorization (further details from Thomas's notes):
	   - cloud os give token to server
	   - server proves legitimacy to hsot VM in VN to TOR1 by showing token
	   - TOR1 gives back another token, allowing server to prove ... to TOR2 in the future
	   - this second token is periodically refreshed bw server and TOR1
	   - the second token is verifieable by TOR2 by some means, e.g. key shared by TOR1 and TOR2
	   - (with a timestamp, periodicity small enough to cover clock drift)
	  -->
        </section>

        <section title="Terminating a VM">
          <t>On receiving a request from the provisioning system to terminate
          a VM, the server sends a dissociate message to the l-NVE with the
          hold time set to zero. The dissociate message contains the
          operation, authentication, VNID, table type, and VM addresses. On
          receiving the dissociate message on port P from server S, each NVE
          device L does the following: <list style="format D.%d:">
              <t>Validate the authentication (if present). If not, inform the
              provisioning system, log the error, and stop processing the
              associate message.</t>

              <t>Delete the VM's addresses from the mapping table and delete
              any VM-specific network policies associated with any of the VM
              addresses. If the VNID table is empty after deleting the VM's
              addresses, optionally delete the table and any network policies
              for the VNID.</t>

              <t>Respond to S saying that the operation was successful.</t>
            </list></t>
        </section>

        <section title="Migrating a VM">
          <t><!--	    David says: This needs a lot of work.  Should ensure that above two cases are correct before doing it.
	  -->NOTE: This sub section has not been updated from the -00 version of this
          draft; it will be updated in the forthcoming -02 version. The set of
          VM migration steps are known to be incomplete, material on
          concurrent actions and race conditions (based on list discussion)
          should be added and new step PA.5 is anticipated to need
          generalization to encompass control planes that may not push all
          addressing changes to all relevant rNVEs. Please ignore the text in
          this subsection and beyond - this document is a draft and the
          authors are working on it.</t>

          <t>Let's say that a VM is to be migrated from server S (connected to
          lNVE device L) to server S' (connected to lNVE device L'). The
          sequence of steps for migration is: <list counter="-1"
              style="format M.%d:">
              <t>S' gets a request to prepare to receive a copy of the VM from
              S.</t>

              <t>S gets a request to copy the VM to S'.</t>

              <t>S then gets a request to terminate the VM on S.</t>

              <t>Finally, S' gets a request to start up the VM on S'.</t>
            </list></t>

          <t>At Step M.1, S' initiates the move, and also sends a
          pre-associate message to L', including the pre-associate
          information. The processing of a pre-associate message (PA.1 to
          PA.7) for L' is the same as that of an associate message (A.1 to
          A.6), with the following change to step 5. <list style="hanging">
              <t hangText="PA.5:">Commmunicate with each rNVE device to
              advertise the VM's addresses but as non-preferred
              destinations(*). Also get the addresses of other VMs in the
              DCVPN. Populate the table with the VM's addresses and addresses
              learned from each rNVE. <!--
		    routing-oriented: rewrite to be neutral
		--></t>
            </list></t>

          <t>(*) See <xref target="VNif"/> for some mechanisms for doing this.
          This is necessary so that L' does not attract traffic to the VM's
          new location before the migration is complete, yet L knows ahead of
          time how to send traffic to L' (Step D.2), minimizing traffic loss
          to the VM when migration is complete.</t>

          <t>At step M.2, S initiates the VM copy. If at any time L hears
          advertisements from L' about how to communicate with the VM in its
          new location (as unpreferred destinations), L stores that
          information for use in step D.2.</t>

          <t>At step M.3, S terminates the running of the VM on itself, and
          sends a dissociate message to L with a non-zero hold time (either
          what the provisioning system sends, or a default value). L processes
          the dissociate message as above.</t>

          <!--
	  <t>
	    PRIVATE NOTE: need to take care of the case where L and L' are
	    the same device.  Also, should the VNID table on L' be
	    activated on a pre-associate message, or should it be
	    programmed to drop traffic until an associate message is
	    received?  Finally, is a pre-dissociate message needed?
	  </t>
	  -->
        </section>
      </section>

      <section title="Signaling Protocols">
        <t>There are several options for protocols to use to signal the above
        messages. One could invent a new protocol for this purpose. One could
        reuse existing protocols, among them LLDP, XMPP, HTTP REST, and VDP
        <xref target="VDP"/>, a new protocol standardized for the purposes of
        signaling a VM's network parameters from server to lNVE. Several
        factors influence the choice of protocol(s); at this time, the focus
        is on what needs to be signaled, leaving for later the choice of how
        the information is signaled, and specific encodings.</t>

        <!--
	discovery of IP address for a REST API
	 - link local anycast address present on all ToR
	 - server has two address on the tor facing interface: its own and an address in 169.254.x.x. in VLAN 0
	-->

        <!--
	<t>
	  With VDP, several further options remain.  One can enhance the
	  existing TLVs with sub-TLVs that carry needed information that
	  is not present today.  One can add new filter types to carry
	  some of this information.  Or one can create a parallel set of
	  TLVs under the umbrella "OUI TLV", with the OUI set to the
	  IETF's or IANA's OUI.  All of these approaches seem feasible,
	  although The last option may be the least disruptive.
	</t>
	-->
      </section>

      <section title="Liveness">
        <t>Procedures to handle failures of the server or of the NVE will be
        covered in a further revision.</t>

        <!--
	o issue1: what if the server crashes ?
	 - VM is respawn elsewhere
	 - traffic need to go there and not to the old location
	o maybe we could use "generation" id style BGP attribute
	 - issue is how to make this work for other ToR2tOR VN networking solutions
	 - if we use this we don't need a keepalive technique for issue1
	o issue2: what if the ToR goes down ?
	 - the server needs to re-associate to another ToR
	   - in this case, we need a trigger/keepalive to associate to backup ToR
	 - or the server is dual-associating a VM to two ToRs
	   - in this case we need a trigger to let the server avoid sending traffic to the ToR which is down
	o candidates keepalive mechanisms
	 - BFD
	 - Ethernet OAM
	 - (laser) [aka loss of light]
	 - rely on the cloud OS which needs to know that a server is down (ToR?)
	-->
      </section>
    </section>

    <section anchor="VNif" title="Interfacing with DCVPN Control Planes">
      <t>The control plane for a DCVPN manages the creation/deletion,
      membership and span of the DCVPN (<xref
      target="I-D.narten-nvo3-overlay-problem-statement"/>, <xref
      target="I-D.kreeger-nvo3-overlay-cp"/>). Such a control plane needs to
      work with the server-to-nve signaling in a coordinated manner, to ensure
      that address changes at a local NVE are reflected appropriately in
      remote NVEs. The details of such coordination will be specified in a
      companion document.</t>
    </section>

    <section anchor="sec-con" title="Security Considerations"/>

    <section anchor="iana-con" title="IANA Considerations"/>

    <section title="Acknowledgments">
      <t>Many thanks to Amit Shukla for his help with the details of EVB and
      his insight into data center issues.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include='reference.RFC.2119'?>

      <reference anchor="VDP">
        <front>
          <title>Edge Virtual Bridging (802.1Qbg) (work in progress)</title>

          <author fullname="" initials="" surname="">
            <organization>IEEE 802.1 Working Group</organization>
          </author>

          <date year="2012"/>
        </front>
      </reference>
    </references>

    <references title="Informative References">
      <?rfc include='reference.RFC.4364'?>

      <?rfc include='reference.RFC.4761'?>

      <?rfc include='reference.RFC.4762'?>

      <?rfc include='reference.RFC.6325'?>

      <?rfc include='reference.I-D.ietf-l2vpn-evpn'?>

      <?rfc include='reference.I-D.kreeger-nvo3-overlay-cp'?>

      <?rfc include='reference.I-D.mahalingam-dutt-dcops-vxlan'?>

      <?rfc include='reference.I-D.narten-nvo3-overlay-problem-statement'?>

      <?rfc include='reference.I-D.sridharan-virtualization-nvgre'?>
    </references>
  </back>
</rfc>

PAFTECH AB 2003-20262026-04-21 11:59:09