One document matched: draft-williams-iwarp-ift-01.txt
Differences from draft-williams-iwarp-ift-00.txt
INTERNET-DRAFT J. Williams
draft-williams-iwarp-ift-01.txt Emulex Corporation
Expires: August 2003
February 2003
iWARP Framing for TCP
1 Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026. Internet-Drafts are work-
ing documents of the Internet Engineering Task Force (IETF), its
areas, and its working groups. Note that other groups may also dis-
tribute working documents as Internet-Drafts. Internet-Drafts are
draft documents valid for a maximum of six months and may be
updated, replaced, or obsoleted by other documents at any time. It
is inappropriate to use Internet-Drafts as reference material or to
cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-
Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
2 Abstract
A framing protocol is defined for DDP over TCP that is fully
compliant with applicable TCP RFCs and fully interoperable with
existing TCP implementations.
The protocol offers to two things, definition of the DDP record
boundaries within the TCP stream, and added data integrity
by means of added CRC protection. In addition, an adaption
mechanism is defined that confirms use of RDDP, and negotiates
parameters including use of CRCs, markers, and enabling of
speculative placement.
3. Acknowledgements
The detailed and pains taking effort of the RDMA Consortium members
is acknowledged as well as others who have contributed to defining
an RDMA over IP protocol.
J. Williams Expires: August 2003 [Page 1]
INTERNET-DRAFT iWARP Framing for TCP February 2003
4. Adaption Indication
It is expected that RDDP mode may be used immediately at connection
setup time, or alternatively may be initiated at some time after
the connection has been set up. The latter case would be used
when an existing ULP negotiates an upgrade to RDDP mode.
The adaption indication is sent in band, but MUST not be the initial
offer to use RDDP. The ULP MUST first agree to use RDDP, or else
the port number MUST be one that is agreed to be used for RDDP.
The sending of the adaption indication is a confirmation of this
agreement to use RDDP, and also a means to negotiate the particular
parameters of the RDDP connection.
4.1 Adaption Indication Format
The adoption indication is exactly 64 bytes in length and has
the following format.
+--------------+---------------+---------------+---------------+
| Magic number ( 0x52444450 ) |
+--------------+---------------+---------------+---------------+
| version | BC version |
+--------------+---------------+---------------+---------------+
| Flags |
+--------------+---------------+---------------+---------------+
| Sender's VTag | marker interval |
+--------------+---------------+---------------+---------------+
| |
| reserved (44 bytes) |
| |
| |
+--------------+---------------+---------------+---------------+
| CRC |
+--------------+---------------+---------------+---------------+
4.2 Field definitions
Version Defines protocol version of the adaption indication.
This document describes version 1. The field
MUST have value of one.
BcVersion Backward compatible version. For this version of the
protocol, this field MUST be set to 1. In general it
will be set to the minimum version with which it is
backward compatible. On receipt, any value greater
than the current version MUST be rejected as
incompatible. The received version need not be rejected
if greater than the current version as long as the
BC version is not.
J. Williams Expires: August 2003 [Page 2]
INTERNET-DRAFT iWARP Framing for TCP February 2003
The following flag bits are defined starting with the MSB of
the flag field and proceeding left to right. The remainder of the
flag field is reserved.
PH Adoption is two phase negotiation, with phases
numbered 0, and 1. Phase 0 is sent from node A
to node B. On receipt of phase 0, node B sends phase
1 back to node A. This bit indicates phase 0 or 1.
R Indicates sender wants to use RDMA mode. If both
ends set this bit, then connection is in RDMA mode.
If both ends clear this bit, then the connection is
in DDP mode. If ends disagree, then the connection
MUST be closed in error.
HC Indicates that the sender will be including a header
CRC. If the senders choice to use or not use a header
CRC is unacceptable to the receiver, the receiver
SHOULD abort the connection.
PC Indicates that the sender will be including a payload
CRC. If the senders choice to use or not use a payload
CRC is unacceptable to the receiver, the receiver
SHOULD abort the connection.
IH Indicates that the sender will ignore and not check
any header CRC which the receiver of the adaption
indication sends. If this is unacceptable to the
receiver of the adaption indication, the connection
should be closed.
IP Indicates that the sender will ignore and not check
any payload CRC which the receiver of the adaption
indication sends. If this is unacceptable to the
receiver of the adaption indication, the connection
should be closed in error.
SP Indicates the sender is giving the receiver permission
to do speculative placement. If this bit is set, the
receiver MAY do speculative placement. If not set,
the receiver MUST NOT do speculative placement.
See the chapter below for full description of
speculative placement.
SM Indicates sender of adaption indication is willing to
send periodic markers. Markers will be sent only if
the SM bit is set and other end of the connection sets
the RM bit in its adaption indication.
J. Williams Expires: August 2003 [Page 3]
INTERNET-DRAFT iWARP Framing for TCP February 2003
RM Indicates the sender of the adaption indication wishes
to receive periodic markers. Markers will be received
if and only if the RM bit is set and the other end
sets the SM bit in its adaption indication.
Sender's VTag
The sender's VTag is a 16 bit value that SHOULD be
selected at random using a good random number
generator. It is included in all outgoing RDDP
PDUs sent by the receiver of this adaption indication.
The main purpose is to provide padding to align
the RDDP header, however it also provides some added
checking as to the validity of the IFT header, and
some added confidence for doing speculative placement.
Marker Interval
Used to negotiate use of periodic markers and determine
the interval at which they will be sent. Exact method
of negotiation specified in the chapter on periodic
markers.
CRC CRC computed on adaption indication is REQUIRED
regardless of whether header or payload CRCs are
negotiated in either direction.
reserved All reserved bits MUST be set to zero by the sender
and ignored by the receiver.
J. Williams Expires: August 2003 [Page 4]
INTERNET-DRAFT iWARP Framing for TCP February 2003
5. Protocol definition
5.1 Frame format
+--------------+---------------+---------------+---------------+
| header_size | payload_size |
+--------------+---------------+---------------+---------------+
| receiver's VTag | |
+--------------+---------------+ |
| DDP Header |
| |
| |
+--------------+---------------+---------------+---------------+
| Header CRC (if negotiated) |
+--------------+---------------+---------------+---------------+
| |
| |
| |
| DDP Payload |
| |
| |
| |
| |
| |
+--------------+---------------+---------------+---------------+
| Payload CRC (if negotiated) |
+--------------+---------------+---------------+---------------+
The header_size and payload_size are 16 bit fields containing the
number of bytes in the DDP header and payload respectively.
The header CRC covers both the IFT header (header_size and
payload_size fields) and the DDP header. The payload CRC covers
only the DDP payload. The CRC uses the CRC-32c algorithm as
defined in [iSCSI]. Note that the RDMA header (if present) and
headers associated with higher level protocols are considered
as part of the DDP payload, and not covered by the IFT header CRC.
J. Williams Expires: August 2003 [Page 5]
INTERNET-DRAFT iWARP Framing for TCP February 2003
The above format shows the DDP header and DDP payload size as
being a multiple of four bytes. This is not necessary, however,
and no padding is added to align the CRC. An example of
an unaligned frame is shown below.
+--------------+---------------+---------------+---------------+
| header_size | payload_size |
+--------------+---------------+---------------+---------------+
| receiver's VTag | |
+--------------+---------------+ |
| |
| DDP Header |
| +---------------+---------------|
| | |
+--------------+---------------+---------------+---------------+
| Header CRC (if negotiated) | |
+--------------+---------------+ |
| |
| |
| |
| DDP Payload |
| |
| |
| |
+ +---------------+---------------+---------------+
| | Payload CRC (if negotiated) |
+--------------+---------------+---------------+---------------+
| |
+--------------+
6. Ordering semantics
The IFT layer receives all data in order from the TCP layer and
delivers all data in order to the DDP layer. Note that this
does not preclude a merged layer implementation from placing
data out of order, but any such implementation MUST be functionally
equivalent to a layered implementation in which TCP delivers
all data in order.
7. Motivation
The IFT header contains the size of the DDP Header and DDP payload
in bytes. Assuming in order processing of received TCP data,
this is fully sufficient to define the DDP PDU boundaries.
The header and payload CRCs are 32 bits each, and provide
additional protection against data corruption. In the event
of a CRC error on received data, the IFT layer will notify
the next layer that the data contains an error. That next
layer will define the action to be taken. Typical action
is to close the connection with a fatal error.
J. Williams Expires: August 2003 [Page 6]
INTERNET-DRAFT iWARP Framing for TCP February 2003
7.1 Motivation for separating header and payload CRCs.
There are three important reasons for this separation.
First, because of TCP segmentation, the entire payload may
not be received together with the header. The next protocol
layer (DDP) may wish to place the portion of the payload that has
been received, but this can't be safely done until the header,
which indicated where the data should be placed, has been
verified.
The second important reason is that it makes hardware
implementations significantly more efficient in that the
payload CRC can be calculated as the payload data is streamed
from NIC memory to host memory. This streaming can't take
place until the header (and therefore the destination host address)
has been verified correct.
There are only three ways the payload CRC can be verified,
on the way into the NIC buffer, on the way out of the NIC buffer
towards host memory, or by making a separate access
to the NIC buffer memory just for the purpose of CRC
verification. The third option causes significant
inefficiencies in terms of required memory bandwidth.
Verifying the CRC while the data is on the way into the
NIC buffer is great if it can be done, but is generally
not practical if the CRC is part of a layer above the
transport (TCP in this case) layer. This is because the transport
processing must be done first, and the time between receiving
the packet on the link and writing it to buffer memory is too brief
to complete the transport processing.
Therefore the proposal requires checking only the header CRC with
a separate access to buffer memory, and allows the payload CRC to
be verified as the data is streamed from NIC buffer to host buffer.
The third reason is that some applications may require a header
CRC but not require a payload CRC. This may be for a number
of reasons including the presense of added end to end checks
at the ULP level, of simply the ability of the ULP to tolerate
data errors (but not placement errors).
J. Williams Expires: August 2003 [Page 7]
INTERNET-DRAFT iWARP Framing for TCP February 2003
7.2 Motivation for not requiring padding to align CRCs.
Experience building hardware for [iSCSI] has shown that the
hardware required to compute unaligned CRCs is trivial requiring
only a couple byte shifters. The hardware required to insert
the padding is an order of magnitude more complex, and affects
control timing in ways that require complex verification.
Since the only claimed benefit of padding insertion was to
simplify hardware design, and since the result was exactly
the opposite, this proposal does not include padding.
8. Speculative Placement
Speculative placement is done by the receiver of a RDDP PDU.
On receiving an out of order TCP segment, the receiver
guesses at the location of the RDDP PDU within the
TCP segment. This guess is confirmed by doing a number
of checks, and if the checks pass, directly placing
the payload data. If the confirmation fails,
speculative placement MUST NOT be done.
If the PDU identified appears to be in error, the error
MUST NOT be reported speculatively. In the case of either the
failed confirmation or PDU error, nothing may be done with the
PDU until it can be processed in order.
Proir to delivering the payload data to the ULP, the
alignment is verified by receiving all preceding PDUs and
using the IFT length fields to absolutely verify that
the alignment was correct. If this verification fails,
the PDU MUST be placed again. If an implementation has
discarded the original after placing it (which it typically
will do), then it MUST withhold TCP acknowledgement of this
segment and force the remote end to retransmit it.
8.1 Confirmation checks
Before doing speculative placement, the IFT and DDP headers
should be checked. Specific field checked include the
IFT header length, IFT payload length, IFT VTag, IFT header
CRC, DDP STag, DDP QN, DDP MSN, DDP TO, DDP DV. The
subset of these which exists SHOULD all be checked, and
any field containing an invalid value disqualifies the PDU for
speculative placement.
J. Williams Expires: August 2003 [Page 8]
INTERNET-DRAFT iWARP Framing for TCP February 2003
8.2 Overwrite checks
Before speculative placement is done, the RDDP implementation
MUST insure that no previously placed data is overwritten. This
is necessary to insure that if the speculative placement is
being done in error, that the error is recoverable.
This implies that the RDDP implementation must track what
portions of a buffer have been written at any time, or
if the RDDP implementation loses track, then do no further
speculative placements in that buffer.
If an in-order placement would overwrite a previously done
speculative placement, then that in-order placement should be
done and the preceding speculative placement regarded as
invalid, and needs to be re-done in-order.
8.3 Unwritten Portions of Buffers
Because speculative placement may write erroneously to unwritten
portions of buffers, applications that allow speculative
placement MUST assume that when a buffer is delivered to it
by RDDP, any unwritten portion of a buffer contains unpredictable
data. Applications MUST NOT assume that unwritten portions
of buffers are unmodified.
9. Periodic Markers
Periodic markers may be inserted in the data stream as a means
of locating the beginning of RDDP PDUs when TCP segments are
received out of order.
Details of markers are TBD.
10. IFT and SCTP
IFT is a framing protocol for TCP only and does not address SCTP.
It is expected that IFT will be a temporary transitional solution
in the event the SCTP ultimately achieves wide spread use.
It would become a long term solution only if SCTP fails to achieve
wide spread use and acceptance.
J. Williams Expires: August 2003 [Page 9]
INTERNET-DRAFT iWARP Framing for TCP February 2003
11. Security Considerations
It is expected that IFT introduces no new security considerations.
It has all the strengths and weaknesses normally associated with
TCP, and creates no new weaknesses. All application related
security issues are the responsibility of higher layer protocols.
12. References
[MPA] P. Culley et al., draft-culley-iwarp-mpa-00.txt,
September 16, 2001
[iSCSI] Satran, Julian, draft-ietf-iscsi-15.txt, July 30, 2002
[TCP] Postel, J., "Transmission Control Protocol - DARPA Internet
Program Protocol Specification", RFC 793, September 1981.
[DDP] H. Shah et al., "Direct Data Placement over Reliable
Transports", RDMA Consortium Draft Specification draft-shah-
rdmap-ddp-00.txt, September 2002
[RDMA] R. Recio et al., "RDMA Protocol Specification", RDMA
Consortium Draft Specification draft-recio-rdmap-rdma-00.txt,
September 2002
[SCTP] R. Stewart et al., "Stream Control Transmission Protocol",
RFC 2960, October 2000.
13. Author's Addresses
Jim Williams
Emulex Corporation
580 Main Street
Bolton, MA 01740 USA
Phone: +1 978 779 7224
Email: jim.williams@emulex.com
J. Williams Expires: August 2003 [Page 10]
| PAFTECH AB 2003-2026 | 2026-04-24 07:32:50 |