One document matched: draft-lennox-clue-rtp-usage-01.xml
<?xml version='1.0'?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY rfc2119 SYSTEM
'reference.RFC.2119.xml'>
<!ENTITY rtp SYSTEM
'reference.RFC.3550.xml'>
<!ENTITY clueusecases SYSTEM
'reference.I-D.ietf-clue-telepresence-use-cases.xml'>
<!ENTITY cluerequirements SYSTEM
'reference.I-D.ietf-clue-telepresence-requirements.xml'>
<!ENTITY clueframework SYSTEM
'reference.I-D.ietf-clue-framework.xml'>
<!ENTITY westerlundrtpmux SYSTEM
'reference.I-D.westerlund-avtcore-multiplex-architecture.xml'>
<!ENTITY rtcwebmux SYSTEM
'reference.I-D.lennox-rtcweb-rtp-media-type-mux.xml'>
<!ENTITY h281 SYSTEM
'reference.ITU.H281.1994.xml'>
<!ENTITY h224rtp SYSTEM
'reference.RFC.4573.xml'>
<!ENTITY contentattr SYSTEM
'reference.RFC.4796.xml'>
<!ENTITY hdrext SYSTEM
'reference.RFC.5285.xml'>
<!ENTITY ccm SYSTEM
'reference.RFC.5104.xml'>
]>
<?rfc toc="yes" ?>
<?rfc strict="yes" ?>
<?rfc compact="yes" ?>
<?rfc sortrefs="yes" ?>
<?rfc symrefs="yes" ?>
<rfc category='std' ipr='trust200902' docName='draft-lennox-clue-rtp-usage-01'>
<front>
<title abbrev='RTP Usage for Telepresence'>
Real-Time Transport Protocol (RTP) Usage for Telepresence Sessions
</title>
<author initials='J.' surname='Lennox'
fullname='Jonathan Lennox'>
<organization abbrev='Vidyo'>
Vidyo, Inc.
</organization>
<address>
<postal>
<street>433 Hackensack Avenue</street>
<street>Seventh Floor</street>
<city>Hackensack</city> <region>NJ</region>
<code>07601</code>
<country>US</country>
</postal>
<email>jonathan@vidyo.com</email>
</address>
</author>
<author initials="A." surname="Romanow"
fullname="Allyn Romanow">
<organization>Cisco Systems</organization>
<address>
<postal>
<street> </street>
<city>San Jose</city> <region>CA</region>
<code>95134</code>
<country>USA</country>
</postal>
<email>allyn@cisco.com</email>
</address>
</author>
<author initials="P." surname="Witty"
fullname="Paul Witty">
<organization>Cisco Systems</organization>
<address>
<postal>
<street> </street>
<city>Langley</city> <region>England</region>
<country>UK</country>
</postal>
<email>pauwitty@cisco.com</email>
</address>
</author>
<date />
<area>RAI</area>
<workgroup>RTCWEB</workgroup>
<keyword>I-D</keyword>
<keyword>Internet-Draft</keyword>
<!-- TODO: more keywords -->
<abstract>
<t>
This document describes mechanisms and recommended practice
for transmitting the media streams of telepresence sessions
using the Real-Time Transport Protocol (RTP).
</t>
</abstract>
</front>
<middle>
<section title='Introduction' anchor='introduction'>
<t>Telepresence systems, of the architecture described by
<xref target='I-D.ietf-clue-telepresence-use-cases' />
and <xref target='I-D.ietf-clue-telepresence-requirements' />, will send and
receive multiple media streams, where the number of streams in use is
potentially large and asymmetric between endpoints, and streams can
come and go dynamically. These characteristics lead to a number of architectural
design choices which, while still in the scope of potential
architectures envisioned by the <xref target='RFC3550'>Real-Time
Transport Protocol</xref>, must be fairly different than those
typically implemented by the current generation of voice or video
conferencing systems. This document makes recommendations about how
streams should be encoded and transmitted in RTP for this telepresence
architecture.</t>
</section>
<section title='Terminology'>
<t>The key words "MUST", "MUST NOT", "REQUIRED",
"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
and "OPTIONAL" in this document
are to be interpreted as described in <xref
target='RFC2119'>RFC 2119</xref> and indicate requirement levels for
compliant implementations.</t>
</section>
<section title='Source multiplexing - overview'>
<t>Telepresence sessions have lots of media streams: easily dozens at a time
(given, e.g., a continuous presence screen in a multi-point
conference), potentially out of a
possible pool of hundreds. Furthermore, endpoints will have an
asymmetric number of media streams.</t>
<t>In such an environment the usual model of existing SIP endpoints –-
sending zero or one source (in each direction) per RTP session --
doesn’t scale, and mapping asymmetric numbers of sources to sessions
is needlessly complex.</t>
<t>Therefore, telepresence systems SHOULD use a single RTP session per
media type, except where
there's a need to give sessions different transport treatment. All
sources of the same media type are sent over this single RTP
session. This architecture (known as "source multiplexing") was
defined by <xref target='RFC3550' />, but was used rarely until more
recently by some Telepresence systems.
</t>
<t>
Multiplexing multiple media streams in this way has additional
advantages. It makes going through middle boxes
considerably easier, as it allows Telepresence devices to work through
SIP B2BUAs that do not support multiple media lines of
the same media type. It also simplifies NAT and firewall traversal by allowing
endpoint to deal with only a single address/port mapping
per media type rather than multiple mappings.
</t>
<t>
During call setup, a single RTP session is negotiated for each media type.
In SDP, only one media line is negotiated per media and multiple media
streams are sent over the same UDP channel negotiated
using the SDP media line.
</t>
<t>
A number of protocol issues involved in multiplexing RTP streams
into session are discussed in
<xref target='I-D.westerlund-avtcore-multiplex-architecture' /> and
<xref target='I-D.lennox-rtcweb-rtp-media-type-mux' />. In this
draft we concentrate on examining the demultiplexing of RTP
streams, in the specific context of telepresence systems.
</t>
<t>
A key issue to work out is how a receiver interprets the multiple
streams it receives, and corrolates them with the captures it has
requested. In some cases, the
<xref target='I-D.ietf-clue-framework'>CLUE Framework</xref>'s
concept of the "capture" maps cleanly to the RTP concept of an SSRC,
but in some cases it does not.
</t>
<t>
First we will consider the cases that need to be considered. We will then examine
the two most obvious approaches to demultiplexing, showing their pros and cons.
We then describe a third possible alternative.
</t>
</section>
<section title='Use Cases' anchor='usecases'>
<t>There are three distinct use cases relevant for telepresence systems:</t>
<t>Static stream choice:</t>
<t>In this case, the streams sent over the multiplex are constant over
the complete session. An example is a triple-camera system to MCU in
which left, center and right streams are sent for the duration of
the session.</t>
<t>This describes an endpoint to endpoint, endpoint to multipoint
device, and equivalently a transcoding multipoint device to
endpoint.</t>
<t>This is illustrated in <xref target='p2p-static-streams' />.</t>
<figure anchor="p2p-static-streams" title="Point to Point Static Streams">
<artwork>
<![CDATA[
,'''''''''''| +-----------Y
| | | |
| +--------+|"""""""""""""""""""""""""""|+--------+ |
| |EndPoint||---------------------------||EndPoint| |
| +--------+|"""""""""""""""""""""""""""|+--------+ |
| | | |
"-----------' "------------
]]>
</artwork>
</figure>
<t>Dynamic streams from a finite set:</t>
<t>In this case, the receiver has requested a smaller number of
streams than the number of media sources that are available, and
expects the sender to switch the sources being sent
based on criteria chosen by the sender. (This is called
auto-switched in the <xref target='I-D.ietf-clue-framework'>CLUE
Framework</xref>.)</t>
<t>An example is a triple-camera system to two-screen system, in which
the sender needs to switch either LC -> LR, or CR -> LR.</t>
<t>This describes an endpoint to endpoint, endpoint to multipoint
device, and a transcoding device to endpoint.</t>
<t>This is illustrated in <xref target='p2p-finite-streams' />.</t>
<figure anchor="p2p-finite-streams" title="Point to Point Finite Source Streams">
<artwork>
<![CDATA[
,'''''''''''| +-----------Y
| | |+--------+ |
| +--------+|"""""""""""""""""""""""""""||EndPoint| |
| |EndPoint|| |+--------+_|
| +--------+'''''''''' '''''''''''
| |........
"-----------'
]]>
</artwork>
</figure>
<t>Dynamic streams from an infinite set:</t>
<t>This case describes a switched multipoint device to endpoint, in
which the multipoint device can choose to send any streams received
from any other endpoints within the conference
to the endpoint.</t>
<t>For example, in an MCU to triple-screen system, the MCU could send
e.g. LCR of a triple-camera system -> LCR, or CCC of three
single-camera endpoints -> LCR. </t>
<t>This is illustrated in <xref target='multipoint-infinite-streams' />.</t>
<figure anchor="multipoint-infinite-streams" title="Multipoint Infinite Streams">
<artwork>
<![CDATA[
+-+--+--+
| |EP| `-.
| +--+ |`.`-.
+-------`. `. `.
`-.`. `-.
`.`-. `-.
`-.`. `-.-------+ +------+
+--+--+---+ `.`.| +---+ ---------------| +--+ |
| |EP| +----.....:=. |MCU| ...............| |EP| |
| +--+ |"""""""""--| +---+ |______________| +--+ |
+---------+"""""""""";'.'.'.'---+ +------+
.'.'.'.'
.'.'.'.'
/ /.'.'
.'.::-'
+--+--+--+ .'.::'
| |EP| .'.::'
| +--+ .::'
+--------.'
]]>
</artwork>
</figure>
<!-- " (close quote to keep emacs xml mode happy) -->
<t>Within any of these cases, every stream within the multiplexed
session MUST have a unique SSRC. The SSRC is chosen at random
<xref target='RFC3550' /> to ensure uniqueness (within the
conference), and contains no meaningful information.</t>
<t>Any source may choose to restart a stream at any time, resulting in
a new SSRC. For example, a transcoding MCU might, for reasons of load
balancing, transfer an encoder onto a different DSP, and throw away
all context of the encoding at this state, sending an RTCP BYE
message for the old SSRC, and picking a new SSRC for the stream when
started on the new DSP.</t>
<t>Because of this possibility of changing the SSRC at any time, all
our use cases can be considered to be the third and most difficult
case, that of dynamic streams from an infinite set. Thus, this is
the only case we will consider.</t>
</section>
<section title='Demultiplexing'>
<t>There are two obvious choices in order to demultiplex: the SSRC,
which is guaranteed to be unique for a stream, but conveys no
intrinsic useful information, or an additional multiplex ID tagged
on to media packets. There may be other choices, e.g., payload type
number, which might be appropriate for multiplexing one audio with
one video stream on the same RTP session, but this not relevant for
the cases discussed here.</t>
<t>For receivers with limited decoding resources, it is particularly
important to ensure that the number of streams which the receiver is
expecting to receive never exceeds the maximum number it has
requested. On a change of stream, the receiver can be expected to
have a one-out, one-in policy, so that the decoder of the stream
currently being decoded is stopped before starting the decoder for
the stream replacing it. The sender should therefore indicate to
the receiver which stream will be replaced upon a stream change.</t>
<section title='Using the SSRC for demultiplexing'>
<t>Using the SSRC has the advantage of being included already in each
RTP packet. However, there are some disadvantages to
consider. First, the SSRC needs to be linked to some metadata to
associate it to the capture stream. This is because although it
uniquely identifies a media stream, it does not indicate which of
the requested streams each SSRC is tied to. If more than one media
stream is expected, it is therefore required to send some additional
metadata to indicate the link between the SSRC and the CLUE stream
ID. This is simply a mapping from transmitted SSRC to stream ID,
updated as new SSRCs replace old ones.</t>
<t>Because of the one-out, one-in codec policy, the receiver must know
in advance of receiving the media stream how to allocate its
decoding resources. Athough it could cache incoming media received
before it knows what multiplex stream it applies to, this will
require an unknown amount of storage space (particularly if the
metadata is lost), and could lead to significant latency, after
which the receiver may not find it possible to catch up because of
resource constraints, or else it would require an expensive state
refresh, such as a <xref target='RFC5104'>Full Intra Request
(FIR)</xref>.</t>
<t>In addition, a receiver will have to store lookup tables of SSRCs
to stream IDs/decoders etc. Because of the large SSRC space (32
bits), this will have to be in the form of something like a hash
map, and a lookup will have to be performed for every incoming
packet, which may prove costly on the receiver side.</t>
<t>Consider the choices for where to put the metadata. The metadata
could be sent in the CLUE messaging. The use of a reliable
transport means that it can be sure that the metadata will not be
lost, but if this reliability is acheived through retransmission,
the time taken for the metadata to reach all receivers (particularly
in a very large scale conference, e.g., with thousands of users) could result in
very poor switching times, providing a bad user experience.</t>
<t>A second option for sending the metadata is in RTCP, for instance
as a new SDES item. This is likely to
follow the same path as media, and therefore if the metadata is sent
slightly in advance of the media, it can be expected to be received
in advance of the media. However, because RTCP is lossy, the
metadata may not be received for some time, resulting in the receiver of the
media not knowing how to route the received media. A system of
acks and retransmissions could mitigate this, but this results in
the same high switching latency behaviour as discussed for using
CLUE as a transport for the metadata.</t>
</section>
<section title='Multiplex ID'>
<t>The second option is to tag each media packet with
an <xref target='RFC5285'>RTP header extension</xref> carrying a
multiplex ID.
This means that a receiver immediately knows how to interpret received
media, even when an unknown SSRC is seen. As long as the media
carries a known
multiplex ID, it can be assumed that this media stream will replace
the stream currently being received with that multiplex ID.</t>
<t>This gives significant advantages to switching latency, as a switch
between sources can be acheived without any form of negotiation with
the receiver. There is no chance of receiving media without knowing
to which switched capture it belongs.</t>
<t>Although multiplex IDs may be chosen by either the sender or
receiver, the multiplex ID can, if chosen by the receiver, contain
semantic information relevant to the receiver. For example, on a
large multipoint device with many DSPs, the receiver chosen
multiplex ID could identify the DSP to which the media should be
sent, and possibly contain routing information to the DSP.</t>
<t>However, there are also significant disadvantages in using a
multiplex ID. It introduces additional processing costs.</t>
<t>Multiplex IDs are scoped only within one hop (i.e., within a
cascaded conference a multiplex ID that is used from the source to the first MCU
is not meaningful between two MCUs, or between an MCU and a
receiver), and so they may need to be modified at every stage.</t>
<t>To add or modify the multiplex ID is an expensive operation,
particularly if SRTP is used to authenticate the packet.
Modification to the contents of the RTP header requires a
reauthentication of the complete packet, and this could prove to be
a limiting factor in the throughput of a multipoint device. However,
it may be that reauthentication is required in any case due to the
nature of SDP. SDP permits the receiver to choose payload types,
meaning that a similar option to modify the payload type in the
packet header will cause the need to reauthenticate.</t>
</section>
<section title='Combined approach'>
<t>The two major flaws of the above methods (poor switching
performance of SSRC multiplexing, high computational cost on
switching nodes) can be mitigated with a combined method. In this,
the multiplex ID can be included in packets belonging to the first
frame of media (typically an IDR/GDR), but following this only the
SSRC is used to demultiplex.</t>
<t>Because the IDR is already required to be received before any
further frames can be decoded, this does not create any further
restrictions on the media stream -- existing mechanisms to ensure
the reliability of an IDR frame can be used.
It does introduce extra
complexity on the demultiplex side, requiring a two stage process of
inspecting the packet for a multiplex ID, and, if it is not present,
looking for the SSRC in a table of known streams.</t>
<t>The solution is somewhat more complex if it is possible for a
source to change which switched capture is sending it: for instance, in the
second example in <xref target='usecases' />, when the sender
switches from sending LC -> LR to sending CR -> LR, the sender's "C" source
moves from the receiver's "R" multiplex ID to the receiver's "L"
multiplex ID. For reasons of coding efficiency, it is desirable in
this case to avoid sending a new IDR frame for the "C" stream, if
the receiver's architecture allows the same decoding state to be
used for its various multiplex IDs. In this case, the multiplex ID
could be sent for a small number of frames after the source's
multiplex ID has changed.</t>
</section>
</section>
<section title='Transmission of presentation sources'>
<t>Most existing videoconferencing systems use separate RTP sessions
for main and presentation video sources, distinguished by the
<xref target='RFC4796'>SDP content attribute</xref>. The use of
<xref target='I-D.ietf-clue-framework' />the CLUE telepresence
framework to describe multiplexed streams can remove this
need. However, it could still be useful in some cases to make the
distinction between presentation and main video sources at the
transport layer. In particular, if different treatment is desired at
the transport layer or below (e.g. different VLANs, different QoS
characteristics, etc.) for main video vs
presentiation, the use of multiple RTP sessions m lines with
different transport addresses could would be necessary.</t>
</section>
<section title='Other considerations'>
<t>As currently defined, <xref target='ITU.H281.1994'>H.281 Far-End Camera
Control</xref><xref target='RFC4573' /> does not, in SIP-based
videoconferences, support selecting among multiple remote sources
(though it does in H.323 conferences controled by an MCU, which can
assign terminal IDs to sources). When RTP sessions contain multiple sources,
this limitation becomes pressing. (However, this problem does not
appear to be in scope of the CLUE working group.)</t>
</section>
<section title='Security Considerations' anchor='security'>
<t>The security considerations for multiplexed RTP do not seem to be
different than for non-multiplexed RTP.</t>
</section>
<section title='IANA Considerations' anchor='iana'>
<t>This document makes no requests of IANA.</t>
<t>Note to RFC Editor: please remove this section before publication
as an RFC.</t>
</section>
</middle>
<back>
<references title='Normative References'>
&rfc2119;
&rtp;
</references>
<references title='Informative References'>
&clueusecases;
&cluerequirements;
&clueframework;
&h281;
&h224rtp;
&contentattr;
&hdrext;
&rtcwebmux;
&westerlundrtpmux;
&ccm;
</references>
<!--
<section title='Open issues'>
<t><list style='symbols'>
<t></t>
</list></t>
</section>
-->
<!--
<section title='Changes From Earlier Versions'>
<t>Note to the RFC-Editor: please remove this section prior to publication
as an RFC.</t>
<section title='Changes From Draft -00'>
<t><list style='symbols'>
<t></t>
</list></t>
</section>
</section>
-->
</back>
</rfc>
| PAFTECH AB 2003-2026 | 2026-04-24 05:39:19 |