http://stupid.domain.name/ietf/

One document matched: draft-lennox-clue-rtp-usage-01.xml
<?xml version='1.0'?>

<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
    <!ENTITY rfc2119 SYSTEM
      'reference.RFC.2119.xml'>
    <!ENTITY rtp SYSTEM
      'reference.RFC.3550.xml'>
    <!ENTITY clueusecases SYSTEM
      'reference.I-D.ietf-clue-telepresence-use-cases.xml'>
    <!ENTITY cluerequirements SYSTEM
      'reference.I-D.ietf-clue-telepresence-requirements.xml'>
    <!ENTITY clueframework SYSTEM
      'reference.I-D.ietf-clue-framework.xml'>
    <!ENTITY westerlundrtpmux SYSTEM
      'reference.I-D.westerlund-avtcore-multiplex-architecture.xml'>
    <!ENTITY rtcwebmux SYSTEM
      'reference.I-D.lennox-rtcweb-rtp-media-type-mux.xml'>
    <!ENTITY h281 SYSTEM
      'reference.ITU.H281.1994.xml'> 
    <!ENTITY h224rtp SYSTEM
      'reference.RFC.4573.xml'>
    <!ENTITY contentattr SYSTEM
      'reference.RFC.4796.xml'>
    <!ENTITY hdrext SYSTEM
      'reference.RFC.5285.xml'>
    <!ENTITY ccm SYSTEM
      'reference.RFC.5104.xml'>
]>

<?rfc toc="yes" ?>
<?rfc strict="yes" ?>
<?rfc compact="yes" ?>
<?rfc sortrefs="yes" ?>
<?rfc symrefs="yes" ?>

<rfc category='std' ipr='trust200902' docName='draft-lennox-clue-rtp-usage-01'>
    <front>
        <title abbrev='RTP Usage for Telepresence'>
		Real-Time Transport Protocol (RTP) Usage for Telepresence Sessions
        </title>

        <author initials='J.' surname='Lennox'
                fullname='Jonathan Lennox'>
            <organization abbrev='Vidyo'>
               Vidyo, Inc.
            </organization>
            <address>
               <postal>
                   <street>433 Hackensack Avenue</street>
                   <street>Seventh Floor</street>
                   <city>Hackensack</city> <region>NJ</region>
                   <code>07601</code>
                   <country>US</country>
               </postal>
               <email>jonathan@vidyo.com</email>
            </address>
        </author>

        <author initials="A." surname="Romanow"
                fullname="Allyn Romanow">
            <organization>Cisco Systems</organization>
    
            <address>
               <postal>
				   <street> </street>
                   <city>San Jose</city> <region>CA</region>
                   <code>95134</code>
                   <country>USA</country>
               </postal>
               <email>allyn@cisco.com</email>
            </address>
        </author>

		<author initials="P." surname="Witty"
				fullname="Paul Witty">
            <organization>Cisco Systems</organization>
    
            <address>
               <postal>
				   <street> </street>
				   <city>Langley</city> <region>England</region>
				   <country>UK</country>
			   </postal>
			   <email>pauwitty@cisco.com</email>
			</address>
		</author>

        <date />
        <area>RAI</area>
        <workgroup>RTCWEB</workgroup>

        <keyword>I-D</keyword>
        <keyword>Internet-Draft</keyword>
		<!-- TODO: more keywords -->

        <abstract>
		<t>
		  This document describes mechanisms and recommended practice
		  for transmitting the media streams of telepresence sessions
		  using the Real-Time Transport Protocol (RTP).
		</t>
        </abstract>

    </front>

<middle>

<section title='Introduction' anchor='introduction'>

<t>Telepresence systems, of the architecture described by
<xref target='I-D.ietf-clue-telepresence-use-cases' />
and <xref target='I-D.ietf-clue-telepresence-requirements' />, will send and
receive multiple media streams, where the number of streams in use is
potentially large and asymmetric between endpoints, and streams can
come and go dynamically.  These characteristics lead to a number of architectural
design choices which, while still in the scope of potential
architectures envisioned by the <xref target='RFC3550'>Real-Time
Transport Protocol</xref>, must be fairly different than those
typically implemented by the current generation of voice or video
conferencing systems.  This document makes recommendations about how
streams should be encoded and transmitted in RTP for this telepresence
architecture.</t>

</section>

<section title='Terminology'>

<t>The key words "MUST", "MUST NOT", "REQUIRED", 
"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", 
and "OPTIONAL" in this document
are to be interpreted as described in <xref
target='RFC2119'>RFC 2119</xref> and indicate requirement levels for
compliant implementations.</t>

</section>

<section title='Source multiplexing - overview'>

<t>Telepresence sessions have lots of media streams: easily dozens at a time
  (given, e.g., a continuous presence screen in a multi-point
  conference), potentially out of a
  possible pool of hundreds.  Furthermore, endpoints will have an
  asymmetric number of media streams.</t>

<t>In such an environment the usual model of existing SIP endpoints –-
  sending zero or one source (in each direction) per RTP session --
  doesn’t scale, and mapping asymmetric numbers of sources to sessions
  is needlessly complex.</t>

<t>Therefore, telepresence systems SHOULD use a single RTP session per
  media type, except where
  there's a need to give sessions different transport treatment.  All
  sources of the same media type are sent over this single RTP
  session.  This architecture (known as "source multiplexing") was
  defined by <xref target='RFC3550' />, but was used rarely until more
  recently by some Telepresence systems.  
</t>

<t>
Multiplexing multiple media streams in this way has additional
advantages.   It makes going through middle boxes  
considerably easier, as it allows Telepresence devices to work through
SIP B2BUAs that do not support multiple media lines of  
the same media type. It also simplifies NAT and firewall traversal by allowing
endpoint to deal with only a single address/port mapping  
per media type rather than multiple mappings.
</t>

<t>
During call setup, a single RTP session is negotiated for each media type.  
In SDP, only one media line is negotiated per media and multiple media
streams are sent over the same UDP channel negotiated  
using the SDP media line.
</t>


<t>
   A number of protocol issues involved in multiplexing RTP streams
   into session are discussed in
   <xref target='I-D.westerlund-avtcore-multiplex-architecture' /> and
   <xref target='I-D.lennox-rtcweb-rtp-media-type-mux' />. In this
   draft we concentrate on examining the demultiplexing of RTP
   streams, in the specific context of telepresence systems.
</t>

<t>
  A key issue to work out is how a receiver interprets the multiple
  streams it receives, and corrolates them with the captures it has
  requested.  In some cases, the
  <xref target='I-D.ietf-clue-framework'>CLUE Framework</xref>'s
  concept of the "capture" maps cleanly to the RTP concept of an SSRC,
  but in some cases it does not.
</t>

<t>
   First we will consider the cases that need to be considered. We will then examine 
   the two most obvious approaches to demultiplexing, showing their pros and cons.
   We then describe a third possible alternative. 
</t>

</section>

<section title='Use Cases' anchor='usecases'>

<t>There are three distinct use cases relevant for telepresence systems:</t>

<t>Static stream choice:</t>

<t>In this case, the streams sent over the multiplex are constant over
  the complete session. An example is a triple-camera system to MCU in
  which left, center and right streams are sent for the duration of
  the session.</t>
<t>This describes an endpoint to endpoint, endpoint to multipoint
  device, and equivalently a transcoding multipoint device to
  endpoint.</t>
<t>This is illustrated in <xref target='p2p-static-streams' />.</t>

<figure anchor="p2p-static-streams" title="Point to Point Static Streams">
<artwork>
<![CDATA[
            ,'''''''''''|                           +-----------Y
            |           |                           |           |
            | +--------+|"""""""""""""""""""""""""""|+--------+ |
            | |EndPoint||---------------------------||EndPoint| |
            | +--------+|"""""""""""""""""""""""""""|+--------+ |
            |           |                           |           |
            "-----------'                           "------------
]]>
</artwork>
</figure>

<t>Dynamic streams from a finite set:</t>

<t>In this case, the receiver has requested a smaller number of
  streams than the number of media sources that are available, and
  expects the sender to switch the sources being sent
  based on criteria chosen by the sender.  (This is called
  auto-switched in the <xref target='I-D.ietf-clue-framework'>CLUE
  Framework</xref>.)</t>

<t>An example is a triple-camera system to two-screen system, in which
  the sender needs to switch either LC -> LR, or CR -> LR.</t>

<t>This describes an endpoint to endpoint, endpoint to multipoint
  device, and a transcoding device to endpoint.</t>

<t>This is illustrated in <xref target='p2p-finite-streams' />.</t>


<figure anchor="p2p-finite-streams" title="Point to Point Finite Source Streams">
<artwork>
<![CDATA[
            ,'''''''''''|                           +-----------Y
            |           |                           |+--------+ |
            | +--------+|"""""""""""""""""""""""""""||EndPoint| |
            | |EndPoint||                           |+--------+_|
            | +--------+''''''''''                   '''''''''''
            |           |........
            "-----------'
]]>
</artwork>
</figure>


<t>Dynamic streams from an infinite set:</t>

<t>This case describes a switched multipoint device to endpoint, in
  which the multipoint device can choose to send any streams received
  from any other endpoints within the conference 
  to the endpoint.</t>
<t>For example, in an MCU to triple-screen system,  the MCU could send
  e.g. LCR of a triple-camera system -> LCR, or CCC of three
  single-camera endpoints -> LCR. </t>
<t>This is illustrated in <xref target='multipoint-infinite-streams' />.</t>

<figure anchor="multipoint-infinite-streams" title="Multipoint Infinite Streams">
<artwork>
<![CDATA[
           +-+--+--+
           | |EP|  `-.
           | +--+  |`.`-.
           +-------`. `. `.
                     `-.`. `-.
                        `.`-. `-.
                          `-.`.  `-.-------+              +------+
           +--+--+---+       `.`.|  +---+  ---------------| +--+ |
           |  |EP|   +----.....:=.  |MCU|  ...............| |EP| |
           |  +--+   |"""""""""--|  +---+  |______________| +--+ |
           +---------+"""""""""";'.'.'.'---+              +------+
                              .'.'.'.'
                            .'.'.'.'
                           / /.'.'
                         .'.::-'
            +--+--+--+ .'.::'
            |  |EP|  .'.::'
            |  +--+  .::'
            +--------.'
]]>
</artwork>
</figure>
<!-- " (close quote to keep emacs xml mode happy) -->

<t>Within any of these cases, every stream within the multiplexed
  session MUST have a unique SSRC.  The SSRC is chosen at random
  <xref target='RFC3550' /> to ensure uniqueness (within the
  conference), and contains no meaningful information.</t>

<t>Any source may choose to restart a stream at any time, resulting in
  a new SSRC. For example, a transcoding MCU might, for reasons of load
  balancing, transfer an encoder onto a different DSP, and throw away
  all context of the encoding at this state, sending an RTCP BYE
  message for the old SSRC, and picking a new SSRC for the stream when
  started on the new DSP.</t>
<t>Because of this possibility of changing the SSRC at any time, all
  our use cases can be considered to be the third and most difficult
  case, that of dynamic streams from an infinite set. Thus, this is
  the only case we will consider.</t>


</section>

<section title='Demultiplexing'>
<t>There are two obvious choices in order to demultiplex: the SSRC,
  which is guaranteed to be unique for a stream, but conveys no
  intrinsic useful information, or an additional multiplex ID tagged
  on to media packets.  There may be other choices, e.g., payload type
  number, which might be appropriate for multiplexing one audio with
  one video stream on the same RTP session, but this not relevant for
  the cases discussed here.</t>

<t>For receivers with limited decoding resources, it is particularly
  important to ensure that the number of streams which the receiver is
  expecting to receive never exceeds the maximum number it has
  requested.  On a change of stream, the receiver can be expected to
  have a one-out, one-in policy, so that the decoder of the stream
  currently being decoded is stopped before starting the decoder for
  the stream replacing it.  The sender should therefore indicate to
  the receiver which stream will be replaced upon a stream change.</t>

<section title='Using the SSRC for demultiplexing'>
<t>Using the SSRC has the advantage of being included already in each
  RTP packet. However, there are some disadvantages to
  consider. First, the SSRC needs to be linked to some metadata to
  associate it to the capture stream. This is because although it
  uniquely identifies a media stream, it does not indicate which of
  the requested streams each SSRC is tied to.  If more than one media
  stream is expected, it is therefore required to send some additional
  metadata to indicate the link between the SSRC and the CLUE stream
  ID.  This is simply a mapping from transmitted SSRC to stream ID,
  updated as new SSRCs replace old ones.</t>

<t>Because of the one-out, one-in codec policy, the receiver must know
  in advance of receiving the media stream how to allocate its
  decoding resources. Athough it could cache incoming media received
  before it knows what multiplex stream it applies to, this will
  require an unknown amount of storage space (particularly if the
  metadata is lost), and could lead to significant latency, after
  which the receiver may not find it possible to catch up because of
  resource constraints, or else it would require an expensive state
  refresh, such as a <xref target='RFC5104'>Full Intra Request
  (FIR)</xref>.</t>

<t>In addition, a receiver will have to store lookup tables of SSRCs
  to stream IDs/decoders etc.  Because of the large SSRC space (32
  bits), this will have to be in the form of something like a hash
  map, and a lookup will have to be performed for every incoming
  packet, which may prove costly on the receiver side.</t>

<t>Consider the choices for where to put the metadata.  The metadata
  could be sent in the CLUE messaging.  The use of a reliable
  transport means that it can be sure that the metadata will not be
  lost, but if this reliability is acheived through retransmission,
  the time taken for the metadata to reach all receivers (particularly
  in a very large scale conference, e.g., with thousands of users) could result in
  very poor switching times, providing a bad user experience.</t>

<t>A second option for sending the metadata is in RTCP, for instance
  as a new SDES item.  This is likely to
  follow the same path as media, and therefore if the metadata is sent
  slightly in advance of the media, it can be expected to be received
  in advance of the media.  However, because RTCP is lossy, the
  metadata may not be received for some time, resulting in the receiver of the
  media not knowing how to route the received media.  A system of
  acks and retransmissions could mitigate this, but this results in
  the same high switching latency behaviour as discussed for using
  CLUE as a transport for the metadata.</t>

</section>

<section title='Multiplex ID'>
<t>The second option is to tag each media packet with
  an <xref target='RFC5285'>RTP header extension</xref> carrying a
  multiplex ID.
  This means that a receiver immediately knows how to interpret received
  media, even when an unknown SSRC is seen.  As long as the media
  carries a known
  multiplex ID, it can be assumed that this media stream will replace
  the stream currently being received with that multiplex ID.</t>

<t>This gives significant advantages to switching latency, as a switch
  between sources can be acheived without any form of negotiation with
  the receiver.  There is no chance of receiving media without knowing
  to which switched capture it belongs.</t>

<t>Although multiplex IDs may be chosen by either the sender or
  receiver, the multiplex ID can, if chosen by the receiver, contain
  semantic information relevant to the receiver. For example, on a
  large multipoint device with many DSPs, the receiver chosen
  multiplex ID could identify the DSP to which the media should be
  sent, and possibly contain routing information to the DSP.</t>

<t>However, there are also significant disadvantages in using a
  multiplex ID. It introduces additional processing costs.</t>

<t>Multiplex IDs are scoped only within one hop (i.e., within a
  cascaded conference a multiplex ID that is used from the source to the first MCU
  is not meaningful between two MCUs, or between an MCU and a
  receiver), and so they may need to be modified at every stage.</t> 

<t>To add or modify the multiplex ID is an expensive operation,
  particularly if SRTP is used to authenticate the packet.
  Modification to the contents of the RTP header requires a
  reauthentication of the complete packet, and this could prove to be
  a limiting factor in the throughput of a multipoint device.  However,
  it may be that reauthentication is required in any case due to the
  nature of SDP. SDP permits the receiver to choose payload types,
  meaning that a similar option to modify the payload type in the
  packet header will cause the need to reauthenticate.</t>

</section>

<section title='Combined approach'>
<t>The two major flaws of the above methods (poor switching
  performance of SSRC multiplexing, high computational cost on
  switching nodes) can be mitigated with a combined method.  In this,
  the multiplex ID can be included in packets belonging to the first
  frame of media (typically an IDR/GDR), but following this only the
  SSRC is used to demultiplex.</t>

<t>Because the IDR is already required to be received before any
  further frames can be decoded, this does not create any further
  restrictions on the media stream -- existing mechanisms to ensure
  the reliability of an IDR frame can be used.
  It does introduce extra
  complexity on the demultiplex side, requiring a two stage process of
  inspecting the packet for a multiplex ID, and, if it is not present,
  looking for the SSRC in a table of known streams.</t>

<t>The solution is somewhat more complex if it is possible for a
  source to change which switched capture is sending it: for instance, in the
  second example in <xref target='usecases' />, when the sender
  switches from sending LC -> LR to sending CR -> LR, the sender's "C" source
  moves from the receiver's "R" multiplex ID to the receiver's "L"
  multiplex ID.  For reasons of coding efficiency, it is desirable in
  this case to avoid sending a new IDR frame for the "C" stream, if
  the receiver's architecture allows the same decoding state to be
  used for its various multiplex IDs.  In this case, the multiplex ID
  could be sent for a small number of frames after the source's
  multiplex ID has changed.</t>

</section>

</section>

<section title='Transmission of presentation sources'>

<t>Most existing videoconferencing systems use separate RTP sessions
  for main and presentation video sources, distinguished by the
  <xref target='RFC4796'>SDP content attribute</xref>.  The use of
  <xref target='I-D.ietf-clue-framework' />the CLUE telepresence
  framework to describe multiplexed streams can remove this
  need. However, it could still be useful in some cases to make the
  distinction between presentation and main video sources at the
  transport layer. In particular, if different treatment is desired at
  the transport layer or below (e.g. different VLANs, different QoS
  characteristics, etc.) for main video vs
  presentiation, the use of multiple RTP sessions m lines with
  different transport addresses could would be necessary.</t>

</section>

<section title='Other considerations'>

<t>As currently defined, <xref target='ITU.H281.1994'>H.281 Far-End Camera
Control</xref><xref target='RFC4573' /> does not, in SIP-based
videoconferences, support selecting among multiple remote sources
(though it does in H.323 conferences controled by an MCU, which can
assign terminal IDs to sources).  When RTP sessions contain multiple sources,
this limitation becomes pressing.  (However, this problem does not
appear to be in scope of the CLUE working group.)</t>

</section>

<section title='Security Considerations' anchor='security'>

<t>The security considerations for multiplexed RTP do not seem to be
  different than for non-multiplexed RTP.</t>

</section>

<section title='IANA Considerations' anchor='iana'>

<t>This document makes no requests of IANA.</t>

<t>Note to RFC Editor: please remove this section before publication
  as an RFC.</t>

</section>

</middle>

<back>

<references title='Normative References'>

&rfc2119;

&rtp;

</references>

<references title='Informative References'>

&clueusecases;

&cluerequirements;

&clueframework;

&h281;

&h224rtp;

&contentattr;

&hdrext;

&rtcwebmux;

&westerlundrtpmux;

&ccm;

</references>

<!--
<section title='Open issues'>

<t><list style='symbols'>

<t></t>

</list></t>

</section>
-->

<!-- 
<section title='Changes From Earlier Versions'>

<t>Note to the RFC-Editor: please remove this section prior to publication
as an RFC.</t>

<section title='Changes From Draft -00'>

<t><list style='symbols'>

<t></t>


</list></t>

</section>

</section>

-->

</back>

</rfc>
PAFTECH AB 2003-2026
2026-04-24 05:39:19