http://stupid.domain.name/ietf/

One document matched: draft-westerlund-clue-multistream-conference-00.xml
<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" docName="draft-westerlund-clue-multistream-conference-00"
     ipr="trust200902">
  <front>
    <title abbrev="MSMC">Multi-Stream Media Conferencing</title>

    <author fullname="Magnus Westerlund" initials="M." surname="Westerlund">
      <organization>Ericsson</organization>

      <address>
        <postal>
          <street>Farogatan 6</street>

          <city>SE-164 80 Kista</city>

          <country>Sweden</country>
        </postal>

        <phone>+46 10 714 82 87</phone>

        <email>magnus.westerlund@ericsson.com</email>
      </address>
    </author>

    <author fullname="Bo Burman" initials="B." surname="Burman">
      <organization>Ericsson</organization>

      <address>
        <postal>
          <street>Farogatan 6</street>

          <city>SE-164 80 Kista</city>

          <country>Sweden</country>
        </postal>

        <phone>+46 10 714 13 11</phone>

        <email>bo.burman@ericsson.com</email>
      </address>
    </author>

    <date day="3" month="February" year="2012" />

    <abstract>
      <t>This memo describes a multimedia multi-party conferencing
      architecture based on use of multiple Real-Time Transport Protocol (RTP)
      streams.</t>
    </abstract>
  </front>

  <middle>
    <section title="Introduction">
      <t>Multimedia multi-party conferencing is being improved through the use
      of multiple media streams per media type. At the same time, the number
      of different types of end-points that are capable of participating in a
      multimedia conference increases, and so does the number of access types
      that end-points may use to connect. This memo describes a number of use
      cases relevant to that scenario. The use cases aims to use as high media
      quality as possible, while making as efficient use of available
      resources as possible and accommodating a high degree of end-point and
      network diversity.</t>
    </section>

    <section title="Requirements">
      <t>The use cases in this memo should handle diversity in end-point
      capability such as media quality, processing power, and network
      interface. The perceived end-user media quality is in turn impacted by
      media stream bitrate and also number of streams per media type, as well
      as end-to-end delay, which should also be taken into account. Diversity
      in network capacity, network quality, and user preferences based on
      location, device, etc, should also be handled. These properties should
      be expected to change between conference sessions and even within the
      same session.</t>

      <t>Different types of conference Topologies must be supported, including
      centralized, multicast, and point-to-point. It should also be possible
      to use cascaded, centralized conferences as well as combinations of
      unicast and multicast. The relevant RTP base topologies are described in
      <xref target="RFC5117"></xref>.</t>

      <t>Use cases including an RTP Mixer should avoid adding delay or
      reducing quality by forwarding streams as unmodified as possible, with
      reduced processing requirements in the RTP Mixer as added advantage.</t>

      <t>The conference use cases should strive for continuous presence,
      presenting media from as many participants as reasonably possible. At
      the same time, it should strive to use as high quality media as possible
      from each participant. The bandwidth used for the conference should at
      the same time be as low as possible. Those three requirements are in
      conflict and there is no generally accepted, optimum trade-off.</t>

      <t>The number of participants in the conference shall assumed to be
      unlimited, and it may thus not be possible to create a continuous
      presence experience with media from all participants being presented to
      all other participants, and only media from a limited sub-set of
      participants may be presented to each individual participant. This is
      seen as the hardest conferencing use case and most other use cases
      should also be covered by that use case. How many other participants'
      media that are presented simultaneously on a certain end-point is
      allowed to vary and may depend on a number of preconditions.</t>
    </section>

    <section title="Use Cases">
      <t>The sub-sections below describe multi-stream conferencing use cases
      relevant to each RTP base topology. Each use case is constructed as to
      cover as many requirements as possible and must consequently include a
      number of different end-point types.</t>

      <t>In the figures below, the involved end-points are for clarity drawn
      as only sending or only receiving, but may in a real scenario of course
      have the possibility to both send and receive.</t>

      <t>The common figure legends for the end-point capability categories
      used in all use cases are:<list style="hanging">
          <t hangText="D:">Dual channel, high quality, for example conference
          room client. Streams in figures are denoted by '||' or '='. The term
          'channel' is chosen as a generic term and may apply to any media. A
          channel is related to a single media source in the sender and a
          media sink in the receiver. For video media, the channel source is
          typically a camera and the sink a screen or window. For audio, the
          channel source is typically a microphone and the sink a loudspeaker.
          A dual channel sending end-point thus has two independent media
          sources that can be sent simultaneously to a receiving end-point.
          Nothing prevents those dual channels from using other qualities than
          'high', but that is chosen as an example in this memo to simplify
          the description.</t>

          <t hangText="H:">High quality, single channel, for example meeting
          room client. Streams in figures are denoted by '|' or '-'.</t>

          <t hangText="M:">Medium quality, single channel, for example desktop
          client. Streams in figures are denoted by ':' or '..'.</t>

          <t hangText="L:">Low quality, single channel, for example mobile
          client. Streams in figures are denoted by '''.</t>
        </list>It is of course possible to divide end-points into more
      categories, but the chosen ones should make it possible to highlight
      most of the relevant topics.</t>

      <t>Note also that the use cases are kept as separated and clean-cut as
      possible to simplify the description. Most use cases, especially the
      <xref target="sec-mix">RTP Mixer ones</xref>, could be combined into
      larger, more complex scenarios.</t>

      <section anchor="sec-p2p" title="Point to Point">
        <t>The point to point case is close to trivial, since it is assumed
        that capability exchange during setup will be able to negotiate the
        best quality between the two end-points. When the number of channels
        differ between sender and receiver, the channel aspects of <xref
        target="sec-mix-dual"></xref> apply.</t>
      </section>

      <section anchor="sec-mix" title="RTP Mixer">
        <t>Each link between the RTP Mixer and an end-point is in principle a
        <xref target="sec-p2p">Point to Point connection</xref> as described
        above. One major difference from a Point to Point Connection is that
        the RTP Mixer should represent and act according to some combination
        of the wishes and needs of multiple end-points on the other side of
        the Mixer. That may include handling of conflicting or partly
        conflicting requirements, and the way to resolve those is not
        generally defined but will typically depend on RTP Mixer design,
        configuration, applied policies, or some combination.</t>

        <section anchor="sec-mix-incompatible" title="Incompatible Codecs">
          <t>This use case is the only feasible one when end-points have
          incompatible codecs, and transcoding is in that case always
          necessary and cannot be avoided. The incompatibility is not
          necessarily only due to different codec types, but may also be
          caused by limited codec capacity, or limitations in media stream
          transport between the end-points. This use case is trivial in the
          sense that the Mixer is assumed to always be capable of creating a
          dedicated media mix to each receiving end-point. It may be used as
          an overall strategy, or as part of other use cases.</t>

          <t>A special case of transcoding is when the sender is configured to
          use scalable encoding and receivers do not support scalability.
          Transcoding of scalable streams to non-scalable streams is often a
          less complex operation than transcoding in general.</t>

          <figure align="center" anchor="fig-mix-incompatible"
                  title="Transcoding Incompatible Codecs">
            <artwork><![CDATA[
       +---------+
       | Codec 1 |
       +---------+
            |
            v
        +-------+
        | Mixer |
        +-------+
       /         \ Trans-
      /           \ coded
     v             v stream
+---------+   +---------+
| Codec 1 |   | Codec 2 |
+---------+   +---------+
]]></artwork>
          </figure>
        </section>

        <section title="Low Quality End-Point">
          <t>A low quality end-point has per definition the lowest media
          quality in the conference, and as a sender it is assumed that all
          other end-points can receive and present the media without
          restrictions. Some of the more capable end-points will have to
          choose how to present the received media in the best way, but it can
          always be presented.</t>

          <t>As a receiver, a low quality end-point is only capable of
          receiving streams from other low quality end-points (without
          transcoding).</t>

          <t>It is conceivable that a receiver of a certain quality category
          (not only low quality) can receive higher quality streams and reduce
          the quality locally such that it is feasible for presentation, but
          that will in general waste both bandwidth and processing
          resources.</t>

          <figure align="center" anchor="fig-mix-low"
                  title="Low Quality Stream">
            <artwork><![CDATA[
       +---+
       | L |
       +---+
         '
         v
     +-------+    +---+---+
     | Mixer |'''>|   D   |
     +-------+    +---+---+
    '    '    '
   '     '     '
  v      v      v
+---+  +---+  +---+
| L |  | M |  | H |
+---+  +---+  +---+
]]></artwork>
          </figure>
        </section>

        <section anchor="sec-mix-medium" title="Medium Quality End-Point">
          <t>Similar to above, the medium quality sender media stream is
          assumed to be possible to receive without restrictions in all but
          the low quality end-point. For the medium quality media stream to
          reach also the low quality end-point, there are three options that
          are described in the sub-sections below.</t>

          <t>When receiving, a medium quality end-point is capable of
          receiving other medium quality streams, as well as low quality
          streams (without transcoding).</t>

          <section anchor="sec-med-transcoding" title="Transcoding">
            <t>Transcode the stream when sent towards low quality receivers,
            as described <xref target="sec-mix-incompatible">above</xref>.
            This could sometimes be feasible quality-wise, especially if the
            quality difference between the medium quality and the low quality
            streams are large, making the reduced medium quality stream be
            relatively close quality-wise to un-encoded low quality media.
            End-to-end delay will however always suffer.</t>

            <figure align="center" anchor="fig-mix-med-transcoding"
                    title="Medium Quality Stream Transcoding">
              <artwork><![CDATA[
       +---+
       | M |
       +---+
         :
         v
     +-------+    +---+---+
     | Mixer |...>|   D   |
     +-------+    +---+---+
    '    :    :
  T'     :     :
  v      v      v
+---+  +---+  +---+
| L |  | M |  | H |
+---+  +---+  +---+
T: Transcoded stream
]]></artwork>
            </figure>
          </section>

          <section anchor="sec-med-simulcast" title="Simulcast">
            <t>Encode both a medium quality and a low quality stream from the
            same un-encoded source data and simulcast them. The RTP Mixer can,
            without having to transcode, forward the low quality stream
            towards the low quality end-points, and forward the medium quality
            stream towards all other end-points.</t>

            <figure align="center" anchor="fig-mix-med-simulcast"
                    title="Medium Quality Stream Simulcast">
              <artwork><![CDATA[
       +---+
       | M |
       +---+
        ' :
        v v
     +-------+    +---+---+
     | Mixer |...>|   D   |
     +-------+    +---+---+
    '    :    :
   '     :     :
  v      v      v
+---+  +---+  +---+
| L |  | M |  | H |
+---+  +---+  +---+
]]></artwork>
            </figure>
          </section>

          <section anchor="sec-med-scalable" title="Scalable Coding">
            <t>As a variant of simulcast, if it is possible to use a scalable
            codec, create a scalable stream with one low quality sub-stream
            and one sub-stream that together with the low quality sub-stream
            can reconstruct a medium quality stream. Similar but not identical
            to the above, the RTP Mixer can, without having to transcode,
            forward the low quality sub-stream towards the low quality
            end-points, and forward both the low quality and the medium
            quality sub-streams (jointly describing a medium quality stream)
            to all other end-points.</t>

            <figure align="center" anchor="fig-mix-med-scalable"
                    title="Medium Quality Scalable Stream">
              <artwork><![CDATA[
       +---+
       | M |
       +---+
        ':
        vv 
     +-------+    +---+---+
     | Mixer |...>|   D   |
     +-------+'''>+---+---+
    '   ':   ':
   '    ':    ':
  v     vv     vv
+---+  +---+  +---+
| L |  | M |  | H |
+---+  +---+  +---+
]]></artwork>
            </figure>
          </section>
        </section>

        <section anchor="sec-mix-high"
                 title="Single Channel High Quality End-Point">
          <t>This use case is very similar to the <xref
          target="sec-mix-medium">medium quality end-point case</xref> above.
          The difference is that there are now two different (sample)
          categories of end-points that cannot receive the high quality stream
          instead of one category. Simulcast and scalable streams must thus be
          extended to three versions or three sub-streams, respectively.</t>

          <t>As a receiver, all streams from other end-points can be received.
          The only exception is when multiple streams from a single end-point
          are used, such as from D.</t>

          <section anchor="sec-high-transcoding" title="Transcoding">
            <t>Similar to <xref target="sec-med-transcoding">Medium
            Quality</xref>, just that the transcoding needs to produce two
            different qualities instead of one.</t>

            <figure align="center" anchor="fig-mix-high-transcoding"
                    title="High Quality Stream Transcoding">
              <artwork><![CDATA[
       +---+
       | H |
       +---+
         |
         v
     +-------+    +---+---+
     | Mixer |--->|   D   |
     +-------+    +---+---+
    '    :    |
  T'    T:     |
  v      v      v
+---+  +---+  +---+
| L |  | M |  | H |
+---+  +---+  +---+
T: Transcoded stream
]]></artwork>
            </figure>
          </section>

          <section anchor="sec-high-simulcast" title="Simulcast">
            <t>Similar to <xref target="sec-med-simulcast">Medium
            Quality</xref>, just that three instead of two simulcast streams
            need to be sent.</t>

            <figure align="center" anchor="fig-mix-high-simulcast"
                    title="High Quality Stream Simulcast">
              <artwork><![CDATA[
       +---+
       | H |
       +---+
       ' : |
       v v v
     +-------+    +---+---+
     | Mixer |--->|   D   |
     +-------+    +---+---+
    '    :    |
   '     :     |
  v      v      v
+---+  +---+  +---+
| L |  | M |  | H |
+---+  +---+  +---+
]]></artwork>
            </figure>
          </section>

          <section anchor="sec-high-scalable" title="Scalable Coding">
            <t>Similar to <xref target="sec-med-scalable">Medium
            Quality</xref>, just that three instead of two scalable layers are
            used.</t>

            <figure align="center" anchor="fig-mix-high-scalable"
                    title="High Quality Scalable Stream">
              <artwork><![CDATA[
       +---+
       | H |
       +---+
        ':|
        vvv
     +-------+--->+---+---+
     | Mixer |...>|   D   |
     +-------+'''>+---+---+
    '   ':   ':|
   '    ':    ':|
  v     vv     vvv
+---+  +---+  +---+
| L |  | M |  | H |
+---+  +---+  +---+
]]></artwork>
            </figure>
          </section>
        </section>

        <section anchor="sec-mix-dual"
                 title="Dual Channel High Quality End-Point">
          <t>Again, this use case is very similar to <xref
          target="sec-mix-high">the one above</xref>. The major difference is
          that this end-point is capable of sending and receiving dual, high
          quality streams where each stream has to be treated in a similar way
          to the previous section.</t>

          <t>When using multiple inter-related media, such as video with
          corresponding audio, those media streams need not only be
          synchronized time-wise, just as for single channel end-points, but
          their spatial relation need also be established. For example, a left
          camera with an attached microphone and a right camera with an
          attached microphone. In general it is likely always desirable to be
          able to relate streams from a multi-channel end-point in a defined
          way, representing related sub-parts of a larger scene, both
          intra-media and inter-media. Description and signaling of stream
          relations is a complex problem in itself, which is currently <xref
          target="I-D.ietf-clue-framework">work in progress in CLUE WG</xref>
          and will not be elaborated further in this memo.</t>

          <t>Another major, additional, aspect to account for is that the RTP
          Mixer needs to choose how to map dual (or multiple) streams onto a
          single stream, when forwarding towards end-points that has fewer
          receive channels than the sender. This problem is similar to
          choosing a limited set of participants from a potentially unlimited
          set, which is described <xref
          target="sec-mix-select">below</xref>.</t>

          <t>A dual-channel (or multi-channel) receiving end-point that is
          receiving fewer simultaneous channel streams from a sending
          end-point than the maximum possible this end-point can handle, will
          have to decide which one(s) of the available receive channels should
          be used for each received stream. This decision can also be made by
          the RTP Mixer, if it knows the concept of multi-channel clients, has
          information about how many simultaneous channels the individual
          receiver supports, and knows how those channels should be
          related.</t>

          <t>There may of course be end-points that have capability for more
          than two simultaneous channels. It is also possible to envision
          end-points where the number of receive channels differ from the
          number of send channels.</t>
        </section>

        <section anchor="sec-mix-select"
                 title="Mixer Source and Sink Selection">
          <t>When a Mixer cannot forward all available streams to each client,
          it has to choose a small set out of a potentially very large set of
          streams in the conference. Multiple strategies are possible for that
          choice. The Mixer may use different strategies towards different
          receivers, depending on for example their capabilities or
          preferences.</t>

          <t>Another dimension of selection exist when the conference contains
          end-points with different number of channels (multiple media streams
          of the same media type). On the Mixer receive side, it may be
          necessary to select a few streams from a multi-stream media. On the
          Mixer send side, it may be necessary to select to which channel in a
          multi-stream capable receiver a certain stream should be sent.</t>

          <t>The choice of source may be either algorithmic (pre-configured)
          or manual (user controlled from one or more end-points). Speech
          activity is a commonly used algorithmic measure to choose which
          participants' media streams to forward, but it is not the only
          conceivable measure. Which and how many streams to select can either
          be based on some algorithm included in the RTP Mixer (for example
          the N most active speakers), or it can be controlled by the
          conference owner, or even by the receiving users individually.</t>

          <t>The figure below depicts the case where a receiving end-point
          explicitly selects streams through signaling (*) to the Mixer. Both
          the media senders and their streams are numbered for clarity, and
          the conceptual signaling message is contained in {...}.</t>

          <figure align="center" anchor="fig-receiver-selection"
                  title="Receiver Stream Selection">
            <artwork><![CDATA[
     +----+  +----+  +----+
     | L1 |  | M2 |  | S3 |
     +----+  +----+  +----+
         '      :      |
          1'   2:   3|
            v   v   v  
+----+ 4  +-----------+  5 +----+
| M4 |...>|   Mixer   |<'''| L5 |
+----+    +-----------+    +----+
              ^   |1,3,4
       {1,3,4}*   v
              +---+
              | H |
              +---+
]]></artwork>
          </figure>

          <t>To be able to make an informed choice on what streams to select,
          the user at the receiving end-point will need information about
          which conference participant correspond to which stream, and
          possibly also other meta-information about the streams and sending
          end-points.</t>
        </section>

        <section anchor="sec-mix-compose" title="Media Composition">
          <t>When it is desirable that the RTP Mixer <xref
          target="sec-mix-select">selects</xref> and forwards a larger number
          of simultaneous streams than what the receiving end-point can
          support, the Mixer has the option to make a composition of multiple
          streams onto fewer streams, possibly to only a single stream. Which
          and how many streams to compose is typically based on the selection,
          as described in the previous section above.</t>

          <t>The composition operation is basically independent from
          selection. In general, Mixer composition requires transcoding. A
          Mixer composition use case example is depicted below. To simplify
          the picture, only a single receiver is included. Also, both the
          media senders and their streams are numbered for clarity. In the
          below example, the Mixer has chosen to compose stream 1, 3 and 4
          into the single stream sent to the receiving end-point.</t>

          <figure align="center" anchor="fig-mixer-composition"
                  title="Mixer Composition">
            <artwork><![CDATA[
     +----+  +----+  +----+
     | L1 |  | M2 |  | S3 |
     +----+  +----+  +----+
         '      :      |
          1'   2:   3|
            v   v   v  
+----+ 4  +-----------+  5 +----+
| M4 |...>|   Mixer   |<'''| L5 |
+----+    +-----------+    +----+
                |1,3,4
                v
              +---+
              | H |
              +---+
]]></artwork>
          </figure>

          <t>An alternative to make composition in the Mixer is to let the
          end-point do local composition by sending it multiple, un-composed,
          streams. This could avoid transcoding at the cost of sending
          multiple streams, which is depicted below, where stream 2, 3 and 5
          are sent to the receiving client for local composition, as an
          example.</t>

          <figure align="center" anchor="fig-local-composition"
                  title="Local Composition">
            <artwork><![CDATA[
     +----+  +----+  +----+
     | L1 |  | M2 |  | S3 |
     +----+  +----+  +----+
         '      :      |
          1'   2:   3|
            v   v   v  
+----+ 4  +-----------+  5 +----+
| M4 |...>|   Mixer   |<'''| L5 |
+----+    +-----------+    +----+
            2: 3| 5'
             :  |  '
             v  v  v
             +-----+
             |  H  |
             +-----+
]]></artwork>
          </figure>

          <t>When the senders offer streams of multiple qualities, either the
          mixer or the local composition can select and combine media of
          different qualities. Use of multiple qualities could help optimizing
          resource utilization for transport, decoding and rendering. In the
          figure below, one high quality (3), one medium quality (4) and two
          low qualities (1 and 2) are selected for local composition, as an
          example.</t>

          <figure align="center" anchor="fig-multi-local-composition"
                  title="Multi Quality Local Composition">
            <artwork><![CDATA[
     +----+  +----+  +----+
     | L1 |  | M2 |  | S3 |
     +----+  +----+  +----+
         '    ' :    ' : |
          1' 2'2: 3'3:3|
       4   v  v v  v v v
+----+...>+-------------+  5 +----+
| M4 | 4  |    Mixer    |<'''| L5 |
+----+'''>+-------------+    +----+
          3|  4:  1'  2'
             |  : '  '
              v v v v
             +-------+
             |   H   |
             +-------+
]]></artwork>
          </figure>

          <t>This local composition scenario can be further enhanced by the
          Mixer providing different quality streams to the receiver, based on
          the Mixer selection algorithm. One example could be to let the Mixer
          forward the stream from the most active speaker as a high quality
          stream, and forward the less active speakers as lower quality
          streams. To use this in the local composition, the receiving
          end-point must know the streams' different roles, which requires a
          stream role agreement between the Mixer and the receiving end-point.
          In the figure above, this can be achieved by tagging for example the
          leftmost stream from the mixer as having the "most active" role. The
          role agreement can be made through signaling between Mixer and
          receiving end-point.</t>

          <t>When the receiving end-point supports reception and presentation
          of several channels (for example has several screens), it is
          possible to combine Mixer composition with local composition of
          multiple un-composed streams by sending one or more composed streams
          and one or more un-composed streams from the Mixer.</t>
        </section>
      </section>

      <section anchor="sec-multicast" title="Multicast">
        <t>In the multicast or multi-unicast case, each media stream from a
        single sender will reach multiple receivers unmodified. This can be
        achieved by multicast addressing, or by multi-unicast and <xref
        target="RFC5117">RTP Translators</xref>.</t>

        <t>This use case is similar to when an RTP Mixer is neither performing
        <xref target="sec-mix-compose">composition</xref> nor <xref
        target="sec-mix-select">source selection</xref>, but is forwarding all
        streams (and qualities, if more than one) to all receivers. The entire
        load of stream and quality selection for presentation is put on the
        receiving end-point.</t>

        <t>In the figure below, multi-unicast through the use of an RTP
        Translator is depicted since the figure becomes clearer than with a
        full mesh multicast. The figure illustrates non-scalable streams, but
        it is of course also possible to multicast scalable streams.</t>

        <figure align="center" anchor="fig-multicast"
                title="Multi-Unicast of Multiple Qualities">
          <artwork><![CDATA[
        +---+
        | H |
        +---+
        ' : |
        v v v
  +---------------+--->+---+---+
  |  Translator   |...>|   D   |
  +---------------+'''>+---+---+
  ' : | ' : | ' : |
 ' : |  ' : |  ' : |
v v v   v v v   v v v
+---+   +---+   +---+
| L |   | M |   | H |
+---+   +---+   +---+
]]></artwork>
        </figure>

        <t></t>
      </section>
    </section>

    <section anchor="sec-rtp-usage" title="RTP Usage">
      <t>This section discusses how <xref target="RFC3550">RTP
      transport</xref> could be used with the scenarios discussed in the
      previous section. It complements, extends and partially presents an
      alternative solution to what is described in <xref
      target="I-D.lennox-clue-rtp-usage"></xref>.</t>

      <section anchor="sec-ssrc-csrc" title="Use of SSRC and CSRC">
        <t>It is assumed that each RTP media stream in the use cases in the
        previous section is identified by an SSRC. There already exist methods
        to <xref target="RFC4575">convey information about each media stream
        and sending end-point in a conference</xref>. Those methods also
        provide means to correlate that information with the stream SSRC. This
        information could be sufficient for a receiving end-point to make
        <xref target="sec-mix-select">informed media stream selection
        decisions</xref>.</t>

        <t>When the RTP Mixer associates a <xref target="sec-mix-compose">role
        to a stream</xref>, for example "most active speaker" or "leftmost
        channel", it is possible to associate that additional property to an
        SSRC belonging to the Mixer, while also keeping the original SSRC in
        the RTP packet as CSRC. This way, it is possible to apply special
        treatment to received streams based on their SSRC without losing the
        ability to identify the original source, using existing RTP
        functionality.</t>

        <figure align="center" anchor="fig-ssrc-csrc"
                title="SSRC and CSRC from Mixer">
          <artwork><![CDATA[
     +----+  +----+  +----+
     | L1 |  | M2 |  | S3 |
     +----+  +----+  +----+
    SSRC1' SSRC2: SSRC3|
          1'   2:   3|
     SSRC4  v   v   v  SSRC5
+----+ 4  +-----------+  5 +----+
| M4 |...>|   Mixer   |<'''| L5 |
+----+    +-----------+    +----+
        SSRC6 :2 5' SSRC7
       (CSRC2)v   v(CSRC5)
              +---+
              | H |
              +---+
]]></artwork>
        </figure>

        <t>It is also possible in a receiving end-point to let each role (and
        Mixer SSRC) map towards a specific media decoder, since that Mixer
        SSRC would rarely (if ever) change during a session other than due to
        SSRC conflicts, while the CSRC would typically change every time a new
        stream is selected for that specific role, for example "active
        speaker".</t>

        <t>Note that when simulcast is used, different simulcast versions can
        typically use different SSRC. When scalable coding is used, different
        layers can sometimes be sent within a single SSRC using a single
        Payload Type and thus cannot be distinguished on RTP level.
        Identification of different layers will then have to be codec
        specific. Some scalable codecs can also send different layers on
        separate SSRC or using separate Payload Types.</t>
      </section>

      <section anchor="sec-signal-ext" title="Signaling Extensions">
        <t>End-points and Mixers <xref target="sec-mix-dual">supporting
        multiple channels</xref> need to know how many simultaneous channels
        that can be accepted in a receiver and will be used from a sender.
        Assuming that each channel is sent as a single SSRC, there should be
        <xref target="I-D.westerlund-avtcore-max-ssrc">signaling that limits
        the number of SSRC in an RTP session</xref>. This maps well with the
        above suggested relation between SSRC and media decoders, since the
        suggested limitation then also expresses the maximum simultaneously
        available decoding resources.</t>

        <t>When representing a specific media source in several different
        qualities and when using <xref target="sec-med-simulcast">simulcast to
        transport</xref> them rather than as scalable layers contained in a
        single stream, those separate streams need to be <xref
        target="I-D.westerlund-avtcore-rtp-simulcast">signaled as simulcast
        versions</xref>, in order for the receiver to be able to <xref
        target="sec-mix-select">apply correct selection logic</xref>.</t>

        <t>When a conference is configured to let <xref
        target="sec-mix-select">individual users at receiving end-points
        choose which streams to receive</xref>, responsive <xref
        target="I-D.westerlund-dispatch-stream-selection">selection signaling
        between end-point and RTP Mixer</xref> is needed to initiate the
        stream selection. This selection is applicable to the media streams
        included in a compositions also.</t>

        <t>Note that when simulcast is used, different simulcast versions can
        typically use different SSRC. When scalable coding is used, different
        layers can sometimes be sent within a single SSRC using a single
        Payload Type and thus cannot use the parts of SDP signaling that
        relies on those identifiers. Identification of different layers will
        then have to be codec specific. Some scalable codecs can also send
        different layers on separate SSRC or using separate Payload Types.</t>
      </section>

      <section anchor="sec-optimizations" title="Optimizations">
        <t>When it is desirable to minimize the number of UDP ports used by an
        end-point, for example to reduce the resources for NAT and firewall
        traversal, it should be possible to send <xref
        target="I-D.westerlund-avtcore-transport-multiplexing">all media
        streams from all RTP sessions on a single UDP port</xref>. This should
        preferably be done without losing any important RTP functionality.
        Transport resource priority and Quality of Service handling are
        typically performed based on 5-tuple (source and destination addresses
        and ports, and protocol), which together with desired differentiation
        of media stream priority can require use of more than one UDP port
        (5-tuple).</t>

        <t>In a conference use case with multiple sending end-points and where
        the receiving end-points do not make use of all available streams,
        there is a risk that some of the sent streams are not used by any
        receiver. The probability for this increases when end-points provide
        multiple streams of different qualities. The need for a certain stream
        can change very quickly, for example when the need is based on
        conditions of other streams such as speech activity. To save bandwidth
        and processing resources in the sending end-point, it would thus be
        desirable for an RTP Mixer to be able to quickly turn off or <xref
        target="I-D.westerlund-avtext-rtp-stream-pause">pause individual
        streams</xref> that are no longer used in any media mix sent to
        receiving end-points, and even more importantly be able to quickly
        resume needed streams when they are needed again.</t>

        <t>In use cases where <xref target="sec-mix-dual">multiple media
        streams</xref> are used in a single RTP session, when SDP is used as
        signaling protocol, and specifically when the <xref
        target="sec-signal-ext">number of streams depends on the SDP
        negotiation outcome</xref>, the currently defined bandwidth signaling
        attribute is only capable of describing the maximum possible bandwidth
        usage for the most demanding alternative. It would be desirable to
        <xref target="I-D.westerlund-mmusic-sdp-bw-attribute">express
        bandwidth requirements in a more precise way</xref>.</t>

        <t>While any RTP stream relations such as for example spatial
        co-location of related audio and video streams should be possible to
        express in session signaling or other application signaling protocol,
        there may be times when it is desirable that RTP stream <xref
        target="I-D.westerlund-avtext-rtcp-sdes-srcname">SSRC relations</xref>
        such as simulcast alternatives or related FEC streams can be seen
        directly in the RTP or RTCP streams. This would allow for processing
        of media and related streams in middle boxes, without the need to have
        access to all higher layer signaling. Keeping protocol layer
        separation will enable some architectural freedom and may ease future
        extensions.</t>
      </section>
    </section>

    <section anchor="sec-security" title="Security Considerations">
      <t>Any security considerations relevant to this memo are described in
      the RFCs and drafts referenced in the <xref target="sec-rtp-usage">RTP
      Usage section</xref>.</t>
    </section>

    <section anchor="sec-iana" title="IANA Considerations">
      <t>This document makes no request of IANA.</t>

      <t>Note to RFC Editor: this section may be removed on publication as an
      RFC.</t>
    </section>

    <section anchor="sec-acknowledgements" title="Acknowledgements">
      <t></t>
    </section>
  </middle>

  <back>
    <references title="Informative References">
      <?rfc include='reference.RFC.3550'?>

      <?rfc include='reference.RFC.4575'?>

      <?rfc include='reference.RFC.5117'?>

      <?rfc include='reference.I-D.ietf-clue-framework'?>

      <?rfc include='reference.I-D.westerlund-avtcore-max-ssrc'?>

      <?rfc include='reference.I-D.westerlund-avtcore-rtp-simulcast'?>

      <?rfc include='reference.I-D.westerlund-avtcore-transport-multiplexing'?>

      <?rfc include='reference.I-D.westerlund-avtext-rtcp-sdes-srcname'?>

      <?rfc include='reference.I-D.westerlund-avtext-rtp-stream-pause'?>

      <?rfc include='reference.I-D.westerlund-dispatch-stream-selection'?>

      <?rfc include='reference.I-D.westerlund-mmusic-sdp-bw-attribute'?>

      <?rfc include='reference.I-D.lennox-clue-rtp-usage'?>
    </references>
  </back>
</rfc>
PAFTECH AB 2003-2026
2026-04-24 05:43:29