One document matched: draft-ietf-avtext-rtp-grouping-taxonomy-00.xml
<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc autobreaks="yes"?>
<rfc category="info"
docName="draft-ietf-avtext-rtp-grouping-taxonomy-00"
ipr="trust200902">
<front>
<title abbrev="RTP Grouping Taxonomy">A Taxonomy of Grouping
Semantics and Mechanisms for Real-Time Transport Protocol (RTP)
Sources</title>
<author fullname="Jonathan Lennox" initials="J." surname="Lennox">
<organization abbrev="Vidyo">Vidyo, Inc.</organization>
<address>
<postal>
<street>433 Hackensack Avenue</street>
<street>Seventh Floor</street>
<city>Hackensack</city>
<region>NJ</region>
<code>07601</code>
<country>US</country>
</postal>
<email>jonathan@vidyo.com</email>
</address>
</author>
<author fullname="Kevin Gross" initials="K." surname="Gross">
<organization abbrev="AVA">AVA Networks, LLC</organization>
<address>
<postal>
<street/>
<city>Boulder</city>
<region>CO</region>
<country>US</country>
</postal>
<email>kevin.gross@avanw.com</email>
</address>
</author>
<author fullname="Suhas Nandakumar" initials="S"
surname="Nandakumar">
<organization>Cisco Systems</organization>
<address>
<postal>
<street>170 West Tasman Drive</street>
<city>San Jose</city>
<region>CA</region>
<code>95134</code>
<country>US</country>
</postal>
<email>snandaku@cisco.com</email>
</address>
</author>
<author fullname="Gonzalo Salgueiro" initials="G"
surname="Salgueiro">
<organization>Cisco Systems</organization>
<address>
<postal>
<street>7200-12 Kit Creek Road</street>
<city>Research Triangle Park</city>
<region>NC</region>
<code>27709</code>
<country>US</country>
</postal>
<email>gsalguei@cisco.com</email>
</address>
</author>
<author fullname="Bo Burman" initials="B." surname="Burman">
<organization>Ericsson</organization>
<address>
<postal>
<street>Farogatan 6</street>
<city>SE-164 80 Kista</city>
<country>Sweden</country>
</postal>
<phone>+46 10 714 13 11</phone>
<email>bo.burman@ericsson.com</email>
</address>
</author>
<!-- Add more authors here! -->
<date day="5" month="November" year="2013"/>
<area>Real Time Applications and Infrastructure (RAI)</area>
<keyword>I-D</keyword>
<keyword>Internet-Draft</keyword>
<!-- TODO: more keywords -->
<abstract>
<t>The terminology about, and associations among, Real-Time
Transport Protocol (RTP) sources can be complex and somewhat
opaque. This document describes a number of existing and
proposed relationships among RTP sources, and attempts to define
common terminology for discussing protocol entities and their
relationships.</t>
</abstract>
</front>
<middle>
<section anchor="introduction" title="Introduction">
<t>The existing taxonomy of sources in RTP is often regarded as
confusing and inconsistent. Consequently, a deep understanding
of how the different terms relate to each other becomes a real
challenge. Frequently cited examples of this confusion are (1)
how different protocols that make use of RTP use the same terms
to signify different things and (2) how the complexities
addressed at one layer are often glossed over or ignored at
another.</t>
<t>This document attempts to provide some clarity by reviewing
the semantics of various aspects of sources in RTP. As an
organizing mechanism, it approaches this by describing various
ways that RTP sources can be grouped and associated
together.</t>
<t>All non-specific references to ControLling mUltiple streams
for tElepresence (CLUE) in this document map to <xref
target="I-D.ietf-clue-framework"/> and all references to Web
Real-Time Communications (WebRTC) map to <xref
target="I-D.ietf-rtcweb-overview"/>.</t>
</section>
<section title="Concepts">
<t>This section defines concepts that serve to identify and name
various transformations and streams in a given RTP usage. For
each concept an attempt is made to list any alternate
definitions and usages that co-exist today along with various
characteristics that further describes the concept. These
concepts are divided into two categories, one related to the
chain of streams and transformations that media can be subject
to, the other for entities involved in the communication.</t>
<section title="Media Chain">
<t>This section contains the concepts that can be involved in
taking a sequence of physical world stimulus (sound waves,
photons, key-strokes) at a sender side and transport them to a
receiver, which may recover a sequence of physical stimulus.
This chain of concepts is of two main types, streams and
transformations. Streams are time-based sequences of samples
of the physical stimulus in various representations, while
transformations changes the representation of the streams in
some way.</t>
<t>The below examples are basic ones and it is important to
keep in mind that this conceptual model enables more complex
usages. Some will be further discussed in later sections of
this document. In general the following applies to this
model:<list style="symbols">
<t>A transformation may have zero or more inputs and one
or more outputs.</t>
<t>A Stream is of some type.</t>
<t>A Stream has one source transformation and one or more
sink transformation (with the exception of <xref
target="physical-stimulus">Physical Stimulus</xref> that
can have no source or sink transformation).</t>
<t>Streams can be forwarded from a transformation output
to any number of inputs on other transformations that
support that type.</t>
<t>If the output of a transformation is sent to multiple
transformations, those streams will be identical; it takes
a transformation to make them different.</t>
<t>There are no formal limitations on how streams are
connected to transformations, this may include loops if
required by a particular transformation.</t>
</list> It is also important to remember that this is a
conceptual model. Thus real-world implementations may look
different and have different structure.</t>
<t>To provide a basic understanding of the relationships in
the chain we below first introduces the concepts for the <xref
target="fig-sender-chain">sender side</xref>. This covers
physical stimulus until media packets are emitted onto the
network.</t>
<figure align="center" anchor="fig-sender-chain"
title="Sender Side Concepts in the Media Chain">
<artwork><![CDATA[ Physical Stimulus
|
V
+--------------------+
| Media Capture |
+--------------------+
|
Raw stream
V
+--------------------+
| Media Source |<- Synchronization Timing
+--------------------+
|
Source Stream
V
+--------------------+
| Media Encoder |
+--------------------+
|
Encoded Stream +-----------+
V | V
+--------------------+ | +--------------------+
| Media Packetizer | | | Media Redundancy |
+--------------------+ | +--------------------+
| | |
+------------+ Redundancy Packet Stream
Source Packet Stream |
V V
+--------------------+ +--------------------+
| Media Transport | | Media Transport |
+--------------------+ +--------------------+
]]></artwork>
</figure>
<t>In <xref target="fig-sender-chain"/> we have included a
branched chain to cover the concepts for using redundancy to
improve the reliability of the transport. The Media Transport
concept is an aggregate that is decomposed below in <xref
target="media-stream-decomposition"/>.</t>
<t>Below we review a <xref
target="fig-receiver-chain">receiver media chain</xref>
matching the sender side to look at the inverse
transformations and their attempts to recover possibly
identical streams as in the sender chain. Note that the
streams out of a reverse transformation, like the Source
Stream out the Media Decoder are in many cases not the same as
the corresponding ones on the sender side, thus they are
prefixed with a "Received" to denote a potentially modified
version. The reason for not being the same lies in the
transformations that can be of irreversible type. For example,
lossy source coding in the Media Encoder prevents the Source
Stream out of the Media Decoder to be the same as the one fed
into the Media Encoder. Other reasons include packet loss or
late loss in the Media Transport transformation that even
Media Repair, if used, fails to repair. It should be noted
that some transformations are not always present, like Media
Repair that cannot operate without Redundancy Packet
Streams.</t>
<figure align="center" anchor="fig-receiver-chain"
title="Receiver Side Concepts of the Media Chain">
<artwork><![CDATA[+--------------------+ +--------------------+
| Media Transport | | Media Transport |
+--------------------+ +--------------------+
| |
Received Packet Stream Received Redundancy PS
| |
| +-------------------+
V V
+--------------------+
| Media Repair |
+--------------------+
|
Repaired Packet Stream
V
+--------------------+
| Media Depacketizer |
+--------------------+
|
Received Encoded Stream
V
+--------------------+
| Media Decoder |
+--------------------+
|
Received Source Stream
V
+--------------------+
| Media Sink |--> Synchronization Information
+--------------------+
|
Received Raw Stream
V
+--------------------+
| Media Renderer |
+--------------------+
|
V
Physical Stimulus
]]></artwork>
</figure>
<section anchor="physical-stimulus" title="Physical Stimulus">
<t>The physical stimulus is a physical event that can be
captured and provided as media to a receiver. This include
sound waves making up audio, photons in a light field that
is visible, or other excitations or interactions with
sensors, like keystrokes on a keyboard.</t>
</section>
<section anchor="media-capture" title="Media Capture">
<t>The process of transforming the <xref
target="physical-stimulus">Physical Stimulus</xref> into
captured media. The Media Capture performs a digital
sampling of the physical stimulus, usually periodically, and
outputs this in some representation as a <xref
target="raw-stream">Raw Stream</xref>. This data is due to
its periodical sampling, or at least being timed
asynchronous events, some form of a stream of media data.
The Media Capture is normally instantiated in some type of
device, i.e. media capture device. Examples of different
types of media capturing devices are digital cameras,
microphones connected to A/D converters, or keyboards.</t>
<section title="Alternate Usages">
<t>The CLUE WG uses the term "Capture Device" to identify
a physical capture device.</t>
<t>WebRTC WG uses the term "Recording Device" to refer to
the locally available capture devices in an
end-system.</t>
</section>
<section title="Characteristics">
<t><list style="symbols">
<t>A Media Capture is identified either by
hardware/manufacturer ID or via a session-scoped
device identifier as mandated by the application
usage.</t>
<t>A Media Capture can generate an <xref
target="encoded-stream">Encoded Stream </xref> if the
capture device support such a configuration.</t>
</list></t>
</section>
</section>
<section anchor="raw-stream" title="Raw Stream">
<t>The time progressing stream of digitally sampled
information, usually periodically sampled, provided by a
<xref target="media-capture">Media Capture</xref>.</t>
</section>
<section anchor="media-source" title="Media Source">
<t>A Media Source is the logical source of a reference clock
synchronized, time progressing, digital media stream, called
a <xref target="source-stream">Source Stream</xref>. This
transformation takes one or more <xref
target="raw-stream">Raw Streams</xref> and provides a Source
Stream as output. This output has been synchronized with
some reference clock, even if just a system local wall
clock.</t>
<t>The output can be of different types. One type is
directly associated with a particular Media Capture's Raw
Stream. Others are more conceptual sources, like an <xref
target="fig-media-source-mixer">audio mix of multiple Raw
Streams</xref>, a mixed selection of the three loudest
inputs regarding speech activity, a selection of a
particular video based on the current speaker, i.e.
typically based on other Media Sources.</t>
<figure align="center" anchor="fig-media-source-mixer"
title="Conceptual Media Source in form of Audio Mixer">
<artwork><![CDATA[ Raw Raw Raw
Stream Stream Stream
| | |
V V V
+--------------------------+
| Media Source |<-- Reference Clock
| Mixer |
+--------------------------+
|
V
Source Stream]]></artwork>
</figure>
<t/>
<section title="Alternate Usages">
<t>The CLUE WG uses the term "Media Capture" for this
purpose. A CLUE Media Capture is identified via indexed
notation. The terms Audio Capture and Video Capture are
used to identify Audio Sources and Video Sources
respectively. Concepts such as "Capture Scene", "Capture
Scene Entry" and "Capture" provide a flexible framework to
represent media captured spanning spatial regions.</t>
<t>The WebRTC WG defines the term "RtcMediaStreamTrack" to
refer to a Media Source. An "RtcMediaStreamTrack" is
identified by the ID attribute.</t>
<!--MW: I think the below SDP is a bit misplaced. Do we need a special section to discuss
relation to SDP terminology. Or should this be focused and other interpretations
be added?-->
<t>Typically a Media Source is mapped to a single m=line
via the Session Description Protocol (SDP) <xref
target="RFC4566"/> unless mechanisms such as
Source-Specific attributes are in place <xref
target="RFC5576"/>. In the latter cases, an m=line can
represent either multiple Media Sources, multiple <xref
target="packet-stream">Packet Streams</xref>, or both.</t>
</section>
<section title="Characteristics">
<t><list style="symbols">
<t>At any point, it can represent a physical captured
source or conceptual source.</t>
<!--MW: Put back a discussion of relation between Media Capture and Media sources?-->
</list></t>
</section>
</section>
<section anchor="source-stream" title="Source Stream">
<t>A time progressing stream of digital samples that has
been synchronized with a reference clock and comes from
particular <xref target="media-source">Media
Source</xref>.</t>
</section>
<section anchor="media-encoder" title="Media Encoder">
<t>A Media Encoder is a transform that is responsible for
encoding the media data from a <xref
target="source-stream">Source Stream</xref> into another
representation, usually more compact, that is output as an
<xref target="encoded-stream">Encoded Stream</xref>.</t>
<t>The Media Encoder step commonly includes pre-encoding
transformations, such as scaling, resampling etc. The Media
Encoder can have a significant number of configuration
options that affects the properties of the encoded stream.
This include properties such as bit-rate, start points for
decoding, resolution, bandwidth or other fidelity affecting
properties. The actually used codec is also an important
factor in many communication systems, not only its
parameters.</t>
<t>Scalable Media Encoders need special mentioning as they
produce multiple outputs that are potentially of different
types. A scalable Media Encoder takes one input Source
Stream and encodes it into multiple output streams of two
different types; at least one Encoded Stream that is
independently decodable and one or more <xref
target="dependent-stream">Dependent Streams</xref> that
requires at least one Encoded Stream and zero or more
Dependent Streams to be possible to decode. A Dependent
Stream's dependency is one of the grouping relations this
document discusses further in <xref target="svc"/>.</t>
<figure align="center" anchor="fig-scalable-media-encoder"
title="Scalable Media Encoder Input and Outputs">
<artwork><![CDATA[ Source Stream
|
V
+--------------------------+
| Scalable Media Encoder |
+--------------------------+
| | ... |
V V V
Encoded Dependent Dependent
Stream Stream Stream
]]></artwork>
</figure>
<t/>
<section title="Alternate Usages">
<t>Within the SDP usage, an SDP media description (m=line)
describes part of the necessary configuration required for
encoding purposes.</t>
<t>CLUE's "Capture Encoding" provides specific encoding
configuration for this purpose.</t>
</section>
<section title="Characteristics">
<t><list style="symbols">
<t>A Media Source can be multiply encoded by different
Media Encoders to provide various encoded
representations.</t>
</list></t>
</section>
</section>
<section anchor="encoded-stream" title="Encoded Stream">
<t>A stream of time synchronized encoded media that can be
independently decoded.</t>
<section title="Characteristics">
<t><list style="symbols">
<t>Due to temporal dependencies, an Encoded Stream may
have limitations in where decoding can be started.
These entry points, for example Intra frames from a
video encoder, may require identification and their
generation may be event based or configured to occur
periodically.</t>
</list></t>
</section>
</section>
<section anchor="dependent-stream" title="Dependent Stream">
<t>A stream of time synchronized encoded media fragments
that are dependent on one or more <xref
target="encoded-stream">Encoded Streams</xref> and zero or
more Dependent Streams to be possible to decode.</t>
<section title="Characteristics">
<t><list style="symbols">
<t>Each Dependent Stream has a set of dependencies.
These dependencies must be understood by the parties
in a multi-media session that intend to use a
Dependent Stream.</t>
</list></t>
</section>
</section>
<section anchor="media_packetizer" title="Media Packetizer">
<t>The transformation of taking one or more <xref
target="encoded-stream">Encoded</xref> or <xref
target="dependent-stream">Dependent Stream</xref> and put
their content into one or more sequences of packets,
normally RTP packets, and output <xref
target="packet-stream">Source Packet Streams</xref>. This
step includes both generating RTP payloads as well as RTP
packets.</t>
<t>The Media Packetizer can use multiple inputs when
producing a single Packet Stream. One such example is the
packetization when using SVC, as in Single Stream Transport
(SST) usage of the payload format both an Encoded Stream as
well as Dependent Streams are packetized in a single Source
Packet Stream using a single SSRC.</t>
<t>The Media Packetizer can also produce multiple Packet
Streams, for example when Encoded and/or Dependent Streams
are distributed over multiple Packet Streams, possibly in
different RTP sessions.</t>
<section title="Alternate Usages">
<t>An RTP sender is part of the Media Packetizer.</t>
</section>
<section title="Characteristics">
<t><list style="symbols">
<t>The Media Packetizer will select which
Synchronization source(s) (SSRC) <xref
target="RFC3550"/> in which RTP sessions that are
used.</t>
<t>Media Packetizer can combine multiple Encoded or
Dependent Streams into one or more Packet Streams.</t>
</list></t>
</section>
</section>
<section anchor="packet-stream" title="Packet Stream">
<t>A stream of RTP packets containing media data, source or
redundant. The Packet Stream is identified by an SSRC
belonging to a particular RTP session. The RTP session is
identified as discussed in <xref target="rtp-session"/>.</t>
<t>A Source Packet Stream is a packet stream containing at
least some content from an Encoded Stream. Source material
is any media material that is produced for transport over
RTP without any additional redundancy applied to cope with
network transport losses. Compare this with the <xref
target="redundancy-packet-stream">Redundancy Packet
Stream</xref>.</t>
<section title="Alternate Usages">
<t>The term "Stream" is used by the CLUE WG to define an
encoded Media Source sent via RTP. "Capture Encoding",
"Encoding Groups" are defined to capture specific details
of the encoding scheme.</t>
<t>RFC3550 <xref target="RFC3550"/> uses the terms media
stream, audio stream, video stream and streams of (RTP)
packets interchangeably. It defines the SSRC as the "The
source of a stream of RTP packets, ..."</t>
<t>The equivalent mapping of a Packet Stream in SDP <xref
target="RFC4566"/> is defined per usage. For example, each
Media Description (m=line) and associated attributes can
describe one Packet Stream OR properties for multiple
Packet Streams OR for an RTP session (via <xref
target="RFC5576"/> mechanisms for example).</t>
</section>
<section title="Characteristics">
<t><list style="symbols">
<t>Each Packet Stream is identified by a unique
Synchronization source (SSRC) <xref target="RFC3550"/>
that is carried in every RTP and RTP Control Protocol
(RTCP) packet header in a specific RTP session
context.</t>
<t>At any given point in time, a Packet Stream can
have one and only one SSRC.</t>
<t>Each Packet Stream defines a unique RTP sequence
numbering and timing space.</t>
<t>Several Packet Streams may map to a single Media
Source via the source transformations.</t>
<t>Several Packet Streams can be carried over a single
RTP Session.</t>
</list></t>
</section>
</section>
<section anchor="media-redundancy" title="Media Redundancy">
<t>Media redundancy is a transformation that generates
redundant or repair packets sent out as a Redundancy Packet
Stream to mitigate network transport impairments, like
packet loss and delay.</t>
<t>The Media Redundancy exists in many flavors; they may be
generating independent Repair Streams that are used in
addition to the Source Stream (<xref target="RFC4588">RTP
Retransmission</xref> and some <xref
target="RFC5109">FEC</xref>), they may generate a new Source
Stream by combining redundancy information with source
information (Using <xref target="RFC5109">XOR FEC</xref> as
a <xref target="RFC2198">redundancy payload</xref>), or
completely replace the source information with only
redundancy packets.</t>
</section>
<section anchor="redundancy-packet-stream"
title="Redundancy Packet Stream">
<t>A <xref target="packet-stream">Packet Stream</xref> that
contains no original source data, only redundant data that
may be combined with one or more <xref
target="received-packet-stream">Received Packet
Stream</xref> to produce <xref
target="repaired-packet-stream">Repaired Packet
Streams</xref>.</t>
</section>
<section anchor="media-transport" title="Media Transport">
<t>A Media Transport defines the transformation that the
<xref target="packet-stream">Packet Streams</xref> are
subjected to by the end-to-end transport from one RTP sender
to one specific RTP receiver (an RTP session may contain
multiple RTP receivers per sender). Each Media Transport is
defined by a transport association that is identified by a
5-tuple (source address, source port, destination address,
destination port, transport protocol). Each transport
association normally contains only a single RTP session,
although a proposal exists for sending <xref
target="I-D.westerlund-avtcore-transport-multiplexing">multiple
RTP sessions over one transport association</xref>.</t>
<section title="Characteristics">
<t><list style="symbols">
<t>Media Transport transmits Packet Streams of RTP
Packets from a source transport address to a
destination transport address.</t>
</list></t>
</section>
<section anchor="media-stream-decomposition"
title="Media Stream Decomposition">
<t>The Media Transport concept sometimes needs to be
decomposed into more steps to enable discussion of what a
sender emits that gets transformed by the network before
it is received by the receiver. Thus we provide also this
<xref target="fig-media-transport">Media Transport
decomposition</xref>.</t>
<figure align="center" anchor="fig-media-transport"
title="Decomposition of Media Transport">
<artwork><![CDATA[ Packet Stream
|
V
+--------------------------+
| Media Transport Sender |
+--------------------------+
|
Sent Packet Stream
V
+--------------------------+
| Network Transport |
+--------------------------+
|
Transported Packet Stream
V
+--------------------------+
| Media Transport Receiver |
+--------------------------+
|
V
Received Packet Stream
]]></artwork>
</figure>
<t/>
<section anchor="media-transport-sender"
title="Media Transport Sender">
<t>The first transformation within the <xref
target="media-transport">Media Transport</xref> is the
Media Transport Sender, where the sending <xref
target="end-point">End-Point</xref> takes a Packet
Stream and emits the packets onto the network using the
transport association established for this Media
Transport thus creating a <xref
target="sent-packet-stream">Sent Packet Stream</xref>.
In this process it transforms the Packet Stream in
several ways. First, it gains the necessary protocol
headers for the transport association, for example IP
and UDP headers, thus forming IP/UDP/RTP packets. In
addition, the Media Transport Sender may queue, pace or
otherwise affect how the packets are emitted onto the
network. Thus adding delay, jitter and inter packet
spacings that characterize the Sent Packet Stream.</t>
</section>
<section anchor="sent-packet-stream"
title="Sent Packet Stream">
<t>The Sent Packet Stream is the Packet Stream as
entering the first hop of the network path to its
destination. The Sent Packet Stream is identified using
network transport addresses, like for IP/UDP the 5-tuple
(source IP address, source port, destination IP address,
destination port, and protocol (UDP)).</t>
</section>
<section anchor="network-transport"
title="Network Transport">
<t>Network Transport is the transformation that the
<xref target="sent-packet-stream">Sent Packet
Stream</xref> is subjected to by traveling from the
source to the destination through the network. These
transformations include, loss of some packets, varying
delay on a per packet basis, packet duplication, and
packet header or data corruption. These transformations
produces a <xref
target="transported-packet-stream">Transported Packet
Stream</xref> at the exit of the network path.</t>
</section>
<section anchor="transported-packet-stream"
title="Transported Packet Stream">
<t>The Packet Stream that is emitted out of the network
path at the destination, subjected to the <xref
target="network-transport">Network Transport's
transformation</xref>.</t>
</section>
<section title="Media Transport Receiver">
<t>The receiver <xref
target="end-point">End-Point's</xref> transformation of
the <xref target="transported-packet-stream">Transported
Packet Stream</xref> by its reception process that
result in the <xref
target="received-packet-stream">Received Packet
Stream</xref>. This transformation includes transport
checksums being verified and if non-matching, causing
discarding of the corrupted packet. Other
transformations can include delay variations in
receiving a packet on the network interface and
providing it to the application.</t>
</section>
</section>
</section>
<section anchor="received-packet-stream"
title="Received Packet Stream">
<t>The <xref target="packet-stream">Packet Stream</xref>
resulting from the Media Transport's transformation, i.e.
subjected to packet loss, packet corruption, packet
duplication and varying transmission delay from sender to
receiver.</t>
</section>
<section anchor="received-redundancy-ps"
title="Received Redundandy Packet Stream">
<t>The <xref target="redundancy-packet-stream">Redundancy
Packet Stream</xref> resulting from the Media Transport's
transformation, i.e. subjected to packet loss, packet
corruption, and varying transmission delay from sender to
receiver.</t>
</section>
<section title="Media Repair">
<t>A Transformation that takes as input one or more <xref
target="packet-stream">Source Packet Streams</xref> as well
as <xref target="redundancy-packet-stream">Redundancy Packet
Streams</xref> and attempts to combine them to counter the
transformations introduced by the <xref
target="media-transport">Media Transport</xref> to minimize
the difference between the <xref
target="source-stream">Source Stream</xref> and the <xref
target="received-source-stream">Received Source
Stream</xref> after <xref target="media-decoder">Media
Decoder</xref>. The output is a <xref
target="repaired-packet-stream">Repaired Packet
Stream</xref>.</t>
</section>
<section anchor="repaired-packet-stream"
title="Repaired Packet Stream">
<t>A <xref target="received-packet-stream">Received Packet
Stream</xref> for which <xref
target="received-redundancy-ps">Received Redundancy Packet
Stream</xref> information has been used to try to re-create
the <xref target="packet-stream">Packet Stream</xref> as it
was before <xref target="media-transport">Media
Transport</xref>.</t>
</section>
<section title="Media Depacketizer">
<t>A Media Depacketizer takes one or more <xref
target="packet-stream">Packet Streams</xref> and
depacketizes them and attempts to reconstitute the <xref
target="encoded-stream">Encoded Streams</xref> or <xref
target="dependent-stream">Dependent Streams</xref> present
in those Packet Streams.</t>
</section>
<section anchor="received-encoded-stream"
title="Received Encoded Stream">
<t>The received version of an <xref
target="encoded-stream">Encoded Stream</xref>.</t>
</section>
<section anchor="media-decoder" title="Media Decoder">
<t>A Media Decoder is a transformation that is responsible
for decoding <xref target="encoded-stream">Encoded
Streams</xref> and any <xref
target="dependent-stream">Dependent Streams</xref> into a
<xref target="source-stream">Source Stream</xref>.</t>
<section title="Alternate Usages">
<t>Within the context of SDP, an m=line describes the
necessary configuration and identification (RTP Payload
Types) required to decode either one or more incoming
Media Streams.</t>
</section>
<section title="Characteristics">
<t><list style="symbols">
<t>A Media Decoder is the entity that will have to
deal with any errors in the encoded streams that
resulted from corruptions or failures to repair packet
losses. This as a media decoder generally is forced to
produce some output periodically. It thus commonly
includes concealment methods.</t>
</list></t>
</section>
</section>
<section anchor="received-source-stream"
title="Received Source Stream">
<t>The received version of a <xref
target="source-stream">Source Stream</xref>.</t>
</section>
<section anchor="media-sink" title="Media Sink">
<t>The Media Sink receives a <xref
target="source-stream">Source Stream</xref> that contains,
usually periodically, sampled media data together with
associated synchronization information. Depending on
application, this Source Stream then needs to be transformed
into a <xref target="raw-stream">Raw Stream</xref> that is
sent in synchronization with the output from other Media
Sinks to a <xref target="media-render">Media Render</xref>.
The media sink may also be connected with a <xref
target="media-source">Media Source</xref> and be used as
part of a conceptual Media Source.</t>
<section title="Characteristics">
<t><list style="symbols">
<t>The media sink can further transform the source
stream into a representation that is suitable for
rendering on the Media Render as defined by the
application or system-wide configuration. This include
sample scaling, level adjustments etc.</t>
</list></t>
</section>
</section>
<section title="Received Raw Stream">
<t>The received version of a <xref target="raw-stream">Raw
Stream</xref>.</t>
</section>
<section anchor="media-render" title="Media Render">
<t>A Media Render takes a <xref target="raw-stream">Raw
Stream</xref> and converts it into <xref
target="physical-stimulus">Physical Stimulus</xref> that a
human user can perceive. Examples of such devices are
screens, D/A converters connected to amplifiers and
loudspeakers.</t>
<section title="Characteristics">
<t><list style="symbols">
<t>An End Point can potentially have multiple Media
Renders for each media type.</t>
</list></t>
</section>
</section>
</section>
<section anchor="communication-entities"
title="Communication Entities">
<t>This section contains concept for entities involved in the
communication.</t>
<section anchor="end-point" title="End Point">
<t>A single addressable entity sending or receiving RTP
packets. It may be decomposed into several functional
blocks, but as long as it behaves as a single RTP stack
entity it is classified as a single "End Point".</t>
<section title="Alternate Usages">
<t>The CLUE Working Group (WG) uses the terms "Media
Provider" and "Media Consumer" to describes aspects of End
Point pertaining to sending and receiving
functionalities.</t>
</section>
<section title="Characteristics">
<t>End Points can be identified in several different ways.
While RTCP Canonical Names (CNAMEs) <xref
target="RFC3550"/> provide a globally unique and stable
identification mechanism for the duration of the
Communication Session (see <xref target="comm-session"/>),
their validity applies exclusively within a <xref
target="syncontext">Synchronization Context</xref>. Thus
one End Point can have multiple CNAMEs. Therefore,
mechanisms outside the scope of RTP, such as application
defined mechanisms, must be used to ensure End Point
identification when outside this Synchronization
Context.</t>
</section>
</section>
<section anchor="rtp-session" title="RTP Session">
<t>An RTP session is an association among a group of
participants communicating with RTP. It is a group
communications channel which can potentially carry a number
of Packet Streams. Within an RTP session, every participant
can find meta-data and control information (over RTCP) about
all the Packet Streams in the RTP session. The bandwidth of
the RTCP control channel is shared between all participants
within an RTP Session.</t>
<section title="Alternate Usages">
<t>Within the context of SDP, a singe m=line can map to a
single RTP Session or multiple m=lines can map to a single
RTP Session. The latter is enabled via multiplexing
schemes such as BUNDLE <xref
target="I-D.ietf-mmusic-sdp-bundle-negotiation"/>, for
example, which allows mapping of multiple m=lines to a
single RTP Session.</t>
</section>
<section title="Characteristics">
<t><list style="symbols">
<t>Typically, an RTP Session can carry one ore more
Packet Streams.</t>
<t>An RTP Session shares a single SSRC space as
defined in RFC3550 <xref target="RFC3550"/>. That is,
the End Points participating in an RTP Session can see
an SSRC identifier transmitted by any of the other End
Points. An End Point can receive an SSRC either as
SSRC or as a Contributing source (CSRC) in RTP and
RTCP packets, as defined by the endpoints' network
interconnection topology.</t>
<t>An RTP Session uses at least two <xref
target="media-transport">Media Transports</xref>, one
for sending and one for receiving. Commonly, the
receiving one is the reverse direction of the same one
as used for sending. An RTP Session may use many Media
Transports and these define the session's network
interconnection topology. A single Media Transport can
normally not transport more than one RTP Session,
unless a solution for multiplexing multiple RTP
sessions over a single Media Transport is used. One
example of such a scheme is <xref
target="I-D.westerlund-avtcore-transport-multiplexing">Multiple
RTP Sessions on a Single Lower-Layer
Transport</xref>.</t>
<t>Multiple RTP Sessions can be related.</t>
</list></t>
</section>
</section>
<section anchor="participant" title="Participant">
<t>A participant is an entity reachable by a single
signaling address, and is thus related more to the signaling
context than to the media context.</t>
<section title="Characteristics">
<t><list style="symbols">
<t>A single signaling-addressable entity, using an
application-specific signaling address space, for
example a SIP URI.</t>
<t>A participant can have several <xref
target="multimedia-session">Multimedia
Sessions</xref>.</t>
<t>A participant can have several associated transport
flows, including several separate local transport
addresses for those transport flows.</t>
<!--MW: I can't understand what the purpose is of the last bullet regarding many
transport flows. It needs to be aligned with the rest of the concept language.
But I am unable to change it because I don't understand what one attempts
to say.
BoB: Speculatively, it is just trying to prohibit definig a Participant as
being one end of a single Media Transport. This bullet is then not needed,
as a single Multimedia Session can already have multiple Media Transports.
-->
</list></t>
</section>
</section>
<section anchor="multimedia-session"
title="Multimedia Session">
<t>A multimedia session is an association among a group of
participants engaged in the communication via one or more
<xref target="rtp-session">RTP Sessions</xref>. It defines
logical relationships among <xref
target="media-source">Media Sources</xref> that appear in
multiple RTP Sessions.</t>
<section title="Alternate Usages">
<t>RFC4566 <xref target="RFC4566"/> defines a multimedia
session as a set of multimedia senders and receivers and
the data streams flowing from senders to receivers.</t>
<t>RFC3550 <xref target="RFC3550"/> defines it as set of
concurrent RTP sessions among a common group of
participants. For example, a video conference (which is a
multimedia session) may contain an audio RTP session and a
video RTP session.</t>
</section>
<section title="Characteristics">
<t><list style="symbols">
<t>A Multimedia Session can be composed of several
parallel RTP Sessions with potentially multiple Packet
Streams per RTP Session.</t>
<t>Each participant in a Multimedia Session can have a
multitude of Media Captures and Media Rendering
devices.</t>
</list></t>
</section>
</section>
<section anchor="comm-session" title="Communication Session">
<t>A Communication Session is an association among group of
participants communicating with each other via a set of
Multimedia Sessions.</t>
<section title="Alternate Usages">
<t>The <xref target="RFC4566">Session Description Protocol
(SDP)</xref> defines a multimedia session as a set of
multimedia senders and receivers and the data streams
flowing from senders to receivers. In that definition it
is however not clear if a multimedia session includes both
the sender's and the receiver's view of the same RTP
Packet Stream.</t>
</section>
<section title="Characteristics">
<t><list style="symbols">
<t>Each participant in a Communication Session is
identified via an application-specific signaling
address.</t>
<t>A Communication Session is composed of at least one
Multimedia Session per participant, involving one or
more parallel RTP Sessions with potentially multiple
Packet Streams per RTP Session.</t>
</list> For example, in a full mesh communication, the
Communication Session consists of a set of separate
Multimedia Sessions between each pair of Participants.
Another example is a centralized conference, where the
Communication Session consists of a set of Multimedia
Sessions between each Participant and the conference
handler.</t>
</section>
</section>
</section>
</section>
<section title="Relations at Different Levels">
<t>This section uses the concepts from previous section and look
at different types of relationships among them. These
relationships occur at different levels and for different
purposes. The section is organized such as to look at the level
where a relation is required. The reason for the relationship
may exist at another step in the media handling chain. For
example, using Simulcast (discussed in <xref
target="simulcast"/>) needs to determine relations at Packet
Stream level, however the reason to relate Packet Streams is
that multiple Media Encoders use the same Media Source, i.e. to
be able to identify a common Media Source.</t>
<section title="Media Source Relations">
<t><xref target="media-source">Media Sources</xref> are
commonly grouped and related to an <xref
target="end-point">End Point</xref> or a <xref
target="participant">Participant</xref>. This occurs for
several reasons; both application logic as well as media
handling purposes. These cases are further discussed
below.</t>
<section anchor="syncontext" title="Synchronization Context">
<t>A Synchronization Context defines a requirement on a
strong timing relationship between the Media Sources,
typically requiring alignment of clock sources. Such
relationship can be identified in multiple ways as listed
below. A single Media Source can only belong to a single
Synchronization Context, since it is assumed that a single
Media Source can only have a single media clock and
requiring alignment to several Synchronization Contexts (and
thus reference clocks) will effectively merge those into a
single Synchronization Context.</t>
<!--MW: The following paragraph may be quite misplaced. Should be reconsidered when improving
text for the relations between RTP Sessions, Multimedia Sessions and Communication
Sessions.-->
<t>A single Multimedia Session can contain media from one or
more Synchronization Contexts. An example of that is a
Multimedia Session containing one set of audio and video for
communication purposes belonging to one Synchronization
Context, and another set of audio and video for presentation
purposes (like playing a video file) with a separate
Synchronization Context that has no strong timing
relationship and need not be strictly synchronized with the
audio and video used for communication.</t>
<section title="RTCP CNAME">
<t>RFC3550 <xref target="RFC3550"/> describes Inter-media
synchronization between RTP Sessions based on RTCP CNAME,
RTP and Network Time Protocol (NTP) <xref
target="RFC5905"/> formatted timestamps of a reference
clock. As indicated in <xref
target="I-D.ietf-avtcore-clksrc"/>, despite using NTP
format timestamps, it is not required that the clock be
synchronized to an NTP source.</t>
</section>
<section title="Clock Source Signaling">
<t><xref target="I-D.ietf-avtcore-clksrc"/> provides a
mechanism to signal the clock source in SDP both for the
reference clock as well as the media clock, thus allowing
a Synchronization Context to be defined beyond the one
defined by the usage of CNAME source descriptions.</t>
</section>
<section title="CLUE Scenes">
<t>In CLUE "Capture Scene", "Capture Scene Entry" and
"Captures" define an implied Synchronization Context.</t>
</section>
<section title="Implicitly via RtcMediaStream">
<t>The WebRTC WG defines "RtcMediaStream" with one or more
"RtcMediaStreamTracks". All tracks in a "RTCMediaStream"
are intended to be possible to synchronize when
rendered.</t>
</section>
<section title="Explicitly via SDP Mechanisms">
<t>RFC5888 <xref target="RFC5888"/> defines m=line
grouping mechanism called "Lip Synchronization (LS)" for
establishing the synchronization requirement across
m=lines when they map to individual sources.</t>
<t>RFC5576 <xref target="RFC5576"/> extends the above
mechanism when multiple media sources are described by a
single m=line.</t>
</section>
</section>
<section title="End Point">
<t>Some applications requires knowledge of what Media
Sources originate from a particular <xref
target="end-point">End Point</xref>. This can include such
decisions as packet routing between parts of the topology,
knowing the End Point origin of the Packet Streams.</t>
<t>In RTP, this identification has been overloaded with the
Synchronization Context through the usage of the source
description CNAME item. This works for some usages, but
sometimes it breaks down. For example, if an End Point has
two sets of Media Sources that have different
Synchronization Contexts, like the audio and video of the
human participant as well as a set of Media Sources of audio
and video for a shared movie. Thus, an End Point may have
multiple CNAMEs. The CNAMEs or the Media Sources themselves
can be related to the End Point.</t>
</section>
<section title="Participant">
<t>In communication scenarios, it is commonly needed to know
which Media Sources that originate from which <xref
target="participant">Participant</xref>. Thus enabling the
application to for example display Participant Identity
information correctly associated with the Media Sources.
This association is currently handled through the signaling
solution to point at a specific Multimedia Session where the
Media Sources may be explicitly or implicitly tied to a
particular End Point.</t>
<t>Participant information becomes more problematic due to
Media Sources that are generated through mixing or other
conceptual processing of Raw Streams or Source Streams that
originate from different Participants. This type of Media
Sources can thus have a dynamically varying set of origins
and Participants. RTP contains the concept of Contributing
Sources (CSRC) that carries such information about the
previous step origin of the included media content on RTP
level.</t>
</section>
<section title="WebRTC MediaStream">
<t>An RtcMediaStream, in addition to requiring a single
Synchronization Context as discussed above, is also an
explicit grouping of a set of Media Sources, as identified
by RtcMediaStreamTracks, within the RtcMediaStream.</t>
</section>
</section>
<section title="Packetization Time Relations">
<t>At RTP Packetization time, there exists a possibility for a
number of different types of relationships between <xref
target="encoded-stream">Encoded Streams</xref>, <xref
target="dependent-stream">Dependent Streams</xref> and <xref
target="packet-stream">Packet Streams</xref>. These are caused
by grouping together or distributing these different types of
streams into Packet Streams. This section will look at such
relationships.</t>
<section title="Single Stream Transport of SVC">
<t><xref target="RFC6190">Scalable Video Coding</xref> has a
mode of operation where Encoded Streams and Dependent
Streams from the SVC Media Encoder is grouped together in a
single Source Packet Stream using the SVC RTP Payload
format.</t>
</section>
<section title="Multi-Channel Audio">
<t>There exist a number of RTP payload formats that can
carry multi-channel audio, despite the codec being a mono
encoder. Multi-channel audio can be viewed as multiple Media
Sources sharing a common Synchronization Context. These are
then independently encoded by a Media Encoder and the
different Encoded Streams are then packetized together in a
time synchronized way into a single Source Packet Stream
using the used codec's RTP Payload format. Example of such
codecs are, <xref target="RFC3551">PCMA and PCMU</xref>,
<xref target="RFC4867">AMR</xref>, and <xref
target="RFC5404">G.719</xref>.</t>
</section>
<section title="Redundancy Format">
<t>The <xref target="RFC2198">RTP Payload for Redundant
Audio Data</xref> defines how one can transport redundant
audio data together with primary data in the same RTP
payload. The redundant data can be a time delayed version of
the primary or another time delayed Encoded stream using a
different Media Encoder to encode the same Media Source as
the primary, as depicted below in <xref
target="fig-red-rfc2198"/>.</t>
<figure align="center" anchor="fig-red-rfc2198"
title="Concept for usage of Audio Redundancy with different Media Encoders">
<artwork><![CDATA[+--------------------+
| Media Source |
+--------------------+
|
Source Stream
|
+------------------------+
| |
V V
+--------------------+ +--------------------+
| Media Encoder | | Media Encoder |
+--------------------+ +--------------------+
| |
| +------------+
Encoded Stream | Time Delay |
| +------------+
| |
| +------------------+
V V
+--------------------+
| Media Packetizer |
+--------------------+
|
V
Packet Stream ]]></artwork>
</figure>
<t>The Redundancy format is thus providing the necessary
meta information to correctly relate different parts of the
same Encoded Stream, or in the case <xref
target="fig-red-rfc2198">depicted above</xref> relate the
Received Source Stream fragments coming out of different
Media Decoders to be able to combine them together into a
less erroneous Source Stream.</t>
</section>
</section>
<section title="Packet Stream Relations">
<t>This section discusses various cases of relationships among
Packet Streams. This is a common relation to handle in RTP due
to that Packet Streams are separate and have their own SSRC,
implying independent sequence numbers and timestamp spaces.
The underlying reasons for the Packet Stream relationships are
different, as can be seen in the cases below. The different
Packet Streams can be handled within the same RTP Session or
different RTP Sessions to accomplish different transport
goals. This separation of Packet Streams is further discussed
in <xref target="packet-stream-separation"/>.</t>
<section anchor="simulcast" title="Simulcast">
<t>A Media Source represented as multiple independent
Encoded Streams constitutes a simulcast of that Media
Source. <xref target="fig-simulcast"/> below represents an
example of a Media Source that is encoded into three
separate and different Simulcast streams, that are in turn
sent on the same Media Transport flow. When using Simulcast,
the Packet Streams may be sharing RTP Session and Media
Transport, or be separated on different RTP Sessions and
Media Transports, or be any combination of these two. It is
other considerations that affect which usage is desirable,
as discussed in <xref
target="packet-stream-separation"/>.</t>
<figure anchor="fig-simulcast"
title="Example of Media Source Simulcast">
<artwork align="center"><![CDATA[ +----------------+
| Media Source |
+----------------+
Source Stream |
+----------------------+----------------------+
| | |
v v v
+------------------+ +------------------+ +------------------+
| Media Encoder | | Media Encoder | | Media Encoder |
+------------------+ +------------------+ +------------------+
| Encoded | Encoded | Encoded
| Stream | Stream | Stream
v v v
+------------------+ +------------------+ +------------------+
| Media Packetizer | | Media Packetizer | | Media Packetizer |
+------------------+ +------------------+ +------------------+
| Source | Source | Source
| Packet | Packet | Packet
| Stream | Stream | Stream
+-----------------+ | +-----------------+
| | |
V V V
+-------------------+
| Media Transport |
+-------------------+
]]></artwork>
</figure>
<t>The simulcast relation between the Packet Streams is the
common Media Source. In addition, to be able to identify the
common Media Source, a receiver of the Packet Stream may
need to know which configuration or encoding goals that lay
behind the produced Encoded Stream and its properties. This
to enable selection of the stream that is most useful in the
application at that moment.</t>
</section>
<section anchor="svc"
title="Layered Multi-Stream Transmission">
<t>Multi-stream transmission (MST) is a mechanism by which
different portions of a layered encoding of a Source Stream
are sent using separate Packet Streams (sometimes in
separate RTP sessions). MSTs are useful for receiver control
of layered media.</t>
<t>A Media Source represented as an Encoded Stream and
multiple Dependent Streams constitutes a Media Source that
has layered dependency. The figure below represents an
example of a Media Source that is encoded into three
dependent layers, where two layers are sent on the same
Media Transport using different Packet Streams, i.e. SSRCs,
and the third layer is sent on a separate Media Transport,
i.e. a different RTP Session.</t>
<figure align="center" anchor="fig-ddp"
title="Example of Media Source Layered Dependency">
<artwork align="center"><![CDATA[ +----------------+
| Media Source |
+----------------+
|
|
V
+---------------------------------------------------------+
| Media Encoder |
+---------------------------------------------------------+
| | |
Encoded Stream Dependent Stream Dependent Stream
| | |
V V V
+----------------+ +----------------+ +----------------+
|Media Packetizer| |Media Packetizer| |Media Packetizer|
+----------------+ +----------------+ +----------------+
| | |
Packet Stream Packet Stream Packet Stream
| | |
+------+ +------+ |
| | |
V V V
+-----------------+ +-----------------+
| Media Transport | | Media Transport |
+-----------------+ +-----------------+
]]></artwork>
</figure>
<t>The SVC MST relation needs to identify the common Media
Encoder origin for the Encoded and Dependent Streams. The
SVC RTP Payload RFC is not particularly explicit about how
this relation is to be implemented. When using different RTP
Sessions, thus different Media Transports, and as long as
there is only one Packet Stream per Media Encoder and a
single Media Source in each RTP Session, common SSRC and
CNAMEs can be used to identify the common Media Source. When
multiple Packet Streams are sent from one Media Encoder in
the same RTP Session, then CNAME is the only currently
specified RTP identifier that can be used. In cases where
multiple Media Encoders use multiple Media Sources sharing
Synchronization Context, and thus having a common CNAME,
additional heuristics need to be applied to create the MST
relationship between the Packet Streams.</t>
</section>
<section anchor="repair" title="Robustness and Repair">
<t>Packet Streams may be protected by Redundancy Packet
Streams during transport. Several approaches listed below
can achieve the same result; <list style="symbols">
<t>Duplication of the original Packet Stream</t>
<t>Duplication of the original Packet Stream with a time
offset,</t>
<t>Forward Error Correction (FEC) techniques, and</t>
<t>Retransmission of lost packets (either globally or
selectively).</t>
</list></t>
<t/>
<section title="RTP Retransmission">
<t>The <xref target="fig-rtx">figure below</xref>
represents an example where a Media Source's Source Packet
Stream is protected by a <xref
target="RFC4588">retransmission (RTX) flow</xref>. In this
example the Source Packet Stream and the Redundancy Packet
Stream share the same Media Transport.</t>
<figure align="center" anchor="fig-rtx"
title="Example of Media Source Retransmission Flows">
<artwork align="center"><![CDATA[+--------------------+
| Media Source |
+--------------------+
|
V
+--------------------+
| Media Encoder |
+--------------------+
| Retransmission
Encoded Stream +--------+ +---- Request
V | V V
+--------------------+ | +--------------------+
| Media Packetizer | | | RTP Retransmission |
+--------------------+ | +--------------------+
| | |
+------------+ Redundancy Packet Stream
Source Packet Stream |
| |
+---------+ +---------+
| |
V V
+-----------------+
| Media Transport |
+-----------------+
]]></artwork>
</figure>
<t>The <xref target="fig-rtx">RTP Retransmission
example</xref> helps illustrate that this mechanism works
purely on the Source Packet Stream. The RTP Retransmission
transform buffers the sent Source Packet Stream and upon
requests emits a retransmitted packet with some extra
payload header as a Redundancy Packet Stream. The <xref
target="RFC4588">RTP Retransmission mechanism</xref> is
specified so that there is a one to one relation between
the Source Packet Stream and the Redundancy Packet Stream.
Thus a Redundancy Packet Stream needs to be associated
with its Source Packet Stream upon being received. This is
done based on CNAME selectors and heuristics to match
requested packets for a given Source Packet Stream with
the original sequence number in the payload of any new
Redundancy Packet Stream using the RTX payload format. In
cases where the Redundancy Packet Stream is sent in a
separate RTP Session from the Source Packet Stream, these
sessions are related, e.g. using the <xref
target="RFC5888">SDP Media Grouping's</xref> FID
semantics.</t>
</section>
<section title="Forward Error Correction">
<t>The <xref target="fig-fec">figure below</xref>
represents an example where two Media Sources' Source
Packet Streams are protected by FEC. Source Packet Stream
A has a Media Redundancy transformation in FEC Encoder 1.
This produces a Redundancy Packet Stream 1, that is only
related to Source Packet Stream A. The FEC Encoder 2,
however takes two Source Packet Streams (A and B) and
produces a Redundancy Packet Stream 2 that protects them
together, i.e. Redundancy Packet Stream 2 relate to two
Source Packet Streams (a FEC group). FEC decoding, when
needed due to packet loss or packet corruption at the
receiver, requires knowledge about which Source Packet
Streams that the FEC encoding was based on.</t>
<t>In <xref target="fig-fec"/> all Packet Streams are sent
on the same Media Transport. This is however not the only
possible choice. Numerous combinations exist for spreading
these Packet Streams over different Media Transports to
achieve the communication application's goal.</t>
<figure align="center" anchor="fig-fec"
title="Example of FEC Flows">
<artwork align="center"><![CDATA[+--------------------+ +--------------------+
| Media Source A | | Media Source B |
+--------------------+ +--------------------+
| |
V V
+--------------------+ +--------------------+
| Media Encoder A | | Media Encoder B |
+--------------------+ +--------------------+
| |
Encoded Stream Encoded Stream
V V
+--------------------+ +--------------------+
| Media Packetizer A | | Media Packetizer B |
+--------------------+ +--------------------+
| |
Source Packet Stream A Source Packet Stream B
| |
+-----+-------+-------------+ +-------+------+
| V V V |
| +---------------+ +---------------+ |
| | FEC Encoder 1 | | FEC Encoder 2 | |
| +---------------+ +---------------+ |
| | | |
| Redundancy PS 1 Redundancy PS 2 |
V V V V
+----------------------------------------------------------+
| Media Transport |
+----------------------------------------------------------+
]]></artwork>
</figure>
<t>As FEC Encoding exists in various forms, the methods
for relating FEC Redundancy Packet Streams with its source
information in Source Packet Streams are many. The <xref
target="RFC5109">XOR based RTP FEC Payload format</xref>
is defined in such a way that a Redundancy Packet Stream
has a one to one relation with a Source Packet Stream. In
fact, the RFC requires the Redundancy Packet Stream to use
the same SSRC as the Source Packet Stream. This requires
to either use a separate RTP session or to use the <xref
target="RFC2198">Redundancy RTP Payload format</xref>. The
underlying relation requirement for this FEC format and a
particular Redundancy Packet Stream is to know the related
Source Packet Stream, including its SSRC.</t>
<t><!--MW: Here we could ad something about FECFRAME and generalized block FEC that can
protect multiple Packet Streams with one Redundancy Packet Stream. However, that do requrie
usage of explicit Source Packet Information. --></t>
</section>
</section>
<section anchor="packet-stream-separation"
title="Packet Stream Separation">
<t>Packet Streams can be separated exclusively based on
their SSRCs or at the RTP Session level or at the
Multi-Media Session level as explained below.</t>
<t>When the Packet Streams that have a relationship are all
sent in the same RTP Session and are uniquely identified
based on their SSRC only, it is termed an SSRC-Only Based
Separation. Such streams can be related via RTCP CNAME to
identify that the streams belong to the same End Point.
<xref target="RFC5576"/>-based approaches, when used, can
explicitly relate various such Packet Streams.</t>
<t>On the other hand, when Packet Streams that are related
but are sent in the context of different RTP Sessions to
achieve separation, it is known as RTP Session-based
separation. This is commonly used when the different Packet
Streams are intended for different Media Transports.</t>
<t>Several mechanisms that use RTP Session-based separation
rely on it to enable an implicit grouping mechanism
expressing the relationship. The solutions have been based
on using the same SSRC value in the different RTP Sessions
to implicitly indicate their relation. That way, no explicit
RTP level mechanism has been needed, only signalling level
relations have been established using semantics from <xref
target="RFC5888">Grouping of Media lines framework</xref>.
Examples of this are <xref target="RFC4588">RTP
Retransmission</xref>, <xref target="RFC6190">SVC Multi
Stream Transmission</xref> and <xref target="RFC5109">XOR
Based FEC</xref>. RTCP CNAME explicitly relates Packet
Streams across different RTP Sessions, as explained in the
previous section. Such a relationship can be used to perform
inter-media synchronization.</t>
<t>Packet Streams that are related and need to be associated
can be part of different Multimedia Sessions, rather than
just different RTP sessions within the same Multimedia
Session context. This puts further demand on the scope of
the mechanism(s) and its handling of identifiers used for
expressing the relationships.</t>
</section>
</section>
<section title="Multiple RTP Sessions over one Media Transport">
<t><xref
target="I-D.westerlund-avtcore-transport-multiplexing"/>
describes a mechanism that allow several RTP Sessions to be
carried over a single underlying Media Transport. The main
reasons for doing this are related to the impact of using one
or more Media Transports. Thus using a common network path or
potentially have different ones. There is reduced need for
NAT/FW traversal resources and no need for flow based QoS.</t>
<t>However, Multiple RTP Sessions over one Media Transport
makes it clear that a single Media Transport 5-tuple is not
sufficient to express which RTP Session context a particular
Packet Stream exists in. Complexities in the relationship
between Media Transports and RTP Session already exist as one
RTP Session contains multiple Media Transports, e.g. even a
Peer-to-Peer RTP Session with RTP/RTCP Multiplexing requires
two Media Transports, one in each direction. The relationship
between Media Transports and RTP Sessions as well as
additional levels of identifiers need to be considered in both
signalling design and when defining terminology.</t>
</section>
</section>
<section anchor="topologies"
title="Topologies and Communication Entities">
<t>This Section reviews some communication topologies and looks
at the relationship among the communication entities that are
defined in <xref target="communication-entities"/>. This section
doesn't deal with discussions about the streams and their
relation to the transport. Instead, it covers the aspects that
enable the transport of those streams. For example, the <xref
target="media-transport">Media Transports</xref> that exists
between the <xref target="end-point">End Points</xref> that are
part of an <xref target="rtp-session">RTP session</xref> and
their relationship to the <xref
target="multimedia-session">Multi-Media Session</xref> between
<xref target="participant">Participants</xref> and the
established <xref target="comm-session">Communication
session</xref> are explained.</t>
<section title="Point-to-Point Communication">
<t><xref target="fig-p2p-basic"/> shows a very basic
point-to-point communication session between A and B. It uses
two different audio and video RTP sessions between A's and B's
end points. Assume that the Multi-media session shared by the
participants is established using SIP (i.e., there is a SIP
Dialog between A and B). The high level representation of this
communication scenario can be demonstrated using <xref
target="fig-p2p-basic"/>.</t>
<figure align="center" anchor="fig-p2p-basic"
title="Point to Point Communication">
<artwork><![CDATA[
+---+ +---+
| A |<------->| B |
+---+ +---+
]]></artwork>
</figure>
<t>However, this picture gets slightly more complex when
redrawn using the communication entities concepts defined
earlier in this document.</t>
<figure align="center" anchor="fig-p2p"
title="Point to Point Communication Session with two RTP Sessions">
<artwork><![CDATA[
+-----------------------------------------------------------+
| Communication Session |
| |
| +----------------+ +----------------+ |
| | Participant A | +-------------+ | Participant B | |
| | | | Multi-Media | | | |
| | +-------------+|<=>| Session |<=>|+-------------+ | |
| | | End Point A || |(SIP Dialog) | || End Point B | | |
| | | || +-------------+ || | | |
| | | +-----------++---------------------++-----------+ | | |
| | | | RTP Session| | | | | |
| | | | Audio |---Media Transport-->| | | | |
| | | | |<--Media Transport---| | | | |
| | | +-----------++---------------------++-----------+ | | |
| | | || || | | |
| | | +-----------++---------------------++-----------+ | | |
| | | | RTP Session| | | | | |
| | | | Video |---Media Transport-->| | | | |
| | | | |<--Media Transport---| | | | |
| | | +-----------++---------------------++-----------+ | | |
| | +-------------+| |+-------------+ | |
| +----------------+ +----------------+ |
+-----------------------------------------------------------+
]]></artwork>
</figure>
<t><xref target="fig-p2p"/> shows the two RTP Sessions only
exist between the two End Points A and B and over their
respective Media Transports. The Multi-Media Session
establishes the association between the two Participants and
configures these RTP sessions and the Media Transports that
are used.</t>
</section>
<section anchor="central-conferencing"
title="Central Conferencing">
<t>This section looks at the central conferencing
communication topology, where a number of participants, like
A, B, C, and D in <xref target="fig-central-conf-basic"/>,
communicate using an RTP mixer.</t>
<figure anchor="fig-central-conf-basic"
title="Centralized Conferincing using an RTP Mixer">
<artwork><![CDATA[+---+ +------------+ +---+
| A |<---->| |<---->| B |
+---+ | | +---+
| Mixer |
+---+ | | +---+
| C |<---->| |<---->| D |
+---+ +------------+ +---+
]]></artwork>
</figure>
<t>In this case each of the Participants establish their
Multi-media session with the Conference Bridge. Thus,
negotiation for the establishment of the used RTP sessions and
their configuration happens between these entities. The
participants have their End Points (A, B, C, D) and the
Conference Bridge has the host running the RTP mixer, referred
to as End Point M in <xref target="fig-central-conf"/>.
However, despite the individual establishment of four
Multi-Media Sessions and the corresponding Media Transports
for each of the RTP sessions between the respective End Points
and the Conference Bridge, there is actually only two RTP
sessions. One for audio and one for Video, as these RTP
sessions are, in this topology, shared between all the
Participants.</t>
<figure anchor="fig-central-conf"
title="Central Conferencing with Two Participants A and B communicating over a Conference Bridge">
<artwork><![CDATA[+-------------------------------------------------------------------+
| Communication Session |
| |
| +----------------+ +----------------+ |
| | Participant A | +-------------+ | Conference | |
| | | | Multi-Media | | Bridge | |
| | +-------------+|<=====>| Session A |<=====>|+-------------+ | |
| | | End Point A || |(SIP Dialog) | || End Point M | | |
| | | || +-------------+ || | | |
| | | +-----------++-----------------------------++-----------+ | | |
| | | | RTP Session| | | | | |
| | | | Audio |-------Media Transport------>| | | | |
| | | | |<------Media Transport-------| | | | |
| | | +-----------++-----------------------------++------+ | | | |
| | | || || | | | | |
| | | +-----------++-----------------------------++----+ | | | | |
| | | | RTP Session| | | | | | | |
| | | | Video |-------Media Transport------>| | | | | | |
| | | | |<------Media Transport-------| | | | | | |
| | | +-----------++-----------------------------++ | | | | | |
| | +-------------+| || | | | | | |
| +----------------+ || | | | | | |
| || | | | | | |
| +----------------+ || | | | | | |
| | Participant B | +-------------+ || | | | | | |
| | | | Multi-Media | || | | | | | |
| | +-------------+|<=====>| Session B |<=====>|| | | | | | |
| | | End Point B || |(SIP Dialog) | || | | | | | |
| | | || +-------------+ || | | | | | |
| | | +-----------++-----------------------------++ | | | | | |
| | | | RTP Session| | | | | | | |
| | | | Video |-------Media Transport------>| | | | | | |
| | | | |<------Media Transport-------| | | | | | |
| | | +-----------++-----------------------------++----+ | | | | |
| | | || || | | | | |
| | | +-----------++-----------------------------++------+ | | | |
| | | | RTP Session| | | | | |
| | | | Audio |-------Media Transport------>| | | | |
| | | | |<------Media Transport-------| | | | |
| | | +-----------++-----------------------------++-----------+ | | |
| | +-------------+| |+-------------+ | |
| +----------------+ +----------------+ |
+-------------------------------------------------------------------+
]]></artwork>
</figure>
<t>It is important to stress that in the case of <xref
target="fig-central-conf"/>, it might appear that the the
Multi-Media Sessions context is scoped between A and B over M.
This might not be always true and they can have contexts that
extend further. In this case the RTP session, its common SSRC
space goes beyond what occurs between A and M and B and M
respectively.</t>
</section>
<section title="Full Mesh Conferencing">
<t>This section looks at the case where the three Participants
(A, B and C) wish to communicate. They establish individual
Multi-Media Sessions and RTP sessions between themselves and
the other two peers. Thus, each providing two copies of their
media to every other participant. <xref
target="fig-full-mesh-basic"/> shows a high level
representation of such a topology.</t>
<figure align="center" anchor="fig-full-mesh-basic"
title="Full Mesh Conferencing with three Participants A, B and C">
<artwork><![CDATA[+---+ +---+
| A |<---->| B |
+---+ +---+
^ ^
\ /
\ /
v v
+---+
| C |
+---+
]]></artwork>
</figure>
<t>In this particular case there are two aspects worth noting.
The first is there will be multiple Multi-Media Sessions per
Communication Session between the participants. This, however,
hasn't been true in the earlier examples; the Centralized
Conferencing in<xref target="central-conferencing"/> being the
exception. The second aspect is consideration of whether one
needs to maintain relationships between entities and concepts,
for example MediaSources, between these different Multi-Media
Sessions and between Packet Streams in the independent RTP
sessions configured by those Multi-Media Sessions.</t>
<figure align="center" anchor="fig-full-mesh"
title="Full Mesh Conferencing between three Participants A, B and C">
<artwork><![CDATA[ +-----------------------------------------+
| Participant A |
+----------+ | +--------------------------------------+|
| Multi- | | | End Point A ||
| Media |<======>| | ||
| Session | | |+-------+ +-------+ +-------+ ||
| 1 | | || RTP 1 |<----| MS A1 |---->| RTP 2 | ||
+----------+ | || | +-------+ | | ||
^^ | +|-------|-------------------|-------|-+|
|| +--|-------|-------------------|-------|--+
|| | | ^^ | |
VV | | || | |
+-------------------------|-------|----+ || | |
| Participant B | | | VV | |
| +-----------------------|-------|---+| +----------+ | |
| | End Point B +----->| | || | Multi- | | |
| | | +-------+ || | Media | | |
| | +-------+ | +-------+ || | Session | | |
| | | MS B1 |------+----->| RTP 3 | || | 2 | | |
| | +-------+ | | || +----------+ | |
| +-----------------------|-------|---+| ^^ | |
+-------------------------|-------|----+ || | |
^^ | | || | |
|| | | VV | |
|| +--|-------|-------------------|-------|--+
VV | | | Participant C | | |
+----------+ | +|-------|-------------------|-------|-+|
| Multi- | | || | End Point C | | ||
| Media |<======>| |+-------+ +-------+ ||
| Session | | | ^ +-------+ ^ ||
| 3 | | | +---------| MS C1 |---------+ ||
+----------+ | | +-------+ ||
| +--------------------------------------+|
+-----------------------------------------+
]]></artwork>
</figure>
<t>For the sake of clarity, <xref target="fig-full-mesh"/>
above does not include all these concepts. The Media Sources
(MS) from a given End Point is sent to the two peers. This
requires encoding and Media Packetization to enable the Packet
Streams to be sent over Media Transports in the context of the
RTP sessions depicted. The RTP sessions 1, 2, and 3 are
independent, and established in the context of each of the
Multi-Media Sessions 1, 2 and 3. The joint communication
session the full figure represents (not shown here as it was
<xref target="fig-central-conf"/> in order to save space),
however, combines the received representations of the peers'
Media Sources and plays them back.</t>
<t>It is noteworthy that the full mesh conferencing topologies
described here have the potential for creating loops. For
example, if one compares the above full mesh with a mixing
three party communication session as <xref
target="fig-three-relay">depicted in </xref>. In this example
A's Media Source A1 is sent to B over a Multi-Media Session
(A-B). In B the Media Source A1 is mixed with Media Source B1
and the resulting Media Source (MS AB) is sent to C over a
Multi-Media Session (B-C). If C and A would establish a
Multi-Media Session (A-C) and C would act in the same role as
B, then A would receive a Media Source from C that contains a
mix of A, B and C's individual Media Sources. This would
result in A playing out a time delay version of its own signal
(i.e., the system has created an echo path).</t>
<figure anchor="fig-three-relay"
title="Mixing Three Party Communication Session">
<artwork><![CDATA[+--------------+ +--------------+ +--------------+
| A | | B +-------+ | | C |
| | | | MS B1 | | | |
| | | +-------+ | | |
| +-------+ | | | | | |
| | MS A1 |----|--->|-----+ MS AB -|--->| |
| +-------+ | | | | |
+--------------+ +--------------+ +--------------+
]]></artwork>
</figure>
<t>The looping issue can be avoided, detected or prevented
using two general methods. The first method is to use great
care when setting up and establishing the communication
session if participants have any mixing or forwarding
capacity, so that one doesn't end up getting back a partial or
full representation of one's own media believing it is someone
else's. The other method is to maintain some unique
identifiers at the communication session level for all Media
Sources and ensure that any Packet Streams received identify
those Media Sources that contributed to the content of the
Packet Stream.</t>
</section>
<section title="Source-Specific Multicast">
<t>In one-to-many media distribution cases (e.g., IPTV), where
one Media Sender or a set of Media Senders is allowed to send
Packet Streams on a particular Source-Specific Multicast (SSM)
group to many receivers (R), there are some different aspects
to consider. <xref target="fig-ssm-basic"/> presents a high
level SSM system for RTP/RTCP defined in <xref
target="RFC5760"/>. In this case, several Media Senders sends
their Packet Streams to the Distribution Source, which is the
only one allowed to send to the SSM group. The Receivers
joining the SSM group can provide RTCP feedback on its
reception by sending unicast feedback to a Feedback Target
(FT).</t>
<figure anchor="fig-ssm-basic"
title="Source-Specific Multicast Communication Topology">
<artwork><![CDATA[+--------+ +-----+
|Media | | | Source-Specific
|Sender 1|<----->| D S | Multicast (SSM)
+--------+ | I O | +--+----------------> R(1)
| S U | | | |
+--------+ | T R | | +-----------> R(2) |
|Media |<----->| R C |->+ | : | |
|Sender 2| | I E | | +------> R(n-1) | |
+--------+ | B | | | | | |
: | U | +--+--> R(n) | | |
: | T +-| | | | |
: | I | |<---------+ | | |
+--------+ | O |F|<---------------+ | |
|Media | | N |T|<--------------------+ |
|Sender M|<----->| | |<-------------------------+
+--------+ +-----+ RTCP Unicast
FT = Feedback Target
]]></artwork>
</figure>
<t>Here the Media Transport from the Distribution Source to
all the SSM receivers (R) have the same 5-tuple, but in
reality have different paths. Also, the Multi-Media Sessions
between the Distribution Source and the individual receivers
are normally identical. This is due to one-way communication
from the Distribution Source to the receiver of configuration
information. This is information typically embedded in
Electronic Program Guides (EPGs), distributed by the Session
Announcement Protocol (SAP) <xref target="RFC2974"/> or other
one-way protocols. In some cases load balancing occurs, for
example, by providing the receiver with a set of Feedback
Targets and then it randomly selects one out of the set.</t>
<t>This scenario varies significantly from previously
described communication topologies due to the asymmetric
nature of the RTP Session context across the Distribution
Source. The Distribution Source forms a focal point in
collecting the unicasted RTCP feedback from the receivers and
then re-distributing it to the Media Senders. Each Media
Sender and the Distribution Source establish their own
Multi-Media Session Context for the underlying RTP Sessions
but with shared RTCP context across all the receivers.</t>
<t>To improve the readability,<xref target="fig-ssm-basic">
</xref> intentionally hides the details of the various
entities . Expanding on this, one can think of Media Senders
being part of one or more Multi-Media Sessions grouped under a
Communication Session. The Media Sender in this scenario
refers to the Media Packetizer transformation <xref
target="media_packetizer"/>. The Packet Stream generated by
such a Media Sender can be part of its own RTP Session or can
be multiplexed with other Packet Streams within an End Point.
The latter case requires careful consideration since the
re-distributed RTCP packets now correspond to a single RTP
Session Context across all the Media Senders.</t>
</section>
</section>
<section anchor="security" title="Security Considerations">
<t>This document simply tries to clarify the confusion prevalent
in RTP taxonomy because of inconsistent usage by multiple
technologies and protocols making use of the RTP protocol. It
does not introduce any new security considerations beyond those
already well documented in the RTP protocol <xref
target="RFC3550"/> and each of the many respective
specifications of the various protocols making use of it.</t>
<t>Hopefully having a well-defined common terminology and
understanding of the complexities of the RTP architecture will
help lead us to better standards, avoiding security
problems.</t>
</section>
<section title="Acknowledgement">
<t>This document has many concepts borrowed from several
documents such as WebRTC <xref
target="I-D.ietf-rtcweb-overview"/>, CLUE <xref
target="I-D.ietf-clue-framework"/>, Multiplexing Architecture
<xref target="I-D.westerlund-avtcore-transport-multiplexing"/>.
The authors would like to thank all the authors of each of those
documents.</t>
<t>The authors would also like to acknowledge the insights,
guidance and contributions of Magnus Westerlund, Roni Even, Paul
Kyzivat, Colin Perkins, Keith Drage, and Harald Alvestrand.</t>
</section>
<section title="Contributors">
<t>Magnus Westerlund has contributed the concept model for the
media chain using transformations and streams model, including
rewriting pre-existing concepts into this model and adding
missing concepts. The first proposal for updating the
relationships and the topologies based on this concept was also
performed by Magnus.</t>
</section>
<section anchor="iana" title="IANA Considerations">
<t>This document makes no request of IANA.</t>
</section>
</middle>
<back>
<references title="Normative References">
<?rfc include="reference.RFC.3550"?>
<reference anchor="UML">
<front>
<title>OMG Unified Modeling Language (OMG UML),
Superstructure, V2.2</title>
<author>
<organization abbrev="OMG">Object Management
Group</organization>
</author>
<date month="February" year="2009"/>
</front>
<seriesInfo name="OMG" value="formal/2009-02-02"/>
<format target="http://www.omg.org/spec/UML/2.2/Superstructure/PDF/"
type="PDF"/>
</reference>
</references>
<references title="Informative References">
<?rfc include='reference.RFC.2198'?>
<?rfc include='reference.RFC.2974'?>
<?rfc include="reference.RFC.3264"?>
<?rfc include='reference.RFC.3551'?>
<?rfc include="reference.RFC.4566"?>
<?rfc include='reference.RFC.4588'?>
<?rfc include='reference.RFC.4867'?>
<?rfc include='reference.RFC.5109'?>
<?rfc include='reference.RFC.5404'?>
<?rfc include="reference.RFC.5576"?>
<?rfc include='reference.RFC.5760'?>
<?rfc include="reference.RFC.5888"?>
<?rfc include="reference.RFC.5905"?>
<?rfc include='reference.RFC.6190'?>
<?rfc include="reference.RFC.6222"?>
<?rfc include="reference.I-D.ietf-clue-framework"?>
<?rfc include="reference.I-D.ietf-rtcweb-overview"?>
<?rfc include="reference.I-D.ietf-mmusic-sdp-bundle-negotiation"?>
<?rfc include="reference.I-D.ietf-avtcore-clksrc"?>
<?rfc include="reference.I-D.westerlund-avtcore-transport-multiplexing"?>
</references>
<section title="Changes From Earlier Versions">
<t>NOTE TO RFC EDITOR: Please remove this section prior to
publication.</t>
<section title="Modifications Between Version -02 and -03">
<t><list style="symbols">
<t>Section 4 rewritten (and new communication topologies
added) to reflect the major updates to Sections 1-3</t>
<t>Section 8 removed (carryover from initial -00
draft)</t>
<t>General clean up of text, grammar and nits</t>
</list></t>
</section>
<section title="Modifications Between Version -01 and -02">
<t><list style="symbols">
<t>Section 2 rewritten to add both streams and
transformations in the media chain.</t>
<t>Section 3 rewritten to focus on exposing
relationships.</t>
</list></t>
</section>
<section title="Modifications Between Version -00 and -01">
<t><list style="symbols">
<t>Too many to list</t>
<t>Added new authors</t>
<t>Updated content organization and presentation</t>
</list></t>
</section>
</section>
</back>
</rfc>
| PAFTECH AB 2003-2026 | 2026-04-23 10:17:53 |