One document matched: draft-ivov-avt-slic-00.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc SYSTEM 'rfcXXXX.dtd'>
<rfc category='info' ipr='trust200902' docName='draft-ivov-avt-slic-00'>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc toc='yes' ?>
<?rfc symrefs='yes' ?>
<?rfc sortrefs='yes'?>
<?rfc iprnotified='no' ?>
<?rfc strict='yes' ?>
<?rfc compact='yes' ?>
<front>
<title abbrev='Sound Level Indicators in Conferences'>
Delivering Conference Participant Sound Level Indicators in RTP
Streams
</title>
<author initials='E.' surname='Ivov' fullname='Emil Ivov'>
<organization abbrev='SIP Communicator'>SIP Communicator</organization>
<address>
<postal>
<street></street>
<city>Strasbourg</city>
<code>67000</code>
<country>France</country>
</postal>
<email>emcho@sip-communicator.org</email>
</address>
</author>
<author initials='E.' surname='Marocco' fullname='Enrico Marocco'>
<organization>Telecom Italia</organization>
<address>
<postal>
<street>Via G. Reiss Romoli, 274</street>
<city>Turin</city>
<code>10148</code>
<country>Italy</country>
</postal>
<email>enrico.marocco@telecomitalia.it</email>
</address>
</author>
<date month='June' year='2009' />
<abstract>
<t>
This document describes a mechanism for RTP-level mixers in
audio conferences to deliver information about the sound level
information on the individual participants. Such sound level
indicators are transported in the same RTP packets as the audio
data they pertain to.
</t>
</abstract>
</front>
<middle>
<section title='Introduction'>
<t>
The Framework for Conferencing with the Session Initiation
Protocol (SIP) defined in
<xref target="RFC4353">RFC 4353</xref>
presents an overall architecture for multi-party conferencing.
Among others, the framework borrows from
<xref target="RFC3550">RTP</xref>
and extends the concept of a mixer entity "responsible for
combining the media streams that make up a conference, and
generating one or more output streams that are delivered to
recipients". Every participant would hence receive, in a flat
single stream, media originating from all the others.
</t>
<t>
Using such centralized mixer-based architectures simplifies
support for conference calls on the client side since they would
hardly differ from one-to-one conversations. However, the
method also introduces a few limitations. The flat nature of
the streams that a mixer would output and send to participants
makes it difficult for users to identify the original source of
what they are hearing.
</t>
<t>
Mechanisms that allow the mixer to send to participants cues on
current speakers (e.g. the CSRC fields in
<xref target='RFC3550'>RTP</xref>) only work for speaking/silent
binary indications. There are, however, a number of use cases
where one would require more detailed information. Possible
examples include the presence of background
chat/noise/music/typing, someone breathing noisily in their
microphone, or other cases where identifying the source of the
disturbance would make it easy to remove it (e.g. by sending a
private IM to the concerned party asking them to mute their
microphone). A more advanced scenario could involve an intense
discussion between multiple participants that the user does not
personally know. Sound level information would help better
recognize the speakers by associating with them complex (but
still human readable) characteristics like loudness and speed
for example.
</t>
<t>
One way of presenting such information in a user friendly
manner would be for a conferencing client to attach sound level
indicators to the corresponding participant related components
in the user interface as displayed in
<xref target='figure-conference-ui' />.
</t>
<figure anchor="figure-conference-ui">
<artwork>
<![CDATA[
------------------------
| |
| 00:42 | Weekly Call |
| |
|------------------------|
| |
| Alice |====== | (S) |
| |
| Bob |= | |
| |
| Carol | | (M) |
| |
| Dave |=== | |
| |
|________________________|
]]>
</artwork>
<postamble>
Displaying detailed speaker information to the user by
including sound level for every participant.
</postamble>
</figure>
<t>
Implementing a user interface like the above requires analysis
of the media sent from other participants. In a conventional
audio conference this is only possible for the mixer since all
other conference participants are generally receiving a single,
flat audio stream and have therefore no immediate way of
determining individual sound levels.
</t>
<t>
This document specifies an RTP extension header that allows such
mixers to deliver sound level information to conference
participants by including it directly in the RTP packets
transporting the corresponding audio data.
</t>
</section>
<section title="Terminology">
<t>
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described
in <xref target="RFC2119">RFC 2119</xref>.
</t>
</section>
<section title='Protocol Operation'>
<t>
According to <xref target='RFC3550'>RFC 3550</xref> a mixer
is expected to include in outgoing RTP packets a list of
identifiers (CSRC IDs) indicating the sources that contributed
to the resulting stream. The presence of such CSRC IDs allows an
RTP client to determine, in a binary way, the active speaker(s)
in any given moment. RTCP also provides a basic mechanism to map
the CSRC IDs to user identities through the CNAME field. More
advanced mechanisms, may exist depending on the signaling
protocol used to establish and control a conference. In the case
of the <xref target="RFC3261">Session Initiation Protocol</xref>
for example, the <xref target="RFC4575"> Event Package for
Conference State</xref> defines a <src-id> tag which binds
CSRC IDs to media streams and SIP URIs.
</t>
<t>
This document describes an RTP header extension that allows
mixers to indicate the sound-level of every conference
participant (CSRC) in addition to simply indicating their
on/off status. This new header extension is based on the
<xref target="RFC5285"> "General Mechanism for RTP Header
Extensions"</xref>.
</t>
<t>
Each instance of this header contains a list of one-octet
sound level values (see <xref target='hdr-fmt'/>). Such values
indicate sound level on a 0 to 255 scale where 0 is silence (i.e.
same as omitting the corresponding source id from the CSRC list)
and 255 corresponds to a threshold accepted by the mixer
implementation as the maximum sound level that a participant is
likely to reach during a conference.
</t>
<t>
Every sound level value pertains to the CSRC identifier
located at the corresponding position in the CSRC list. In other
words, the first value would indicate the sound level of the
conference participant represented by the first CSRC identifier
in that packet and so forth. The number and order of these
values MUST therefore match the number and order of the CSRC
IDs present in the same packet.
</t>
<t>
When encoding sound level information, a mixer SHOULD include in
a packet information that corresponds to the audio data being
transported in that same packet. It is important that these
values follow the actual stream as closely as possible.
Therefore a mixer SHOULD also calculate the values after the
original contributing stream has undergone possible processing
such as level normalization, and noise reduction for example.
</t>
<t>
Note that in some cases a mixer may be sending an RTP audio
stream that only contains sound level information and no actual
audio. Updating a (web) interface conference module may be one
reason for this to happen.
</t>
<!-- t>
Absence of sound level information in an RTP packet SHOULD be
interpreted by receivers as an indication that sound level for
the CSRC IDs present in the packet remains unchanged as per the
previous packet and is set to 0 (silence) for all other
participants. Note however that this mechanism is unreliable
since the last packet containing a change in the sound level
may have never reached some receivers and this could lead to
inconsistencies. It is therefore RECOMMENDED that mixers always
deliver sound level information when there is at least one
non-silent party.
</t -->
<t>
It may sometimes happen that a conference involves more than a
single mixer. In such cases each of the mixers MAY choose to
relay the CSRC list and sound-level information they receive
from peer mixers (as long as the total CSRC count remains below
16). Given that the maximum sound level is not precisely defined
by this specification, it is likely that in such situations
average sound levels would be perceptibly different for the
participants located behind the different mixers.
</t>
</section>
<section title='Header Format' anchor='hdr-fmt'>
<t>
The sound level indicators are delivered to the receivers
in-band using the <xref target='RFC5285'>"General Mechanism for
RTP Header Extensions"</xref>. The payload of this extension
(the transmitted list of sound level values) is a sequence of
8-bit unsigned integers.
<figure>
<preamble>
The form of the sound level indicators extension block is
as follows:
</preamble>
<artwork>
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ID | len | level 1 | level 2 | level 3 ...
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
</artwork>
</figure>
The 4-bit len field is the number minus one of data bytes (i.e.
sound level values) transported in this header extension element
following the one-byte header. Therefore, the value zero in this
field indicates that one byte of data follows. A value of 15 is
not allowed by this specification and it MUST NOT be used as the
RTP header can carry a maximum of 15 CSRC IDs. The maximum value
allowed is therefore 14 indicating a following sequence of 15
sound level values.
</t>
<t>
Note that use of the two-byte header defined in
<xref target='RFC5285'> RFC 5285</xref> follows the same rules
the only change being the length of the ID and len fields.
</t>
</section>
<section title='Signaling Information' anchor='sig-info'>
<t>
The URI for declaring the sound level header extension in an SDP
extmap attribute and mapping it to a local extension header
identifier is "urn:ietf:params:rtp-hdrext:csrc-sound-level".
There is no additional setup information needed for this
extension (i.e. no extensionattributes).
</t>
<t>
An example attribute line in the SDP, for a conference might be:
</t>
<figure>
<artwork>
a=extmap:7 urn:ietf:params:rtp-hdrext:csrc-sound-level
</artwork>
</figure>
<t>
The above mapping will most often be provided per media stream
(in the media-level section(s) of SDP, i.e., after an "m=" line)
or globally if there is more than one stream containing sound
level indicators in a session.
</t>
<t>
Presence of the above attribute in the SDP description of a
media stream indicates that some or all RTP packets in that
stream would contain the sound level information RTP extension
header.
</t>
<t>
Conferencing clients that support sound level indicators and
have no mixing capabilities SHOULD always include the
direction parameter in the "extmap" attribute setting it to
"recvonly". Conference focus entities with mixing
capabilities MAY omit the direction or set it to "sendrecv" in
SDP offers. Such entities SHOULD set it to "sendonly" in SDP
answers to offers with a "recvonly" parameter and to
"sendrecv" when answering other "sendrecv" offers.
</t>
<t>
The following <xref target='client-focus'/> and <xref
target='focus-focus'/> show two example offer/answer exchanges
between a conferencing client and a focus, and between two
conference focus entities.
</t>
<figure anchor="client-focus">
<artwork>
v=0
o=alice 2890844526 2890844526 IN IP6 host.example.com
c=IN IP6 host.example.com
t=0 0
m=audio 49170 RTP/AVP 0 4
a=rtpmap:0 PCMU/8000
a=rtpmap:4 G723/8000
a=extmap:1/recvonly urn:ietf:params:rtp-hdrext:csrc-sound-level
v=0
i=A Seminar on the session description protocol
o=conf-focus 2890844730 2890844730 IN IP6 focus.example.net
c=IN IP6 focus.example.net
t=0 0
m=audio 52543 RTP/AVP 0
a=rtpmap:0 PCMU/8000
a=extmap:1/sendonly urn:ietf:params:rtp-hdrext:csrc-sound-level
</artwork>
<postamble>
A client-initiated example SDP offer/answer exchange
negotiating an audio stream with one-way flow of of sound
level information.
</postamble>
</figure>
<figure anchor="focus-focus">
<artwork>
v=0
i=Un seminaire sur le protocole de description des sessions
o=fr-focus 2890844730 2890844730 IN IP6 focus.fr.example.net
c=IN IP6 focus.fr.example.net
t=0 0
m=audio 49170 RTP/AVP 0
a=rtpmap:0 PCMU/8000
a=extmap:1/sendrecv urn:ietf:params:rtp-hdrext:csrc-sound-level
v=0
i=A Seminar on the session description protocol
o=us-focus 2890844526 2890844526 IN IP6 focus.us.example.net
c=IN IP6 focus.us.example.net
t=0 0
m=audio 52543 RTP/AVP 0
a=rtpmap:0 PCMU/8000
a=extmap:1/sendrecv urn:ietf:params:rtp-hdrext:csrc-sound-level
</artwork>
<postamble>
An example SDP offer/answer exchange between two conference
focus entities with mixing capabilities negotiating an audio
stream with bidirectional flwo of sound level information.
</postamble>
</figure>
</section>
<section title='Security Considerations'>
<t>
<list style='numbers'>
<t>
This document defines a means of attributing sound level
to a particular participant in a conference. An attacker may
try to modify the content of RTP packets in a way that would
make sound activity from one participant appear as coming
from another.
</t>
<t>
Furthermore, the fact that sound level values would not be
protected even in an SRTP session may be of concern in some
cases where the activity of a particular participant in a
conference is confidential.
</t>
<t>
Both of the above are concerns that stem from the design of
the RTP protocol itself. It is therefore important that
according to the needs of a particular scenario,
implementors and deployers consider use of a lower level
security and authentication mechanism.
</t>
</list>
</t>
</section>
<section title='IANA Considerations'>
<t>
This document defines a new extension URI that, if approved,
would need to be added to the RTP Compact Header Extensions
sub-registry of the Real-Time Transport Protocol (RTP)
Parameters registry, according to the following data:
</t>
<figure>
<artwork>
Extension URI: urn:ietf:params:rtp-hdrext:csrc-sound-level
Description: Sound level indicators
Contact: emcho@sip-communicator.org
Reference: RFC XXXX
</artwork>
</figure>
</section>
<section title='Open Issues'>
<t>
At the time of writing of this document the authors have no
clear view on how and if the following list of issues should
be address here:
<list style='numbers'>
<t>
Specific sound level mappings. The current version of this
specification treats sound level indicators as referable
to any scale chosen by the mixer. The only limitations
consist in making sure that the value of 0 should correspond
to participant inactivity/silence and the value 255x to a
level that would appear to users as loud but still
attainable. It is however possible to map specific levels
(e.g. measured in dBm) with the purpose of achieving
cross-mixer uniformity of these values. An obvious tradeoff
here is the increased complexity of implementation that
would require mixers to convert sound level to whatever
specific unit they use for internal estimation, which could
be non-trivial in a number of cases.
</t>
<t>
Sound levels in video streams. This specification allows
use of sound level values in "silent" audio streams that
don't otherwise carry any payload thus allowing their
delivery within systems where the various focus/mixer
components communicate with each other as conference
participants. The same train of thought may very well
justify sound level transport in video streams.
</t>
</list>
</t>
</section>
<section title="Acknowledgments">
<t>
Roni Even, Ingemar Johansson, and several others provided
helpful feedback over the dispatch mailing list.
</t>
<t>
SIP Communicator's participation in this specification is
funded by the NLnet Foundation.
</t>
</section>
<section title='Appendix: An alternative approach'>
<t>
The <xref target='I-D.ivov-dispatch-slic-ps'>problem statement
</xref> preceding this document originally favored a slightly
different resolution approach that the authors feel may still
be relevant and therefore worth publishing here.
</t>
<t>
A very simple way for a mixer to use the CSRC fields as a
transport means for sound level indication would be to extend
their meaning over a series of packets rather than a single
one. This way it could be specified that the sound-level of
a particular participant, represented on a zero to ten scale,
corresponds to the number of occurrences of its CSRC
identifier in the ten most recent RTP packets received from
the mixer.
</t>
<t>
For example, consider a conference call with four
participants: Alice, Bob, Carol, and Dave. At a certain
point in time Alice has a sound level of 6/10, Bob 1/10,
Carol is silent or in other words 0/10 and Dave has a level
of 3/10. In order to describe this state the mixer could
have sent the last ten RTP packets with the following CSRC
configuration:
</t>
<texttable anchor="tab:differences">
<ttcol width="10%"></ttcol>
<ttcol width="9%"> P1 </ttcol>
<ttcol width="9%"> P2 </ttcol>
<ttcol width="9%"> P3 </ttcol>
<ttcol width="9%"> P4 </ttcol>
<ttcol width="9%"> P5 </ttcol>
<ttcol width="9%"> P6 </ttcol>
<ttcol width="9%"> P7 </ttcol>
<ttcol width="9%"> P8 </ttcol>
<ttcol width="9%"> P9 </ttcol>
<ttcol width="9%"> P10 </ttcol>
<c> Alice </c><c>+</c><c>+</c><c>+</c><c>+</c><c>+</c>
<c>+</c><c></c><c></c><c></c><c></c>
<c> Bob </c> <c></c><c>+</c><c></c><c></c><c></c><c></c>
<c></c><c></c><c></c><c></c>
<c> Carol </c> <c></c><c></c><c></c><c></c><c></c><c></c>
<c></c><c></c><c></c><c></c>
<c> Dave </c> <c></c><c></c><c></c><c></c><c></c><c></c>
<c></c><c>+</c><c>+</c><c>+</c>
<postamble>
A possible representation of a particular sound level
configuration through the presence/absence of CSRC IDs in
subsequent RTP packets.
</postamble>
</texttable>
<t>
The graphical interface of a user agent involved in such a
conference (like the one sketched in
<xref target='figure-conference-ui'/>) would then display
correct sound levels just showing for each participant as
many ticks as were the occurrencies of the respective CSRC
in the previous ten RTP packets.
</t>
<t>
The algorithm for encoding sound level information this way
is relatively simple. In order to determine whether or not
to include a particular CSRC a mixer should:
<list style='symbols'>
<t>
include the CSRC if the sound level of the participant
in the current packet is greater than the number of
occurrencies of that same CSRC in the nine previous
packets;
</t>
<t>
omit the CSRC if the sound level of the participant in
the current packet is lower than or equal to the number
of occurrencies of that same CSRC in the nine previous
packets.
</t>
</list>
</t>
<t>
There are several advantages to using this approach, the
most obvious being its simplicity as well as the fact that
sound level information is transported together with the
parts of the audio stream that it actually concerns which
should make synchronization straightforward.
</t>
<t>
The technique would also work with other signaling protocols
using RTP such as <xref target='RFC3920'>XMPP's</xref> Jingle
extensions for example.
</t>
<t>
One of the first disadvantages that come to mind with this
approach is the fact that mixer would not be able to indicate
level in a single packet but would have to distribute it over
a succession of up to ten packets which would reduce the
reactivity of the representation.
</t>
<t>
It is probably worth mentioning, however, that a granularity
that allows switching from a level of zero to ten and back to
zero again in an instant manner is not of much use anyway
since such UI updates would be barely perceptible to the user.
Still, this is a UI decision and making it on a protocol level
may bring some inconveniences.
</t>
<t>
Another possible problem would come from implementations using
CSRC presence in a binary way to determine current speaker.
When running against a mixer that supports sound level
indication such implementations may appear to be jumpy as
the participants that they are designating as active may be
changing status too rapidly.
</t>
</section>
</middle>
<back>
<references title='Normative References'>
<?rfc include="reference.RFC.2119"?>
<?rfc include="reference.RFC.3550"?>
<?rfc include="reference.RFC.5285"?>
</references>
<references title='Informative References'>
<?rfc include="reference.I-D.ietf-mmusic-ice"?>
<?rfc include="reference.I-D.ivov-dispatch-slic-ps"?>
<?rfc include="reference.RFC.3261"?>
<?rfc include="reference.RFC.3551"?>
<?rfc include="reference.RFC.3920"?>
<?rfc include="reference.RFC.4353"?>
<?rfc include="reference.RFC.4575"?>
</references>
</back>
</rfc>
| PAFTECH AB 2003-2026 | 2026-04-24 03:16:23 |