One document matched: draft-ivov-dispatch-slic-ps-00.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc SYSTEM 'rfcXXXX.dtd'>
<rfc category='info' ipr='trust200902' docName='draft-ivov-dispatch-slic-ps-00'>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc toc='yes' ?>
<?rfc symrefs='yes' ?>
<?rfc sortrefs='yes'?>
<?rfc iprnotified='no' ?>
<?rfc strict='yes' ?>
<front>
<title abbrev='Sound Level Indicators in Conferences'>
Dispatching Sound Level Indicators in Conferences (Problem
Statement)
</title>
<author initials='E.' surname='Ivov' fullname='Emil Ivov'>
<organization abbrev='SIP Communicator'>SIP Communicator</organization>
<address>
<postal>
<street></street>
<city>Strasbourg</city>
<code>67000</code>
<country>France</country>
</postal>
<email>emcho@sip-communicator.org</email>
</address>
</author>
<author initials='E.' surname='Marocco' fullname='Enrico Marocco'>
<organization>Telecom Italia</organization>
<address>
<postal>
<street>Via G. Reiss Romoli, 274</street>
<city>Turin</city>
<code>10148</code>
<country>Italy</country>
</postal>
<email>enrico.marocco@telecomitalia.it</email>
</address>
</author>
<date month='May' year='2009' />
<abstract>
<t>
The Conferencing Framework described in RFC 4353 defines the
semantics necessary for conducting conference calls with the
session initiation protocol. It also introduces a mixer entity
responsible for combining all media streams and delivering
them to the participants of the call. This document presents
the lack of a standardized way for such mixers to deliver
information about the audio activity (sound level) of
participants in a conference call. The document describes the
problem and discusses a few possible ways of transporting such
information.
</t>
</abstract>
</front>
<middle>
<section title='The Problem'>
<t>
The Framework for Conferencing with the Session Initiation
Protocol defined in
<xref target="RFC4353">RFC 4353</xref>
presents an overall architecture for multi-party conferencing.
Among others, the framework borrows from
<xref target="RFC3550">RTP</xref>
and extends the concept of a mixer entity "responsible for
combining the media streams that make up a conference, and
generating one or more output streams that are delivered to
recipients". Every participant would hence receive, in a flat
single stream, media originating from the others.
</t>
<t>
Using such centralized mixer-based architectures simplifies
support for conference calls on the client side since they would
hardly differ from one-to-one conversations. However, the
method also introduces a few limitations. The flat nature of
the streams that a mixer would output and send to participants
makes it difficult for users to identify the original source of
what they are hearing.
</t>
<t>
The IETF has already defined mechanisms (e.g. the CSRC fields
in <xref target='RFC3550'>RTP</xref>) that allow the mixer to
send to participants cues on current speakers, but they only
work for speaking/silent binary indications. In other words,
there are still a number of use cases where one would require
more detailed information. Possible examples include the
presence of background chat/noise/music/typing, someone
breathing noisily in their microphone, or other cases where
identifying the source of the disturbance would make it easy
to remove it (e.g. by sending a private IM to the concerned
party asking them to mute their microphone).
</t>
<t>
One way of presenting such information in a user friendly
manner would be for a conferencing client to attach sound level
indicators to the corresponding participant related components
in the user interface as displayed in
<xref target='figure-conference-ui' />.
</t>
<figure anchor="figure-conference-ui">
<artwork>
<![CDATA[
------------------------
| |
| 00:42 | Weekly Call |
| |
|------------------------|
| |
| Alice |====== | (S) |
| |
| Bob |= | |
| |
| Carol | | (M) |
| |
| Dave |=== | |
| |
|________________________|
]]>
</artwork>
<postamble>
Delivering detailed speaker information to the user by
displaying sound level for every participant.
</postamble>
</figure>
<t>
Implementing a user interface like the above on the client side,
however, would be quite delicate (if at all possible) since, as
we have already mentioned, conference participants are generally
receiving a single, flat audio stream and have therefore no
immediate way of determining sound level based solely on the
media. With today's common conferencing solutions a mixer is
the only party aware of such information. It therefore seems
like a logical next step to determine what would be the best
way to allow a mixer to deliver such information to conference
participants.
</t>
<t>
The rest of this document investigates existing IETF mechanisms
that could be extended in order to allow for a way to transport
sound level information.
</t>
</section>
<section title='Possible Approaches'>
<t>
This section dwells on various existing mechanisms and their
use for transporting participant sound level indicators.
</t>
<section title='An Extension to the Conference State Event
Package for SIP'>
<t>
<xref target="RFC4575">RFC 4575</xref> defines a conference
event package for tightly coupled conferences using the
Session Initiation Protocol (SIP) events framework. It
allows for the delivery of various conference related
details such as conference descriptions, participant count
and identity. The document also provides a way of indicating
who the speakers are at any given moment by specifying a
mechanism for mapping conference participants to RTP
SSRC/CSRC identifiers. All these details are dispatched in
an asynchronous manner using the SIP events framework, or,
in other words, through NOTIFY SIP requests following an
initial SUBSCRIBE from a participant. It may therefore seem
logical to try and extend the framework by adding the syntax
necessary to convey sound levels.
</t>
<t>
Further thought on the subject, however, raises numerous
issues with such an approach. Sound level in human speech is
obviously a very time sensitive characteristic which would
require frequent updates (i.e. approximately once every
50-100 ms). In order for the update of the user interface to
appear "natural" to the user, sound level information would
probably have to be delivered after every one or two RTP
packets. Using <xref target="RFC4575">RFC 4575</xref> or
SIP in general for this would generate traffic on the (often
low-bandwidth) signalling path comparable to, if not
exceeding, the media itself.
</t>
<t>
It is probably also worth mentioning that the use of
<xref target="RFC4575">RFC 4575</xref> for such a feature
would make the mechanism incompatible with non-SIP signaling
protocols like, for example,
<xref target='RFC3920'>XMPP</xref> and its Jingle extensions.
</t>
</section>
<section title='Various RTP Etensions'>
<t>
The sound levels of different human voices in a conversation
are one kind of particularly fast changing information RTP
seems to be well suited for. Additionally, RTP syntax,
through the CSRC list in the RTP packet header and one or
more SDES RTCP packets, already allows a mixer to specify
the identities of the users whose voices were aggregated in
a mixed stream. It seems thus straightforward to consider an
extension to RTP as a possible approach for carrying such
information.
</t>
<t>
A first option for extending RTP is to define an RTP header
extension as specified in <xref target='RFC3550'>RFC
3550</xref> that would allow encoding sound level indicators
for each element of the CSRC list. The main advantage of
such an approach would consist of the very little impact it
would have in terms of bandwidth overhead; however, the RTP
header extension mechanism was initially meant only for
experimentation and its use for specifying new features is
explicitly discouraged.
<list style='empty'>
<t>
A possible workaround for such a limitation could be the
definition of that extension in a new RTP profile, in
turn defined as an extension of the Audio/Video profile
specified in <xref target='RFC3551'>RFC
3551</xref>. However, the complexity introduced in the
profile negotiation process, especially when done with
<xref target='I-D.ietf-mmusic-ice'>ICE</xref>, makes the
approach an overkill for the goal it tries to achieve.
</t>
</list>
</t>
<t>
Alternatively, the syntax needed for encoding sound level
indicators for the participants in an audio conference can
be specified as a new payload type for the RTP Audio/Video
profile defined in <xref target='RFC3551'>RFC
3551</xref>. The drawback of such an approach resides in the
significant increase of RTP packets it would generate; in
fact, even if the amount of additional information would be
very small, encoding it in a new payload would require a
separate RTP packet for each update (that, for a decent user
experience, should happen several times per second).
</t>
</section>
<section title='Extending the Role of the CSRC Identifiers in RTP'>
<t>
The <xref target="RFC3550">RTP</xref> specification defines
a Synchronization Source (SSRC) identifier. SSRCs are used
by every RTP source (e.g. every participant in a conference
call) and they are meant to be globally unique within a
particular RTP Session. Again, according to the
specification, mixers are expected to record the SSRC
identifiers of all contributing streams as a list of CSRC
identifiers in the RTP packets transporting the resulting
combined stream. In the case of a conference call this
would mean that if the mixer is respecting the above, every
participant would receive the SSRC identifier of every other
active participant.
</t>
<t>
<xref target="RFC4575">RFC 4575</xref> then defines a way of
mapping an SSRC identifier to an actual conference
participant through the <src-id> tag. The mapping
provides a way of determining which are the currently active
(i.e. speaking) conference call participants.
</t>
<t>
A very simple way for a mixer to use the CSRC fields as a
transport means for sound level indication would be to extend
their meaning over a series of packets rather than a single
one. This way it could be specified that the sound-level of
a particular participant, represented on a zero to ten scale,
corresponds to the number of occurrences of its CSRC
identifier in the ten most recent RTP packets received from
the mixer.
</t>
<t>
For example, consider a conference call with four
participants: Alice, Bob, Carol, and Dave. At a certain
point in time Alice has a sound level of 6/10, Bob 1/10,
Carol is silent or in other words 0/10 and Dave has a level
of 3/10. In order to describe this state the mixer could
have sent the last ten RTP packets with the following CSRC
configuration:
</t>
<texttable anchor="tab:differences">
<ttcol width="10%"></ttcol>
<ttcol width="9%"> P1 </ttcol>
<ttcol width="9%"> P2 </ttcol>
<ttcol width="9%"> P3 </ttcol>
<ttcol width="9%"> P4 </ttcol>
<ttcol width="9%"> P5 </ttcol>
<ttcol width="9%"> P6 </ttcol>
<ttcol width="9%"> P7 </ttcol>
<ttcol width="9%"> P8 </ttcol>
<ttcol width="9%"> P9 </ttcol>
<ttcol width="9%"> P10 </ttcol>
<c> Alice </c><c>+</c><c>+</c><c>+</c><c>+</c><c>+</c>
<c>+</c><c></c><c></c><c></c><c></c>
<c> Bob </c> <c></c><c>+</c><c></c><c></c><c></c><c></c>
<c></c><c></c><c></c><c></c>
<c> Carol </c> <c></c><c></c><c></c><c></c><c></c><c></c>
<c></c><c></c><c></c><c></c>
<c> Dave </c> <c></c><c></c><c></c><c></c><c></c><c></c>
<c></c><c>+</c><c>+</c><c>+</c>
<postamble>
A possible representation of a particular sound level
configuration through the presence/absence of CSRC
identifiers in subsequent RTP packets.
</postamble>
</texttable>
<t>
The graphical interface of a user agent involved in such a
conference (like the one sketched in
<xref target='figure-conference-ui'/>) would then display
correct sound levels just showing for each participant as
many ticks as were the occurrencies of the respective CSRC
in the previous ten RTP packets.
</t>
<t>
The algorithm for encoding sound level information this way
is relatively simple. In order to determine whether or not
to include a particular CSRC a mixer should:
<list style='symbols'>
<t>
include the CSRC if the sound level of the participant
in the current packet is greater than the number of
occurrencies of that same CSRC in the nine previous
packets;
</t>
<t>
omit the CSRC if the sound level of the participant in
the current packet is lower than or equal to the number
of occurrencies of that same CSRC in the nine previous
packets.
</t>
</list>
</t>
<t>
There are several advantages to using this approach, the
most obvious being its simplicity as well as the fact that
sound level information is transported together with the
parts of the audio stream that it actually concerns which
should make synchronization straightforward.
</t>
<t>
The technique would also work with other signaling protocols
using RTP such as <xref target='RFC3920'>XMPP's</xref> Jingle
extensions for example.
</t>
<t>
One of the first disadvantages that come to mind with this
approach is the fact that mixer would not be able to indicate
level in a single packet but would have to distribute it over
a succession of up to ten packets which would reduce the
reactivity of the representation.
</t>
<t>
It is probably worth mentioning, however, that a granularity
that allows switching from a level of zero to ten and back to
zero again in an instant manner is not of much use anyway
since such UI updates would be barely perceptible to the user.
Still, this is a UI decision and making it on a protocol level
may bring some inconveniences.
</t>
<t>
Another possible problem would come from implementations using
CSRC presence in a binary way to determine current speaker.
When running against a mixer that supports sound level
indication such implementations may appear to be jumpy as
the participants that they are designating as active may be
changing status too rapidly.
</t>
</section>
</section>
<section title='Security Considerations'>
<t>
<list style='numbers'>
<t>
A MITM could modify sound level indicators and make
participants believe that someone is saying something when
they actually aren't ...
</t>
<t>
Should use some authentication method to resolve this?
</t>
<t>
Could break compatibility with SRTP?
</t>
</list>
</t>
</section>
</middle>
<back>
<references title='Informative References'>
<reference anchor='RFC4353'>
<front>
<title>A Framework for Conferencing with the Session
Initiation Protocol (SIP)</title>
<author initials='J.' surname='Rosenberg' fullname='J. Rosenberg'>
<organization />
</author>
<date year='2006' month='February' />
<abstract>
<t>The Session Initiation Protocol (SIP) supports the
initiation, modification, and termination of media
sessions between user agents. These sessions are managed
by SIP dialogs, which represent a SIP relationship between
a pair of user agents. Because dialogs are between pairs
of user agents, SIP's usage for two-party communications
(such as a phone call), is obvious. Communications
sessions with multiple participants, generally known as
conferencing, are more complicated. This document defines
a framework for how such conferencing can occur. This
framework describes the overall architecture, terminology,
and protocol components needed for multi-party
conferencing. This memo provides information for the
Internet community.</t>
</abstract>
</front>
<seriesInfo name='RFC' value='4353' />
<format type='TXT' octets='67405'
target='ftp://ftp.isi.edu/in-notes/rfc4353.txt' />
</reference>
<reference anchor='RFC4575'>
<front>
<title>A Session Initiation Protocol (SIP) Event Package for
Conference State</title>
<author initials='J.' surname='Rosenberg' fullname='J. Rosenberg'>
<organization />
</author>
<author initials='H.' surname='Schulzrinne' fullname='H. Schulzrinne'>
<organization />
</author>
<author initials='O.' surname='Levin' fullname='O. Levin'>
<organization />
</author>
<date year='2006' month='August' />
<abstract>
<t>This document defines a conference event package for
tightly coupled conferences using the Session Initiation
Protocol (SIP) events framework, along with a data format
used in notifications for this package. The conference
package allows users to subscribe to a conference Uniform
Resource Identifier (URI). Notifications are sent about
changes in the membership of this conference and
optionally about changes in the state of additional
conference components. [STANDARDS TRACK]</t>
</abstract>
</front>
<seriesInfo name='RFC' value='4575' />
<format type='TXT' octets='97484'
target='ftp://ftp.isi.edu/in-notes/rfc4575.txt' />
</reference>
<reference anchor='RFC3550'>
<front>
<title>RTP: A Transport Protocol for Real-Time
Applications</title>
<author initials='H.' surname='Schulzrinne' fullname='H. Schulzrinne'>
<organization />
</author>
<author initials='S.' surname='Casner' fullname='S. Casner'>
<organization />
</author>
<author initials='R.' surname='Frederick' fullname='R. Frederick'>
<organization />
</author>
<author initials='V.' surname='Jacobson' fullname='V. Jacobson'>
<organization />
</author>
<date year='2003' month='July' />
<abstract>
<t>This memorandum describes RTP, the real-time
transport protocol. RTP provides end-to-end network
transport functions suitable for applications transmitting
real-time data, such as audio, video or simulation data,
over multicast or unicast network services. RTP does not
address resource reservation and does not guarantee
quality-of- service for real-time services. The data
transport is augmented by a control protocol (RTCP) to
allow monitoring of the data delivery in a manner scalable
to large multicast networks, and to provide minimal
control and identification functionality. RTP and RTCP are
designed to be independent of the underlying transport and
network layers. The protocol supports the use of RTP-level
translators and mixers. Most of the text in this
memorandum is identical to RFC 1889 which it obsoletes.
There are no changes in the packet formats on the wire,
only changes to the rules and algorithms governing how the
protocol is used. The biggest change is an enhancement to
the scalable timer algorithm for calculating when to send
RTCP packets in order to minimize transmission in excess
of the intended rate when many participants join a session
simultaneously. [STANDARDS TRACK]</t>
</abstract>
</front>
<seriesInfo name='STD' value='64' />
<seriesInfo name='RFC' value='3550' />
<format type='TXT' octets='259985'
target='ftp://ftp.isi.edu/in-notes/rfc3550.txt' />
<format type='PS' octets='630740'
target='ftp://ftp.isi.edu/in-notes/rfc3550.ps' />
<format type='PDF' octets='504117'
target='ftp://ftp.isi.edu/in-notes/rfc3550.pdf' />
</reference>
<reference anchor='RFC3920'>
<front>
<title abbrev='XMPP Core'>Extensible Messaging and Presence
Protocol (XMPP): Core</title>
<author initials='P.' surname='Saint-Andre' fullname='Peter Saint-Andre'
role='editor'>
<organization>Jabber Software Foundation</organization>
<address>
<email>stpeter@jabber.org</email>
</address>
</author>
<date year='2004' month='October' />
<area>Applications</area>
<workgroup>XMPP Working Group</workgroup>
<keyword>Presence</keyword>
<keyword>XML</keyword>
<keyword>Extensible Markup Language</keyword>
<abstract>
<t>This memo defines the core features of the Extensible
Messaging and Presence Protocol (XMPP), a protocol for
streaming Extensible Markup Language (XML) elements in
order to exchange structured information in close to real
time between any two network endpoints. While XMPP
provides a generalized, extensible framework for
exchanging XML data, it is used mainly for the purpose of
building instant messaging and presence applications
that meet the requirements of RFC 2779.</t>
</abstract>
</front>
</reference>
<reference anchor="RFC3551">
<front>
<title>
RTP Profile for Audio and Video Conferences with Minimal
Control
</title>
<author initials="H." surname="Schulzrinne"
fullname="H. Schulzrinne">
<organization/>
</author>
<author initials="S." surname="Casner"
fullname="S. Casner">
<organization/>
</author>
<date year="2003" month="July"/>
</front>
<seriesInfo name="STD" value="65"/>
<seriesInfo name="RFC" value="3551"/>
<format type="TXT" octets="106621"
target="ftp://ftp.isi.edu/in-notes/rfc3551.txt"/>
<format type="PS" octets="317286"
target="ftp://ftp.isi.edu/in-notes/rfc3551.ps"/>
<format type="PDF" octets="237831"
target="ftp://ftp.isi.edu/in-notes/rfc3551.pdf"/>
</reference>
<reference anchor='I-D.ietf-mmusic-ice'>
<front>
<title>
Interactive Connectivity Establishment (ICE): A
Methodology for Network Address Translator (NAT) Traversal
for Offer/Answer Protocols
</title>
<author initials='J' surname='Rosenberg'
fullname='Jonathan Rosenberg'>
<organization />
</author>
<date month='October' day='29' year='2007' />
</front>
<seriesInfo name='Internet-Draft'
value='draft-ietf-mmusic-ice-19' />
<format type='TXT'
target='http://www.ietf.org/internet-drafts/draft-ietf-mmusic-ice-19.txt' />
</reference>
</references>
</back>
</rfc>
| PAFTECH AB 2003-2026 | 2026-04-24 02:54:37 |