One document matched: draft-ivov-dispatch-slic-ps-00.txt
Network Working Group E. Ivov
Internet-Draft SIP Communicator
Intended status: Informational E. Marocco
Expires: November 23, 2009 Telecom Italia
May 22, 2009
Dispatching Sound Level Indicators in Conferences (Problem Statement)
draft-ivov-dispatch-slic-ps-00
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on November 23, 2009.
Copyright Notice
Copyright (c) 2009 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents in effect on the date of
publication of this document (http://trustee.ietf.org/license-info).
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document.
Ivov & Marocco Expires November 23, 2009 [Page 1]
Internet-Draft Sound Level Indicators in Conferences May 2009
Abstract
The Conferencing Framework described in RFC 4353 defines the
semantics necessary for conducting conference calls with the session
initiation protocol. It also introduces a mixer entity responsible
for combining all media streams and delivering them to the
participants of the call. This document presents the lack of a
standardized way for such mixers to deliver information about the
audio activity (sound level) of participants in a conference call.
The document describes the problem and discusses a few possible ways
of transporting such information.
Table of Contents
1. The Problem . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Possible Approaches . . . . . . . . . . . . . . . . . . . . . 5
2.1. An Extension to the Conference State Event Package for
SIP . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2. Various RTP Etensions . . . . . . . . . . . . . . . . . . 5
2.3. Extending the Role of the CSRC Identifiers in RTP . . . . 6
3. Security Considerations . . . . . . . . . . . . . . . . . . . 9
4. Informative References . . . . . . . . . . . . . . . . . . . . 10
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 11
Ivov & Marocco Expires November 23, 2009 [Page 2]
Internet-Draft Sound Level Indicators in Conferences May 2009
1. The Problem
The Framework for Conferencing with the Session Initiation Protocol
defined in RFC 4353 [RFC4353] presents an overall architecture for
multi-party conferencing. Among others, the framework borrows from
RTP [RFC3550] and extends the concept of a mixer entity "responsible
for combining the media streams that make up a conference, and
generating one or more output streams that are delivered to
recipients". Every participant would hence receive, in a flat single
stream, media originating from the others.
Using such centralized mixer-based architectures simplifies support
for conference calls on the client side since they would hardly
differ from one-to-one conversations. However, the method also
introduces a few limitations. The flat nature of the streams that a
mixer would output and send to participants makes it difficult for
users to identify the original source of what they are hearing.
The IETF has already defined mechanisms (e.g. the CSRC fields in RTP
[RFC3550]) that allow the mixer to send to participants cues on
current speakers, but they only work for speaking/silent binary
indications. In other words, there are still a number of use cases
where one would require more detailed information. Possible examples
include the presence of background chat/noise/music/typing, someone
breathing noisily in their microphone, or other cases where
identifying the source of the disturbance would make it easy to
remove it (e.g. by sending a private IM to the concerned party asking
them to mute their microphone).
One way of presenting such information in a user friendly manner
would be for a conferencing client to attach sound level indicators
to the corresponding participant related components in the user
interface as displayed in Figure 1.
Ivov & Marocco Expires November 23, 2009 [Page 3]
Internet-Draft Sound Level Indicators in Conferences May 2009
------------------------
| |
| 00:42 | Weekly Call |
| |
|------------------------|
| |
| Alice |====== | (S) |
| |
| Bob |= | |
| |
| Carol | | (M) |
| |
| Dave |=== | |
| |
|________________________|
Delivering detailed speaker information to the user by displaying
sound level for every participant.
Figure 1
Implementing a user interface like the above on the client side,
however, would be quite delicate (if at all possible) since, as we
have already mentioned, conference participants are generally
receiving a single, flat audio stream and have therefore no immediate
way of determining sound level based solely on the media. With
today's common conferencing solutions a mixer is the only party aware
of such information. It therefore seems like a logical next step to
determine what would be the best way to allow a mixer to deliver such
information to conference participants.
The rest of this document investigates existing IETF mechanisms that
could be extended in order to allow for a way to transport sound
level information.
Ivov & Marocco Expires November 23, 2009 [Page 4]
Internet-Draft Sound Level Indicators in Conferences May 2009
2. Possible Approaches
This section dwells on various existing mechanisms and their use for
transporting participant sound level indicators.
2.1. An Extension to the Conference State Event Package for SIP
RFC 4575 [RFC4575] defines a conference event package for tightly
coupled conferences using the Session Initiation Protocol (SIP)
events framework. It allows for the delivery of various conference
related details such as conference descriptions, participant count
and identity. The document also provides a way of indicating who the
speakers are at any given moment by specifying a mechanism for
mapping conference participants to RTP SSRC/CSRC identifiers. All
these details are dispatched in an asynchronous manner using the SIP
events framework, or, in other words, through NOTIFY SIP requests
following an initial SUBSCRIBE from a participant. It may therefore
seem logical to try and extend the framework by adding the syntax
necessary to convey sound levels.
Further thought on the subject, however, raises numerous issues with
such an approach. Sound level in human speech is obviously a very
time sensitive characteristic which would require frequent updates
(i.e. approximately once every 50-100 ms). In order for the update
of the user interface to appear "natural" to the user, sound level
information would probably have to be delivered after every one or
two RTP packets. Using RFC 4575 [RFC4575] or SIP in general for this
would generate traffic on the (often low-bandwidth) signalling path
comparable to, if not exceeding, the media itself.
It is probably also worth mentioning that the use of RFC 4575
[RFC4575] for such a feature would make the mechanism incompatible
with non-SIP signaling protocols like, for example, XMPP [RFC3920]
and its Jingle extensions.
2.2. Various RTP Etensions
The sound levels of different human voices in a conversation are one
kind of particularly fast changing information RTP seems to be well
suited for. Additionally, RTP syntax, through the CSRC list in the
RTP packet header and one or more SDES RTCP packets, already allows a
mixer to specify the identities of the users whose voices were
aggregated in a mixed stream. It seems thus straightforward to
consider an extension to RTP as a possible approach for carrying such
information.
A first option for extending RTP is to define an RTP header extension
as specified in RFC 3550 [RFC3550] that would allow encoding sound
Ivov & Marocco Expires November 23, 2009 [Page 5]
Internet-Draft Sound Level Indicators in Conferences May 2009
level indicators for each element of the CSRC list. The main
advantage of such an approach would consist of the very little impact
it would have in terms of bandwidth overhead; however, the RTP header
extension mechanism was initially meant only for experimentation and
its use for specifying new features is explicitly discouraged.
A possible workaround for such a limitation could be the
definition of that extension in a new RTP profile, in turn defined
as an extension of the Audio/Video profile specified in RFC 3551
[RFC3551]. However, the complexity introduced in the profile
negotiation process, especially when done with ICE
[I-D.ietf-mmusic-ice], makes the approach an overkill for the goal
it tries to achieve.
Alternatively, the syntax needed for encoding sound level indicators
for the participants in an audio conference can be specified as a new
payload type for the RTP Audio/Video profile defined in RFC 3551
[RFC3551]. The drawback of such an approach resides in the
significant increase of RTP packets it would generate; in fact, even
if the amount of additional information would be very small, encoding
it in a new payload would require a separate RTP packet for each
update (that, for a decent user experience, should happen several
times per second).
2.3. Extending the Role of the CSRC Identifiers in RTP
The RTP [RFC3550] specification defines a Synchronization Source
(SSRC) identifier. SSRCs are used by every RTP source (e.g. every
participant in a conference call) and they are meant to be globally
unique within a particular RTP Session. Again, according to the
specification, mixers are expected to record the SSRC identifiers of
all contributing streams as a list of CSRC identifiers in the RTP
packets transporting the resulting combined stream. In the case of a
conference call this would mean that if the mixer is respecting the
above, every participant would receive the SSRC identifier of every
other active participant.
RFC 4575 [RFC4575] then defines a way of mapping an SSRC identifier
to an actual conference participant through the <src-id> tag. The
mapping provides a way of determining which are the currently active
(i.e. speaking) conference call participants.
A very simple way for a mixer to use the CSRC fields as a transport
means for sound level indication would be to extend their meaning
over a series of packets rather than a single one. This way it could
be specified that the sound-level of a particular participant,
represented on a zero to ten scale, corresponds to the number of
occurrences of its CSRC identifier in the ten most recent RTP packets
Ivov & Marocco Expires November 23, 2009 [Page 6]
Internet-Draft Sound Level Indicators in Conferences May 2009
received from the mixer.
For example, consider a conference call with four participants:
Alice, Bob, Carol, and Dave. At a certain point in time Alice has a
sound level of 6/10, Bob 1/10, Carol is silent or in other words 0/10
and Dave has a level of 3/10. In order to describe this state the
mixer could have sent the last ten RTP packets with the following
CSRC configuration:
+-------+----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 |
+-------+----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Alice | + | + | + | + | + | + | | | | |
| | | | | | | | | | | |
| Bob | | + | | | | | | | | |
| | | | | | | | | | | |
| Carol | | | | | | | | | | |
| | | | | | | | | | | |
| Dave | | | | | | | | + | + | + |
+-------+----+-----+-----+-----+-----+-----+-----+-----+-----+------+
A possible representation of a particular sound level configuration
through the presence/absence of CSRC identifiers in subsequent RTP
packets.
Table 1
The graphical interface of a user agent involved in such a conference
(like the one sketched in Figure 1) would then display correct sound
levels just showing for each participant as many ticks as were the
occurrencies of the respective CSRC in the previous ten RTP packets.
The algorithm for encoding sound level information this way is
relatively simple. In order to determine whether or not to include a
particular CSRC a mixer should:
o include the CSRC if the sound level of the participant in the
current packet is greater than the number of occurrencies of that
same CSRC in the nine previous packets;
o omit the CSRC if the sound level of the participant in the current
packet is lower than or equal to the number of occurrencies of
that same CSRC in the nine previous packets.
There are several advantages to using this approach, the most obvious
being its simplicity as well as the fact that sound level information
is transported together with the parts of the audio stream that it
actually concerns which should make synchronization straightforward.
Ivov & Marocco Expires November 23, 2009 [Page 7]
Internet-Draft Sound Level Indicators in Conferences May 2009
The technique would also work with other signaling protocols using
RTP such as XMPP's [RFC3920] Jingle extensions for example.
One of the first disadvantages that come to mind with this approach
is the fact that mixer would not be able to indicate level in a
single packet but would have to distribute it over a succession of up
to ten packets which would reduce the reactivity of the
representation.
It is probably worth mentioning, however, that a granularity that
allows switching from a level of zero to ten and back to zero again
in an instant manner is not of much use anyway since such UI updates
would be barely perceptible to the user. Still, this is a UI
decision and making it on a protocol level may bring some
inconveniences.
Another possible problem would come from implementations using CSRC
presence in a binary way to determine current speaker. When running
against a mixer that supports sound level indication such
implementations may appear to be jumpy as the participants that they
are designating as active may be changing status too rapidly.
Ivov & Marocco Expires November 23, 2009 [Page 8]
Internet-Draft Sound Level Indicators in Conferences May 2009
3. Security Considerations
1. A MITM could modify sound level indicators and make participants
believe that someone is saying something when they actually
aren't ...
2. Should use some authentication method to resolve this?
3. Could break compatibility with SRTP?
Ivov & Marocco Expires November 23, 2009 [Page 9]
Internet-Draft Sound Level Indicators in Conferences May 2009
4. Informative References
[I-D.ietf-mmusic-ice]
Rosenberg, J., "Interactive Connectivity Establishment
(ICE): A Methodology for Network Address Translator (NAT)
Traversal for Offer/Answer Protocols",
draft-ietf-mmusic-ice-19 (work in progress), October 2007.
[RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V.
Jacobson, "RTP: A Transport Protocol for Real-Time
Applications", STD 64, RFC 3550, July 2003.
[RFC3551] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and
Video Conferences with Minimal Control", STD 65, RFC 3551,
July 2003.
[RFC3920] Saint-Andre, P., Ed., "Extensible Messaging and Presence
Protocol (XMPP): Core", October 2004.
[RFC4353] Rosenberg, J., "A Framework for Conferencing with the
Session Initiation Protocol (SIP)", RFC 4353,
February 2006.
[RFC4575] Rosenberg, J., Schulzrinne, H., and O. Levin, "A Session
Initiation Protocol (SIP) Event Package for Conference
State", RFC 4575, August 2006.
Ivov & Marocco Expires November 23, 2009 [Page 10]
Internet-Draft Sound Level Indicators in Conferences May 2009
Authors' Addresses
Emil Ivov
SIP Communicator
Strasbourg 67000
France
Email: emcho@sip-communicator.org
Enrico Marocco
Telecom Italia
Via G. Reiss Romoli, 274
Turin 10148
Italy
Email: enrico.marocco@telecomitalia.it
Ivov & Marocco Expires November 23, 2009 [Page 11]
| PAFTECH AB 2003-2026 | 2026-04-24 01:43:43 |