One document matched: draft-ivov-avt-slic-00.xml


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc SYSTEM 'rfcXXXX.dtd'>

<rfc category='info' ipr='trust200902' docName='draft-ivov-avt-slic-00'>

<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>

<?rfc toc='yes' ?>
<?rfc symrefs='yes' ?>
<?rfc sortrefs='yes'?>
<?rfc iprnotified='no' ?>
<?rfc strict='yes' ?>
<?rfc compact='yes' ?>
  <front>

    <title abbrev='Sound Level Indicators in Conferences'>
      Delivering Conference Participant Sound Level Indicators in RTP
      Streams
    </title>

    <author initials='E.' surname='Ivov' fullname='Emil Ivov'>
      <organization abbrev='SIP Communicator'>SIP Communicator</organization>
      <address>
        <postal>
          <street></street>
          <city>Strasbourg</city>
          <code>67000</code>
          <country>France</country>
        </postal>
        <email>emcho@sip-communicator.org</email>
      </address>
    </author>
    <author initials='E.' surname='Marocco' fullname='Enrico Marocco'>
      <organization>Telecom Italia</organization>
      <address>
        <postal>
          <street>Via G. Reiss Romoli, 274</street>
          <city>Turin</city>
          <code>10148</code>
          <country>Italy</country>
        </postal>
        <email>enrico.marocco@telecomitalia.it</email>
      </address>
    </author>

    <date month='June' year='2009' />

    <abstract>
      <t>
        This document describes a mechanism for RTP-level mixers in
        audio conferences to deliver information about the sound level
        information on the individual participants. Such sound level
        indicators are transported in the same RTP packets as the audio
        data they pertain to.
      </t>
    </abstract>
  </front>

  <middle>
    <section title='Introduction'>
      <t>
        The Framework for Conferencing with the Session Initiation
        Protocol (SIP) defined in
        <xref target="RFC4353">RFC 4353</xref>
        presents an overall architecture for multi-party conferencing.
        Among others, the framework borrows from
        <xref target="RFC3550">RTP</xref>
        and extends the concept of a mixer entity "responsible for
        combining the media streams that make up a conference, and
        generating one or more output streams that are delivered to
        recipients". Every participant would hence receive, in a flat
        single stream, media originating from all the others.
      </t>
      <t>
        Using such centralized mixer-based architectures simplifies
        support for conference calls on the client side since they would
        hardly differ from one-to-one conversations. However, the
        method also introduces a few limitations. The flat nature of
        the streams that a mixer would output and send to participants
        makes it difficult for users to identify the original source of
        what they are hearing.
      </t>
      <t>
        Mechanisms that allow the mixer to send to participants cues on
        current speakers  (e.g. the CSRC fields in
        <xref target='RFC3550'>RTP</xref>) only work for speaking/silent
        binary indications. There are, however, a number of use cases
        where one would require more detailed information. Possible
        examples include the presence of background
        chat/noise/music/typing, someone breathing noisily in their
        microphone, or other cases where identifying the source of the
        disturbance would make it easy to remove it (e.g. by sending a
        private IM to the concerned party asking them to mute their
        microphone). A more advanced scenario could involve an intense
        discussion between multiple participants that the user does not
        personally know. Sound level information would help better
        recognize the speakers by associating with them complex (but
        still human readable) characteristics like loudness and speed
        for example.
      </t>
      <t>
        One way of presenting such information in a user friendly
        manner would be for a conferencing client to attach sound level
        indicators to the corresponding participant related components
        in the user interface as displayed in
        <xref target='figure-conference-ui' />.
      </t>
      <figure anchor="figure-conference-ui">
        <artwork>
<![CDATA[
                      ------------------------
                     |                        |
                     |  00:42 |  Weekly Call  |
                     |                        |
                     |------------------------|
                     |                        |
                     | Alice |======    | (S) |
                     |                        |
                     | Bob   |=         |     |
                     |                        |
                     | Carol |          | (M) |
                     |                        |
                     | Dave  |===       |     |
                     |                        |
                     |________________________|
]]>
        </artwork>
        <postamble>
          Displaying detailed speaker information to the user by
          including sound level for every participant.
        </postamble>
      </figure>
      <t>
        Implementing a user interface like the above requires analysis
        of the media sent from other participants. In a conventional
        audio conference this is only possible for the mixer since all
        other conference participants are generally receiving a single,
        flat audio stream and have therefore no immediate way of
        determining individual sound levels.
      </t>
      <t>
        This document specifies an RTP extension header that allows such
        mixers to deliver sound level information to conference
        participants by including it directly in the RTP packets
        transporting the corresponding audio data.
      </t>
    </section>
    <section title="Terminology">
      <t>
        The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
        NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
        "OPTIONAL" in this document are to be interpreted as described
        in <xref target="RFC2119">RFC 2119</xref>.
      </t>
    </section>
    <section title='Protocol Operation'>
      <t>
        According to <xref target='RFC3550'>RFC 3550</xref> a mixer
        is expected to include in outgoing RTP packets a list of
        identifiers (CSRC IDs) indicating the sources that contributed
        to the resulting stream. The presence of such CSRC IDs allows an
        RTP client to determine, in a binary way, the active speaker(s)
        in any given moment. RTCP also provides a basic mechanism to map
        the CSRC IDs to user identities through the CNAME field. More
        advanced mechanisms, may exist depending on the signaling
        protocol used to establish and control a conference. In the case
        of the <xref target="RFC3261">Session Initiation Protocol</xref>
        for example, the <xref target="RFC4575"> Event Package for
        Conference State</xref> defines a <src-id> tag which binds
        CSRC IDs to media streams and SIP URIs.
      </t>
      <t>
        This document describes an RTP header extension that allows
        mixers to indicate the sound-level of every conference
        participant (CSRC) in addition to simply indicating their
        on/off status. This new header extension is based on the
        <xref target="RFC5285"> "General Mechanism for RTP Header
        Extensions"</xref>.
       </t>
       <t>
        Each instance of this header contains a list of one-octet
        sound level values (see <xref target='hdr-fmt'/>). Such values
        indicate sound level on a 0 to 255 scale where 0 is silence (i.e.
        same as omitting the corresponding source id from the CSRC list)
        and 255 corresponds to a threshold accepted by the mixer
        implementation as the maximum sound level that a participant is
        likely to reach during a conference.
      </t>
      <t>
        Every sound level value pertains to the CSRC identifier
        located at the corresponding position in the CSRC list. In other
        words, the first value would indicate the sound level of the
        conference participant represented by the first CSRC identifier
        in that packet and so forth. The number and order of these
        values MUST therefore match the number and order of the CSRC
        IDs present in the same packet.
       </t>
       <t>
        When encoding sound level information, a mixer SHOULD include in
        a packet information that corresponds to the audio data being
        transported in that same packet. It is important that these
        values follow the actual stream as closely as possible.
        Therefore a mixer SHOULD also calculate the values after the
        original contributing stream has undergone possible processing
        such as level normalization, and noise reduction for example.
       </t>
       <t>
        Note that in some cases a mixer may be sending an RTP audio
        stream that only contains sound level information and no actual
        audio. Updating a (web) interface conference module may be one
        reason for this to happen.
       </t>
       <!-- t>
        Absence of sound level information in an RTP packet SHOULD be
        interpreted by receivers as an indication that sound level for
        the CSRC IDs present in the packet remains unchanged as per the
        previous packet and is set to 0 (silence) for all other
        participants. Note however that this mechanism is unreliable
        since the last packet containing a change in the sound level
        may have never reached some receivers and this could lead to
        inconsistencies. It is therefore RECOMMENDED that mixers always
        deliver sound level information when there is at least one
        non-silent party.
       </t -->
       <t>
        It may sometimes happen that a conference involves more than a
        single mixer. In such cases each of the mixers MAY choose to
        relay the CSRC list and sound-level information they receive
        from peer mixers (as long as the total CSRC count remains below
        16). Given that the maximum sound level is not precisely defined
        by this specification, it is likely that in such situations
        average sound levels would be perceptibly different for the
        participants located behind the different mixers.
       </t>
    </section>
    <section title='Header Format' anchor='hdr-fmt'>
      <t>
        The sound level indicators are delivered to the receivers
        in-band using the <xref target='RFC5285'>"General Mechanism for
        RTP Header Extensions"</xref>.  The payload of this extension
        (the transmitted list of sound level values) is a sequence of
        8-bit unsigned integers.

        <figure>
          <preamble>
            The form of the sound level indicators extension block is
            as follows:
          </preamble>
          <artwork>
    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  ID   |  len  |    level 1    |    level 2    |    level 3   ...
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
          </artwork>
        </figure>
        The 4-bit len field is the number minus one of data bytes (i.e.
        sound level values) transported in this header extension element
        following the one-byte header. Therefore, the value zero in this
        field indicates that one byte of data follows. A value of 15 is
        not allowed by this specification and it MUST NOT be used as the
        RTP header can carry a maximum of 15 CSRC IDs. The maximum value
        allowed is therefore 14 indicating a following sequence of 15
        sound level values.
      </t>
      <t>
        Note that use of the two-byte header defined in
        <xref target='RFC5285'> RFC 5285</xref> follows the same rules
        the only change being the length of the ID and len fields.
      </t>
    </section>
    <section title='Signaling Information' anchor='sig-info'>
      <t>
        The URI for declaring the sound level header extension in an SDP
        extmap attribute and mapping it to a local extension header
        identifier is "urn:ietf:params:rtp-hdrext:csrc-sound-level".
        There is no additional setup information needed for this
        extension (i.e. no extensionattributes).
      </t>
      <t>
        An example attribute line in the SDP, for a conference might be:
      </t>
      <figure>
        <artwork>
        a=extmap:7 urn:ietf:params:rtp-hdrext:csrc-sound-level
        </artwork>
      </figure>
      <t>
        The above mapping will most often be provided per media stream
        (in the media-level section(s) of SDP, i.e., after an "m=" line)
        or globally if there is more than one stream containing sound
        level indicators in a session.
      </t>
      <t>
        Presence of the above attribute in the SDP description of a
        media stream indicates that some or all RTP packets in that
        stream would contain the sound level information RTP extension
        header.
      </t>
      <t>
        Conferencing clients that support sound level indicators and
        have no mixing capabilities SHOULD always include the
        direction parameter in the "extmap" attribute setting it to
        "recvonly".  Conference focus entities with mixing
        capabilities MAY omit the direction or set it to "sendrecv" in
        SDP offers. Such entities SHOULD set it to "sendonly" in SDP
        answers to offers with a "recvonly" parameter and to
        "sendrecv" when answering other "sendrecv" offers.
      </t>
      <t>
        The following <xref target='client-focus'/> and <xref
        target='focus-focus'/> show two example offer/answer exchanges
        between a conferencing client and a focus, and between two
        conference focus entities.
      </t>
      <figure anchor="client-focus">
        <artwork>
  v=0
  o=alice 2890844526 2890844526 IN IP6 host.example.com
  c=IN IP6 host.example.com
  t=0 0
  m=audio 49170 RTP/AVP 0 4
  a=rtpmap:0 PCMU/8000
  a=rtpmap:4 G723/8000
  a=extmap:1/recvonly urn:ietf:params:rtp-hdrext:csrc-sound-level

  v=0
  i=A Seminar on the session description protocol
  o=conf-focus 2890844730 2890844730 IN IP6 focus.example.net
  c=IN IP6 focus.example.net
  t=0 0
  m=audio 52543 RTP/AVP 0
  a=rtpmap:0 PCMU/8000
  a=extmap:1/sendonly urn:ietf:params:rtp-hdrext:csrc-sound-level
        </artwork>
        <postamble>
          A client-initiated example SDP offer/answer exchange
          negotiating an audio stream with one-way flow of of sound
          level information.
        </postamble>
      </figure>
      <figure anchor="focus-focus">
        <artwork>
  v=0
  i=Un seminaire sur le protocole de description des sessions
  o=fr-focus 2890844730 2890844730 IN IP6 focus.fr.example.net
  c=IN IP6 focus.fr.example.net
  t=0 0
  m=audio 49170 RTP/AVP 0
  a=rtpmap:0 PCMU/8000
  a=extmap:1/sendrecv urn:ietf:params:rtp-hdrext:csrc-sound-level

  v=0
  i=A Seminar on the session description protocol
  o=us-focus 2890844526 2890844526 IN IP6 focus.us.example.net
  c=IN IP6 focus.us.example.net
  t=0 0
  m=audio 52543 RTP/AVP 0
  a=rtpmap:0 PCMU/8000
  a=extmap:1/sendrecv urn:ietf:params:rtp-hdrext:csrc-sound-level
        </artwork>
        <postamble>
          An example SDP offer/answer exchange between two conference
          focus entities with mixing capabilities negotiating an audio
          stream with bidirectional flwo of sound level information.
        </postamble>
      </figure>
    </section>
    <section title='Security Considerations'>
      <t>
        <list style='numbers'>
          <t>
            This document defines a means of attributing sound level
            to a particular participant in a conference. An attacker may
            try to modify the content of RTP packets in a way that would
            make sound activity from one participant appear as coming
            from another.
          </t>
          <t>
            Furthermore, the fact that sound level values would not be
            protected even in an SRTP session may be of concern in some
            cases where the activity of a particular participant in a
            conference is confidential.
          </t>
          <t>
            Both of the above are concerns that stem from the design of
            the RTP protocol itself. It is therefore important that
            according to the needs of a particular scenario,
            implementors and deployers consider use of a lower level
            security and authentication mechanism.
          </t>
        </list>
      </t>
    </section>
    <section title='IANA Considerations'>
      <t>
        This document defines a new extension URI that, if approved,
        would need to be added to the RTP Compact Header Extensions
        sub-registry of the Real-Time Transport Protocol (RTP)
        Parameters registry, according to the following data:
      </t>
      <figure>
        <artwork>
        Extension URI: urn:ietf:params:rtp-hdrext:csrc-sound-level
        Description:   Sound level indicators
        Contact:       emcho@sip-communicator.org
        Reference:     RFC XXXX
        </artwork>
      </figure>
    </section>
    <section title='Open Issues'>
      <t>
        At the time of writing of this document the authors have no
        clear view on how and if the following list of issues should
        be address here:
        <list style='numbers'>
          <t>
            Specific sound level mappings. The current version of this
            specification treats sound level indicators as referable
            to any scale chosen by the mixer. The only limitations
            consist in making sure that the value of 0 should correspond
            to participant inactivity/silence and the value 255x to a
            level that would appear to users as loud but still
            attainable. It is however possible to map specific levels
            (e.g. measured in dBm) with the purpose of achieving
            cross-mixer uniformity of these values. An obvious tradeoff
            here is the increased complexity of implementation that
            would require mixers to convert sound level to whatever
            specific unit they use for internal estimation, which could
            be non-trivial in a number of cases.
          </t>
          <t>
            Sound levels in video streams. This specification allows
            use of sound level values in "silent" audio streams that
            don't otherwise carry any payload thus allowing their
            delivery within systems where the various focus/mixer
            components communicate with each other as conference
            participants. The same train of thought may very well
            justify sound level transport in video streams.
          </t>
        </list>
      </t>
    </section>
    <section title="Acknowledgments">
        <t>
          Roni Even, Ingemar Johansson, and several others provided
          helpful feedback over the dispatch mailing list.
        </t>
        <t>
          SIP Communicator's participation in this specification is
          funded by the NLnet Foundation.
        </t>
    </section>
    <section title='Appendix: An alternative approach'>
        <t>
          The <xref target='I-D.ivov-dispatch-slic-ps'>problem statement
          </xref> preceding this document originally favored a slightly
          different resolution approach that the authors feel may still
          be relevant and therefore worth publishing here.
        </t>
        <t>
          A very simple way for a mixer to use the CSRC fields as a
          transport means for sound level indication would be to extend
          their meaning over a series of packets rather than a single
          one. This way it could be specified that the sound-level of
          a particular participant, represented on a zero to ten scale,
          corresponds to the number of occurrences of its CSRC
          identifier in the ten most recent RTP packets received from
          the mixer.
        </t>
        <t>
          For example, consider a conference call with four
          participants: Alice, Bob, Carol, and Dave. At a certain
          point in time Alice has a sound level of 6/10, Bob 1/10,
          Carol is silent or in other words 0/10 and Dave has a level
          of 3/10. In order to describe this state the mixer could
          have sent the last ten RTP packets with the following CSRC
          configuration:
        </t>
        <texttable anchor="tab:differences">
          <ttcol width="10%"></ttcol>
          <ttcol width="9%"> P1 </ttcol>
          <ttcol width="9%"> P2 </ttcol>
          <ttcol width="9%"> P3 </ttcol>
          <ttcol width="9%"> P4 </ttcol>
          <ttcol width="9%"> P5 </ttcol>
          <ttcol width="9%"> P6 </ttcol>
          <ttcol width="9%"> P7 </ttcol>
          <ttcol width="9%"> P8 </ttcol>
          <ttcol width="9%"> P9 </ttcol>
          <ttcol width="9%"> P10 </ttcol>
          <c> Alice </c><c>+</c><c>+</c><c>+</c><c>+</c><c>+</c>
                        <c>+</c><c></c><c></c><c></c><c></c>
          <c> Bob </c>  <c></c><c>+</c><c></c><c></c><c></c><c></c>
                        <c></c><c></c><c></c><c></c>
          <c> Carol </c> <c></c><c></c><c></c><c></c><c></c><c></c>
                         <c></c><c></c><c></c><c></c>
          <c> Dave </c> <c></c><c></c><c></c><c></c><c></c><c></c>
                        <c></c><c>+</c><c>+</c><c>+</c>
          <postamble>
            A possible representation of a particular sound level
            configuration through the presence/absence of CSRC IDs in
            subsequent RTP packets.
          </postamble>
        </texttable>
        <t>
          The graphical interface of a user agent involved in such a
          conference (like the one sketched in
          <xref target='figure-conference-ui'/>) would then display
          correct sound levels just showing for each participant as
          many ticks as were the occurrencies of the respective CSRC
          in the previous ten RTP packets.
        </t>
        <t>
          The algorithm for encoding sound level information this way
          is relatively simple. In order to determine whether or not
          to include a particular CSRC a mixer should:
          <list style='symbols'>
            <t>
              include the CSRC if the sound level of the participant
              in the current packet is greater than the number of
              occurrencies of that same CSRC in the nine previous
              packets;
            </t>
            <t>
              omit the CSRC if the sound level of the participant in
              the current packet is lower than or equal to the number
              of occurrencies of that same CSRC in the nine previous
              packets.
            </t>
          </list>
        </t>
        <t>
          There are several advantages to using this approach, the
          most obvious being its simplicity as well as the fact that
          sound level information is transported together with the
          parts of the audio stream that it actually concerns which
          should make synchronization straightforward.
        </t>
        <t>
          The technique would also work with other signaling protocols
          using RTP such as <xref target='RFC3920'>XMPP's</xref> Jingle
          extensions for example.
        </t>
        <t>
          One of the first disadvantages that come to mind with this
          approach is the fact that mixer would not be able to indicate
          level in a single packet but would have to distribute it over
          a succession of up to ten packets which would reduce the
          reactivity of the representation.
        </t>
        <t>
          It is probably worth mentioning, however, that a granularity
          that allows switching from a level of zero to ten and back to
          zero again in an instant manner is not of much use anyway
          since such UI updates would be barely perceptible to the user.
          Still, this is a UI decision and making it on a protocol level
          may bring some inconveniences.
        </t>
        <t>
          Another possible problem would come from implementations using
          CSRC presence in a binary way to determine current speaker.
          When running against a mixer that supports sound level
          indication such implementations may appear to be jumpy as
          the participants that they are designating as active may be
          changing status too rapidly.
        </t>
    </section>
  </middle>
  <back>
    <references title='Normative References'>
      <?rfc include="reference.RFC.2119"?>
      <?rfc include="reference.RFC.3550"?>
      <?rfc include="reference.RFC.5285"?>
    </references>
    <references title='Informative References'>
      <?rfc include="reference.I-D.ietf-mmusic-ice"?>
      <?rfc include="reference.I-D.ivov-dispatch-slic-ps"?>
      <?rfc include="reference.RFC.3261"?>
      <?rfc include="reference.RFC.3551"?>
      <?rfc include="reference.RFC.3920"?>
      <?rfc include="reference.RFC.4353"?>
      <?rfc include="reference.RFC.4575"?>
    </references>
  </back>
</rfc>

PAFTECH AB 2003-20262026-04-24 03:16:23