http://stupid.domain.name/ietf/

One document matched: draft-ietf-avtext-client-to-mixer-audio-level-04.xml
<?xml version='1.0'?>

<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
    <!ENTITY hdrext SYSTEM
      'reference.RFC.5285.xml'>
    <!ENTITY red SYSTEM
      'reference.RFC.2198.xml'>
    <!ENTITY rfc2119 SYSTEM
      'reference.RFC.2119.xml'>
    <!ENTITY rtp SYSTEM
      'reference.RFC.3550.xml'>
    <!ENTITY slic SYSTEM
      'reference.I-D.ietf-avtext-mixer-to-client-audio-level.xml'>
    <!ENTITY srtp SYSTEM
      'reference.RFC.3711.xml'>
    <!ENTITY hdrcrypt SYSTEM
      'reference.I-D.ietf-avtcore-srtp-encrypted-header-ext.xml'>
    <!ENTITY vbr-srtp SYSTEM
      'reference.I-D.perkins-avt-srtp-vbr-audio.xml'>
]>

<?rfc toc="yes" ?>
<?rfc strict="yes" ?>
<?rfc compact="yes" ?>
<?rfc sortrefs="yes" ?>
<?rfc symrefs="yes" ?>

<rfc category='std' ipr='pre5378Trust200902' 
     docName='draft-ietf-avtext-client-to-mixer-audio-level-04'>
    <front>
        <title abbrev='Client-to-Mixer Audio Level Indication'>
        A Real-Time Transport Protocol (RTP) Header Extension for 
        Client-to-Mixer Audio Level Indication
        </title>

        <author initials='J.' surname='Lennox'
                fullname='Jonathan Lennox'
                role="editor">
            <organization abbrev='Vidyo'>
               Vidyo, Inc.
            </organization>
            <address>
               <postal>
                   <street>433 Hackensack Avenue</street>
                   <street>Seventh Floor</street>
                   <city>Hackensack</city> <region>NJ</region>
                   <code>07601</code>
                   <country>US</country>
               </postal>
               <email>jonathan@vidyo.com</email>
            </address>
        </author>

        <author initials='E.' surname='Ivov'
                fullname='Emil Ivov'>
            <organization abbrev='Jitsi'>
               Jitsi
            </organization>
            <address>
               <postal>
                   <street> </street>
                   <city>Strasbourg</city>
                   <code>67000</code>
                   <country>France</country>
               </postal>
               <email>emcho@jitsi.org</email>
            </address>
        </author>

        <author initials='E.' surname='Marocco'
                fullname='Enrico Marocco'>
            <organization abbrev='Telecom Italia'>
               Telecom Itialia
            </organization>
            <address>
               <postal>
                   <street>Via G. Reiss Romoli, 274</street>
                   <city>Turin</city>
                   <code>10148</code>
                   <country>Italy</country>
               </postal>
               <email>enrico.marocco@telecomitalia.it</email>
            </address>
        </author>


        <date />
        <area>RAI</area>
        <workgroup>AVT</workgroup>

        <keyword>I-D</keyword>
        <keyword>Internet-Draft</keyword>
        <!-- TODO: more keywords -->

        <abstract>
        <t>This document defines a mechanism by which packets of
        Real-Time Transport Protocol (RTP) audio streams can indicate,
        in an RTP header extension, the audio level of the audio
        sample carried in the RTP packet.  In large conferences, this
        can reduce the load on an audio mixer or other middlebox which
        wants to forward only a few of the loudest audio streams,
        without requiring it to decode and measure every stream that
        is received.</t>
        </abstract>

    </front>

<middle>

<section title='Introduction' anchor='introduction'>

<t>In a centralized <xref target='RFC3550'>Real-Time Transport Protocol
(RTP)</xref> audio conference, an audio mixer or
  forwarder receives audio streams from many or all of the conference
  participants.  It then selectively forwards some of them to
  other participants in the conference.  In
  large conferences, it is possible that such a server might be
  receiving a large number of streams, of which only a few are intended to be
  forwarded to the other conference participants.</t>

<t>In such a scenario, in order to pick the audio streams to forward,
  a centralized server needs to decode, measure audio levels, and
  possibly perform voice activity detection on audio data from a large
  number of streams.  The need for such processing limits the size or
  number of conferences such a server can support.</t>

<t>As an alternative, this document defines
  an <xref target='RFC5285'>RTP header extension</xref> through which
  senders of audio packets can indicate the audio level of the
  packets' payload, reducing the processing load for a server.
</t>

<t>The header extension in this draft is different than, but
  complementary with, the one defined in
  <xref target='I-D.ietf-avtext-mixer-to-client-audio-level' />, which
  defines a mechanism by which audio mixers can indicate to clients the
  levels of the contributing sources that made up the mixed audio.</t>

</section>

<section title='Terminology'>

<t>The key words "MUST", "MUST NOT", "REQUIRED",
"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
and "OPTIONAL" in this document
are to be interpreted as described in <xref
target='RFC2119'>RFC 2119</xref> and indicate requirement levels for
compliant implementations.</t>

</section>

<section title='Audio Levels' anchor="levels">

<t>The audio level header extension carries the level of the audio 
in the <xref target='RFC3550'>RTP</xref> payload of the packet it is
associated with. This information 
is carried in an RTP header extension element as defined by the 
<xref target='RFC5285'>"General Mechanism for RTP Header Extensions"
</xref>.</t>

<t>The payload of the audio level header extension element can be 
encoded using the one-byte or the two-byte header defined in <xref 
target='RFC5285'/>. <xref target='exthdr'/> and <xref target='exthdr2'/>
show sample audio level encodings with each  of them.</t>

<figure anchor='exthdr'>
<artwork>
<![CDATA[
       0                   1
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |  ID   | len=0 |V| level       |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]>
</artwork>
<postamble>Sample audio level encoding using the one-byte header format
</postamble>
</figure>

<figure anchor='exthdr2'>
<artwork>
<![CDATA[
     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |      ID       |     len=1     |V|    level    |    0 (pad)    |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]>
</artwork>
<postamble>Sample audio level encoding using the two-byte header format
</postamble>
</figure>

<t>Note that, as indicated in <xref target='RFC5285'/> length field in 
the one-byte header format takes the value 0 to indicate that 1 byte 
follows. In the two-byte header format on the other hand it takes the 
value of 1.</t>

<t>The magnitude of the audio level itself is packed into the seven 
least significant bits of the single byte of the header extension,
shown in <xref target='exthdr'/> and <xref target='exthdr2'/>. The least
significant bit of the audio level magnitude is packed into the least 
significant bit of the byte. The most significant bit of the byte is 
used as a separate flag bit "V", defined below.</t>

<t>The audio level is expressed in -dBov, with values from 0 to 127 
representing 0 to -127 dBov.  dBov is the level, in decibels, relative 
to the overload point of the system, i.e. the maximum-amplitude signal 
that can be handled by the system without clipping. (Note: 
Representation relative to the overload point of a system is 
particularly useful for digital implementations, since one does not 
need to know the relative calibration of the analog circuitry.)  For 
example, in the case of <xref target='ITU.G711.1988'>u-law (audio/pcmu) 
audio</xref>, the 0 dBov reference would be a square wave with values
+/- 8031.  (This translates to 6.18 dBm0, relative to u-law's dBm0 
definition in Table 6 of G.711.)</t>

<t>The audio level for digital silence, for example for a muted audio 
source, MUST be represented as 127 (-127 dBov), regardless of the 
dynamic range of the encoded audio format.</t>

<t> The audio level header extension only carries the level of the 
audio in the RTP payload of the packet it is associated with, with no 
long-term averaging or smoothing applied. That level is measured as a 
root mean square of all the samples in the measured range.</t>

<t>To simplify implementation of the encoding procedures described here, 
the reference implementation section in <xref 
target='I-D.ietf-avtext-mixer-to-client-audio-level' /> provides a 
sample Java implementation of an audio level calculator that helps 
obtain such values from raw linear PCM audio samples.</t>

<t>In addition, a flag bit (labeled V) optionally indicates whether
the encoder believes the audio packet contains voice activity.  If the
V bit is in use, the value 1 indicates that the encoder believes the
audio packet contains voice activity, and the value 0 indicates that
the encoder believes it does not.  (The voice activity detection
algorithm is unspecified and left implementation-specific.)  If the V
bit is not in use, its value is unspecified and MUST be ignored by
receivers.  The use of the V bit is signaled using the extension
attribute "vad", discussed in <xref target='signaling' />.</t>

<t>When this header extension is used with RTP data sent using 
<xref target='RFC2198'>the RTP Payload for Redundant Audio Data</xref>,
the header's data describes the contents of the primary encoding.</t>

<t>Note: This audio level is defined in the same manner as is audio 
noise level in the <xref target="RFC3389">RTP Payload Comfort Noise
specification</xref>. In the comfort noise specification, the overall 
magnitude of the noise level in comfort noise is encoded into the first
byte of the payload, with spectral information about the noise in 
subsequent bytes. This specification's audio level parameter is defined 
so as to be identical to the comfort noise payload's noise-level byte.
</t>

</section>

<section title='Signaling (Setup) Information' anchor='signaling'>

<t>The URI for declaring this header extension in an extmap attribute is
   "urn:ietf:params:rtp-hdrext:ssrc-audio-level".</t>

<t>It has a single extension attribute, named "vad".  It takes the
   form "vad=on" or "vad=off".  If the header extension element is signaled
   with "vad=on", the "V" bit described in <xref target='levels' /> is
   in use, and MUST be set by senders.  If the header extension element is
   signaled with "vad=off", the "V" bit is not in use, and its value
   MUST be ignored by receivers.  If the "vad" extension attribute is
   not specified, the default is "vad=on".</t>

<t>An example attribute line in the SDP, for a conference might hence 
be:</t>
<figure>
  <artwork>
    a=extmap:6 urn:ietf:params:rtp-hdrext:ssrc-audio-level vad=on
  </artwork>
</figure>

<t>The "vad" extension attribute only controls the semantics of this
   header extension attribute, and does not make any statement about
   whether the sender is using any other voice activity detection
   features such as discontinuous transmission, comfort noise, or
   silence suppression.</t>

<t>Using the mechanisms of <xref target='RFC5285' />, an endpoint MAY
signal multiple instances of the header extension element, with
different values of the vad attribute, so long as these instances use
different values for the extension identifier.  However, again
following the rules of <xref target='RFC5285' />, the semantics chosen
for a header extension element (including its vad setting) for a
particular extension identifier value MUST NOT be changed within an
RTP session.</t>

</section>

<section title='Considerations on Use' anchor='use'>

<t>Mixers and forwarders generally ought not base audio forwarding
  decisions directly on packet-by-packet audio level information, but
  rather ought to apply some analysis of the audio levels and trends.
  This general rule applies whether audio levels are provided by
  endpoints (as defined in this document), or are calculated at a
  server, as would be done in the absence of this information.  This
  section discusses several issues that mixers and forwarders may
  wish to take into account.  (Note that this section provides design
  guidance only, and is not normative.)</t>

<t>First of all, audio levels generally ought to be measured over longer
  intervals than that of a single audio packet.  In order to avoid
  false-positives for short bursts of sound (such as a cough or a
  dropped microphone), it is often useful to require that a
  participant's audio level be maintained for some period of time
  before considering it to be "real", i.e. some type of low-pass
  filter ought to be applied to the audio levels.  Note, though, that such
  filtering must be balanced with the need to avoid clipping of
  the beginning of a speaker's speech.</t>

<t>Additionally, different participants may have their audio input set
  differently.  It may be useful to apply some sort of automatic gain
  control to the audio levels.  There are a number of possible
  approaches to acheiving this, e.g. by measuring peak audio levels,
  by average audio levels during speech, or by measuring background
  audio levels (average audio level levels during non-speech).</t>

</section>

<section title='Security Considerations' anchor='security'>

<t>A malicious endpoint could choose to set the values in this
  header extension falsely, so as to falsely claim that audio or voice
  is or is not present.  It is not clear what could be gained by
  falsely claiming that audio is not present, but an endpoint falsely
  claiming that audio is present could perform a denial-of-service
  attack on an audio conference, so as to send silence to suppress
  other conference members' audio.  Thus, if a device relys on audio
  level data from untrusted endpoints, it SHOULD periodically audit the
  level information transmitted, taking appropriate corrective action
  against endpoints that appear to be sending incorrect data.  (However,
  as it is valid for an endpoint to choose to measure audio levels prior to
  encoding, some degree of discrepancy could be present.  This would not
  indicate that an endpoint is malicous.)</t>

<t>In the <xref target='RFC3711'>Secure Real-Time Transport Protocol
    (SRTP)</xref>, RTP header extensions are authenticated but not
    encrypted.  When this header extension is used, audio levels are
    therefore visible on a packet-by-packet basis to an attacker
    passively observing the audio stream.  As discussed in
    <xref target='I-D.perkins-avt-srtp-vbr-audio' />, such an attacker
    might be able to infer information about the conversation,
    possibly with phoneme-level resolution.  In scenarios where this is a
    concern, additional mechanisms SHOULD be used to protect the
    confidentiality of the header extension.  This mechanism could be
    <xref target='I-D.ietf-avtcore-srtp-encrypted-header-ext'>header
    extension encryption</xref>, or a lower-level security and
	authentication mechanism.</t>

</section>

<section title='IANA Considerations' anchor='iana'>

<t>This document defines a new extension URI to the RTP Compact Header
  Extensions subregistry of the Real-Time Transport Protocol (RTP)
  Parameters registry, according to the following data:</t>

<t>
<list style='hanging'>
<t hangText='Extension URI:'>urn:ietf:params:rtp-hdrext:ssrc-audio-level</t>
<t hangText='Description:'>Audio Level</t>
<t hangText='Contact:'>jonathan@vidyo.com</t>
<t hangText='Reference:'>RFC &rfc.number;</t>
</list>
</t>

<t>Note to RFC Editor: please replace "RFC &rfc.number;" with the number of this RFC.</t>

</section>

</middle>

<back>

<references title='Normative References'>

&hdrext;

&red;

&rfc2119;

&rtp;

</references>

<references title='Informative References'>

<?rfc include="reference.RFC.3389"?>
<reference anchor="ITU.G711.1988">
  <front>
  <title>Pulse Code Modulation (PCM) of Voice Frequencies</title>
  <author>
    <organization>
      International Telecommunications Union
    </organization>
  </author>
  <date month="November" year="1988" />
  </front>
  <seriesInfo name="ITU-T" value="Recommendation G.711" />
</reference>

&slic;

&srtp;

&vbr-srtp;

&hdrcrypt;

</references>

<section title='Changes From Earlier Versions'>

<t>Note to the RFC-Editor: please remove this section prior to publication
as an RFC.</t>

<section title='Changes From Draft -03'>

<t><list style='symbols'>

  <t>Added vad extension attribute to negotiate use of the V bit.</t> 
  <t>Addressed editorial comments made on the mailing list.</t> 

</list></t>

</section>

<section title='Changes From Draft -02'>
        <t>
          <list style='symbols'>
            <t>Changed encoding related text so that it would cover both
            the one-byte and the two-byte header formats.</t>
            <t>Clarified use of root mean square for dBov calculation
            </t>
<t>Added references to the sample level calculator in
   <xref target='I-D.ietf-avtext-mixer-to-client-audio-level' />.</t>

<t>Changed affiliation for Emil Ivov.</t>
            <t>Other minor editorial changes.</t> 
          </list>
        </t>
      </section>

<section title='Changes From Draft -01'>

<t><list style='symbols'>

<t>Changed the URI for declaring this header extension from
  "urn:ietf:params:rtp-hdrext:audio-level" to
  "urn:ietf:params:rtp-hdrext:ssrc-audio-level" for consistency with
  <xref target="I-D.ietf-avtext-mixer-to-client-audio-level"/>.</t>

<t>Removed the "Limitations" section; it was discussing a potential extension that
  consensus indicated was out of scope of this document.</t>

<t>Closed the P.56 open issue. It was agreed on IETF 80 that P.56 is
mostly about speech levels and the levels transported by the extension
defined here should also be able to serve as an indication for noise.
</t>

<t>Closed the open issue about transmitting noise floor information.
Noise floor is (loosely) inferrable by observing the per-packet level
information over a period of time, so the additional complexity seemed
unnecessary.</t>

<t>Editorial changes for consistency with <xref
 target="I-D.ietf-avtext-mixer-to-client-audio-level"/>.</t>

<t>Moved several descriptions of normative items that previously had only been described
  in informative sections of the text.</t>

<t>Other editorial clarifications.</t>

</list></t>

</section>

<section title='Changes From Individual Submission Draft -01'>

<t><list style='symbols'>

<t>This version is primarily a document refresh.</t>

<t>Emil Ivov and Enrico Marocco have been added as co-authors.</t>

<t>Additional open issues listed.</t>

</list></t>

</section>

<section title='Changes From Individual Submission Draft -00'>

<t><list style='symbols'>

<t>The draft name has been changed to clarify that this document
  defines Client-To-Mixer Audio Levels, to more clearly distinguish
  it from <xref target='I-D.ietf-avtext-mixer-to-client-audio-level' />.
</t>

<t>The header extension format has been changed from a two-byte to a
  one-byte payload, eliminating the 7 reserved bits and the one
  must-be-zero bit.</t>

<t>The sections <xref target='use'>Considerations on Use</xref> and
  Limitations have been added.</t>

<t>It has been noted that senders MAY indicate -127 dBov for digital
  silence, and that level measurement MAY be done prior to encoding
  audio.</t>

<t>A reference
  to <xref target='I-D.ietf-avtcore-srtp-encrypted-header-ext' />
  has been added to the security considerations.</t>

<t>The term "header extension" is now used consistentenly throughout
  the document (as opposed to "extension header").</t>

</list></t>

</section>

</section>

</back>

</rfc>
PAFTECH AB 2003-2026
2026-04-24 04:26:25