One document matched: draft-ietf-avtext-client-to-mixer-audio-level-01.xml
<?xml version='1.0'?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY hdrext SYSTEM
'reference.RFC.5285.xml'>
<!ENTITY red SYSTEM
'reference.RFC.2198.xml'>
<!ENTITY rfc2119 SYSTEM
'reference.RFC.2119.xml'>
<!ENTITY rtp SYSTEM
'reference.RFC.3550.xml'>
<!ENTITY slic SYSTEM
'reference.I-D.ietf-avtext-mixer-to-client-audio-level.xml'>
<!ENTITY srtp SYSTEM
'reference.RFC.3711.xml'>
<!ENTITY hdrcrypt SYSTEM
'reference.I-D.lennox-avt-srtp-encrypted-extension-headers.xml'>
<!ENTITY vbr-srtp SYSTEM
'reference.I-D.perkins-avt-srtp-vbr-audio.xml'>
]>
<?rfc toc="yes" ?>
<?rfc strict="yes" ?>
<?rfc compact="yes" ?>
<?rfc sortrefs="yes" ?>
<?rfc symrefs="yes" ?>
<rfc category='std' ipr='pre5378Trust200902' docName='draft-ietf-avtext-client-to-mixer-audio-level-01'>
<front>
<title abbrev='Client-to-Mixer Audio Level Indication'>
A Real-Time Transport Protocol (RTP) Header Extension for Client-to-Mixer Audio Level Indication
</title>
<author initials='J.' surname='Lennox'
fullname='Jonathan Lennox'
role="editor">
<organization abbrev='Vidyo'>
Vidyo, Inc.
</organization>
<address>
<postal>
<street>433 Hackensack Avenue</street>
<street>Seventh Floor</street>
<city>Hackensack</city> <region>NJ</region>
<code>07601</code>
<country>US</country>
</postal>
<email>jonathan@vidyo.com</email>
</address>
</author>
<author initials='E.' surname='Ivov'
fullname='Emil Ivov'>
<organization abbrev='Jitsi'>
Jitsi
</organization>
<address>
<postal>
<street> </street>
<city>Strasbourg</city>
<code>67000</code>
<country>France</country>
</postal>
<email>emcho@jitsi.org</email>
</address>
</author>
<author initials='E.' surname='Marocco'
fullname='Enrico Marocco'>
<organization abbrev='Telecom Italia'>
Telecom Itialia
</organization>
<address>
<postal>
<street>Via G. Reiss Romoli, 274</street>
<city>Turin</city>
<code>10148</code>
<country>Italy</country>
</postal>
<email>enrico.marocco@telecomitalia.it</email>
</address>
</author>
<date />
<area>RAI</area>
<workgroup>AVT</workgroup>
<keyword>I-D</keyword>
<keyword>Internet-Draft</keyword>
<!-- TODO: more keywords -->
<abstract>
<t>This document defines a mechanism by which packets of
Real-Time Transport Protocol (RTP) audio streams can indicate,
in an RTP header extension, the audio level of the audio
sample carried in the RTP packet. In large conferences, this
can reduce the load on an audio mixer or other middlebox which
wants to forward only a few of the loudest audio streams,
without requiring it to decode and measure every stream that
is received.</t>
</abstract>
</front>
<middle>
<section title='Introduction' anchor='introduction'>
<t>In a centralized <xref target='RFC3550'>Real-Time Transport Protocol
(RTP)</xref> audio conference, an audio mixer or
forwarder receives audio streams from many or all of the conference
participants. It then selectively forwards some of them to
other participants in the conference. In
large conferences, it is possible that such a server might be
receiving a large number of streams, of which only a few should be
forwarded to the other conference participants.</t>
<t>In such a scenario, in order to pick the audio streams to forward,
a centralized server needs to decode, measure audio levels, and
possibly perform voice activity detection on audio data from a large
number of streams. The need for such processing limits the size or
number of conferences such a server can support.</t>
<t>As an alternative, this document defines
an <xref target='RFC5285'>RTP header extension</xref> through which
senders of audio packets can indicate the audio level of the
packets' payload, reducing the processing load for a server.
</t>
<t>The header extension in this draft is different to, but
complementary with, the one defined in
<xref target='I-D.ietf-avtext-mixer-to-client-audio-level' />, which
defines a mechanism by which audio mixers can indicate to clients the
levels of the contributing sources that made up the mixed audio.</t>
</section>
<section title='Terminology'>
<t>The key words "MUST", "MUST NOT", "REQUIRED",
"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
and "OPTIONAL" in this document
are to be interpreted as described in <xref
target='RFC2119'>RFC 2119</xref> and indicate requirement levels for
compliant implementations.</t>
</section>
<section title='Audio Levels'>
<t>The audio level header extension carries both the level of the
audio carried in the RTP payload of the packet it is associated
with, as well as an indication as to whether voice activity
has been detected in the packet.</t>
<t>The form of the audio level extension block is as follows:</t>
<figure anchor='exthdr'>
<artwork>
<![CDATA[
0 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ID | len=0 |V| level |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]>
</artwork>
</figure>
<t>The length field takes the value 0 to indicate that 1 byte follows.</t>
<t>The audio level is defined in the same manner as is audio noise
level in the <xref target='RFC3389'>RTP Comfort Noise</xref>
specification. In that specification, the overall magnitude of
the noise level is encoded into the first byte of the payload, with
spectral information about the noise in subsequent bytes. This
specification's audio level parameter is defined so as to be
identical to the comfort noise payload's noise-level byte.</t>
<t>The magnitude of the audio level is packed into the seven least
significant bits of the single byte of the header extension,
shown in <xref target='exthdr'/>.
The least significant bit of the audio level
magnitude is packed into the least significant bit of the byte.
The most significant bit of the byte is used as a separate flag bit
"V", defined below.</t>
<t>The audio level is expressed in -dBov, with values from 0 to 127
representing 0 to -127 dBov. dBov is the level, in decibels,
relative to the overload point of the system, i.e. the maximum-amplitude
signal that can be handled by the system without clipping.
(Note: Representation relative to the
overload point of a system is particularly useful for digital
implementations, since one does not need to know the relative
calibration of the analog circuitry.) For example, in the case of
<xref target='ITU.G711.1988'>u-law (audio/pcmu) audio</xref>, the 0
dBov reference would be a square wave with values
+/- 8031. (This translates to 6.18 dBm0, relative to u-law's dBm0
definition in Table 6 of G.711.)</t>
<t>The reference implementation section in
<xref target='I-D.ietf-avtext-mixer-to-client-audio-level' />
provides a sample implementation of an audio level calculator
that helps obtain such values from raw audio samples.</t>
<t>In addition, a flag bit (labeled V) indicates whether
the encoder believes the audio packet contains voice activity (1) or
does not (0). The voice activity detection algorithm is
unspecified and left implementation-specific.</t>
<t>The audio level for digital silence (e.g. all-0 pcmu audio),
for example for a muted audio source, MAY be represented as 127 (-127
dBov), regardless of the dynamic range of the encoded audio format.</t>
<t>When this header extension is used with RTP data sent
using <xref target='RFC2198'>the RTP Payload for Redundant Audio
Data</xref>, the header's data describes the contents of the
primary encoding.</t>
</section>
<section title='Signaling (Setup) Information'>
<t>The URI for declaring this header extension in an extmap attribute is
"urn:ietf:params:rtp-hdrext:audio-level". There is no additional setup
information needed for this extension (no extensionattributes).</t>
</section>
<section title='Considerations on Use' anchor='use'>
<t>Mixers and forwarders generally should not base audio forwarding
decisions directly on packet-by-packet audio level information, but
rather should apply some analysis of the audio levels and trends.
This general rule applies whether audio levels are provided by
endpoints (as defined in this document), or are calculated at a
server, as would be done in the absence of this information. This
section discusses several issues that mixers and forwarders may
wish to take into account. (Note that this section provides design
guidance only, and is not normative.)</t>
<t>First of all, audio levels should generally be measured over longer
intervals than that of a single audio packet. In order to avoid
false-positives for short bursts of sound (such as a cough or a
dropped microphone), it is often useful to require that a
participant's audio level be maintained for some period of time
before considering it to be "real", i.e. some type of low-pass
filter should be applied to the audio levels. Note, though, that such
filtering must be balanced with the need to avoid clipping of
the beginning of a speaker's speech.</t>
<t>Additionally, different participants may have their audio input set
differently. It may be useful to apply some sort of automatic gain
control to the audio levels. There are a number of possible
approaches to acheiving this, e.g. by measuring peak audio levels,
by average audio levels during speech, or by measuring background
audio levels (average audio level levels during non-speech).</t>
</section>
<section title='Limitations' anchor='limitations'>
<t>The audio levels carried by the extension header defined by this
document are defined as dBov, decibels below system overload.</t>
<t>In principle, it could be more useful to have, instead, dB SPL,
decibels of sound pressure level. In traditional telephony systems,
telephone handsets were calibrated such that a particular (e.g.)
u-law audio level, or analog voltage, corresponded to a particular
sound pressure level at the handset's mouthpiece.</t>
<t>However, in many environments, this information is not available.
Notably, PC soundcard hardware can only determine the levels of mic-
or line-in at the hardware input, and operating systems usually
allow further adjustments of audio input levels without providing
information about these transformations to applications.
Furthermore, in many circumstances, such as speech synthesis or
mixed audio, an "audio" signal may in fact never have actually
existed as sound pressure at all.</t>
<t>Thus, while information about the correspondance between dB SPL and
dBov, or encoded audio, could be useful, this document does not
attempt to define it. If there are circumstances in which this
information would be useful, a separate header extension would be
straightforward to define. (The information carried by such a
header extension could indeed be useful independently from the
information in the header extension defined by this document.)</t>
</section>
<section title='Security Considerations' anchor='security'>
<t>A malicious endpoint could choose to set the values in this
header extension falsely, so as to falsely claim that audio or voice
is or is not present. It is not clear what could be gained by
falsely claiming that audio is not present, but an endpoint falsely
claiming that audio is present could perform a denial-of-service
attack on an audio conference, so as to send silence to suppress
other conference members' audio. Thus, a device relying on audio
level data from untrusted endpoints SHOULD periodically audit the
level information transmitted, taking appropriate corrective action
if endpoints appear to be sending incorrect data. (Note that
endpoints MAY choose to measure audio levels prior to encoding, so
some degree of discrepancy SHOULD be tolerated.)</t>
<t>In the <xref target='RFC3711'>Secure Real-Time Transport Protocol
(SRTP)</xref>, RTP header extensions are authenticated but not
encrypted. When this header extension is used, audio levels are
therefore visible on a packet-by-packet basis to an attacker
passively observing the audio stream. As discussed in
<xref target='I-D.perkins-avt-srtp-vbr-audio' />, such an attacker
might be able to infer information about the conversation,
possibly with phoneme-level resolution. In scenarios where this is a
concern, additional mechanisms SHOULD be used to protect the
confidentiality of the header extension. One solution would be
<xref target='I-D.lennox-avt-srtp-encrypted-extension-headers'>header
extension encryption</xref>.</t>
</section>
<section title='IANA Considerations' anchor='iana'>
<t>This document defines a new extension URI to the RTP Compact Header
Extensions subregistry of the Real-Time Transport Protocol (RTP)
Parameters registry, according to the following data:</t>
<t>
<list style='hanging'>
<t hangText='Extension URI:'>urn:ietf:params:rtp-hdrext:audio-level</t>
<t hangText='Description:'>Audio Level</t>
<t hangText='Contact:'>jonathan@vidyo.com</t>
<t hangText='Reference:'>RFC &rfc.number;</t>
</list>
</t>
</section>
</middle>
<back>
<references title='Normative References'>
&hdrext;
&red;
&rfc2119;
&rtp;
</references>
<references title='Informative References'>
<?rfc include="reference.RFC.3389"?>
<reference anchor="ITU.G711.1988">
<front>
<title>Pulse Code Modulation (PCM) of Voice Frequencies</title>
<author>
<organization>
International Telecommunications Union
</organization>
</author>
<date month="November" year="1988" />
</front>
<seriesInfo name="ITU-T" value="Recommendation G.711" />
</reference>
<reference anchor="ITU.P56.1993">
<front>
<title>Objective Measurement of Active Speech Level</title>
<author>
<organization>
International Telecommunications Union
</organization>
</author>
<date month="March" year="1988" />
</front>
<seriesInfo name="ITU-T" value="Recommendation P.56" />
</reference>
&slic;
&srtp;
&vbr-srtp;
&hdrcrypt;
</references>
<section title='Open issues'>
<t><list style='symbols'>
<t>In order to more accurately determine signal-to-noise ratio, it
would be useful for a sender to also send its estimate of its
current audio noise floor. If so, it's unclear whether this would
be better as a separate header extension element, or added to this
header extension element.</t>
<t>It has been suggested to reference <xref target='ITU.P56.1993'>ITU
P.56</xref> for level measurement. This needs to be
investigated.</t>
</list></t>
</section>
<section title='Changes From Earlier Versions'>
<t>Note to the RFC-Editor: please remove this section prior to publication
as an RFC.</t>
<section title='Changes From Draft -01'>
<t><list style='symbols'>
<t>Added references to the sample level calculator in
<xref target='I-D.ietf-avtext-mixer-to-client-audio-level' />.</t>
<t>Changed affiliation for Emil Ivov.</t>
</list></t>
</section>
</section>
</back>
</rfc>
| PAFTECH AB 2003-2026 | 2026-04-24 04:27:20 |