One document matched: draft-wenger-rtcweb-layered-codec-00.txt
RTCWEB S. Wenger
Internet-Draft A. Eleftheriadis
Intended status: Standards Track Vidyo
Expires: January 5, 2012 July 4, 2011
The Case for Layered Codecs
draft-wenger-rtcweb-layered-codec-00
Abstract
RTCWEB is in the process of developing a protocol infrastructure and
a browser API to support browser-to-browser real-time communication
over IP. Real-time communication necessarily requires the use of
encoders and decoders (codecs) for media data. The document
advocates mandating support for a class of codecs known as scalable
or layered codecs, for their superior network adaptivity, error
resilience, and application adaptivity. Examples are provided for
use cases currently under discussion, focusing on video coding as
the, perhaps, most challenging media type currently under
consideration.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 5, 2012.
Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
Wenger & Eleftheriadis Expires January 5, 2012 [Page 1]
Internet-Draft The Case for Layered Codecs July 2011
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Detour: Suggested additional Requirements . . . . . . . . . . 3
4. Scalable Codecs in the defined Use Cases . . . . . . . . . . . 4
4.1. Use case: Browser to browser use-cases . . . . . . . . . . 5
4.2. Telephony use cases . . . . . . . . . . . . . . . . . . . 7
4.3. Video conferencing use cases . . . . . . . . . . . . . . . 8
4.3.1. Use-case: Multiparty Video Communication . . . . . . . 8
4.3.2. Use-case: Video Conferencing w/ Central Server . . . . 10
4.4. Embedded voice communication use cases . . . . . . . . . . 10
4.5. Bandwidth/QoS/mobility use cases . . . . . . . . . . . . . 11
5. Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . 11
6. Security Considerations . . . . . . . . . . . . . . . . . . . 12
7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7.1. Normative References . . . . . . . . . . . . . . . . . . . 12
7.2. Informative References . . . . . . . . . . . . . . . . . . 12
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13
Wenger & Eleftheriadis Expires January 5, 2012 [Page 2]
Internet-Draft The Case for Layered Codecs July 2011
1. Introduction
In this document we advocate the mandatory support of scalable coding
techniques for RTCWEB. If consensus towards such as position is not
achievable for whatever technical, commercial, or legal reasons, we
suggest that not having mandatory codecs may be better than mandating
the use of an inferior, non-scalable codec.
The current status of RTCWEB's use case and requirements discussion
is captured in [I-D.ietf-rtcweb-use-cases-and-requirements]. (Note:
Version 00 of this I-D was used for the comparison in this document.)
We use this document as a guide and attempt to compare the broad
categories of scalable and non-scalable codecs in terms of how well
they can satisfy the requirements specified therein. We use video
codecs as an example, and conclude that scalable codecs are
considerably better-suited for RTCWEB than non-scalable codecs.
We first address some additional - and, we believe, self-evident -
requirements.
2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119] and
indicate requirement levels for compliant implementations.
3. Detour: Suggested additional Requirements
There are certain requirements that, we believe, appear so obvious to
the authors of [I-D.ietf-rtcweb-use-cases-and-requirements] that they
have not been explicitly captured. One of these requirements is:
Fxx: The browser MUST enable communication with a peer (other
browser(s), MCUs, etc.) with a latency adequate for "real-time"
communication.
This is a crucial requirement when selecting a (video) codec for
reasons that will become apparent shortly. We believe that RTCWEB
should strive for solutions allowing latency (not counting long-
distance network delays) below 200 msec. Furthermore, any solution
that cannot guarantee a latency below 500 msec (before dropping video
altogether) should not be considered. At the time of writing there
are mailing list discussions ongoing regarding a requirement that may
be differently formulated but addresses the same operational aspect.
We don't specifically care about the formulation, but highly
Wenger & Eleftheriadis Expires January 5, 2012 [Page 3]
Internet-Draft The Case for Layered Codecs July 2011
recommend to go through the pain of agreeing to fixed millisecond
numbers.
An additional requirement that should be considered relates to
heterogeneity:
Fxy: The browser must be able to send fully decodable bitstreams,
ideally without wasting resources, for a broad range of receivers
(handheld to desktop) and, in multi-party cases, for a heterogeneous
receiver population.
Seems obvious? Yes, to us. Have the implications, however, been
thoroughly considered? Not that we are aware of. We discuss this
further when addressing the multiparty use case.
Finally, an additional requirement that relates to the heterogeneity
of peers has to do with the heterogeneity of connection
characteristics among peers:
Fxz: The browser MUST be able to perform error control on multiple
streams, with potentially different error characteristics,
simultaneously.
In a full-mesh, multipoint scenario, for example, handling of
individual peer connection characteristics can be very challenging.
Assume, by way of example, that in a five-way conference one of the
streams gets damaged or that the corresponding connection is subject
to congestion. If the browser choses to produce a more robust
stream, then the other three receiving peers will be penalized
through the overhead spent for error protection and resulting
inferior picture quality, even if they have perfect connections to
the sender. If, on the other hand, the browser choses to produce
separate streams for each of its peers, then it must be able to
handle an increasing computational load. Even if the computational
load is not a concern, battery power consumption certainly is. In
fact, there is only one thing that grows faster than CPU power
(according to Moore's law): per-pixel cycle demands of video codecs.
Also, we should bear in mind that even when staying with moderately
complex (and efficient) video codecs, video resolution constantly
goes up. Worse, resolution grows in both horizontal and vertical
dimensions, whereas CPU power grows only in one dimension.
4. Scalable Codecs in the defined Use Cases
The concept of scalable coding (for video, audio, or any media)
consists of providing a representation of the source at multiple
levels of fidelity using a bitstream comprised of a corresponding
Wenger & Eleftheriadis Expires January 5, 2012 [Page 4]
Internet-Draft The Case for Layered Codecs July 2011
number of layers. Typically these layers are formed in a pyramidal
fashion, such that a given level of fidelity requires the
availability of some or all of the lower layers (i.e., those that
correspond to lower levels of fidelity). The dependency exists
because the encoding process uses prediction between layers, thus
providing increased compression efficiency. If no such prediction is
used, then the layers are independent of one another, and we have
what is called simulcasting. Indeed, from a media coding point of
view, simulcasting is scalable coding without inter-layer prediction.
In simulcast, each layer provides a different level of fidelity and
is independently decodable.
One example of a standardized codec that uses such pyramidal layering
to implement scalable coding is H.264/SVC. MPEG-2 and H.263+ are two
other similar examples. The scalable layers in H.264/SVC enhance
fidelity in any of three dimensions: temporal rate, spatial
resolution, or spatial quality (or signal-to-noise ratio, SNR). More
than one such enhancement can exist in a given video at the same
time, i.e., scalability enhancements can be combined.
Scalable coding has been a well-known design architecture in the
media coding community since at least 1993, but it has only recently
been used in products. As a relatively recent addition to the
commercial audio-video communication arsenal, it is not yet widely
known among people who are not video coding experts. An overview of
the H.264/SVC standard can be read in [wiegand2007]. A detailed
discussion of how it can be used in videoconferencing systems is
provided in [eleft2006].
In the following we revisit each of the use cases and discuss how
scalable coding, and particularly pyramidal scalable video coding,
can be used to provide a significantly improved user experience.
4.1. Use case: Browser to browser use-cases
This category has four sub-cases. The sub-cases "simple video
communication" and "simple video communication with inter-operator
calling" are, from a codec viewpoint, the same, with the exception of
the mentioned interoperability argument resulting in a need for a
baseline codec. From a (video) codec viewpoint the "Hockey" sub-case
appears to consist of two independent video streams being sent by the
mobile phone. (We are actually not sure whether this, rather exotic,
multi-camera scenario warrants its own use case, but we are not
opposed to it, either.) The sub-case "video size change" appears, to
us, to introduce a feature that would be relevant to all video-
capable use cases. As a result, from a codec viewpoint, all four use
cases appear to have similar requirements and can be discussed
jointly.
Wenger & Eleftheriadis Expires January 5, 2012 [Page 5]
Internet-Draft The Case for Layered Codecs July 2011
On the surface, these use cases appear to be trivially supported by
any video codec as long as basic rate control and error resilience
features are utilized (probably with the help of protocol support
such as RTCP receiver reports and feedback messages therein,
retransmission, and so on).
Let's see, however, how scalable codecs can offer significant
advantages over non-scalable codecs.
First, temporal scalability (where pictures coded in an enhancement
layer can be decoded in combination with a lower frame rate base
layer) can greatly enhance error resilience, through techniques such
as the one described in [wenger1998]. Especially in conjunction with
feedback messages, virtually latency-neutral repair mechanisms can be
devised without relying on latency- and bandwidth-unfriendly intra
(I) pictures. It is important to note that such "reactive"
techniques do not penalize a system when no errors happen to occur.
Also, by being reactive, they are amenable to heterogeneous error
control support (thus addressing requirement Fxz).
(Note: sending intra pictures is typically not a good idea. First,
when transmitting video over a bandwidth-limited link that is close
to capacity (e.g., a 3G link), an I frame can bring the latency up to
the seconds range. Second, I frames are much larger than P or B
frames and, therefore, statistically much more likely to be hit by a
packet loss. Sending I frames for error control is, in almost all
cases, a bad idea. It has been done for many years, but only because
few, if any, better error control mechanisms were available.)
A second scalability dimension is spatial scalability. Here the
enhancement layer enables decoding of the video in a higher
resolution than a base layer, but can predict information from the
base layer. Spatial scalability can also offer advantages for error
control, something that we would gladly elaborate on in the future.
In this use case, however, its key advantage lies in the graceful
and, if properly implemented, latency-neutral handling of both
changes in available network bandwidth (addressing the bandwidth and
error characteristics aspects of F23, among others) and receiver-side
rendering requirements (addressing F22).
Most, if not all, video coding experts will agree that there is
nothing better than shedding pixels when running into a bandwidth
issue, and this can be trivially done (without sending an I frame!)
when using spatially scalable coding techniques. Similarly,
recovering from bandwidth shortages can be achieved without sending
an I frame, again in those cases where the sender properly implements
spatial scalability. These techniques are generally more flexible
and often lead to better-quality pictures compared to the rate-
Wenger & Eleftheriadis Expires January 5, 2012 [Page 6]
Internet-Draft The Case for Layered Codecs July 2011
control techniques used in single-layer codecs (QP adjustment).
Another point that speaks for spatial scalability is the option to
gracefully and quickly react to rendering size requirements, i.e.,
users enlarging or "full-screening" the rendering window (F22).
Again adding a high resolution scalable layer can be done without
sending an I frame (which, as already pointed out, are evil), and, if
implemented properly, can be done without interrupting the smoothness
of the video playback experience. (Note that even today's most
advanced spatial scalability techniques still encode only a small set
of discrete picture sizes in their finite, and small, number of
layers. As a result, some form of resampling in the rendering
interface will probably still be required.)
It has been remarked that simulcasting low and high resolution
representations of the same picture can have very similar positive
effects. This is correct. However, simulcast comes at a price,
which is bandwidth and compute cycle requirements. The quite
extensive subjective evaluation tests performed by MPEG in
conjunction with the H.264/SVC verification tests have shown that a
2006-generation scalable video encoder offers between 17% and 40%
bitrate savings over a same generation encoder using simulcast
[mpeg2007].
(Note that "subjective" evaluation is the Mercedes of media quality
testing, whereas "objective" evaluation compares to a pre-"Volt"
Chevrolet :-). Subjective tests, when performed properly, are by no
means unscientific and they are, in fact, considered a much better
way to assess the quality of a codec compared with objective tests
(e.g., the PSNR, or less crude metrics devised by groups such as the
Video Quality Experts Group). Unfortunately, subjective evaluations
are also orders of magnitude more expensive and time-consuming than
objective tests, as it takes many person-hours to evaluate a single
test case, whereas objective tests take only CPU cycles.)
Further, while a modern scalable encoder creating two spatial layers
may require roughly the same number of cycles as two encoders coding
the same pixel count, there are savings in the decoder due to what
SVC calls "single loop decoding" (i.e., the fact that the receiver
only needs to maintain a prediction loop just for the layer that is
being decoded, and not for the layers that the to-be-decoded layer
may depend on).
4.2. Telephony use cases
As the telephony use cases appear to address interaction with legacy
telephone equipment only, there is probably little use for the
quality scalability that modern sclabale speech/audio codecs offer.
Wenger & Eleftheriadis Expires January 5, 2012 [Page 7]
Internet-Draft The Case for Layered Codecs July 2011
As formulated, from a codec viewpoint, any speech/audio codec used in
telco environments (and many that have been specified outside this
environment) ought to be acceptable choices.
4.3. Video conferencing use cases
There are two sub-cases in this broad category: full-mesh video
conferencing, and centralized conferencing with resolution switching.
The use cases also make particular assumptions regarding audio; the
first case involves mono audio with panning at the web application,
whereas the second involves mono or stereo audio with mixing at the
server.
We fist point out that these two use cases offer only a small subset
of the functionality offered in today's multiparty videoconferencing
solutions. In particular, the centralized server sub-case appears to
deal with one, not frequently used, aspect of MCU-based communication
that is implemented in a crude and unoptimized fashion (see below).
We believe that video conferencing use cases in a standard drafted in
2011 should encompass at least as much functionality as is routinely
offered in 2005 generation MCUs.
We wonder, specifically, why the group is not (yet?) considering
other use cases in the same broad category of centralized
cponferencing that require more flexibile server architectures. For
example, sometimes active speaker size-up is not wanted, i.e., in
scenarios involving presentations with questions. Or, one wants to
be able to up-size more than one (but less than the whole population)
of speakers. Or a full continuous presence conference where it is up
to the user of each receiving browser to decide how he/she wishes to
render the received signals. All of these capabilities are available
with today's MCUs. Many similar scenarios could be described. We
plan to contribute more detailed descriptions in the future unless we
receive pushback.
We now examine each of the two use-case categories.
4.3.1. Use-case: Multiparty Video Communication
On the surface, this use case is not very different from the Simple
Video use case, at least from a codec viewpoint. Indeed, it appears
that a browser just needs to be able to simultaneously process N
incoming independent video bitstreams (as spelled out in F12 and
F14). A slightly deeper examination will reveal that this use case
appears to be a picture-perfect example of why scalable codecs offer
superior performance.
First off, all the arguments that were made for scalable codecs in
Wenger & Eleftheriadis Expires January 5, 2012 [Page 8]
Internet-Draft The Case for Layered Codecs July 2011
the Simple Video use case still apply. Let's consider, however, what
happens when more than one stream is received.
A first consideration is the CPU requirements for decoding all the
pictures in full resolution. Suppose a browser is receiving, for
example, four video pictures in standard TV resolution and in a non-
scalable format. The computational load for decoding these pictures
in addition to the load of encoding one's outgoing picture would
pretty much max out today's desktop CPUs. In typical multipoint
layouts, the reconstructed pictures of most of the peers would be
shown in thumbnail format, thus throwing away anywhere between 75-95%
of the reconstructed pixel count! What a waste.
But waste what? Battery power? Yes, for handheld devices, which do
not have the screen real-estate to show all the pictures in full
resolution. We come to that in a moment. Electricity? Yes. This is
not a joke. A desktop motherboard can easily double its power intake
based on CPU load and memory access. Say, a motherboard's power
consumption goes up from 100W to 200W, just because video is decoded
in unnecessarily high resolution. One kWh of electricity costs at
one of the author's home (Bay Area, California, USA, PG&E as the
electricity provider) about 26 cents at his consumption level (single
f amily house). This means that 8 hours of unnecessarily decoding of
full- resolution video that is not rendered at full resolution will
cost $0.20. This, by coincidence, is the exact same maximum amount
that MPEG-LA charges as licensing fees for using H.264/SVC :-) Never
mind the numbers of trees that can be saved...
A second consideration is when multiple streams are received, or
transmitted, by the browser in a heterogeneous receiver population.
Browsers are today ubiquitous and can be found anywhere from handheld
devices with QVGA screens to gaming racks with multiple 2K pixel
resolution screens. That, incidentally, is a big part of the appeal
of the RTCWEB activity. If you are sitting behind a 23-inch 1080p
monitor, your display is approximately 120 dpi. A QVGA image,
perfectly appropriate for being rendered on the screens of many
smartphones, would be just 2-in by 1.5-in on the PC screen. On one
of the authors' laptop (a 1080p, 13.3-in model, better than 200 dpi)
it would literally be the size of a thumbnail. Not exactly what one
expects to see, judging form daily uses of Skype and our own
(Vidyo's) desktop video conferencing services. Further, when
upsampling to, say, a quarter HD resolution, a QVGA signal simply
doesn't look good no matter how well your upsample filter is
designed.
This means that a transmitting browser that creates bitstreams for a
receiver population including a smartphone and a PC, would have to,
ideally, generate at least three resolutions: thumbnail, something
Wenger & Eleftheriadis Expires January 5, 2012 [Page 9]
Internet-Draft The Case for Layered Codecs July 2011
useful for smartphones (e.g., QVGA), and something useful for PC
users (VGA or better). More option would be better. There are
commercial products available today that ship, using a software codec
on a high-end PC, 1080p60 video conferencing. Two years down the
road, and a mainstream PC will be able to handle that type of load.
It has already been pointed out that the compute cycle requirements
for simulcast and scalable encoding are roughly the same, assuming a
common set of coding tools (i.e., H.264 Baseline profile vs. H.264
Scalable Baseline profile). The sending bandwidth requirements,
however, are not the same: scalable beats simulcast by several tens
of percentage points.
There are additional considerations for simulcast vs. scalable coding
that have to do with RTP packetization and access unit alignment,
resolution switching, and error resilience, that require much more
extensive analysis than what's intended in this document.
4.3.2. Use-case: Video Conferencing w/ Central Server
This particular use case is interesting because, as formulated, it
appears to assume that multiple video resolutions are produced by
each sending browser and simulcast to the server. As formulated, the
receiving browser receives the active speaker in full resolution and
the other participants in low resolution. The server selects what to
forward based on speech activity. Apparently, what is emulated here
is one operation point implemented in many continuous presence MCUs
(namely voice activation), without requiring the video transcoding
features of a continuous presence MCU.
This use case almost begs for use of scalable video coding. The
simulcasting alternative would have all the drawbacks discussed in
the Simple Video Communication Service. For example, consider what
happens when the active speaker changes. How is the receiving
participant to switch resolutions between the two simulcast streams?
It appears to require the request (from server to sending browser) of
an I frame, with all the drawbacks that entails. Actually, as
described, there appears to be no value in simulcasting all
resolutions to the server, because of the need of such interaction...
4.4. Embedded voice communication use cases
The benefits discussed in multiparty video communication can apply
here for scalable audio codecs. Scalable audio codecs haven't been
mentioned prominently before, as in audio-visual systems the cost of
video processing (in terms of compute cycles and bit rate, among
others) typically outweighs that of audio. The requirements for
today's games are so high that the CPU and bandwidth requirements for
Wenger & Eleftheriadis Expires January 5, 2012 [Page 10]
Internet-Draft The Case for Layered Codecs July 2011
a couple of audio channels may fall into the category of background
noise. However, note that this ceases to be the case as the number
of audio channels increases and the quality and (therefore)
complexity of the audio codecs grows. We understand that the audio
quality here would probably not be restricted to toll quality, and
require something like Opus or MP3 for coding. While one or two of
these codecs probably would still qualify as "background noise" on a
gaming rack, 20 certainly do not. What appears to be useful here
would be scalability both in terms of bitrate scalability and
complexity scalability.
4.5. Bandwidth/QoS/mobility use cases
The use cases as formulated appear to have limited impact on the
codec choices beyond aspects already raised above. NIC changes (or
other changes that materially affect connectivity) are best addressed
by codecs that can flexibly change their operation points without
essentially restarting the codec (such as sending I frames in video,
or user-generated codebooks in audio).
If QoS mechanisms that take advantage of QoS marking are supported in
the underlying network - something that is at least questionable in
today's Internet, but may be very relevant in the future and in
special application fields such as the military - the pyramid
structure of scalable codecs makes the selection of bits that are
best to be transported at higher QoS trivial: the lower the layer,
the higher the desired QoS. We note that scalable systems today
emulate such a desirable network behavior by protecting base layers
better than enhancement layers, through techniques such as
retransmission or FEC.
5. Concluding Remarks
Scalability, while known as a concept for decades, is a relatively
new technique in the commercial sphere of video and audio
communication products. As a result, a lot of people are not
familiar with how it works, and how it can be beneficial to real-time
communication systems.
Most significantly, scalable coding has been successfully used to
solve several fundamental, decades-old challenges in packet video
communication, and it therefore behooves any standardization activity
that considers codec design or adoption to take it into serious
consideration.
Wenger & Eleftheriadis Expires January 5, 2012 [Page 11]
Internet-Draft The Case for Layered Codecs July 2011
6. Security Considerations
None
7. References
7.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
7.2. Informative References
[I-D.ietf-rtcweb-use-cases-and-requirements]
Holmberg, C., Hakansson, S., and G. Eriksson, "Web Real-
Time Communication Use-cases and Requirements",
draft-ietf-rtcweb-use-cases-and-requirements-00 (work in
progress), June 2011.
[eleft2006]
Eleftheriadis, A., Cinvanlar, M., and O. Shapiro,
"Multipoint Videoconferencing with Scalable Video Coding,
Journal of Zhejiang University SCIENCE A, Vol. 7, Nr. 5,
April 2006, pp. 696-705. (Proceedings of the Packet Video
2006 Workshop.)", 2006.
[mpeg2007]
HHI and MPEG, "MPEG Verification Test for SVCd", 2007, <ht
tp://ip.hhi.de/imagecom_G1/savce/MPEG-Verification-Test/
MPEG-Verification-Test.htm>.
[wenger1998]
Wenger, S., "Temporal scalability using P-pictures for
low-latency applications, Multimedia Signal Processing
Workshop 1998, available from
http://www.stewe.org/papers/mmsp98/243/index.htm", 1998.
[wiegand2007]
Schwarz, H., Marpe, D., and T. Wiegand, "Overview of the
Scalable Video Coding Extension of the H.264/AVC Standard,
IEEE Transactions on Circuits and Systems for Video
Technology, Special Issue on Scalable Video Coding, vol.
17, no. 9, pp. 1103-1120, September 2007, Invited Paper.",
2006.
Wenger & Eleftheriadis Expires January 5, 2012 [Page 12]
Internet-Draft The Case for Layered Codecs July 2011
Authors' Addresses
Stephan Wenger
Vidyo, Inc.
2400 SKyfarm Dr.
Hillsborough, CA 94010
US
Email: stewe@stewe.org
Alex Eleftheriadis
Vidyo, Inc.
433 Hackensack Avenue
Seventh Floor
Hackensack, NJ 07601
US
Email: alex@vidyo.com
Wenger & Eleftheriadis Expires January 5, 2012 [Page 13]
| PAFTECH AB 2003-2026 | 2026-04-24 05:26:51 |