没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
内容概要:本文介绍了一种基于深度学习的自适应视频流媒体传输系统Swift,它利用分层神经编码技术,提供高效且灵活的视频压缩和解码能力。Swift系统主要由三部分组成:1) 服务器端的编码器和解码器;2) 客户端的单次解码器;3) 适用于分层编码的自适应比特率(ABR)协议。通过优化分层编码,Swift能够在低带宽和计算资源变化的条件下,实现更高的视频质量和更快的反应时间,同时降低带宽使用。 适合人群:视频流媒体系统研究人员、工程师和开发者。 使用场景及目标:1) 需要在不稳定的网络条件下提供高质量视频流的视频平台;2) 需要灵活调整视频质量以适应不同客户端计算能力的场景;3) 希望提高视频流媒体服务质量(QoE)的企业和研究机构。 其他说明:Swift系统不仅在技术上创新,还通过对传统分层编码挑战的解决,展示了深度学习在视频编解码领域的潜力。通过减少编码开销和解码延迟,Swift能够显著提升视频流媒体的整体性能。
资源推荐
资源详情
资源评论
Swift: Adaptive Video Streaming with Layered Neural Codecs
Mallesham Dasari, Kumara Kahatapitiya, Samir R. Das, Aruna Balasubramanian, Dimitris Samaras
Stony Brook University
Abstract
Layered video coding compresses video segments into layers
(additional code bits). Decoding with each additional layer
improves video quality incrementally. This approach has po-
tential for very fine-grained rate adaptation. However, layered
coding has not seen much success in practice because of its
cross-layer compression overheads and decoding latencies.
We take a fresh new approach to layered video coding by ex-
ploiting recent advances in video coding using deep learning
techniques. We develop
Swift
, an adaptive video streaming
system that includes i) a layered encoder that learns to encode
a video frame into layered codes by purely encoding residu-
als from previous layers without introducing any cross-layer
compression overheads, ii) a decoder that can fuse together a
subset of these codes (based on availability) and decode them
all in one go, and, iii) an adaptive bit rate (ABR) protocol
that synergistically adapts video quality based on available
network and client-side compute capacity.
Swift
can be in-
tegrated easily in the current streaming ecosystem without
any change to network protocols and applications by simply
replacing the current codecs with the proposed layered neural
video codec when appropriate GPU or similar accelerator
functionality is available on the client side. Extensive evalu-
ations reveal
Swift
’s multi-dimensional benefits over prior
video streaming systems.
1 Introduction
Internet video delivery often encounters highly variable
and unpredictable network conditions. Despite various ad-
vances made, delivering the highest possible video qual-
ity continues to be a challenging problem due to this un-
certainty. The problem is more acute in wireless networks
as the channel conditions and mobility adds to the uncer-
tainty [39, 46]. Interestingly, the next generation wireless net-
works may even make the problem more challenging (e.g.,
60GHz/mmWave [10, 11, 38]).
To counter the challenges posed by such varying network
capacity, current video delivery solutions predominantly prac-
tice adaptive streaming (e.g., DASH [50]), where a source
video is split into segments that are encoded at the server into
multiple bitrates providing different video qualities, and a
client runs an adaptive bitrate (ABR) algorithm to dynami-
cally select the highest quality that fits within the estimated
network capacity for the next segment to be downloaded.
Need for layered coding. Most of the current commercial
ABR algorithms adopt a monolithic encoding practice (e.g.,
Video segment numbers
1 2 N
...
1 2
N
…
Layer_1
Layer_2
Layer_L
..
Rate_1 (360p)
Rate_2 (480p)
Rate_M (4K)
480p
4K/8K
Layered Coding Regular Coding
Video segment numbers
..
..
..
..
..
..
..
(c
0
)
(c
1
)
(c
L
)
(c
0
+ c
1
)
(
c
0
+ c
1
+..+c
L
)
360p
(
c
0
)
Q
1
Q
2
Q
M
..
Quality
Level
Figure 1: Layered vs. Regular coding methods. In Regular
coding the video segments are coded independently at dif-
ferent qualities. In Layered coding a given quality can be
reconstructed by combining codes for multiple layers thus
facilitating incremental upgrades or downgrades.
H.265 [53]), where the same video segment is encoded ‘in-
dependently’ for each quality level. The decision on fetching
a segment at a certain quality is considered final once the
ABR algorithm makes a determination based on estimating
the network capacity. However, these estimations are far from
accurate, resulting in either underutilizing or overshooting the
network capacity. For example, the ABR algorithm may fetch
at a low quality by underestimating the network capacity, or
it may fetch at a high quality causing video stalls by overesti-
mating. Consequently, even the optimal ABR algorithms fail
to provide a good quality of experience (QoE), as such rigid
methods that do not fit the need of the streaming conditions.
An alternate technique, called
layered coding
, has been
long studied [12, 14, 36, 47, 67] that can avoid the above
streaming issues. The key idea here is that, instead of in-
dependently encoding the segment in different qualities, the
segment is now encoded into layers; the base layer provides
a certain video quality, and additional layers improve the
video quality when applied over the base layer. See Figure 1.
This means that, if the network throughput improves, one can
fetch additional layers to improve video quality at a much
lower cost compared to a regular codec.
1
We use the term
regular coding to indicate the current practice of indepen-
dent encoding in multiple qualities (current standards such as
H.265/HEVC [53]).
Challenges with layered coding. Layered coding, however,
faces two nontrivial challenges: compression overhead, and
coding latency. The compression overhead mainly comes
from not having the inter-layer frame prediction to avoid re-
construction drift in quality [29,42,61, 67]. On the other hand,
the decoding latency is a function of the number of layers as
1
We use terms coding or codec for encoding and decoding together.
Also we use the terms encoding/compression, decoding/decompression inter-
changeably.
each layer needs to be decoded separately. Notwithstanding
these issues, some studies have indeed applied layered coding
in streaming and have shown slightly better QoE compared
to the regular coding methods, benefiting from its ability to
do dynamic quality upgrades [31]. However, they do not ad-
dress either the overhead or the latency issues directly. Indus-
try streaming solutions continue to adopt the regular codecs,
shipping these codecs in hardware to avoid computational
challenges, making it harder to adopt new innovations.
Neural video codecs. A learning approach to video coding
has shown tremendous improvement in compression effi-
ciency in just a few years [43,60, 65]. Figure 2 shows bits-per-
pixel vs PSNR plots
2
for several generations of codecs of two
types – neural codecs that use deep learning and traditional
algorithmic codecs that use the popular H.26x standards. It
took algorithmic codecs 18 years to make the same progress
that neural codecs achieved in the last 4 years! One reason
for this rapid development is that neural codecs can run in
software that can be integrated as part of the application, sup-
port agile codec development and provide royalty-free codecs.
Further, they run on data parallel platforms such as GPUs that
are increasingly available and powerful.
There are several insights in using neural codecs for video
coding – 1) unlike the traditional layered coding methods
where it is nontrivial to handcraft each layer
3
to have unique
information, a neural network’s loss function can be optimized
to encode a video frame into unique layered codes by purely
encoding residuals from previous layers without introducing
a reconstruction drift; 2) a neural network can be trained to
accept a subset of the layered codes and decode all of them in
a single-shot, which again was traditionally difficult to do with
a handcrafted algorithm due to nonlinear relationships among
the codes. Additionally, 3) neural codecs enable software-
driven coding. We note here that GPUs or similar accelerators
for neural network computation are critical for success with
neural codecs. Fortunately, they are increasingly common in
modern devices.
Swift. Based on the above insights, we present
Swift
, a novel
video streaming system using layered coding built on the
principles of neural video codecs [32, 60, 65].
4
We show that
learning can address the challenges of layered coding men-
tioned earlier – there is no additional compression overhead
with further layering and the decoding latency is independent
of the number of layers.
Swift
consists of three design com-
ponents: i) server-side encoder plus decoder, ii) client-side
decoder, and iii) ABR protocol adapted to layered coding and
varying compute capacity (in addition to varying network
capacity).
2
Bits-per-pixel captures compression efficiency and PSNR (peak signal-
to-noise ratio) captures image quality. Both metrics together capture codec
performance.
3
Throughout the paper, the term ‘layer’ refers to compressed code layers,
not neural network layers.
4
The source code of Swift is available at the following site:
https://github.com/VideoForage/swift.
4 Years
18 Years
Neural Codecs
Algorithmic Codecs
Figure 2: Evolution of neural and algorithmic video codecs
showing compression efficiency plots across generations.
We evaluate
Swift
with diverse video content and FCC-
released real-world network traces [8]. We compare
Swift
with state-the-art streaming algorithms that combine either
regular coding [35,51,52] or layered coding [31] with state-of-
the-art ABR algorithms. In terms of QoE,
Swift
outperforms
the next-best streaming alternative by 45%. It does so using
16% less bandwidth and has a lower reaction time to changing
network conditions. In terms of the neural codec,
Swift
’s
layered coding outperforms the state-of-the-art layered codec
(SHVC [12]) by 58% in terms of compression ratio, and by
×
4 (for six layers) in terms of decoding latency. In summary,
our contributions are the following:
•
We show how deep learning-based coding can make layered
coding both practical and high-performing, while address-
ing existing challenges that stymied the interest in layered
coding.
•
We design and build
Swift
to demonstrate a practical lay-
ered coding based video streaming system.
Swift
is an
embodiment of deep learning-based encoding and decoding
methods along with a purpose-built ABR protocol.
•
We comprehensively evaluate and showcase the multi-
dimensional benefits of
Swift
in terms of QoE, bandwidth
usage, reaction times and compression efficiency.
2 Motivation
2.1
Limitations of Today’s Video Streaming
Due to Regular Coding
Today’s video providers predominantly use source rate adap-
tation (e.g., MPEG-DASH [50]) where video segments are
encoded at different qualities on the server and an adaptive
bitrate (ABR) algorithm chooses the best quality segment to
be downloaded based on the network capacity.
The streaming solutions that are widely deployed, use reg-
ular, standards-driven, algorithmic coding methods such as
H.265/HEVC [53] or VP9 [4] for encoding video segments.
These coding methods do not allow segments to be upgraded
or downgraded based on network conditions.
Figure 3 illustrates this problem using an example experi-
ment (more details about methodology are described in §6.1).
The figure shows the quality of segments that are fetched
by different state-of-the-art ABR algorithms that use regular
75sec
50sec
Slow
Reaction
(a) Most ABR algorithms (BOLA, Penseive) cannot upgrade the
quality of a video segment once downloaded and are slow to react
to changing network conditions.
75sec
Fast
Reaction
25sec
Bandwidth
waste
Discarded
low quality segments
Refilled
high quality
segments
(b) BOLA-FS does allow video quality to be upgraded by re-
downloading a higher quality segment. However, the previously
downloaded segment is wasted.
Figure 3: Limitations of today’s ABR algorithms because
of regular coding: either slower reaction to network condi-
tions or bandwidth wastage to achieve faster reaction time to
highest quality. The reaction latency includes time to notice
throughput increase as well as playing the buffered segments,
and hence segment duration (5 sec here) plays a role. Penseive
aggressively controls video quality fluctuations to compen-
sate for incorrect bandwidth prediction, and hence the sudden
jump in quality compared to BOLA.
coding. During the experiment, the throughput improves dras-
tically at the 100 second mark. Two state-of-the-art streaming
algorithms, Pensieve [35] and BOLA [52], cannot upgrade
the quality of a segment once the segment has been down-
loaded. This causes a slow reaction to adjust to the improved
throughput. In Figure 3(b) however, BOLA-FS [51], a ver-
sion of BOLA, does allow the higher quality segment to be
re-downloaded when the network conditions improve. How-
ever, the previously downloaded lower quality segment is
discarded, resulting in wasted bandwidth.
2.2 Layered Coding
A more suitable coding method to address the above issues is
layered coding, where a video segment is encoded into a base
layer (providing the lowest playback quality level) and multi-
ple enhancement layers as shown in Figure 1. Clearly, layered
coding gives much finer control on rate adaptation compared
to regular coding. For example, multiple enhancement layers
for the same segment can be fetched incrementally as the esti-
2.5-fold
Bitrate Overhead
Figure 4: Compression efficiency of traditional layered coding.
We use H.265 [53] and its layered extension SHVC [12] to
encode the videos (described in §6.1). The single layer bitrate
curve is same for both, and the additional layers are for SHVC.
As shown, SHVC requires 2.5
×
more bits for 4 layers of
SHVC compared to a single layer for the same quality.
mate of the network capacity improves closer to the playback
time, which is not possible in case of regular coding.
2.3 Challenges of Adopting Traditional
Layered Coding in Video Streaming
Layered coding has typically been developed and imple-
mented as an extension to a regular coding technique. Pub-
lished standards demonstrate this dependency: SHVC [12]
has been developed as an extension of H.265 [53], similarly,
older SVC [47] as an extension for H.264 [57]. Developing
layered coding as an extension on top of a regular coding
introduces multiple challenges in real-life deployments:
1) Cross-layer compression overhead:
The key to large com-
pression benefits in current generation video coding standards
(e.g.,
≈
2000
×
compression ratio for H.265 [53]) is inter-
frame prediction – the consecutive frames are similar and so
it is efficient to simply encode the difference between consec-
utive frames. However, using the inter-layer frame prediction
across enhancement layers of the current frame with respect
to the previous frame makes video quality drift during de-
coding [29, 42, 61,67]. To minimize or avoid the drift, most
of the layered coding methods do not use inter-frame pre-
diction across layers and thus lose out on its compression
benefits [11, 17, 31]. In effect, to achieve the same quality,
layered coding (e.g., SHVC) requires significantly more bits
compared to its regular counterpart (e.g., H.265). In our study,
we find that a 4-layer SHVC coding method needs 2.5× bits
per pixel compared to its regular coding counterpart, H.265
(see Figure 4).
2) High encoding and decoding latency:
The computational
complexity of these algorithmic codecs mainly comes from
the motion estimation process during inter-frame predic-
tion [53, 57]. During the motion estimation, it is useful - for
each pixel - to encode its motion vector, i.e., where its relative
location was in the previous frame. The motion vectors are
computed for each frame by dividing the frame into thou-
sands of blocks of pixels and searching a similar block in the
previous frames. In general, the codecs use a set of previous
frames to search blocks in each frame making it computation-
ally expensive. The process becomes even more complex in
case of layered coding because each layer has to be decoded
one after the other because of the dependency of a layer on the
previous one (to exploit the content redundancy) [11, 27, 30].
This serial process of layered coding makes the latency to
be a function of number of layers, and therefore the latency
increases progressively as we increase the number of layers.
1 2 3 4 5
Number of Layers
0
50
100
Decoding Latency (ms)
SHVC
x265
Figure 5: Latency challenges
of traditional layered coding.
The decoder is run on a high-
end Desktop (as described in
§5) using a single-threaded im-
plementation of SHVC [3].
Figure 5 shows per-frame
decoding latency of the
state-of-the-art layered cod-
ing (i.e., SHVC) of a 1-min
video on a desktop with con-
figuration described in §6.1.
As shown, it takes more than
100ms to decode each frame
for 5 layers, an order of mag-
nitude increase in coding la-
tency compared to its reg-
ular counterpart H.265 (an
x265 [7] implementation).
Despite several optimizations in the past, such range of la-
tencies makes it infeasible to realize real-time decoding on
heterogeneous platforms. Recent studies (e.g., Jigsaw [11])
tackle this challenge by proposing a lightweight layered cod-
ing method (using GPU implementation), but the latency is
still a function of number of layers.
Because of these challenges, traditional layered coding
is not used in practice today. In this work, rather than ap-
proaching this problem with yet another extension, we seek
to explore layered coding via a clean-slate, learning-based
approach with a goal towards efficient layered compression
by embracing the opportunities of new hardware capabilities
(e.g., GPUs and other data parallel accelerators).
2.4 Layered Coding using Neural Codecs
Video compression has recently experienced a paradigm
shift in the computer vision community due to new ad-
vances in deep learning [32, 43, 60, 65]. The compres-
sion/decompression here is achieved using neural networks
that we refer to as neural video codecs.
The basic idea is the use of an AutoEncoder (AE), a neural
network architecture used to learn efficient encodings that has
long been used for dimentionality reduction purposes [20].
The AE consists of an encoder and a decoder. The encoder
converts an input video to a code vector that has a lower
dimension than the input size, and the decoder reconstructs
(perhaps with a small error) the original input video from
the low-dimension code vector. The neural network weight
parameters (
W
i
for encoder and
W
′
i
for decoder) are trained
by minimizing the reconstruction error, that is, minimizing
the difference between the input and the output of the decoder.
The smaller the code, the larger the compression factor but
_
_
_
_
Original Image
MS-SSIM=1.0
MS-SSIM=0.90
MS-SSIM=0.94
MS-SSIM=0.97
MS-SSIM=0.99
r
0
r
1
r
2
r
3
High information loss
Low information loss
Figure 6: Illustrating the residuals (
r
0
,...,r
3
) from an orig-
inal frame to a series of compressed-then-decoded frames.
MS-SSIM [56] is a perceptual measure of image quality. A
highly compressed frame (lowest MS-SSIM) has more resid-
ual information (r
0
).
higher the reconstruction error.
Our insight in using Autonencoders is that a their loss func-
tion can be optimized to encode a video frame into unique
layered codes by purely encoding residuals from previous lay-
ers, unlike the traditional layered coding where it is nontrivial
to handcraft each layer to have unique information.
3 Swift
3.1 Overview
Autoencoders are already shown to provide similar or better
performance relative to traditional codecs [32, 43, 60]. Recent
work such as Elf-vc [43] is also able to use Autoencoders
to provide flexible-rate video coding to fit a target network
capacity or achieve a target compression quality. However,
current work does not provide a way to encode in the video
in incrementally decodable layers. To do this, we make use of
residuals to form layered codes. A residual is the difference
between the input to an encoder and output of the correspond-
ing decoder. Residuals has been used in the past for tasks
such as recognition and compression to improve the applica-
tion’s efficiency (e.g., classification accuracy or compression
efficiency) [19, 55, 60].
Swift
uses residuals for designing layered codecs for video
streaming. The idea is to employ a chain of Autoencoders of
identical structure. Each Autoencoder in the chain encodes the
residual from the previous layer, with the very first Autoen-
coder in the chain (implementing the base layer) encoding
the input video frames. Figure 6 shows an example, where
the residuals are shown from an original frame to a series
of progressively compressed-then-decoded frames. The first
decoded frame (marked with
MS_SSIM
= 0.9) has a relatively
high loss from the original frame. As a result, the residual
r
0
has more information. When this residual information is
used for the next layer’s compression, the resulting decoded
frame is closer to the original, and in-turn the residual has
less information, and so on.
The above ‘iterative’ chaining implicitly represents a lay-
ered encoding mechanism. Each iteration (i.e., layer) pro-
剩余15页未读,继续阅读
资源评论
码流怪侠
- 粉丝: 2w+
- 资源: 417
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 基于Hough变换和区间算术确定MRI序列图像中肺部运动的研究
- jsp+ssm房屋租赁管理系统
- 泥沙自动震动过滤网设备sw17可编辑全套技术资料100%好用.zip
- 基于PCA算法的脑肿瘤T1加权MRI图像聚类分割研究与比较
- 南瓜种子分选振动机(step+exb+说明书)全套技术资料100%好用.zip
- 木材削片机step全套技术资料100%好用.zip
- 学生与图书管理系统|Java|JSP|web网站|增删改查
- 基于博弈论的自动多目标聚类方法研究及其应用
- 校园快递物流系统|SSM|JSP
- 基于期望最大化与分水岭变换的脑部MRI图像分割方法
- EV电动汽车VCU HIL BMS HIL硬件在环仿真 文件包括: 1 新能源电动汽车整车建模说明书, 2 HIL模型包含驾驶员模块,仪表模块,BCU整车控制器模块,MCU电机模块,TCU变速箱模块
- 基于Saprk开发实现的电商平台用户行为分析系统源码+文档说明.zip
- 基于Simulink自动化建模的MBD模型管理工具 鉴于Simulink和TargetLink均提供了自动化处理脚本命令,采用MATLAB编写脚本实现一系列关于软件模型搭建的冗余、耗时且容易出错的工
- comsol 锂枝晶加流动耦合电势场,浓度场生长过程中添加流场,改变枝晶形貌
- 无刷直流电机的MRAS模型参考自适应控制算法,仿真模型 a). 当直流无刷电机的转动惯量由1.23*10-3kg.m2变为3.23*10-3kg.m和5.23*10-3kg.m时,双闭环控制和自适应控
- 毕业论文设计 MATLAB 实现基于POA-CNN-BiLSTM鹈鹕算法优化卷积双向长短期记忆神经网络进行多输入单输出回归预测模型应用于产品质量控制与优化的详细项目实例(含完整的程序,GUI设计和代码
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功