神经网络视频编解码器的自适应流媒体传输系统Swift资源-CSDN文库

神经网络

自适应比特率

104 浏览量 2024-12-25 23:40:18 上传评论收藏 2.25MB PDF 举报

资源推荐

资源详情

资源评论

Swift: Adaptive Video Streaming with Layered Neural Codecs

Mallesham Dasari, Kumara Kahatapitiya, Samir R. Das, Aruna Balasubramanian, Dimitris Samaras

Stony Brook University

Abstract

Layered video coding compresses video segments into layers

(additional code bits). Decoding with each additional layer

improves video quality incrementally. This approach has po-

tential for very ﬁne-grained rate adaptation. However, layered

coding has not seen much success in practice because of its

cross-layer compression overheads and decoding latencies.

We take a fresh new approach to layered video coding by ex-

ploiting recent advances in video coding using deep learning

techniques. We develop

Swift

, an adaptive video streaming

system that includes i) a layered encoder that learns to encode

a video frame into layered codes by purely encoding residu-

als from previous layers without introducing any cross-layer

compression overheads, ii) a decoder that can fuse together a

subset of these codes (based on availability) and decode them

all in one go, and, iii) an adaptive bit rate (ABR) protocol

that synergistically adapts video quality based on available

network and client-side compute capacity.

Swift

can be in-

tegrated easily in the current streaming ecosystem without

any change to network protocols and applications by simply

replacing the current codecs with the proposed layered neural

video codec when appropriate GPU or similar accelerator

functionality is available on the client side. Extensive evalu-

ations reveal

Swift

’s multi-dimensional beneﬁts over prior

video streaming systems.

1 Introduction

Internet video delivery often encounters highly variable

and unpredictable network conditions. Despite various ad-

vances made, delivering the highest possible video qual-

ity continues to be a challenging problem due to this un-

certainty. The problem is more acute in wireless networks

as the channel conditions and mobility adds to the uncer-

tainty [39, 46]. Interestingly, the next generation wireless net-

works may even make the problem more challenging (e.g.,

60GHz/mmWave [10, 11, 38]).

To counter the challenges posed by such varying network

capacity, current video delivery solutions predominantly prac-

tice adaptive streaming (e.g., DASH [50]), where a source

video is split into segments that are encoded at the server into

multiple bitrates providing different video qualities, and a

client runs an adaptive bitrate (ABR) algorithm to dynami-

cally select the highest quality that ﬁts within the estimated

network capacity for the next segment to be downloaded.

Need for layered coding. Most of the current commercial

ABR algorithms adopt a monolithic encoding practice (e.g.,

Video segment numbers

1 2 N

...

1 2

…

Layer_1

Layer_2

Layer_L

Rate_1 (360p)

Rate_2 (480p)

Rate_M (4K)

480p

4K/8K

Layered Coding Regular Coding

Video segment numbers

)

+ c

)

(

+ c

+..+c

)

360p

(

)

Quality

Level

Figure 1: Layered vs. Regular coding methods. In Regular

coding the video segments are coded independently at dif-

ferent qualities. In Layered coding a given quality can be

reconstructed by combining codes for multiple layers thus

facilitating incremental upgrades or downgrades.

H.265 [53]), where the same video segment is encoded ‘in-

dependently’ for each quality level. The decision on fetching

a segment at a certain quality is considered ﬁnal once the

ABR algorithm makes a determination based on estimating

the network capacity. However, these estimations are far from

accurate, resulting in either underutilizing or overshooting the

network capacity. For example, the ABR algorithm may fetch

at a low quality by underestimating the network capacity, or

it may fetch at a high quality causing video stalls by overesti-

mating. Consequently, even the optimal ABR algorithms fail

to provide a good quality of experience (QoE), as such rigid

methods that do not ﬁt the need of the streaming conditions.

An alternate technique, called

layered coding

, has been

long studied [12, 14, 36, 47, 67] that can avoid the above

streaming issues. The key idea here is that, instead of in-

dependently encoding the segment in different qualities, the

segment is now encoded into layers; the base layer provides

a certain video quality, and additional layers improve the

video quality when applied over the base layer. See Figure 1.

This means that, if the network throughput improves, one can

fetch additional layers to improve video quality at a much

lower cost compared to a regular codec.

We use the term

regular coding to indicate the current practice of indepen-

dent encoding in multiple qualities (current standards such as

H.265/HEVC [53]).

Challenges with layered coding. Layered coding, however,

faces two nontrivial challenges: compression overhead, and

coding latency. The compression overhead mainly comes

from not having the inter-layer frame prediction to avoid re-

construction drift in quality [29,42,61, 67]. On the other hand,

the decoding latency is a function of the number of layers as

We use terms coding or codec for encoding and decoding together.

Also we use the terms encoding/compression, decoding/decompression inter-

changeably.

each layer needs to be decoded separately. Notwithstanding

these issues, some studies have indeed applied layered coding

in streaming and have shown slightly better QoE compared

to the regular coding methods, beneﬁting from its ability to

do dynamic quality upgrades [31]. However, they do not ad-

dress either the overhead or the latency issues directly. Indus-

try streaming solutions continue to adopt the regular codecs,

shipping these codecs in hardware to avoid computational

challenges, making it harder to adopt new innovations.

Neural video codecs. A learning approach to video coding

has shown tremendous improvement in compression efﬁ-

ciency in just a few years [43,60, 65]. Figure 2 shows bits-per-

pixel vs PSNR plots

for several generations of codecs of two

types – neural codecs that use deep learning and traditional

algorithmic codecs that use the popular H.26x standards. It

took algorithmic codecs 18 years to make the same progress

that neural codecs achieved in the last 4 years! One reason

for this rapid development is that neural codecs can run in

software that can be integrated as part of the application, sup-

port agile codec development and provide royalty-free codecs.

Further, they run on data parallel platforms such as GPUs that

are increasingly available and powerful.

There are several insights in using neural codecs for video

coding – 1) unlike the traditional layered coding methods

where it is nontrivial to handcraft each layer

to have unique

information, a neural network’s loss function can be optimized

to encode a video frame into unique layered codes by purely

encoding residuals from previous layers without introducing

a reconstruction drift; 2) a neural network can be trained to

accept a subset of the layered codes and decode all of them in

a single-shot, which again was traditionally difﬁcult to do with

a handcrafted algorithm due to nonlinear relationships among

the codes. Additionally, 3) neural codecs enable software-

driven coding. We note here that GPUs or similar accelerators

for neural network computation are critical for success with

neural codecs. Fortunately, they are increasingly common in

modern devices.

Swift. Based on the above insights, we present

Swift

, a novel

video streaming system using layered coding built on the

principles of neural video codecs [32, 60, 65].

We show that

learning can address the challenges of layered coding men-

tioned earlier – there is no additional compression overhead

with further layering and the decoding latency is independent

of the number of layers.

Swift

consists of three design com-

ponents: i) server-side encoder plus decoder, ii) client-side

decoder, and iii) ABR protocol adapted to layered coding and

varying compute capacity (in addition to varying network

capacity).

Bits-per-pixel captures compression efﬁciency and PSNR (peak signal-

to-noise ratio) captures image quality. Both metrics together capture codec

performance.

Throughout the paper, the term ‘layer’ refers to compressed code layers,

not neural network layers.

The source code of Swift is available at the following site:

https://github.com/VideoForage/swift.

4 Years

18 Years

Neural Codecs

Algorithmic Codecs

Figure 2: Evolution of neural and algorithmic video codecs

showing compression efﬁciency plots across generations.

We evaluate

Swift

with diverse video content and FCC-

released real-world network traces [8]. We compare

Swift

with state-the-art streaming algorithms that combine either

regular coding [35,51,52] or layered coding [31] with state-of-

the-art ABR algorithms. In terms of QoE,

Swift

outperforms

the next-best streaming alternative by 45%. It does so using

16% less bandwidth and has a lower reaction time to changing

network conditions. In terms of the neural codec,

Swift

’s

layered coding outperforms the state-of-the-art layered codec

(SHVC [12]) by 58% in terms of compression ratio, and by

4 (for six layers) in terms of decoding latency. In summary,

our contributions are the following:

•

We show how deep learning-based coding can make layered

coding both practical and high-performing, while address-

ing existing challenges that stymied the interest in layered

coding.

•

We design and build

Swift

to demonstrate a practical lay-

ered coding based video streaming system.

Swift

is an

embodiment of deep learning-based encoding and decoding

methods along with a purpose-built ABR protocol.

•

We comprehensively evaluate and showcase the multi-

dimensional beneﬁts of

Swift

in terms of QoE, bandwidth

usage, reaction times and compression efﬁciency.

2 Motivation

2.1

Limitations of Today’s Video Streaming

Due to Regular Coding

Today’s video providers predominantly use source rate adap-

tation (e.g., MPEG-DASH [50]) where video segments are

encoded at different qualities on the server and an adaptive

bitrate (ABR) algorithm chooses the best quality segment to

be downloaded based on the network capacity.

The streaming solutions that are widely deployed, use reg-

ular, standards-driven, algorithmic coding methods such as

H.265/HEVC [53] or VP9 [4] for encoding video segments.

These coding methods do not allow segments to be upgraded

or downgraded based on network conditions.

Figure 3 illustrates this problem using an example experi-

ment (more details about methodology are described in §6.1).

The ﬁgure shows the quality of segments that are fetched

by different state-of-the-art ABR algorithms that use regular

75sec

50sec

Slow

Reaction

(a) Most ABR algorithms (BOLA, Penseive) cannot upgrade the

quality of a video segment once downloaded and are slow to react

to changing network conditions.

75sec

Fast

Reaction

25sec

Bandwidth

waste

Discarded

low quality segments

Refilled

high quality

segments

(b) BOLA-FS does allow video quality to be upgraded by re-

downloading a higher quality segment. However, the previously

downloaded segment is wasted.

Figure 3: Limitations of today’s ABR algorithms because

of regular coding: either slower reaction to network condi-

tions or bandwidth wastage to achieve faster reaction time to

highest quality. The reaction latency includes time to notice

throughput increase as well as playing the buffered segments,

and hence segment duration (5 sec here) plays a role. Penseive

aggressively controls video quality ﬂuctuations to compen-

sate for incorrect bandwidth prediction, and hence the sudden

jump in quality compared to BOLA.

coding. During the experiment, the throughput improves dras-

tically at the 100 second mark. Two state-of-the-art streaming

algorithms, Pensieve [35] and BOLA [52], cannot upgrade

the quality of a segment once the segment has been down-

loaded. This causes a slow reaction to adjust to the improved

throughput. In Figure 3(b) however, BOLA-FS [51], a ver-

sion of BOLA, does allow the higher quality segment to be

re-downloaded when the network conditions improve. How-

ever, the previously downloaded lower quality segment is

discarded, resulting in wasted bandwidth.

2.2 Layered Coding

A more suitable coding method to address the above issues is

layered coding, where a video segment is encoded into a base

layer (providing the lowest playback quality level) and multi-

ple enhancement layers as shown in Figure 1. Clearly, layered

coding gives much ﬁner control on rate adaptation compared

to regular coding. For example, multiple enhancement layers

for the same segment can be fetched incrementally as the esti-

2.5-fold

Bitrate Overhead

Figure 4: Compression efﬁciency of traditional layered coding.

We use H.265 [53] and its layered extension SHVC [12] to

encode the videos (described in §6.1). The single layer bitrate

curve is same for both, and the additional layers are for SHVC.

As shown, SHVC requires 2.5

more bits for 4 layers of

SHVC compared to a single layer for the same quality.

mate of the network capacity improves closer to the playback

time, which is not possible in case of regular coding.

2.3 Challenges of Adopting Traditional

Layered Coding in Video Streaming

Layered coding has typically been developed and imple-

mented as an extension to a regular coding technique. Pub-

lished standards demonstrate this dependency: SHVC [12]

has been developed as an extension of H.265 [53], similarly,

older SVC [47] as an extension for H.264 [57]. Developing

layered coding as an extension on top of a regular coding

introduces multiple challenges in real-life deployments:

1) Cross-layer compression overhead:

The key to large com-

pression beneﬁts in current generation video coding standards

(e.g.,

≈

2000

compression ratio for H.265 [53]) is inter-

frame prediction – the consecutive frames are similar and so

it is efﬁcient to simply encode the difference between consec-

utive frames. However, using the inter-layer frame prediction

across enhancement layers of the current frame with respect

to the previous frame makes video quality drift during de-

coding [29, 42, 61,67]. To minimize or avoid the drift, most

of the layered coding methods do not use inter-frame pre-

diction across layers and thus lose out on its compression

beneﬁts [11, 17, 31]. In effect, to achieve the same quality,

layered coding (e.g., SHVC) requires signiﬁcantly more bits

compared to its regular counterpart (e.g., H.265). In our study,

we ﬁnd that a 4-layer SHVC coding method needs 2.5× bits

per pixel compared to its regular coding counterpart, H.265

(see Figure 4).

2) High encoding and decoding latency:

The computational

complexity of these algorithmic codecs mainly comes from

the motion estimation process during inter-frame predic-

tion [53, 57]. During the motion estimation, it is useful - for

each pixel - to encode its motion vector, i.e., where its relative

location was in the previous frame. The motion vectors are

computed for each frame by dividing the frame into thou-

sands of blocks of pixels and searching a similar block in the

previous frames. In general, the codecs use a set of previous

frames to search blocks in each frame making it computation-

ally expensive. The process becomes even more complex in

case of layered coding because each layer has to be decoded

one after the other because of the dependency of a layer on the

previous one (to exploit the content redundancy) [11, 27, 30].

This serial process of layered coding makes the latency to

be a function of number of layers, and therefore the latency

increases progressively as we increase the number of layers.

1 2 3 4 5

Number of Layers

100

Decoding Latency (ms)

SHVC

x265

Figure 5: Latency challenges

of traditional layered coding.

The decoder is run on a high-

end Desktop (as described in

§5) using a single-threaded im-

plementation of SHVC [3].

Figure 5 shows per-frame

decoding latency of the

state-of-the-art layered cod-

ing (i.e., SHVC) of a 1-min

video on a desktop with con-

ﬁguration described in §6.1.

As shown, it takes more than

100ms to decode each frame

for 5 layers, an order of mag-

nitude increase in coding la-

tency compared to its reg-

ular counterpart H.265 (an

x265 [7] implementation).

Despite several optimizations in the past, such range of la-

tencies makes it infeasible to realize real-time decoding on

heterogeneous platforms. Recent studies (e.g., Jigsaw [11])

tackle this challenge by proposing a lightweight layered cod-

ing method (using GPU implementation), but the latency is

still a function of number of layers.

Because of these challenges, traditional layered coding

is not used in practice today. In this work, rather than ap-

proaching this problem with yet another extension, we seek

to explore layered coding via a clean-slate, learning-based

approach with a goal towards efﬁcient layered compression

by embracing the opportunities of new hardware capabilities

(e.g., GPUs and other data parallel accelerators).

2.4 Layered Coding using Neural Codecs

Video compression has recently experienced a paradigm

shift in the computer vision community due to new ad-

vances in deep learning [32, 43, 60, 65]. The compres-

sion/decompression here is achieved using neural networks

that we refer to as neural video codecs.

The basic idea is the use of an AutoEncoder (AE), a neural

network architecture used to learn efﬁcient encodings that has

long been used for dimentionality reduction purposes [20].

The AE consists of an encoder and a decoder. The encoder

converts an input video to a code vector that has a lower

dimension than the input size, and the decoder reconstructs

(perhaps with a small error) the original input video from

the low-dimension code vector. The neural network weight

parameters (

for encoder and

′

for decoder) are trained

by minimizing the reconstruction error, that is, minimizing

the difference between the input and the output of the decoder.

The smaller the code, the larger the compression factor but

Original Image

MS-SSIM=1.0

MS-SSIM=0.90

MS-SSIM=0.94

MS-SSIM=0.97

MS-SSIM=0.99

High information loss

Low information loss

Figure 6: Illustrating the residuals (

,...,r

) from an orig-

inal frame to a series of compressed-then-decoded frames.

MS-SSIM [56] is a perceptual measure of image quality. A

highly compressed frame (lowest MS-SSIM) has more resid-

ual information (r

higher the reconstruction error.

Our insight in using Autonencoders is that a their loss func-

tion can be optimized to encode a video frame into unique

layered codes by purely encoding residuals from previous lay-

ers, unlike the traditional layered coding where it is nontrivial

to handcraft each layer to have unique information.

3 Swift

3.1 Overview

Autoencoders are already shown to provide similar or better

performance relative to traditional codecs [32, 43, 60]. Recent

work such as Elf-vc [43] is also able to use Autoencoders

to provide ﬂexible-rate video coding to ﬁt a target network

capacity or achieve a target compression quality. However,

current work does not provide a way to encode in the video

in incrementally decodable layers. To do this, we make use of

residuals to form layered codes. A residual is the difference

between the input to an encoder and output of the correspond-

ing decoder. Residuals has been used in the past for tasks

such as recognition and compression to improve the applica-

tion’s efﬁciency (e.g., classiﬁcation accuracy or compression

efﬁciency) [19, 55, 60].

Swift

uses residuals for designing layered codecs for video

streaming. The idea is to employ a chain of Autoencoders of

identical structure. Each Autoencoder in the chain encodes the

residual from the previous layer, with the very ﬁrst Autoen-

coder in the chain (implementing the base layer) encoding

the input video frames. Figure 6 shows an example, where

the residuals are shown from an original frame to a series

of progressively compressed-then-decoded frames. The ﬁrst

decoded frame (marked with

MS_SSIM

= 0.9) has a relatively

high loss from the original frame. As a result, the residual

has more information. When this residual information is

used for the next layer’s compression, the resulting decoded

frame is closer to the original, and in-turn the residual has

less information, and so on.

The above ‘iterative’ chaining implicitly represents a lay-

ered encoding mechanism. Each iteration (i.e., layer) pro-

剩余15页未读，继续阅读

评论收藏

内容反馈

码流怪侠

粉丝: 2w+
资源: 417

神经网络视频编解码器的自适应流媒体传输系统Swift

分辨率自适应视频编解码 .pdf

多媒体流媒体领域的多编解码器DASH数据集研究与评价

行业分类-设备装置-用于自适应流媒体中的空间自适应的系统和方法.zip

HEVC视频扩展-播放此视频需要新的编解码器 安装包（免费）

高清音视频编解码器技术的作用.docx

h.264视频编解码器

视频压缩领域的自适应神经视频编码器研究与实现

音视频编解码器AT2042的linux设备驱动程序设计.pdf

视频编解码器实时传输性能对比与分析：AVC、HEVC、AV1、VP9等编解码器的技术比较

7-9+Machine+Learning+技术在视频编解码和流媒体传输中的应用及探索.pdf

Visual++C音视频编解码技术及实践

视频编解码的参考帧处理方法及视频编解码器.pdf

行业分类-设备装置-基于设备能力的流媒体码流自适应传输方法、设备及系统.zip

霍尼韦尔 数字视频编解码器HUSS-E1/D1手册.pdf

开放源码视频编解码器VP9的技术概述

使用Python的Flask框架实现视频的流媒体传输

精通Visual c++ 音频视频编解码技术.pdf

开发技术-硬件-西变厂网络视频监控系统设计中的流媒体传输技术及应用.zip

深入理解视频编解码技术(基于H.264标准及参考模型)

电信设备-具有自适应信道编码器和解码器的传输系统.zip

基于Linux的流媒体传输.pdf

基于RTP的流媒体传输系统的设计与实现

视频编解码时间戳测试视频

深入理解视频编解码技术基于H.264标准及参考模型 [陈靖 编] 2012年版

音视频-编解码-视频编解码系统在嵌入式多核处理器上的设计实现.pdf

数字监控系统视频编解码器性能测试方法.pdf

最新资源

HEVC视频扩展-播放此视频需要新的编解码器安装包（免费）

霍尼韦尔数字视频编解码器HUSS-E1/D1手册.pdf

深入理解视频编解码技术基于H.264标准及参考模型 [陈靖编] 2012年版