开放词汇视频异常检测（Open-VocabularyVideoAnomalyDetection）

194 浏览量 2024-08-07 15:21:33 上传评论收藏 2.95MB PDF 举报

资源推荐

资源详情

资源评论

Open-Vocabulary Video Anomaly Detection

Peng Wu

, Xuerong Zhou

, Guansong Pang

, Yujia Sun

, Jing Liu

, Peng Wang

1∗

, Yanning Zhang

Northwestern Polytechnical University,

Singapore Management University,

Xidian University

{xdwupeng, zxr2333}@gmail.com, gspang@smu.edu.sg, yjsun@stu.xidian.edu.cn

neouma@163.com, {peng.wang, ynzhang}@nwpu.edu.cn

Abstract

Current video anomaly detection (VAD) approaches with

weak supervisions are inherently limited to a closed-set set-

ting and may struggle in open-world applications where

there can be anomaly categories in the test data unseen

during training. A few recent studies attempt to tackle a

more realistic setting, open-set VAD, which aims to de-

tect unseen anomalies given seen anomalies and normal

videos. However, such a setting focuses on predicting frame

anomaly scores, having no ability to recognize the speciﬁc

categories of anomalies, despite the fact that this ability is

essential for building more informed video surveillance sys-

tems. This paper takes a step further and explores open-

vocabulary video anomaly detection (OVVAD), in which we

aim to leverage pre-trained large models to detect and cate-

gorize seen and unseen anomalies. To this end, we propose

a model that decouples OVVAD into two mutually comple-

mentary tasks – class-agnostic detection and class-speciﬁc

classiﬁcation – and jointly optimizes both tasks. Particu-

larly, we devise a semantic knowledge injection module to

introduce semantic knowledge from large language models

for the detection task, and design a novel anomaly synthesis

module to generate pseudo unseen anomaly videos with the

help of large vision generation models for the classiﬁcation

task. These semantic knowledge and synthesis anomalies

substantially extend our model’s capability in detecting and

categorizing a variety of seen and unseen anomalies. Exten-

sive experiments on three widely-used benchmarks demon-

strate our model achieves state-of-the-art performance on

OVVAD task.

1. Introduction

Video anomaly detection (VAD), which aims at detecting

unusual events that do not conform to expected patterns,

has become a growing concern in both academia and indus-

try communities due to its promising application prospects

Corresponding Authors

in, such as, intelligent video surveillance and video content

review. Through several years of vigorous development,

VAD has made signiﬁcant progress with many works con-

tinuously emerging.

Traditional VAD can be broadly classiﬁed into two

types based on the supervised mode, i.e., semi-supervised

VAD [17] and weakly supervised VAD [38]. The main dif-

ference between them lies in the availability of abnormal

training samples. Although they are different in terms of

supervised mode and model design, both can be roughly

regarded as classiﬁcation tasks. In the case of semi-

supervised VAD, it falls under the category of one-class

classiﬁcation, while weakly supervised VAD pertains to bi-

nary classiﬁcation. Speciﬁcally, semi-supervised VAD as-

sumes that only normal samples are available during the

training stage, and the test samples which do not conform

to these normal training samples are identiﬁed as anomalies,

as shown in Fig. 1(a). Most existing methods essentially en-

deavor to learn the one-class pattern, i.e., normal pattern, by

means of one-class classiﬁers [50] or self-supervised learn-

ing technique, e.g. frame reconstruction [9], frame predic-

tion [17], jigsaw puzzles [44], etc. Similarly, as illustrated

in Fig. 1(b), weakly supervised VAD can be seen as a binary

classiﬁcation task with the assumption that both normal and

abnormal samples are available during the training phase

but the precise temporal annotation of abnormal events are

unknown. Previous approaches widely adopt a binary clas-

siﬁer with the multiple instance learning (MIL) [38] or Top-

K mechanism [27] to discriminate between normal and ab-

normal events. In general, existing approaches of both

semi-supervised and weakly supervision VAD restricts their

focus to classiﬁcation and use corresponding discriminator

to categorize each video frame. While these practices have

achieved signiﬁcant success on several widely-used bench-

marks, they are limited to detecting a closed set of anomaly

categories and are unable to handle arbitrary unseen anoma-

lies. This limitation restricts their application in open-world

scenarios and poses a risk of increasing missing reports,

as many real-world anomalies in actual deployment are not

present in the training data.

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

the final published version of the proceedings is available on IEEE Xplore.

18297

Normality

Detect Inconformity

Semi-Supervised VAD

Normality

Seen

Anomalies

Detect Seen Anomalies

Weakly Supervised VAD

Fighting

Open-Vocabulary VAD

Normality

Seen

Anomalies

Detect and Categorize

Unseen Anomalies

Fighting

Crash

Seen

Anomalies

Detect Unseen Anomalies

Detect and Categorize

Unseen Anomalies

Open-Set VAD

Normality

Fighting

Unknown

（a）

（b）

（c）（d）

Figure 1. Comparison of different VAD tasks.

To address this issue, a few recent works explore a whole

new line of VAD, i.e., open-set VAD [1, 5, 66, 67]. The

core purpose of open-set VAD is to train a model with nor-

mal and seen abnormal samples to detect unseen anoma-

lies (see Fig. 1(c)). For example, abnormal training sam-

ples only includes ﬁghting and shooting events, and it is ex-

pected that the trained model can detect abnormal events

that occur in the road accident scene. Compared to tra-

ditional VAD, open-set VAD breaks out of the close-set

dilemma and then possesses ability to deal with open-world

problems. Although these works partly reveal their open-

world capacity, they fall short in addressing semantic un-

derstanding of the abnormal category, which leads to the

ambiguous detection process in the open world.

Recently, large language/vision model pre-training [11,

29, 34, 64] has been phenomenally successful across a wide

range of downstream tasks [13–15, 24, 25, 28, 47, 48, 58,

65] on account of its learned cross-modal prior knowledge

and powerful transfer learning ability, which also allow us

to tackle open-vocabulary video anomaly detection (OV-

VAD). Therefore, in this paper, we propose a novel model

built upon large pre-trained vision/language models for OV-

VAD that aims to detect and categorize seen and unseen

anomalies, as shown in Fig. 1(d). Compared to previous

VAD, OVVAD has high value to applications as it can pro-

vide more informed, ﬁne-grained detection results, but it

is more challenging since that 1) it not only needs to de-

tect but also categorize the anomalies; 2) it needs to tackle

seen (base) as well as unseen (novel) anomalies. To address

these challenges, we explicitly disentangle the OVVAD task

into two mutually complementary sub-tasks: one is class-

agnostic detection, while another one is class-speciﬁc cat-

egorization. To improve the class-agnostic detection, we

make efforts from two aspects. We ﬁrst introduce a nearly

weight-free temporal adapter (TA) module to model tem-

poral relationships, and then introduce a novel semantic

knowledge injection (SKI) module designed to incorpo-

rate textual knowledge into visual signals with assistance

of large language models. To enhance the class-speciﬁc

categorization, we take inspirations from the contrastive

language-image pre-training (CLIP) model [29], and use a

scalable way to categorize anomalies, i.e., the alignment be-

tween textual labels and videos, and furthermore we design

a novel anomaly synthesis (NAS) module to generate vision

(e.g., images and videos) materials to assist the model bet-

ter identify novel anomalies. Based on these operations,

our model achieves state-of-the-art performance on three

popular benchmarks for OVVAD, attaining 86.40% AUC,

66.53% AP and 62.94% AUC on UCF-Crime [38], XD-

Violence [51] and UBnormal [1], respectively.

We summarize our contributions as follows:

• We explore video anomaly detection under a challenging

yet practically important open-vocabulary setting. To our

knowledge, this is the ﬁrst work for OVVAD.

• We then propose a model built on top of pre-trained large

models that disentangles the OVVAD task into two mutu-

ally complementary sub-tasks – class-agnostic detection

and class-speciﬁc categorization – and jointly optimizes

them for accurate OVVAD.

• In the class-agnostic detection task, we design a nearly

weight-free temporal adapter module and a semantic

knowledge injection module for substantially-enhanced

normal/abnormal frame detection.

• In the ﬁne-grained anomaly classiﬁcation task, we in-

troduce a novel anomaly synthesis module to generate

pseudo unseen anomaly videos for accurate classiﬁcation

of novel anomaly types.

2. Related Work

Semi-supervised VAD. Mainstream solutions are to build

a normal pattern by self-supervised manner (e.g., recon-

struction and prediction) or one-class manner. As for the

self-supervised manner [8, 54, 56], reconstruction-based

approaches [4, 21, 22, 33, 39, 55, 60] typically leverage

encoder-decoder frameworks to reconstruct normal events

and compute the reconstruction errors, and these events

with large reconstruction error are classiﬁed as anomalies.

Follow-up prediction-based approaches [17, 19] focuses on

predicting the future frame with previous video frames and

determine whether it is an anomaly frame by calculating

the difference between the predicted frame and the actual

frame. Recent work [37] combined reconstruction- and

prediction-based approaches to improve detection perfor-

mance. As for one-class models, some works endeavors

to learn normal patterns by making use of one-class frame-

works [35], e.g., one-class support vector machine and its

extension (OCSVM [36], SVDD [50], GODS [45]).

Weakly supervised VAD. In contrast to semi-supervised

VAD, weakly supervised VAD [10, 40] consists of normal

as well as abnormal samples, which can be regarded as a

binary classiﬁcation task and aims to detect anomalies at

frame level under the limitation of temporal annotations. As

a pioneer work, Sultani et al. [38] ﬁrst proposed a large-

18298

scale benchmark and trained a lightweight network with

MIL mechanism. Then Zhong et al. [61] proposed a graph

convolutional network based approach to capture the simi-

larity relations and temporal relations across frames. Tian

et al. [42] introduced self-attention blocks and pyramid di-

lated convolution layers to capture multi-scale temporal re-

lations. Wu et al. [51, 52] built the largest-scale benchmark

that includes audio-visual signals and proposed a multi-task

model to deal with coarse- and ﬁne-grained VAD. Zaheer

et al. [57] presented a clustering assisted weakly super-

vised framework with novel normalcy suppression mech-

anism. Li et al. [16] proposed a transformer-based net-

work with self-training multi-sequence learning. Zhang et

al. [59] attempted to exploit the completeness and uncer-

tainty of pseudo labels. The above approaches simply used

video or audio inputs encoded by pre-trained models, such

as C3D [43] and I3D [3], although a few works [12, 23, 53]

introduced CLIP models to weakly-supervised VAD task,

they simply used its powerful visual features and ignored

the zero-shot ability of CLIP.

Open-set VAD. VAD task naturally exists an open-world

requirement. Faced with an open-world requirement, tra-

ditional semi-supervised works are more prone to produc-

ing large false alarms, while weak-supervised works are ef-

fective in detecting known anomalies but could fail in un-

seen anomalies. Open-set VAD aims to train the model

based on normality and seen anomalies, and attempts to

detect unseen anomalies. Acsintoae et al. [1] developed

the ﬁrst benchmark called UBnormal for supervised open-

set VAD task. Zhu et al. [67] proposed an approach to

deal with open-set VAD task by integrating evidential deep

learning and normalizing ﬂows into a MIL framework. Be-

sides, Ding et al. [5] proposed a multi-head network based

model to learn the disentangled anomaly representations,

with each head dedicated to capturing one speciﬁc type

of anomaly. Compared to our model, these above works

mainly devote themselves to open-world detection and over-

look anomaly categorization, moreover, these works also

fail to take full advantage of pre-trained models.

3. Method

Problem Statement. The studied problem, OVVAD, can

be formally stated as follows. Suppose we are given a set

of training samples X = {x

}

N+A

i=1

, where X

= {x

}

is the set of normal samples and X

= {x

}

N+A

i=N+1

is the

set of abnormal samples. For each sample x

in X

, it has

a corresponding video-level category label y

, y

∈ C

base

Here, C

base

represents the set of base (seen) anomaly cat-

egories, and C is the union of C

base

and C

nov el

, where

nov el

stands for the set of novel (unseen) anomaly cate-

gories. Based on the training samples X, the objective is

to train a model capable of detecting and categorizing both

base and novel anomalies. Speciﬁcally, the goal of model is

to predict anomaly conﬁdence for each frame, and identify

the anomaly category if anomalies are present in the video.

3.1. Overall Framework

Traditional methods based on close-set classiﬁcations are

less likely to deal with VAD under the open-vocabulary

scenario. To this end, we leverage language-image pre-

training models, e.g., CLIP, as the foundation thanks to

its powerful zero-shot generalization ability. As illustrated

in Fig. 2, given a training video, we ﬁrst feed it into im-

age encoder of CLIP Φ

CLIP −v

to obtain frame-level fea-

tures x

with shape of n × c, where n is the number of

video frames, and c is the feature dimension. Then these

features pass through TA module, SKI module and a de-

tector to produce frame-level anomaly conﬁdence p, this

pipeline is mainly applied to class-agnostic detection task.

On the other hand, for class-speciﬁc categorization, we take

inspirations from other open-vocabulary works across dif-

ferent vision tasks [31, 46, 63] and use cross-modal align-

ment mechanism. Speciﬁcally, we ﬁrst generate a video-

level aggregated feature across frame-level features, then

also generate textual features/embeddings of anomaly cat-

egories, ﬁnally, we estimate the anomaly category based

on alignments between video-level features and textual fea-

tures. Moreover, we introduce NAS module to generate po-

tential novel anomalies with the assistance of large language

models (LLM) and AI-generated content models (AIGC)

for novel category identiﬁcation achievement.

3.2. Temporal Adapter Module

Temporal dependencies plays a vital role in VAD [49, 62].

In this work, we employ the frozen image encoder of CLIP

to attain vision features, but it lacks consideration of tem-

poral dependencies since CLIP is pre-trained on image-text

pairs. To bridge the gap between images and videos, the use

of a temporal transformer [13, 25] has emerged as a rou-

tine practice in recent studies. However, such a paradigm

suffers from a clear performance degradation on novel cate-

gories [13, 32], the possible reason is that additional param-

eters in temporal transformer could specialise on the train-

ing set, thus harming the generalisation towards novel cate-

gories. Therefore, we design a nearly weight-free temporal

adapter for temporal dependencies, which is built on top of

classical graph convolutional networks. Mathematically, it

can be presented as follows,

= LN(softmax (H) x

) (1)

where LN is the layer normalization operation, H is the ad-

jacency matrix, the softmax normalization is used to ensure

the sum of each row of H equals to one. Such a design is

used to capture contextual dependencies based on positional

distance between each two frames. The adjacency matrix is

18299

剩余10页未读，继续阅读

评论收藏

内容反馈

温柔哥`

粉丝: 4644
资源: 19

开放词汇视频异常检测（Open-Vocabulary Video Anomaly Detection）

最新资源

开放词汇视频异常检测（Open-Vocabulary Video Anomaly Detection）

Open-vocabulary-entity-type-description:生成细粒度的开放词汇实体类型描述

YOLO-World-Real-Time-Open-Vocabulary-Object-Detection-CVPR-2024

单词宝典，小绿书13 Merriam-Websters-Vocabulary-Builder小绿书

Open-Vocabulary-Learning-on-Source-Code-with-a-Graph-Structured-Cache:用图结构化缓存在源代码的开放式词汇学习论文中重现实验的代码-Source code learning

C# Open Vocabulary Object Detection 部署开放域目标检测 源码

GG-vocabulary:一个背单词微信小程序

ISO_FDIS_26262-1-2011-Vocabulary_ISO26262_iso26262_26262_Vocabul

最新完整版标准 ISO 23472-4-2022 Foundry machinery - Vocabulary - Part 4

ket-vocabulary-list.pdf

WebGRE-Vocabulary.doc编程资料

Open vocabulary detection contest - 开放世界目标检测竞赛 2023 第六名方案.zip

YOLO-World：实时开放词汇对象检测

【PaperReading】5. Open-Vocabulary SAM

visual-vocabulary, 数据驱动图形作为起始点的小示例.zip

最新完整版标准 ISO 16559-2022 Solid biofuels - Vocabulary.pdf

最新完整版标准 ISO 24378-2022 Feed machinery - Vocabulary.pdf

ISO 16559-2022 Solid biofuels - Vocabulary.pdf

Api-vocabulary.zip

fascinator-vocabulary-1.1.4.zip

ISO 24378-2022 Feed machinery - Vocabulary.pdf

McCarten-Teaching-Vocabulary

Prentice-Hall-Vocabulary_Basics_for_Business

学案：Unit 5 Nelson Mandela-a modern hero- vocabulary and useful ex

使用Ultralytics框架进行YOLO-World目标检测

ISO 4135-2022 Anaesthetic and respiratory equipment - Vocabulary

yolo-world资料（源码+文档）

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

最新资源

C# Open Vocabulary Object Detection 部署开放域目标检测源码