没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Open-Vocabulary Video Anomaly Detection
Peng Wu
1
, Xuerong Zhou
1
, Guansong Pang
2
*
, Yujia Sun
3
, Jing Liu
3
, Peng Wang
1∗
, Yanning Zhang
1
1
Northwestern Polytechnical University,
2
Singapore Management University,
3
Xidian University
{xdwupeng, zxr2333}@gmail.com, gspang@smu.edu.sg, yjsun@stu.xidian.edu.cn
neouma@163.com, {peng.wang, ynzhang}@nwpu.edu.cn
Abstract
Current video anomaly detection (VAD) approaches with
weak supervisions are inherently limited to a closed-set set-
ting and may struggle in open-world applications where
there can be anomaly categories in the test data unseen
during training. A few recent studies attempt to tackle a
more realistic setting, open-set VAD, which aims to de-
tect unseen anomalies given seen anomalies and normal
videos. However, such a setting focuses on predicting frame
anomaly scores, having no ability to recognize the specific
categories of anomalies, despite the fact that this ability is
essential for building more informed video surveillance sys-
tems. This paper takes a step further and explores open-
vocabulary video anomaly detection (OVVAD), in which we
aim to leverage pre-trained large models to detect and cate-
gorize seen and unseen anomalies. To this end, we propose
a model that decouples OVVAD into two mutually comple-
mentary tasks – class-agnostic detection and class-specific
classification – and jointly optimizes both tasks. Particu-
larly, we devise a semantic knowledge injection module to
introduce semantic knowledge from large language models
for the detection task, and design a novel anomaly synthesis
module to generate pseudo unseen anomaly videos with the
help of large vision generation models for the classification
task. These semantic knowledge and synthesis anomalies
substantially extend our model’s capability in detecting and
categorizing a variety of seen and unseen anomalies. Exten-
sive experiments on three widely-used benchmarks demon-
strate our model achieves state-of-the-art performance on
OVVAD task.
1. Introduction
Video anomaly detection (VAD), which aims at detecting
unusual events that do not conform to expected patterns,
has become a growing concern in both academia and indus-
try communities due to its promising application prospects
*
Corresponding Authors
in, such as, intelligent video surveillance and video content
review. Through several years of vigorous development,
VAD has made significant progress with many works con-
tinuously emerging.
Traditional VAD can be broadly classified into two
types based on the supervised mode, i.e., semi-supervised
VAD [17] and weakly supervised VAD [38]. The main dif-
ference between them lies in the availability of abnormal
training samples. Although they are different in terms of
supervised mode and model design, both can be roughly
regarded as classification tasks. In the case of semi-
supervised VAD, it falls under the category of one-class
classification, while weakly supervised VAD pertains to bi-
nary classification. Specifically, semi-supervised VAD as-
sumes that only normal samples are available during the
training stage, and the test samples which do not conform
to these normal training samples are identified as anomalies,
as shown in Fig. 1(a). Most existing methods essentially en-
deavor to learn the one-class pattern, i.e., normal pattern, by
means of one-class classifiers [50] or self-supervised learn-
ing technique, e.g. frame reconstruction [9], frame predic-
tion [17], jigsaw puzzles [44], etc. Similarly, as illustrated
in Fig. 1(b), weakly supervised VAD can be seen as a binary
classification task with the assumption that both normal and
abnormal samples are available during the training phase
but the precise temporal annotation of abnormal events are
unknown. Previous approaches widely adopt a binary clas-
sifier with the multiple instance learning (MIL) [38] or Top-
K mechanism [27] to discriminate between normal and ab-
normal events. In general, existing approaches of both
semi-supervised and weakly supervision VAD restricts their
focus to classification and use corresponding discriminator
to categorize each video frame. While these practices have
achieved significant success on several widely-used bench-
marks, they are limited to detecting a closed set of anomaly
categories and are unable to handle arbitrary unseen anoma-
lies. This limitation restricts their application in open-world
scenarios and poses a risk of increasing missing reports,
as many real-world anomalies in actual deployment are not
present in the training data.
This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.
Except for this watermark, it is identical to the accepted version;
the final published version of the proceedings is available on IEEE Xplore.
18297
Normality
Detect Inconformity
Semi-Supervised VAD
Normality
Seen
Anomalies
Detect Seen Anomalies
Weakly Supervised VAD
Fighting
Open-Vocabulary VAD
Normality
Seen
Anomalies
Detect and Categorize
Unseen Anomalies
Fighting
Crash
Seen
Anomalies
Detect Unseen Anomalies
Detect and Categorize
Unseen Anomalies
Open-Set VAD
Normality
Fighting
Unknown
(a)
(b)
(c) (d)
Figure 1. Comparison of different VAD tasks.
To address this issue, a few recent works explore a whole
new line of VAD, i.e., open-set VAD [1, 5, 66, 67]. The
core purpose of open-set VAD is to train a model with nor-
mal and seen abnormal samples to detect unseen anoma-
lies (see Fig. 1(c)). For example, abnormal training sam-
ples only includes fighting and shooting events, and it is ex-
pected that the trained model can detect abnormal events
that occur in the road accident scene. Compared to tra-
ditional VAD, open-set VAD breaks out of the close-set
dilemma and then possesses ability to deal with open-world
problems. Although these works partly reveal their open-
world capacity, they fall short in addressing semantic un-
derstanding of the abnormal category, which leads to the
ambiguous detection process in the open world.
Recently, large language/vision model pre-training [11,
29, 34, 64] has been phenomenally successful across a wide
range of downstream tasks [13–15, 24, 25, 28, 47, 48, 58,
65] on account of its learned cross-modal prior knowledge
and powerful transfer learning ability, which also allow us
to tackle open-vocabulary video anomaly detection (OV-
VAD). Therefore, in this paper, we propose a novel model
built upon large pre-trained vision/language models for OV-
VAD that aims to detect and categorize seen and unseen
anomalies, as shown in Fig. 1(d). Compared to previous
VAD, OVVAD has high value to applications as it can pro-
vide more informed, fine-grained detection results, but it
is more challenging since that 1) it not only needs to de-
tect but also categorize the anomalies; 2) it needs to tackle
seen (base) as well as unseen (novel) anomalies. To address
these challenges, we explicitly disentangle the OVVAD task
into two mutually complementary sub-tasks: one is class-
agnostic detection, while another one is class-specific cat-
egorization. To improve the class-agnostic detection, we
make efforts from two aspects. We first introduce a nearly
weight-free temporal adapter (TA) module to model tem-
poral relationships, and then introduce a novel semantic
knowledge injection (SKI) module designed to incorpo-
rate textual knowledge into visual signals with assistance
of large language models. To enhance the class-specific
categorization, we take inspirations from the contrastive
language-image pre-training (CLIP) model [29], and use a
scalable way to categorize anomalies, i.e., the alignment be-
tween textual labels and videos, and furthermore we design
a novel anomaly synthesis (NAS) module to generate vision
(e.g., images and videos) materials to assist the model bet-
ter identify novel anomalies. Based on these operations,
our model achieves state-of-the-art performance on three
popular benchmarks for OVVAD, attaining 86.40% AUC,
66.53% AP and 62.94% AUC on UCF-Crime [38], XD-
Violence [51] and UBnormal [1], respectively.
We summarize our contributions as follows:
• We explore video anomaly detection under a challenging
yet practically important open-vocabulary setting. To our
knowledge, this is the first work for OVVAD.
• We then propose a model built on top of pre-trained large
models that disentangles the OVVAD task into two mutu-
ally complementary sub-tasks – class-agnostic detection
and class-specific categorization – and jointly optimizes
them for accurate OVVAD.
• In the class-agnostic detection task, we design a nearly
weight-free temporal adapter module and a semantic
knowledge injection module for substantially-enhanced
normal/abnormal frame detection.
• In the fine-grained anomaly classification task, we in-
troduce a novel anomaly synthesis module to generate
pseudo unseen anomaly videos for accurate classification
of novel anomaly types.
2. Related Work
Semi-supervised VAD. Mainstream solutions are to build
a normal pattern by self-supervised manner (e.g., recon-
struction and prediction) or one-class manner. As for the
self-supervised manner [8, 54, 56], reconstruction-based
approaches [4, 21, 22, 33, 39, 55, 60] typically leverage
encoder-decoder frameworks to reconstruct normal events
and compute the reconstruction errors, and these events
with large reconstruction error are classified as anomalies.
Follow-up prediction-based approaches [17, 19] focuses on
predicting the future frame with previous video frames and
determine whether it is an anomaly frame by calculating
the difference between the predicted frame and the actual
frame. Recent work [37] combined reconstruction- and
prediction-based approaches to improve detection perfor-
mance. As for one-class models, some works endeavors
to learn normal patterns by making use of one-class frame-
works [35], e.g., one-class support vector machine and its
extension (OCSVM [36], SVDD [50], GODS [45]).
Weakly supervised VAD. In contrast to semi-supervised
VAD, weakly supervised VAD [10, 40] consists of normal
as well as abnormal samples, which can be regarded as a
binary classification task and aims to detect anomalies at
frame level under the limitation of temporal annotations. As
a pioneer work, Sultani et al. [38] first proposed a large-
18298
scale benchmark and trained a lightweight network with
MIL mechanism. Then Zhong et al. [61] proposed a graph
convolutional network based approach to capture the simi-
larity relations and temporal relations across frames. Tian
et al. [42] introduced self-attention blocks and pyramid di-
lated convolution layers to capture multi-scale temporal re-
lations. Wu et al. [51, 52] built the largest-scale benchmark
that includes audio-visual signals and proposed a multi-task
model to deal with coarse- and fine-grained VAD. Zaheer
et al. [57] presented a clustering assisted weakly super-
vised framework with novel normalcy suppression mech-
anism. Li et al. [16] proposed a transformer-based net-
work with self-training multi-sequence learning. Zhang et
al. [59] attempted to exploit the completeness and uncer-
tainty of pseudo labels. The above approaches simply used
video or audio inputs encoded by pre-trained models, such
as C3D [43] and I3D [3], although a few works [12, 23, 53]
introduced CLIP models to weakly-supervised VAD task,
they simply used its powerful visual features and ignored
the zero-shot ability of CLIP.
Open-set VAD. VAD task naturally exists an open-world
requirement. Faced with an open-world requirement, tra-
ditional semi-supervised works are more prone to produc-
ing large false alarms, while weak-supervised works are ef-
fective in detecting known anomalies but could fail in un-
seen anomalies. Open-set VAD aims to train the model
based on normality and seen anomalies, and attempts to
detect unseen anomalies. Acsintoae et al. [1] developed
the first benchmark called UBnormal for supervised open-
set VAD task. Zhu et al. [67] proposed an approach to
deal with open-set VAD task by integrating evidential deep
learning and normalizing flows into a MIL framework. Be-
sides, Ding et al. [5] proposed a multi-head network based
model to learn the disentangled anomaly representations,
with each head dedicated to capturing one specific type
of anomaly. Compared to our model, these above works
mainly devote themselves to open-world detection and over-
look anomaly categorization, moreover, these works also
fail to take full advantage of pre-trained models.
3. Method
Problem Statement. The studied problem, OVVAD, can
be formally stated as follows. Suppose we are given a set
of training samples X = {x
i
}
N+A
i=1
, where X
n
= {x
i
}
N
i
is the set of normal samples and X
a
= {x
i
}
N+A
i=N+1
is the
set of abnormal samples. For each sample x
i
in X
a
, it has
a corresponding video-level category label y
i
, y
i
∈ C
base
,
Here, C
base
represents the set of base (seen) anomaly cat-
egories, and C is the union of C
base
and C
nov el
, where
C
nov el
stands for the set of novel (unseen) anomaly cate-
gories. Based on the training samples X, the objective is
to train a model capable of detecting and categorizing both
base and novel anomalies. Specifically, the goal of model is
to predict anomaly confidence for each frame, and identify
the anomaly category if anomalies are present in the video.
3.1. Overall Framework
Traditional methods based on close-set classifications are
less likely to deal with VAD under the open-vocabulary
scenario. To this end, we leverage language-image pre-
training models, e.g., CLIP, as the foundation thanks to
its powerful zero-shot generalization ability. As illustrated
in Fig. 2, given a training video, we first feed it into im-
age encoder of CLIP Φ
CLIP −v
to obtain frame-level fea-
tures x
f
with shape of n × c, where n is the number of
video frames, and c is the feature dimension. Then these
features pass through TA module, SKI module and a de-
tector to produce frame-level anomaly confidence p, this
pipeline is mainly applied to class-agnostic detection task.
On the other hand, for class-specific categorization, we take
inspirations from other open-vocabulary works across dif-
ferent vision tasks [31, 46, 63] and use cross-modal align-
ment mechanism. Specifically, we first generate a video-
level aggregated feature across frame-level features, then
also generate textual features/embeddings of anomaly cat-
egories, finally, we estimate the anomaly category based
on alignments between video-level features and textual fea-
tures. Moreover, we introduce NAS module to generate po-
tential novel anomalies with the assistance of large language
models (LLM) and AI-generated content models (AIGC)
for novel category identification achievement.
3.2. Temporal Adapter Module
Temporal dependencies plays a vital role in VAD [49, 62].
In this work, we employ the frozen image encoder of CLIP
to attain vision features, but it lacks consideration of tem-
poral dependencies since CLIP is pre-trained on image-text
pairs. To bridge the gap between images and videos, the use
of a temporal transformer [13, 25] has emerged as a rou-
tine practice in recent studies. However, such a paradigm
suffers from a clear performance degradation on novel cate-
gories [13, 32], the possible reason is that additional param-
eters in temporal transformer could specialise on the train-
ing set, thus harming the generalisation towards novel cate-
gories. Therefore, we design a nearly weight-free temporal
adapter for temporal dependencies, which is built on top of
classical graph convolutional networks. Mathematically, it
can be presented as follows,
x
t
= LN(softmax (H) x
f
) (1)
where LN is the layer normalization operation, H is the ad-
jacency matrix, the softmax normalization is used to ensure
the sum of each row of H equals to one. Such a design is
used to capture contextual dependencies based on positional
distance between each two frames. The adjacency matrix is
18299
剩余10页未读,继续阅读
资源评论
温柔哥`
- 粉丝: 4441
- 资源: 19
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功