基于音频事件和主题模型的音频场景识别_不同场景下的音频特征资源-CSDN文库

132 浏览量 2021-03-12 07:05:36 上传评论收藏 1.44MB PDF 举报

音频场景识别是将音频内容分类到特定场景或情境中，这是计算机科学和人工智能领域的一个研究分支，尤其在语音识别和音频信息处理中占有重要地位。音频场景识别的准确度直接影响了智能系统处理音频数据的能力。该领域的研究具有广泛的应用场景，如智能助理、安全监控、自动翻译、内容检索等。音频事件作为音频场景识别的基础，是描述音频信号的基本单位，可以理解为音频中的“词”，例如汽车启动的声音、人的笑声等。音频事件的检测和分类是音频场景识别的核心任务之一。传统的音频事件分析方法依赖于手工设计的特征和分类器，这种方法虽然在一定程度上可以工作，但通常难以适应环境变化，且扩展性有限。主题模型是近年来在多个领域受到关注的热门研究课题。主题模型的目的是自动发现文档集合中隐藏的语义主题结构。在音频场景识别领域，主题模型被引入用于发现音频文档中的潜在主题。文档-词共现矩阵是主题模型分析中的常用方法，该方法基于文档中词的分布来确定主题。但这种方法忽略了音频事件在表达音频文档时可能具有的优势。本研究提出了一种新的基于音频事件和主题模型的音频场景识别算法，该算法利用文档-事件共现矩阵进行主题分析。这种方法基于假设，即音频文档中的事件分布比词分布更符合人类的思维习惯，因此基于文档-事件共现矩阵获得的主题分布可以更好地代表音频文档。该算法的创新之处在于使用了与传统方法不同的分析矩阵，以此提取能更准确表示音频文档的主题分布，并通过主题分布获得更好的识别结果。文章中提到的两种主要主题模型是概率潜在语义分析（PLSA）和潜在狄利克雷分配（LDA）。这两种方法都是用来从大量文本数据中发现隐藏的主题结构。PLSA最初用于文本信息检索，LDA则是在PLSA的基础上改进得到的概率模型，它们通过将文档视为词或事件的概率分布来揭示文档集合的主题结构。支持向量机（SVM）是另一种被广泛应用于模式识别和机器学习领域的方法。它是一种监督学习模型，用于分类和回归分析。在音频场景识别中，SVM能够处理高维数据并找到数据的最佳分割超平面，以最大化不同类别之间的边界。文章的工作贡献包括： 1. 提出了一种新的基于文档-事件共现矩阵的音频场景识别算法。相比于利用文档-词共现矩阵的方法，提出的算法能够提取出能更准确表达音频文档的主题分布，并因此获得更佳的识别效果。 2. 提出了一种更为简便的方法来获取文档-事件共现矩阵。这种方法使得音频场景识别过程更加高效和实用。音频场景识别面临的挑战之一是环境噪声的干扰，影响了音频事件检测的准确性，进而影响场景识别的性能。为了克服这一挑战，研究者们会尝试结合音频增强技术、深度学习技术以及更多的上下文信息来改善音频事件的识别精度。此外，随着人工智能技术的不断进步，音频场景识别也在朝向更加智能化、自适应和个性化方向发展。研究者们期望未来的音频场景识别系统能够更好地理解复杂的音频环境，提供更为准确和丰富的场景信息。

资源推荐

资源详情

资源评论

Knowledge-Based Systems 125 (2017) 1–12

Contents lists available at ScienceDirect

Knowle dge-Base d Systems

journal homepage: www.elsevier.com/locate/knosys

Audio scene recognition based on audio events and topic model

Yan Leng

, Nai Zhou

, Chengli Sun

, Xinyan Xu

, Qi Yuan

, Chuanfu Cheng

, Yunxia Liu

Dengwang Li

a , ∗

Shandong Province Key Laboratory of Medical Physics and Image Processing Technology, Institute of Biomedical Sciences, School of Physics and Electronics,

Shandong Normal University, Ji’nan 250014, China

School of Information, Nanchang Hangkong University, Nanchang 330063, China

Department of Computer Science and Technology, Shandong College of Electronic Technology, Ji’nan 250 014, China

Shandong Provincial Key Laboratory of Network Based Intelligent Computing, School of Information Science and Engineering, University of Jinan, Ji’nan

250014, China

a r t i c l e i n f o

Article history:

Received 9 May 2016

Revised 1 April 2017

Accepted 7 April 2017

Available online 8 April 2017

Keywords:

Audio scene recognition

Audio event

Topic model

PLSA

LDA

Support vector machine

a b s t r a c t

Topic model is a hot research topic which is attracting attentions from many ﬁelds. Recently, several

studies have applied topic model to ASR (audio scene recognition). Among these studies, most of them

use the document-word co-occurrence matrix for topic analysis. In this work, we propose a new ASR

algorithm based on audio events and topic model, which uses the document-event co-occurrence matrix

for topic analysis. Our work is based on the hypothesis that: for an audio document, compared with its

word distribution, its event distribution is more in line with humans’ way of thinking, and then the topic

distribution obtained based on the document-event co-occurrence matrix can represent the audio doc-

ument better. The contribution of this work lies in that: (1) we propose an ASR algorithm which uses

document-event co-occurrence matrix for topic analysis. Compared with the current studies which use

document-word co-occurrence matrix for topic analysis, the proposed algorithm can extract the topic dis-

tribution which can express the audio documents better, and then can get better recognition results; (2)

we propose a much easier method to obtain the document-event co-occurrence matrix; (3) we propose a

method to weight the event distribution of audio documents; this weighting method can emphasize the

audio events that are important in reﬂecting the unique topics of the audio documents, and can suppress

the audio events that are common to many topics. Experimental results on two public datasets verify the

effectiveness of the proposed ASR algorithm, and also verify the necessity and effectiveness of the pro-

posed weighting method. The innovative ideas in this work are not limited to ASR, but can be extended

to many other ﬁelds, such as the video classiﬁcation etc.

1. Introduction

Audio scene recognition (ASR) refers to the task of identifying

the environment for an audio stream in which it is produced, or

in other words, it means using audio information to perceive the

surrounding environment. Compared with vision information, au-

dio information has many unique advantages: ﬁrst, audio is not

affected by light, audio-based system can work under weak light

conditions; second, audio is not limited by the scope of vision, it

can cover a wide range; third, audio can better protect people’s

privacy, and then can be applied in privacy occasions, such as bath-

room and bedroom; besides, the acquisition cost of audio data is

lower than that of vision data. Due to the above reasons, recently,

∗

Corresponding author.

E-mail addresses: lyansdu@163.com , lidengwang@sdnu.edu.cn (D. Li).

audio information is widely used [1–3] , and one of its important

applications is ASR.

ASR has many useful applications. Applying ASR on the mo-

bile devices can make the devices to be smart [4] , for ASR enables

them to perceive the surrounding environment, and then tune the

status correspondingly; ASR can be applied to aquaculture industry

[5] , it enables the sound classiﬁer to perform classiﬁcation accord-

ing to the context environment, and then can help to estimate the

feed consumption of prawns more precisely; ASR can also be ap-

plied to smart home [6] etc.

In this work we want to introduce the topic model into ASR.

Topic models have achieved great success in text analysis; recently,

several studies have applied them to ASR. The paradigm of the

methods used in these studies is similar to that of text analysis,

the audio document is analogous to the text document, and the

audio frames are analogous to the words. In that way, the common

paradigm of ASR based on topic model consists of segmenting

http://dx.doi.org/10.1016/j.knosys.2017.04.001

2 Y. Leng et al. / Knowledge-Based Systems 125 (2017) 1–12

the audio documents into frames, creating the audio vocabulary

by vector quantization, mapping the frames into audio words

according to the audio vocabulary, counting the audio words to

generate the document-word co-occurrence matrix, analyzing the

co-occurrence matrix by the topic model to generate the topic

distribution for each audio document, taking the topic distribution

as the feature set to perform scene recognition. It can be seen that

in these studies, document-word co-occurrence matrix is used for

topic analysis, however, we think that using the document-event

co-occurrence matrix for topic analysis would be more reasonable

than using the document-word co-occurrence matrix, because the

document-event co-occurrence matrix is more in line with hu-

mans’ way of scene recognition. When we humans recognize the

audio scene, we would ﬁrst ﬁgure out what audio events are there

in the audio document, and then by summing up the audio events

we will think what are the topics that these audio events want to

reﬂect; after analyzing the topics, we ﬁnally determine the type

of audio scene. To this end, in this work, we hypothesize that: for

an audio document, compared with its word distribution, its event

distribution is more in line with humans’ way of thinking, and

then the topic distribution obtained based on the document-event

co-occurrence matrix can represent the audio document better.

Based on the above hypothesis, we propose an ASR algorithm

which uses the document-event co-occurrence matrix for topic

analysis.

The contribution of our work lies in that: (1) we propose an

ASR algorithm which uses the document-event co-occurrence ma-

trix for topic analysis. Compared with the current studies which

use the document-word co-occurrence matrix for topic analysis,

the proposed algorithm can extract the topic distribution which

can express the audio documents better, and then can get bet-

ter recognition results; (2) we propose a much easier method

to get the document-event co-occurrence matrix. To obtain the

document-event co-occurrence matrix, one natural way is to rec-

ognize the audio events in the audio documents through clas-

siﬁcation model ﬁrst, and then perform statistical analysis, but

this method needs to construct the classiﬁcation model, when the

number of audio events is large, the amount of calculation will

be great, while our proposed method does not need to construct

the classiﬁcation model, but only needs to do a matrix factoriza-

tion through the topic model. Besides, another problem for obtain-

ing the document-event co-occurrence matrix through classiﬁca-

tion model is that: due to the misclassiﬁcation of audio events,

for the same audio scene class, its document-event co-occurrence

matrix obtained from the test set may have poor consistency with

that obtained from the training set, while the proposed method

would avoid this problem; (3) we propose a method to weight

the event distribution of audio documents. This weighting method

would emphasize the audio events that play important roles in re-

ﬂecting the unique topics of the audio documents, and would sup-

press the audio events that are common to many topics. As a re-

sult, it can help to extract the topic distribution that can better

express the audio documents. Experimental results on two public

datasets have veriﬁed the effectiveness of the proposed ASR algo-

rithm, and have veriﬁed the necessity and the effectiveness of the

proposed weighting method.

The rest of the paper is organized as follows. Section 2 dis-

cusses the related work; Section 3 brieﬂy introduces two

topic models: PLSA and LDA; Section 4 describes the proposed

ASR algorithm; Section 5 shows the experimental results, and

Section 6 gives conclusions and future work.

2. Related work

Many kinds of topic models have been proposed for text analy-

sis, among them, PLSA (Probabilistic Latent Semantic Analysis) and

LDA (Latent Dirichlet Allocation) are the two most popular ones.

PLSA was ﬁrst proposed by Hofmann [7] , it models each docu-

ment as a distribution over latent topics, and models each topic

as a distribution over words, but it does not make any assumption

about the generation of the document-topic distribution. Later, Blei

et al. [8] extended PLSA by introducing a Dirichlet prior on the

document-topic distribution, and proposed LDA.

PLSA and LDA have achieved great success in text analysis. Hof-

mann applied PLSA to automated indexing of documents [7] ; Xu

et al. used LDA to identify the implicit feature in Chinese re-

views [9] , and Zhang et al. utilized LDA to improve short text

classiﬁcation [10] . PLSA and LDA are not limited to text analy-

sis; they have also been applied to many other ﬁelds. For ex-

ample, Pliakos et al. applied PLSA to image classiﬁcation [11] ;

Zhou et al. applied LDA to expert ﬁnding in question answer

communities [12] . PLSA and LDA have also been applied to au-

dio ﬁeld [13–20] . Hazen et al. [13] used PLSA to summarize the

topic content of the audio corpus. In [13] , for each audio docu-

ment, the occurrence number of each word was ﬁrst estimated

through automatic speech recognition system; then PLSA was used

to learn the latent topics of the audio documents; these latent

topics were then ranked according to importance; ﬁnally, signa-

ture words were adopted to describe the content of the topics.

Mesaros et al. [14] used PLSA to help to detect audio events. In

[14] , HMM was used to detect the audio events; in order to gen-

erate the inter-model transition probabilities of the HMM net-

work, PLSA was adopted to estimate the prior probabilities of au-

dio events, and these prior probabilities were then used as the

inter-model transition probabilities. Hu et al. [15] and Kim et al.

[16] applied LDA to audio retrieval. In [15] , the authors improved

the traditional LDA, and proposed Gaussian-LDA for audio retrieval.

Different from the traditional LDA which uses multinomial distri-

bution to model the topic-word distribution, Gaussian-LDA adopts

Gaussian distribution to model the topic-word distribution. In this

way, Gaussian-LDA can avoid information loss caused by VQ (vec-

tor quantization). In [16] , the authors created the audio vocabulary

though LBG-VQ (Linde–Buzo–Gray Vector Quantization), extracted

the topic distribution of each audio clip through LDA, and ﬁnally

adopted SVM (Support Vector Machines) to do classiﬁcation. Later

in 2012, the algorithm proposed in [16] was applied to audio tag

classiﬁcation [17] .

There are also studies which applied PLSA and LDA to ASR

[18–20] . All these methods follow the common paradigm of ASR

which is described in the introduction section, including: creat-

ing audio vocabulary, mapping audio frames into audio words,

counting document-word co-occurrence matrix, generating the

document-topic distribution for each audio clip through topic

model and performing classiﬁcation. In [18] and [19] , the authors

adopted PLSA as the topic model, utilized SVM to do classiﬁcation,

and used RPCL (Rival Penalized Competitive Learning) clustering

and GMM (Gaussian Mixture Model) clustering to create the au-

dio vocabulary respectively. In [20] , Kim et al. adopted LDA as the

topic model, utilized SVM to do classiﬁcation, and used LBG-VQ to

create the audio vocabulary.

In this work we also adopt PLSA and LDA to perform ASR. Com-

pared with the above PLSA/LDA based ASR algorithms [18–20] , the

difference of our proposed algorithm is that it uses the document-

event co-occurrence matrix for topic analysis, while in [18–20] , the

authors used the document-word co-occurrence matrix for topic

analysis. Our work of using the document-event co-occurrence ma-

trix for topic analysis is based on the hypothesis that: for an au-

dio document, compared with its word distribution, its event dis-

tribution is more in line with humans’ way of thinking, and then

the topic distribution obtained based on the document-event co-

occurrence matrix can represent the audio document better.

剩余11页未读，继续阅读

评论收藏

内容反馈

weixin_38682161

粉丝: 3
资源: 972

基于音频事件和主题模型的音频场景识别

基于上下文的环境音频事件识别，可用于场景理解

基于频谱图的语音识别

基于BP神经网络的音频信号识别

语音识别-音频流获取

行业分类-设备装置-一种基于音频的媒体互动方法.zip

音频降噪.zip_gave178_sleep72t_基于DSP的音频降噪_音频滤波_音频降噪

电信设备-基于智能移动终端的场景识别方法.zip

基于AT，PLSA及其组合的重叠音频事件分类

基于MATLAB的音频解析与合成.zip

MP3声音录制-阿里语音识别-音频焦点处理

行业分类-设备装置-一种基于MDCT量化系数的小值区的AAC音频隐写和提取方法.zip

女声数字音频文件

电信设备-基于保持语音信息的单耳音频处理系统和方法.zip

音频信号检测（方案.芯片.论文）.rar

基于深度学习的数字语音识别.zip

noise_NOISE_消除噪声_声纹识别_

行业分类-设备装置-一种基于共生矩阵分析的MP3音频隐写检测方法.zip

enterface05数据库音频wav格式

深度学习-语音识别实战(Python)

matlab语音识别系统(源代码),基于matlab的语音识别的代码,matlab源码.zip

feisao_v26.zip_时延 音频

中文语音识别模型数据集

一种跨感官的人工智能模型，通过识别图像、视频、音频、文本、深度、热和惯性测量单元等关系，实现了不同形式信息的"连接"

语音听写（识别）

音频震动检测

声纹识别代码

基于自然语言处理的非结构化敏感信息识别.pdf

.net版本语音识别实例源码--201903

语音识别相关语音识别相关

最新资源

feisao_v26.zip_时延音频