2 Y. Leng et al. / Knowledge-Based Systems 125 (2017) 1–12
the audio documents into frames, creating the audio vocabulary
by vector quantization, mapping the frames into audio words
according to the audio vocabulary, counting the audio words to
generate the document-word co-occurrence matrix, analyzing the
co-occurrence matrix by the topic model to generate the topic
distribution for each audio document, taking the topic distribution
as the feature set to perform scene recognition. It can be seen that
in these studies, document-word co-occurrence matrix is used for
topic analysis, however, we think that using the document-event
co-occurrence matrix for topic analysis would be more reasonable
than using the document-word co-occurrence matrix, because the
document-event co-occurrence matrix is more in line with hu-
mans’ way of scene recognition. When we humans recognize the
audio scene, we would first figure out what audio events are there
in the audio document, and then by summing up the audio events
we will think what are the topics that these audio events want to
reflect; after analyzing the topics, we finally determine the type
of audio scene. To this end, in this work, we hypothesize that: for
an audio document, compared with its word distribution, its event
distribution is more in line with humans’ way of thinking, and
then the topic distribution obtained based on the document-event
co-occurrence matrix can represent the audio document better.
Based on the above hypothesis, we propose an ASR algorithm
which uses the document-event co-occurrence matrix for topic
analysis.
The contribution of our work lies in that: (1) we propose an
ASR algorithm which uses the document-event co-occurrence ma-
trix for topic analysis. Compared with the current studies which
use the document-word co-occurrence matrix for topic analysis,
the proposed algorithm can extract the topic distribution which
can express the audio documents better, and then can get bet-
ter recognition results; (2) we propose a much easier method
to get the document-event co-occurrence matrix. To obtain the
document-event co-occurrence matrix, one natural way is to rec-
ognize the audio events in the audio documents through clas-
sification model first, and then perform statistical analysis, but
this method needs to construct the classification model, when the
number of audio events is large, the amount of calculation will
be great, while our proposed method does not need to construct
the classification model, but only needs to do a matrix factoriza-
tion through the topic model. Besides, another problem for obtain-
ing the document-event co-occurrence matrix through classifica-
tion model is that: due to the misclassification of audio events,
for the same audio scene class, its document-event co-occurrence
matrix obtained from the test set may have poor consistency with
that obtained from the training set, while the proposed method
would avoid this problem; (3) we propose a method to weight
the event distribution of audio documents. This weighting method
would emphasize the audio events that play important roles in re-
flecting the unique topics of the audio documents, and would sup-
press the audio events that are common to many topics. As a re-
sult, it can help to extract the topic distribution that can better
express the audio documents. Experimental results on two public
datasets have verified the effectiveness of the proposed ASR algo-
rithm, and have verified the necessity and the effectiveness of the
proposed weighting method.
The rest of the paper is organized as follows. Section 2 dis-
cusses the related work; Section 3 briefly introduces two
topic models: PLSA and LDA; Section 4 describes the proposed
ASR algorithm; Section 5 shows the experimental results, and
Section 6 gives conclusions and future work.
2. Related work
Many kinds of topic models have been proposed for text analy-
sis, among them, PLSA (Probabilistic Latent Semantic Analysis) and
LDA (Latent Dirichlet Allocation) are the two most popular ones.
PLSA was first proposed by Hofmann [7] , it models each docu-
ment as a distribution over latent topics, and models each topic
as a distribution over words, but it does not make any assumption
about the generation of the document-topic distribution. Later, Blei
et al. [8] extended PLSA by introducing a Dirichlet prior on the
document-topic distribution, and proposed LDA.
PLSA and LDA have achieved great success in text analysis. Hof-
mann applied PLSA to automated indexing of documents [7] ; Xu
et al. used LDA to identify the implicit feature in Chinese re-
views [9] , and Zhang et al. utilized LDA to improve short text
classification [10] . PLSA and LDA are not limited to text analy-
sis; they have also been applied to many other fields. For ex-
ample, Pliakos et al. applied PLSA to image classification [11] ;
Zhou et al. applied LDA to expert finding in question answer
communities [12] . PLSA and LDA have also been applied to au-
dio field [13–20] . Hazen et al. [13] used PLSA to summarize the
topic content of the audio corpus. In [13] , for each audio docu-
ment, the occurrence number of each word was first estimated
through automatic speech recognition system; then PLSA was used
to learn the latent topics of the audio documents; these latent
topics were then ranked according to importance; finally, signa-
ture words were adopted to describe the content of the topics.
Mesaros et al. [14] used PLSA to help to detect audio events. In
[14] , HMM was used to detect the audio events; in order to gen-
erate the inter-model transition probabilities of the HMM net-
work, PLSA was adopted to estimate the prior probabilities of au-
dio events, and these prior probabilities were then used as the
inter-model transition probabilities. Hu et al. [15] and Kim et al.
[16] applied LDA to audio retrieval. In [15] , the authors improved
the traditional LDA, and proposed Gaussian-LDA for audio retrieval.
Different from the traditional LDA which uses multinomial distri-
bution to model the topic-word distribution, Gaussian-LDA adopts
Gaussian distribution to model the topic-word distribution. In this
way, Gaussian-LDA can avoid information loss caused by VQ (vec-
tor quantization). In [16] , the authors created the audio vocabulary
though LBG-VQ (Linde–Buzo–Gray Vector Quantization), extracted
the topic distribution of each audio clip through LDA, and finally
adopted SVM (Support Vector Machines) to do classification. Later
in 2012, the algorithm proposed in [16] was applied to audio tag
classification [17] .
There are also studies which applied PLSA and LDA to ASR
[18–20] . All these methods follow the common paradigm of ASR
which is described in the introduction section, including: creat-
ing audio vocabulary, mapping audio frames into audio words,
counting document-word co-occurrence matrix, generating the
document-topic distribution for each audio clip through topic
model and performing classification. In [18] and [19] , the authors
adopted PLSA as the topic model, utilized SVM to do classification,
and used RPCL (Rival Penalized Competitive Learning) clustering
and GMM (Gaussian Mixture Model) clustering to create the au-
dio vocabulary respectively. In [20] , Kim et al. adopted LDA as the
topic model, utilized SVM to do classification, and used LBG-VQ to
create the audio vocabulary.
In this work we also adopt PLSA and LDA to perform ASR. Com-
pared with the above PLSA/LDA based ASR algorithms [18–20] , the
difference of our proposed algorithm is that it uses the document-
event co-occurrence matrix for topic analysis, while in [18–20] , the
authors used the document-word co-occurrence matrix for topic
analysis. Our work of using the document-event co-occurrence ma-
trix for topic analysis is based on the hypothesis that: for an au-
dio document, compared with its word distribution, its event dis-
tribution is more in line with humans’ way of thinking, and then
the topic distribution obtained based on the document-event co-
occurrence matrix can represent the audio document better.