T. Wang et al. Engineering Applications of Artificial Intelligence 64 (2017) 43–51
from supervised learning in which a classifier is trained for each distinct
word on a corpus of manually sense-annotated examples, to completely
unsupervised methods that cluster occurrence of words, thereby induc-
ing senses. Recent advances in WSD have benefited greatly from the
availability of corpora annotated with word senses. Most accurate WSD
systems to date exploit supervised methods which automatically learn
cues useful for disambiguation from manually sense-annotated data.
For machine learning-based WSD systems, commonly used algo-
rithms include Naïve Bayesian model, decision trees, maximum en-
tropy, support vector machine (SVM), and so on. Among which, kernel
methods (Hofmann et al., 2008; Shawe-Taylor and Cristianini, 2004),
such as SVM, regularized least-squares classification (RLSC) and kernel
principal component analysis (KPCA), have demonstrated excellent
performance in terms of accuracy and robustness (Giuliano et al., 2009;
Gliozzo et al., 2005; Jin et al., 2008; Joshi et al., 2006; Lee and Ng, 2002;
Lee et al., 2004; Pahikkala et al., 2009; Popescu, 2004; Wang et al.,
2014,2013a,2015; Wu et al., 2004). Recently, Li et al. (2016) presented
an extensive survey of state-of-the-art in the field of kernel methods
for WSD, concentrating on issues such as data representation, kernel
selection and learning algorithms. Basically, kernel methods work by
mapping the data from the input space into a high-dimensional (possibly
infinite) feature space, which is usually chosen to be a reproducing
kernel Hilbert space (RKHS), and then building linear algorithms in the
feature space to implement nonlinear counterparts in the input space.
The mapping, rather than being given in an explicit form, is determined
implicitly by specifying a kernel function, which computes the inner
product between each pair of data points in the feature space. There are
several reasons that make kernel methods applicable to WSD and other
NLP problems (Li et al., 2016; Wang et al., 2015). Firstly, instead of man-
ual construction of feature space for the learning task, kernel functions
provide an alternative way to design useful features in the feature space
automatically, therefore, ensuring necessary representational power.
Secondly, kernel methods offer a flexible and efficient way to define
application-specific kernels for introducing background knowledge and
modeling explicitly linguistic insights. This property allows to notably
improve the performance of the general learning methods and their
simple adaptation to the specific application. Finally, kernel methods
can be naturally applied to the non-vectorial types of data, thus taking
into account the structure of the data and greatly reducing the need for
careful feature engineering in these structures.
From the point of view of modularization, kernel methods consist of
two main components, namely the kernel and actual learning algorithm.
The kernel can be considered as an interface between the input data and
the learning algorithm, and is the key component to ensure the good
performance of kernel methods (Shawe-Taylor and Cristianini, 2004;
Wang et al., 2009). Actually, for real applications, kernel is the only
task-specific component of kernel methods. In the domain of WSD, the
widely used kernel is the ‘‘Bag of Words’’ (BOW) kernel (Shawe-Taylor
and Cristianini, 2004), which is based on the BOW representation of the
context in which an ambiguous word occurs. In this representation, each
word or term constitutes a dimension in a vector space, independent
of other terms in the same context. Despite its ease of use, this kernel
suffers from well-known limitations, mostly due to its inability to
exploit semantic similarity between terms: contexts sharing terms that
are different but semantically related will be considered as unrelated.
To address this problem, a number of attempts have been made to
incorporate semantic knowledge into the BOW kernel, resulting in the
so-called semantic kernels (Shawe-Taylor and Cristianini, 2004). For
example: the semantic kernels that use the external semantic knowledge
provided by word thesauri or ontology were proposed to improve the
kernel-based WSD system (Jin et al., 2008; Joshi et al., 2006). In the
absence of external semantic knowledge, Latent Semantic Indexing (LSI)
technology was applied to capture the semantic relations between terms
(Giuliano et al., 2009; Gliozzo et al., 2005). More information about the
semantic kernel can be found in text categorization (Altınel et al., 2015;
Cristianini et al., 2002), which is a more general application domain
over WSD.
In our previous studies (Wang et al., 2014,2013a), we proposed to
apply the semantic diffusion kernel (Kandola et al., 2003) to improve
the WSD system. Semantic diffusion kernel can be obtained through
a matrix exponentiation transformation on the given kernel matrix,
and virtually exploits higher order co-occurrences to infer semantic
similarity between terms. Geometrically, this kernel models semantic
similarities by means of a diffusion process on a graph defined by
lexicon and co-occurrence information. However, the diffusion is an
unsupervised process, which fails to exploit the class information in a
classification scenario and may not be optimal for the supervised WSD
system. Chakraborti et al. (2006, 2007) introduced a simple approach
called ‘‘sprinkling’’ to incorporate class labels of documents into LSI. In
sprinkling, a set of class-specific artificial terms are appended to the rep-
resentations of documents of the corresponding class. LSI is then applied
on the sprinkled term-document space resulting in a concept space that
better reflects the underlying class distribution of documents. Recently,
this approach was also applied to sprinkle Latent Dirichlet Allocation
(LDA) topics for weakly supervised text classification (Hingmire and
Chakraborti, 2014). The inherent reason for this approach is that the
sprinkled term can add contribution to exploit the class information of
text documents in a classification procedure. Motivated by these works,
in this paper we present a sprinkled semantic diffusion kernel with
application to WSD. The basic idea is to construct an augmented term-
document matrix by encoding class information as additional terms and
appending them to training documents. Diffusion is then performed
on the augmented term-document matrix to learn the semantic matrix,
which is the key component of semantic kernels. In this way, the words
belonging to the same class are indirectly drawn closer to each other,
hence the class-specific word correlations are strengthened. Although
the idea behind the sprinkled semantic diffusion kernel is very similar to
that of sprinkled LSI, to the best of our knowledge, our work is the first
time to simultaneously exploit higher order co-occurrences and class
information to construct semantic smoothing kernel with application to
the supervised WSD task.
The remainder of this article is outlined as follows. Section 2
briefly introduces the kernel methods in general and SVM in particular.
Section 3 then details the proposed sprinkled semantic diffusion kernel
with application to WSD. The proposed kernel is demonstrated with
several Senseval/Semeval benchmark examples in Section 4, followed
by conclusions with some potential future points of the current work.
2. Kernel methods and SVM
Kernel methods have been highly successful in solving various
problems in machine learning and data mining community (Hofmann et
al., 2008; Shawe-Taylor and Cristianini, 2004). These methods map data
points from the input space to some feature space where even relatively
simple algorithms such as linear methods can deliver very impressive
performance. The most attractive feature of kernel methods is that they
can be applied in high dimensional feature spaces without suffering from
the high cost of explicitly computing the feature map. This is possible
with the kernel trick, i.e., using a valid kernel function 𝑘 on any set
X (input space). A function 𝑘(x, x
′
) is a valid kernel if and only if for
any finite set it produces symmetric and positive semi-definite Gram
matrices. For such 𝑘 ∶ X × X → R (R denotes the set of real numbers), it
is known that a mapping 𝝓 ∶ X → F (F denotes the feature space induced
by a kernel function) into a reproducing kernel Hilbert space (RKHS),
such that 𝑘(x, x
′
) = 𝜙(x), 𝜙(x
′
)for any x, x
′
∈ X.
1
Popular kernel
functions include linear kernel, polynomial kernel and Gaussian kernel.
We here consider the SVM, the most well-known kernel method
in practice. In a binary classification problem, we are given 𝑙 pairs of
training samples {(x
𝑖
, 𝑦
𝑖
)}
𝑙
𝑖=1
, where x
𝑖
∈ X and 𝑦
𝑖
∈ {+1, −1}. The
1
With a kernel 𝑘(x
𝑖
, x
𝑗
) for any x
𝑖
, x
𝑗
∈ X , the Gram matrix or kernel matrix is given
by
K
𝑖,𝑗
= 𝑘(𝑥
𝑖
, 𝑥
𝑗
) . Since the Gram matrix and kernel function are essentially equivalent,
we can refer to one or the other as ‘‘kernel’’ without risk of confusion.
44