CIKM ’20, October 19–23, 2020, Virtual Event, Ireland Su Yan, Xin Chen, Ran Huo, Xu Zhang, and Leyu Lin
eciency. It is noteworthy that the user prole is highly connected
to strategies and models used in the recall and rank layers; it is,
therefore, necessary to build an accurate user prole for online
news recommendation systems.
The process to build a user-tag prole is demonstrated in the blue
dashed rectangle in Figure 1. Firstly, we collect data and process it
in the feature generation and label organization steps (the process
is detailed in the Experiment Setup section). Then, for each user, all
"clicked tags" (tags within news articles that the user clicked) in the
user’s reading history are collected in the candidate selection step,
and the user’s preference on these tags is calculated in the model
training step. One user can have thousands of clicked tags. Besides,
when predicting a user’s preferences on unseen tags, candidates
could consist of a large number of tags.
While building the user-tag prole, it is essential to select fea-
tures and learn feature interaction eciently since the model input
consists of numerous sparse features from multiple elds, as shown
on the left part of Figure 1. In news recommendation systems, arti-
cles clicked are taken as positive samples; articles browsed but not
clicked are taken as negative samples. However, directly applying
this method to tagging leads to treating tags within clicked news
articles as positive samples, which can be problematic, since the
user who clicks an article may be interested in only one of its tags.
Based on the aforementioned consideration, we aim to answer the
following two questions:
• RQ1
: How to automatically select useful features and learn
the interaction between features within and among dierent
elds ?
• RQ2
: How to learn user’s preference over dierent tags from
each clicked news article ?
In recent years, deep neural networks have been widely used
in recommendation systems, where user and news article features
are utilized and their interactions are learned. One widely used
deep recommendation model is the Youtube model proposed by
Google [
4
]. However, we have identied two weaknesses of this
model that we aim to improve. Firstly, the YouTube model uses
the average pooling layer to merge multiple input feature embed-
dings, failing to consider that useful and useless features should be
assigned dierent weights. Furthermore, the YouTube model uses
concatenation to merge features across dierent elds and feed the
merged output into the upper layer through MLP(multilayer per-
ceptron). In experiments, we observed that weights of some elds
are underestimated, especially when these elds are not highly
related to labels, which hinders feature fusion across elds.
To avoid the pitfall of the Youtube model and inspired by the
attention mechanism [
20
], which captures useful word and sentence
embedding for doc classication, we design an attention fusion
layer within and across each feature eld. In the original attention
mechanism, there is only one query vector to determine feature
usefulness. We believe that multi-head attention helps reserve more
useful features from multi-aspects, and therefore our model uses
two query vectors and shares the query vector with each head of
attention units.
Furthermore, we propose a cross feature layer to enhance model
performance. FM-based feature interaction methods like AFM [
18
]
and NFM [
7
] are widely used, where the Hadamard product of pair-
wise hidden vectors is summed into a vector with the same size of
the hidden vector. In the user proling task, the aforementioned
methods might lead to the loss of user information. It is therefore
advisable to output all inner product values of each pair-wise hidden
vectors. It helps to learn a user’s multiple interests and leads to
higher performance for multi-labels classication.
The contributions of this paper are as follows: 1) We propose a
user-tag proling model (UTPM), which could make use of multiple
elds of user information and is suitable for other user proling
tasks. 2) In this model, we introduce a multi-head attention mecha-
nism with shared query vectors to capture the important attributes
within each eld and merge multiple elds by assigning each eld
a reasonable weight. 3) We propose a specially designed FM-based
cross feature layer to promote user proling where all crossed val-
ues are fed as a dense vector to the next layer along with linear
values to generate the nal user embedding. 4) Particularly for the
user-tag proling task where each news article contains several
tags, we design joint loss to learn each user-tag preference, which is
proved to achieve better performance compared with the separate
training.
2 RELATED WORK
Tag recommendation system.
In recent years, tag recommen-
dation techniques have received more and more attention. An adap-
tation of user-based collaborative ltering and graph-based rec-
ommender is proposed for tag recommendation system [
8
]. Mean-
while, Vig introduces a tag-based explainable recommendation
system [
16
] where he studied two key components: tag-relevance
and user-tag preference during recommendation, both of which
improve eectiveness. Researchers at Sina Weibo designed an inte-
grated recommendation algorithm to collectively explore the social
relationship among users, the co-occurrence relationship and se-
mantic relationship among tags [
19
]. So far, the methods applied to
the tag recommendation problem have been mainly collaborative
algorithms that are based simply on the co-occurrence among users
and tags. But in reality, these methods cannot fully utilize multi-
eld user information which could help discover users’ interests. It
is necessary to apply state-of-the-art deep models to enhance the
performance of user proling tasks.
Attention Mechanism.
Attention mechanism originates from
Neural Machine Translation(NMT) [
1
], where words are assigned
with dierent weights in dierent contexts. Attention has been
used successfully not only in a variety of NLP tasks including
reading comprehension, abstractive summarization, and textual
entailment [
10
], but also in recommendation systems [
3
,
21
]. One
type of self-attention [
7
] learns the words and sentences normalized
weight for a certain classication task where only one query vector
is learned. Liu [
11
] utilizes the self-attention mechanism to fuse
features among elds and achieves better performance than con-
catenation and MLP for online news recommendation task. Another
kind of self-attention [
15
] studies the inner correlation between
words, where each word is assigned with a query vector. Song [
14
]
utilizes this self-attention mechanism to design a stack of cross
networks called AutoInt, which can learn high-order feature inter-
action among dierent elds.
CIKM '20, October 19–23, 2020, Virtual Event, Ireland