Top-𝐾 O-Policy Correction
for a REINFORCE Recommender System
Minmin Chen
∗
, Alex Beutel
∗
, Paul Covington
∗
, Sagar Jain, Francois Belletti, Ed H. Chi
Google, Inc.
Mountain View, CA
minminc,alexbeutel,pcovington,sagarj,belletti,edchi@google.com
ABSTRACT
Industrial recommender systems deal with extremely large action
spaces – many millions of items to recommend. Moreover, they
need to serve billions of users, who are unique at any point in
time, making a complex user state space. Luckily, huge quantities
of logged implicit feedback (e.g., user clicks, dwell time) are avail-
able for learning. Learning from the logged feedback is however
subject to biases caused by only observing feedback on recommen-
dations selected by the previous versions of the recommender. In
this work, we present a general recipe of addressing such biases in
a production top-
𝐾
recommender system at YouTube, built with a
policy-gradient-based algorithm, i.e. REINFORCE [
48
]. The contri-
butions of the paper are: (1) scaling REINFORCE to a production
recommender system with an action space on the orders of millions;
(2) applying o-policy correction to address data biases in learning
from logged feedback collected from multiple behavior policies; (3)
proposing a novel top-
𝐾
o-policy correction to account for our
policy recommending multiple items at a time; (4) showcasing the
value of exploration. We demonstrate the ecacy of our approaches
through a series of simulations and multiple live experiments on
YouTube.
ACM Reference Format:
Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, Ed
H. Chi. 2019. Top-K O-Policy Correction for a REINFORCE Recommender
System. In The Twelfth ACM International Conference on Web Search and
Data Mining (WSDM’ 19), February 11-15, 2019, Melbourne, VIC, Australia.
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3289600.3290999
1 INTRODUCTION
Recommender systems are relied on, throughout industry, to help
users sort through huge corpuses of content and discover the small
fraction of content they would be interested in. This problem is
challenging because of the huge number of items that could be rec-
ommended. Furthermore, surfacing the right item to the right user
at the right time requires the recommender system to constantly
adapt to users’ shifting interest (state) based on their historical
∗
Authors contributed equally.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
WSDM ’19, February 11–15, 2019, Melbourne, VIC, Australia
© 2019 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-5940-5/19/02.
https://doi.org/10.1145/3289600.3290999
interaction with the system [
6
]. Unfortunately, we observe rela-
tively little data for such a large state and action space, with most
users only having been exposed to a small fraction of items and
providing explicit feedback to an even smaller fraction. That is,
recommender systems receive extremely sparse data for training
in general, e.g., the Netix Prize dataset was only 0.1% dense [
5
].
As a result, a good amount of research in recommender systems
explores dierent mechanisms for treating this extreme sparsity.
Learning from implicit user feedback, such as clicks and dwell-time,
as well as lling in unobserved interactions, has been an important
step in improving recommenders [
19
] but the problem remains an
open one.
In a mostly separate line of research, reinforcement learning (RL)
has recently achieved impressive advances in games [
38
,
46
] as well
as robotics [
22
,
25
]. RL in general focuses on building agents that
take actions in an environment so as to maximize some notion of
long term reward. Here we explore framing recommendation as
building RL agents to maximize each user’s long term satisfaction
with the system. This oers us new perspectives on recommenda-
tion problems as well as opportunities to build on top of the recent
RL advancement. However, there are signicant challenges to put
this perspective into practice.
As introduced above, recommender systems deal with large state
and action spaces, and this is particularly exacerbated in industrial
settings. The set of items available to recommend is non-stationary
and new items are brought into the system constantly, resulting in
an ever-growing action space with new items having even sparser
feedback. Further, user preferences over these items are shifting
all the time, resulting in continuously-evolving user states. Being
able to reason through these large number of actions in such a
complex environment poses unique challenges in applying existing
RL algorithms. Here we share our experience adapting the REIN-
FORCE algorithm [
48
] to a neural candidate generator (a top-
𝐾
recommender system) with extremely large action and state spaces.
In addition to the massive action and state spaces, RL for recom-
mendation is distinct in its limited availability of data. Classic RL
applications have overcome data ineciencies by collecting large
quantities of training data with self-play and simulation [
38
]. In
contrast, the complex dynamics of the recommender system has
made simulation for generating realistic recommendation data non-
viable. As a result, we cannot easily probe for reward in previously
unexplored areas of the state and action space, since observing
reward requires giving a real recommendation to a real user. In-
stead, the model relies mostly on data made available from the
previous recommendation models (policies), most of which we can-
not control or can no longer control. To most eectively utilize
logged-feedback from other policies, we take an o-policy learning
arXiv:1812.02353v3 [cs.LG] 15 Dec 2021