没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Journal of Machine Learning Research 3 (2002) 397-422 Submitted 11/01; Published 11/02
Using Confidence Bounds for
Exploitation-Exploration Trade-offs
Peter Auer pauer@igi.tu-graz.ac.at
Graz University of Technology
Institute for Theoretical Computer Science
Inffeldgasse 16b
A-8010 Graz, Austria
Editor: Philip M. Long
Abstract
We show how a standard tool from statistics — namely confidence bounds — can be used
to elegantly deal with situations which exhibit an exploitation-exploration trade-off. Our
technique for designing and analyzing algorithms for such situations is general and can be
applied when an algorithm has to make exploitation-versus-exploration decisions based on
uncertain information provided by a random process.
We apply our technique to two models with such an exploitation-exploration trade-off.
For the adversarial bandit problem with shifting our new algorithm suffers only
˜
O
(ST)
1/2
regret with high probability over T trials with S shifts. Such a regret bound was previously
known only in expectation. The second model we consider is associative reinforcement
learning with linear value functions. For this model our technique improves the regret from
˜
O
T
3/4
to
˜
O
T
1/2
.
Keywords: Online Learning, Exploitation-Exploration, Bandit Problem, Reinforcement
Learning, Linear Value Function
1. Introduction
In this paper we consider situations which exhibit an exploitation-exploration trade-off. In
such a scenario an algorithm repeatedly makes decisions to maximize its rewards — the
exploitation — but the algorithm has only limited knowledge about the process generating
the rewards. Thus occasionally the algorithm might decide to do exploration which improves
the knowledge about the reward generating process, but which is not necessarily maximizing
the current reward.
If the knowledge about the reward generating process can be captured by a set of random
variables, then confidence bounds provide a very useful tool to deal with the exploitation-
exploration trade-off. The estimated means (or a similar quantity) of the random variables
reflect the current knowledge of the algorithm in a condensed form and guide further ex-
ploitation. The widths of the confidence bounds reflect the uncertainty of the algorithm’s
knowledge and will guide further exploration. By relating means and widths we can obtain
criteria on when to explore and when to exploit. How such a criterion is constructed de-
pends on the actual model under consideration. In the remainder of this paper we consider
two such models in detail, the adversarial bandit problem with shifting and associative re-
c
2002 Peter Auer.
Auer
inforcement learning with linear value functions. The bandit problem is maybe the most
generic way to model an exploitation-exploration trade-off (Robbins, 1952, Lai and Rob-
bins, 1985, Berry and Fristedt, 1985, Agrawal, 1995, Auer et al., 1995, Sutton and Barto,
1998). In this paper we will consider a worst-case variant of the bandit problem with
shifting. Furthermore, we will consider associative reinforcement learning with linear value
functions (Kaelbling, 1994a,b, Sutton and Barto, 1998, Abe and Long, 1999). In this model
exploration is more involved since knowledge about a functional dependency has to be
collected.
Using confidence bounds to deal with an exploitation-exploration trade-off is not a new
idea (e.g. Kaelbling, 1994a,b, Agrawal, 1995). What is new in this paper is that we use
confidence bounds in rather complicated situations and that we are still able to prove
rigorous performance bounds. Thus we believe that confidence bounds can be successfully
applied in many such situations with an exploitation-exploration trade-off. Furthermore,
since algorithms which use confidence bounds can be tuned quite easily, we expect that
such algorithms prove useful in practical applications.
In Section 2 we start off with the random bandit problem. The random bandit problem
is a typical model for the trade-off between exploitation and exploration. Using upper con-
fidence bounds, very simple and almost optimal algorithms for the random bandit problem
have been derived. We shortly review this previous work since it illuminates the main ideas
of using upper confidence bounds. In Section 3 we introduce the adversarial bandit problem
with shifting and compare our new results with the previously known results. In Section 4
we define the model for associative reinforcement learning with linear value functions and
discuss our results for this model.
2. Upper Confidence Bounds for the Random Bandit Problem
The random bandit problem was originally proposed by Robbins (1952). It formalizes an
exploitation-exploration trade-off where in each trial t =1,...,T one out of K possible
alternatives has to be chosen. We denote the choice for trial t by i(t) ∈{1,...,K}.For
the chosen alternative a reward x
i(t)
(t) ∈ R is collected and the rewards for the other
alternatives x
i
(t) ∈ R, i ∈{1,...,K}\{i(t)},arenot revealed. The goal of an algorithm
for the bandit problem is to maximize its total reward
P
T
t=1
x
i(t)
(t). For the random bandit
problem
1
it is assumed that in each trial the rewards x
i
(t) are drawn independently from
some fixed but unknown distributions D
1
,...,D
K
. The expected total reward of a learning
algorithm should be close to the expected total reward given by the best distribution D
i
.
Thus the regret of a learning algorithm for the random bandit problem is defined as
¯
R(T )= max
i∈{1,...,K}
E
"
T
X
t=1
x
i
(t)
#
− E
"
T
X
t=1
x
i(t)
(t)
#
.
In this model the exploitation-exploration trade-off is reflected on one hand by the necessity
for trying all alternatives, and on the other hand by the regret suffered when trying an
1. The term “bandit problem” (or more precisely “K-armed bandit problem”) reflects the problem of a
gambler in a room with various slot machines. In each trial the gambler has to decide which slot
machine he wants to play. To maximize his total gain or reward his (rational) choice will be based on
the previously collected rewards.
398
Confidence Bounds for Exploitation-Exploration Trade-offs
alternative which is not optimal: too little exploration might make a sub-optimal alternative
look better than the optimal one because of random fluctuations, too much exploration
prevents the algorithm from playing the optimal alternative often enough which also results
in a larger regret.
Lai and Robbins (1985) have shown that an optimal algorithm achieves
¯
R(T )=Θ(lnT )
as T →∞when the variances of the distributions D
i
are finite.
2
Agrawal (1995) has shown
that a quite simple learning algorithm suffices to obtain such performance. This simple
algorithm is based on upper confidence bounds of the form ˆµ
i
(t)+σ
i
(t) for the expected
rewards µ
i
of the distributions D
i
.Hereˆµ
i
(t) is an estimate for the true expected reward
µ
i
and σ
i
(t) is chosen such that ˆµ
i
(t) − σ
i
(t) ≤ µ
i
≤ ˆµ
i
(t)+σ
i
(t) with high probability.
In each trial t the algorithm selects the alternative with maximal upper confidence bound
ˆµ
i
(t)+σ
i
(t). Thus an alternative i is selected if ˆµ
i
(t) is large or if σ
i
(t) is large. Informally
we may say that a trial is an exploration trial if an alternative with large σ
i
(t)ischosen
since in this case the estimate ˆµ
i
(t) is rather unreliable. When an alternative with large
ˆµ
i
(t) is chosen we may call such a trial an exploitation trial. Since σ
i
(t) decreases rapidly
with each choice of alternative i, the number of exploration trials is limited. If σ
i
(t)issmall
then ˆµ
i
(t)isclosetoµ
i
and an alternative is selected in an exploitation trial only if it is
indeed the optimal alternative with maximal µ
i
. Thus the use of upper confidence bounds
automatically trades off between exploitation and exploration.
3. The Adversarial Bandit Problem with Shifts
The adversarial bandit problem was first analyzed by Auer et al. (1995). In contrast to the
random bandit problem the adversarial bandit problem makes no statistical assumptions on
how the rewards x
i
(t) are generated. Thus, the rewards might be generated in an adversarial
way to make life hard for an algorithm in this model. Since the rewards are not random
any more, the regret of an algorithm for the adversarial bandit problem is defined as
R(T )= max
i∈1,...,K
T
X
t=1
x
i
(t) −
T
X
t=1
x
i(t)
(t)(1)
where R(T ) might be a random variable depending on a possible randomization of the
algorithm. In our paper (Auer et al., 1995) we have derived a randomized algorithm which
achieves E [R(T )] = O
T
2/3
for bounded rewards. In a subsequent paper (Auer et al.,
1998) this algorithm has been improved to yield E [R(T )] = O
T
1/2
. In the same paper
we have also shown that a variant of the algorithm satisfies R(T )=O
T
2/3
(ln T )
1/3
with
high probability. This bound was improved again (Auer et al., 2000) as we can show that
R(T )=O
T
1/2
(ln T )
1/2
with high probability. This is almost optimal since a lower bound
E [R(T )] = Ω
T
1/2
has already been shown (Auer et al., 1995). This lower bound holds
even if the rewards are generated at random as for the random bandit problem. The reason
is that for suitable distributions D
i
we have
E
"
max
i∈{1,...,K}
T
X
t=1
x
i
(t)
#
=max
i∈{1,...,K}
E
"
T
X
t=1
x
i
(t)
#
+Ω
√
T
2. Their result is even more general.
399
Auer
and thus E [R(T )] =
¯
R(T )+Ω
√
T
.
In the current paper we consider an extension of the adversarial bandit problem where
the bandits are allowed to “shift”: the algorithm keeps track of the alternative which gives
highest reward even if this best alternative changes over time. Formally we compare the
total reward collected by the algorithm with the total reward of the best schedule with S
segments {1,...,t
1
}, {t
1
+1,...,t
2
}, ...,{t
S−1
+1,...,T}. The regret of the algorithm
with respect to the best schedule with S segments is defined as
R
S
(T )= max
0=t
0
<t
1
<···<t
S
=T
S
X
s=1
max
i∈{1,...,K}
t
s
X
t=t
s−1
+1
x
i
(t)
−
T
X
t=1
x
i(t)
(t).
We will show that R
S
(T )=O
p
TSln(T )
with high probability. This is essentially
optimal since any algorithm which has to solve S independent adversarial bandit problems
of length T/S will suffer Ω
√
TS
regret. For a different algorithm Auer et al. (2000)
show the bound E [R
S
(T )] = O
p
TSln(T )
, but for that algorithm the variance of the
regret is so large that no interesting bound on the regret can be given that holds with high
probability.
3.1 The Algorithm for the Adversarial Bandit Problem with Shifting
Our algorithm ShiftBand (Figure 1) for the adversarial bandit problem with shifting
combines several approaches which have been found useful in the context of the bandit
problem or in the context of shifting targets. One of the main ingredients is that the
algorithm calculates estimates for the rewards of all the alternatives. For a single trial these
estimates are given by ˆx
i
(t) since the expectation of such an estimate equals the true reward
x
i
(t).
Another ingredient is the exponential weighting scheme which for each alternative cal-
culates a weight w
i
(t) from an estimate of the cumulative rewards so far. Such exponential
weighting schemes have been used for the analysis of the adversarial bandit problem by Auer
et al. (1995, 1998, 2000). In contrast to previous algorithms (Auer et al., 1995, 1998) we
use in this paper an estimate of the cumulative rewards which does not give the correct ex-
pectation but which — as an upper confidence bound — overestimates the true cumulative
rewards. This over-estimation emphasizes exploration over exploitation, which in turn gives
more reliable estimates for the true cumulative rewards. This is the reason why we are able
to give bounds on the regret which hold with high probability. In the algorithm ShiftBand
the upper bound on the cumulative regret is present only implicitly. Intuitively the sum
t
X
τ=1
ˆx
i
(τ)+
α
p
i
(τ)
p
TK/S
!
can be seen as this upper confidence bound on the cumulative regret
P
t
τ=1
x
i
(τ). In the
analysis of the algorithm the relationship between this confidence bound and the cumulative
regret will be made precise. In contrast to the random bandit problem this confidence bound
is not a confidence bound for an external random process
3
but a confidence bound for the
3. This external random process generates the rewards of the random bandit problem.
400
Confidence Bounds for Exploitation-Exploration Trade-offs
Algorithm ShiftBand
Parameters: Reals α, β, η > 0, γ ∈ (0, 1], the number of trials T ,thenumberof
segments S.
Initialization: w
i
(1) = 1 for i =1,...,K.
Repeat for t =1,...,T
1. Choose i(t) at random accordingly to the probabilities p
i
(t)where
p
i
(t)=(1−γ)
w
i
(t)
W (t)
+
γ
K
and W (t)=
K
X
i=1
w
i
(t).
2. Collect reward x
i(t)
(t) ∈ [0, 1].
3. For i =1,...,K set ˆx
i
(t)=
x
i
(t)/p
i
(t)ifi = i(t)
0 otherwise,
and calculate the update of the weights as
w
i
(t +1)=w
i
(t) · exp
(
η
ˆx
i
(t)+
α
p
i
(t)
p
TK/S
!)
+
β
K
W (t) .
Figure 1: Algorithm ShiftBand
effect of the internal randomization of the algorithm. The application of confidence bounds
to capture the internal randomization of the algorithm shows that confidence bounds are a
quite versatile tool.
The weights reflect the algorithm’s belief in which alternative is best. An alternative
is either chosen at random proportional to its weight, or with probability γ an arbitrary
alternative is chosen for additional exploration.
The final ingredient is a mechanism for dealing with shifting. The basic approach is to
lower bound the weights w
i
(t) appropriately. This is achieved by the term
β
K
W (t)inthe
calculation of the weights w
i
(t). Lower bounding the weights means that it does not take too
long for a small weight to become large again if it corresponds to the new best alternative
after a shift. Similar ideas for dealing with changing targets have been used bt Littlestone
and Warmuth (1994), Auer and Warmuth (1998), and Herbster and Warmuth (1998).
In the remainder of this section we assume that for all rewards x
i
(t) ∈ [0, 1]. If the
rewards x
i
(t)arenotin[0, 1] but bounded, then an appropriate scaling gives results similar
to the theorems below.
Theorem 1 We use the notation of (1) and Figure 1. If T , S, K,andδ, are such that T ≥
144 KS ln(TK/δ),andalgorithmShiftBand is run with parameters α =2
p
ln(T
3
K/δ),
β =1/T , γ =2Kη,andη =
p
ln(TK) S/(TK), then the regret of the algorithm satisfies
R
S
(T ) ≤ 11
p
TKSln(T
3
K/δ)(2)
401
剩余25页未读,继续阅读
资源评论
普通网友
- 粉丝: 13w+
- 资源: 9195
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功