UsingConfidenceBoundsforExploitation-ExplorationTrade-offs资源-CSDN文库

版权申诉

170 浏览量 2022-04-26 13:45:29 上传评论收藏 350KB PDF 举报

资源推荐

资源详情

资源评论

Journal of Machine Learning Research 3 (2002) 397-422 Submitted 11/01; Published 11/02

Using Conﬁdence Bounds for

Exploitation-Exploration Trade-oﬀs

Peter Auer pauer@igi.tu-graz.ac.at

Graz University of Technology

Institute for Theoretical Computer Science

Inﬀeldgasse 16b

A-8010 Graz, Austria

Editor: Philip M. Long

Abstract

We show how a standard tool from statistics — namely conﬁdence bounds — can be used

to elegantly deal with situations which exhibit an exploitation-exploration trade-oﬀ. Our

technique for designing and analyzing algorithms for such situations is general and can be

applied when an algorithm has to make exploitation-versus-exploration decisions based on

uncertain information provided by a random process.

We apply our technique to two models with such an exploitation-exploration trade-oﬀ.

For the adversarial bandit problem with shifting our new algorithm suﬀers only



(ST)

1/2



regret with high probability over T trials with S shifts. Such a regret bound was previously

known only in expectation. The second model we consider is associative reinforcement

learning with linear value functions. For this model our technique improves the regret from



3/4





1/2



Keywords: Online Learning, Exploitation-Exploration, Bandit Problem, Reinforcement

Learning, Linear Value Function

1. Introduction

In this paper we consider situations which exhibit an exploitation-exploration trade-oﬀ. In

such a scenario an algorithm repeatedly makes decisions to maximize its rewards — the

exploitation — but the algorithm has only limited knowledge about the process generating

the rewards. Thus occasionally the algorithm might decide to do exploration which improves

the knowledge about the reward generating process, but which is not necessarily maximizing

the current reward.

If the knowledge about the reward generating process can be captured by a set of random

variables, then conﬁdence bounds provide a very useful tool to deal with the exploitation-

exploration trade-oﬀ. The estimated means (or a similar quantity) of the random variables

reﬂect the current knowledge of the algorithm in a condensed form and guide further ex-

ploitation. The widths of the conﬁdence bounds reﬂect the uncertainty of the algorithm’s

knowledge and will guide further exploration. By relating means and widths we can obtain

criteria on when to explore and when to exploit. How such a criterion is constructed de-

pends on the actual model under consideration. In the remainder of this paper we consider

two such models in detail, the adversarial bandit problem with shifting and associative re-

2002 Peter Auer.

Auer

inforcement learning with linear value functions. The bandit problem is maybe the most

generic way to model an exploitation-exploration trade-oﬀ (Robbins, 1952, Lai and Rob-

bins, 1985, Berry and Fristedt, 1985, Agrawal, 1995, Auer et al., 1995, Sutton and Barto,

1998). In this paper we will consider a worst-case variant of the bandit problem with

shifting. Furthermore, we will consider associative reinforcement learning with linear value

functions (Kaelbling, 1994a,b, Sutton and Barto, 1998, Abe and Long, 1999). In this model

exploration is more involved since knowledge about a functional dependency has to be

collected.

Using conﬁdence bounds to deal with an exploitation-exploration trade-oﬀ is not a new

idea (e.g. Kaelbling, 1994a,b, Agrawal, 1995). What is new in this paper is that we use

conﬁdence bounds in rather complicated situations and that we are still able to prove

rigorous performance bounds. Thus we believe that conﬁdence bounds can be successfully

applied in many such situations with an exploitation-exploration trade-oﬀ. Furthermore,

since algorithms which use conﬁdence bounds can be tuned quite easily, we expect that

such algorithms prove useful in practical applications.

In Section 2 we start oﬀ with the random bandit problem. The random bandit problem

is a typical model for the trade-oﬀ between exploitation and exploration. Using upper con-

ﬁdence bounds, very simple and almost optimal algorithms for the random bandit problem

have been derived. We shortly review this previous work since it illuminates the main ideas

of using upper conﬁdence bounds. In Section 3 we introduce the adversarial bandit problem

with shifting and compare our new results with the previously known results. In Section 4

we deﬁne the model for associative reinforcement learning with linear value functions and

discuss our results for this model.

2. Upper Conﬁdence Bounds for the Random Bandit Problem

The random bandit problem was originally proposed by Robbins (1952). It formalizes an

exploitation-exploration trade-oﬀ where in each trial t =1,...,T one out of K possible

alternatives has to be chosen. We denote the choice for trial t by i(t) ∈{1,...,K}.For

the chosen alternative a reward x

i(t)

(t) ∈ R is collected and the rewards for the other

alternatives x

(t) ∈ R, i ∈{1,...,K}\{i(t)},arenot revealed. The goal of an algorithm

for the bandit problem is to maximize its total reward

t=1

i(t)

(t). For the random bandit

problem

it is assumed that in each trial the rewards x

(t) are drawn independently from

some ﬁxed but unknown distributions D

,...,D

. The expected total reward of a learning

algorithm should be close to the expected total reward given by the best distribution D

Thus the regret of a learning algorithm for the random bandit problem is deﬁned as

R(T )= max

i∈{1,...,K}

t=1

(t)

− E

t=1

i(t)

(t)

In this model the exploitation-exploration trade-oﬀ is reﬂected on one hand by the necessity

for trying all alternatives, and on the other hand by the regret suﬀered when trying an

1. The term “bandit problem” (or more precisely “K-armed bandit problem”) reﬂects the problem of a

gambler in a room with various slot machines. In each trial the gambler has to decide which slot

machine he wants to play. To maximize his total gain or reward his (rational) choice will be based on

the previously collected rewards.

398

Confidence Bounds for Exploitation-Exploration Trade-offs

alternative which is not optimal: too little exploration might make a sub-optimal alternative

look better than the optimal one because of random ﬂuctuations, too much exploration

prevents the algorithm from playing the optimal alternative often enough which also results

in a larger regret.

Lai and Robbins (1985) have shown that an optimal algorithm achieves

R(T )=Θ(lnT )

as T →∞when the variances of the distributions D

are ﬁnite.

Agrawal (1995) has shown

that a quite simple learning algorithm suﬃces to obtain such performance. This simple

algorithm is based on upper conﬁdence bounds of the form ˆµ

(t)+σ

(t) for the expected

rewards µ

of the distributions D

.Hereˆµ

(t) is an estimate for the true expected reward

and σ

(t) is chosen such that ˆµ

(t) − σ

(t) ≤ µ

≤ ˆµ

(t)+σ

(t) with high probability.

In each trial t the algorithm selects the alternative with maximal upper conﬁdence bound

ˆµ

(t)+σ

(t). Thus an alternative i is selected if ˆµ

(t) is large or if σ

(t) is large. Informally

we may say that a trial is an exploration trial if an alternative with large σ

(t)ischosen

since in this case the estimate ˆµ

(t) is rather unreliable. When an alternative with large

ˆµ

(t) is chosen we may call such a trial an exploitation trial. Since σ

(t) decreases rapidly

with each choice of alternative i, the number of exploration trials is limited. If σ

(t)issmall

then ˆµ

(t)isclosetoµ

and an alternative is selected in an exploitation trial only if it is

indeed the optimal alternative with maximal µ

. Thus the use of upper conﬁdence bounds

automatically trades oﬀ between exploitation and exploration.

3. The Adversarial Bandit Problem with Shifts

The adversarial bandit problem was ﬁrst analyzed by Auer et al. (1995). In contrast to the

random bandit problem the adversarial bandit problem makes no statistical assumptions on

how the rewards x

(t) are generated. Thus, the rewards might be generated in an adversarial

way to make life hard for an algorithm in this model. Since the rewards are not random

any more, the regret of an algorithm for the adversarial bandit problem is deﬁned as

R(T )= max

i∈1,...,K

t=1

(t) −

t=1

i(t)

(t)(1)

where R(T ) might be a random variable depending on a possible randomization of the

algorithm. In our paper (Auer et al., 1995) we have derived a randomized algorithm which

achieves E [R(T )] = O



2/3



for bounded rewards. In a subsequent paper (Auer et al.,

1998) this algorithm has been improved to yield E [R(T )] = O



1/2



. In the same paper

we have also shown that a variant of the algorithm satisﬁes R(T )=O



2/3

(ln T )

1/3



with

high probability. This bound was improved again (Auer et al., 2000) as we can show that

R(T )=O



1/2

(ln T )

1/2



with high probability. This is almost optimal since a lower bound

E [R(T )] = Ω



1/2



has already been shown (Auer et al., 1995). This lower bound holds

even if the rewards are generated at random as for the random bandit problem. The reason

is that for suitable distributions D

we have

max

i∈{1,...,K}

t=1

(t)

=max

i∈{1,...,K}

t=1

(t)

+Ω



√



2. Their result is even more general.

399

Auer

and thus E [R(T )] =

R(T )+Ω



√



In the current paper we consider an extension of the adversarial bandit problem where

the bandits are allowed to “shift”: the algorithm keeps track of the alternative which gives

highest reward even if this best alternative changes over time. Formally we compare the

total reward collected by the algorithm with the total reward of the best schedule with S

segments {1,...,t

}, {t

+1,...,t

}, ...,{t

S−1

+1,...,T}. The regret of the algorithm

with respect to the best schedule with S segments is deﬁned as

(T )= max

0=t

<···<t





s=1





max

i∈{1,...,K}

t=t

s−1

(t)









−

t=1

i(t)

(t).

We will show that R

(T )=O



TSln(T )



with high probability. This is essentially

optimal since any algorithm which has to solve S independent adversarial bandit problems

of length T/S will suﬀer Ω



√



regret. For a diﬀerent algorithm Auer et al. (2000)

show the bound E [R

(T )] = O



TSln(T )



, but for that algorithm the variance of the

regret is so large that no interesting bound on the regret can be given that holds with high

probability.

3.1 The Algorithm for the Adversarial Bandit Problem with Shifting

Our algorithm ShiftBand (Figure 1) for the adversarial bandit problem with shifting

combines several approaches which have been found useful in the context of the bandit

problem or in the context of shifting targets. One of the main ingredients is that the

algorithm calculates estimates for the rewards of all the alternatives. For a single trial these

estimates are given by ˆx

(t) since the expectation of such an estimate equals the true reward

(t).

Another ingredient is the exponential weighting scheme which for each alternative cal-

culates a weight w

(t) from an estimate of the cumulative rewards so far. Such exponential

weighting schemes have been used for the analysis of the adversarial bandit problem by Auer

et al. (1995, 1998, 2000). In contrast to previous algorithms (Auer et al., 1995, 1998) we

use in this paper an estimate of the cumulative rewards which does not give the correct ex-

pectation but which — as an upper conﬁdence bound — overestimates the true cumulative

rewards. This over-estimation emphasizes exploration over exploitation, which in turn gives

more reliable estimates for the true cumulative rewards. This is the reason why we are able

to give bounds on the regret which hold with high probability. In the algorithm ShiftBand

the upper bound on the cumulative regret is present only implicitly. Intuitively the sum

τ=1

ˆx

(τ)+

(τ)

TK/S

can be seen as this upper conﬁdence bound on the cumulative regret

τ=1

(τ). In the

analysis of the algorithm the relationship between this conﬁdence bound and the cumulative

regret will be made precise. In contrast to the random bandit problem this conﬁdence bound

is not a conﬁdence bound for an external random process

but a conﬁdence bound for the

3. This external random process generates the rewards of the random bandit problem.

400

Confidence Bounds for Exploitation-Exploration Trade-offs

Algorithm ShiftBand

Parameters: Reals α, β, η > 0, γ ∈ (0, 1], the number of trials T ,thenumberof

segments S.

Initialization: w

(1) = 1 for i =1,...,K.

Repeat for t =1,...,T

1. Choose i(t) at random accordingly to the probabilities p

(t)where

(t)=(1−γ)

(t)

W (t)

and W (t)=

i=1

(t).

2. Collect reward x

i(t)

(t) ∈ [0, 1].

3. For i =1,...,K set ˆx

(t)=



(t)/p

(t)ifi = i(t)

0 otherwise,

and calculate the update of the weights as

(t +1)=w

(t) · exp

(

ˆx

(t)+

(t)

TK/S

W (t) .

Figure 1: Algorithm ShiftBand

eﬀect of the internal randomization of the algorithm. The application of conﬁdence bounds

to capture the internal randomization of the algorithm shows that conﬁdence bounds are a

quite versatile tool.

The weights reﬂect the algorithm’s belief in which alternative is best. An alternative

is either chosen at random proportional to its weight, or with probability γ an arbitrary

alternative is chosen for additional exploration.

The ﬁnal ingredient is a mechanism for dealing with shifting. The basic approach is to

lower bound the weights w

(t) appropriately. This is achieved by the term

W (t)inthe

calculation of the weights w

(t). Lower bounding the weights means that it does not take too

long for a small weight to become large again if it corresponds to the new best alternative

after a shift. Similar ideas for dealing with changing targets have been used bt Littlestone

and Warmuth (1994), Auer and Warmuth (1998), and Herbster and Warmuth (1998).

In the remainder of this section we assume that for all rewards x

(t) ∈ [0, 1]. If the

rewards x

(t)arenotin[0, 1] but bounded, then an appropriate scaling gives results similar

to the theorems below.

Theorem 1 We use the notation of (1) and Figure 1. If T , S, K,andδ, are such that T ≥

144 KS ln(TK/δ),andalgorithmShiftBand is run with parameters α =2

ln(T

K/δ),

β =1/T , γ =2Kη,andη =

ln(TK) S/(TK), then the regret of the algorithm satisﬁes

(T ) ≤ 11

TKSln(T

K/δ)(2)

401

剩余25页未读，继续阅读

评论收藏

内容反馈

版权申诉

普通网友

粉丝: 13w+
资源:
9194

Using Confidence Bounds for Exploitation-Exploration Trade-offs

最新资源

Using Confidence Bounds for Exploitation-Exploration Trade-offs

Posterior cr bounds for discrete-time nonlinear filtering

List index out of bounds(-1)错误解决办法

BoundsChecker Basics - VC++.pdf

bounds-maven-plugin:一个Maven插件来更新范围的下限以减少元数据的下载

Sharp bounds for Toader-Qi mean in terms of logarithmic and identric means

CRB_final.zip_Cramer-Rao Bounds_Cramer-Rao bound_This Is It_cram

BoundsChecker 6.5 VC++ -----------(4)

BoundsChecker 6.5 VC++ -----------(1)

BoundsChecker 6.5 VC++ -----------(6)

BoundsChecker 6.5 VC++ -----------(3)

Sharp bounds for Neuman-Sandor mean in terms of the convex combination of quadratic and first Seiffert means

svg-path-bounds:获取svg路径数据的边界框

BoundsChecker 6.5 VC++ -----------(5)

BoundsChecker 6.5 VC++ -----------(2)

region-bounds-ts-new：此软件包包含不一致的数据，请勿使用

Time Bounds For Selection

L-BFGS-B非线性优化代码

imx6 烧写nand uboot 工具 kobs-ng 5.4

A MATLAB interface for L-BFGS-B.zip

matlab曲线的颜色代码-A-Generic-Framework-for-Assessing-the-Performance-Bounds-

New perturbation bounds for weighted unitary polar factor

Upper bounds for ruin probabilities under stochastic interest rateand optimal investment strategies

Performance Bounds on Fully-Data-Aided Estimation of Time-Selective Channels over Relay Networks

google-maps-coordinates-tile-bounds-projection

Exploring the bounds of narrow-band quantum dot downconverted LEDs

最新资源