David Silver强化学习 ，共10节课 评分

David Silver强化学习 共10节课。
Lecture 4: Model-Free Prediction Introduction Model-Free Reinforcement Learning ■ Last lecture: Planning by dynamic programming Solve a known mdp This lecture Model-free prediction Estimate the value function of an unknown mdp Next lecture Model-free control Optimise the value function of an unknown MDP Monte-Carlo Learning Monte-Carlo Reinforcement Learning a MC methods learn directly from episodes of experience a MC is model-free: no knowledge of MDP transitions/rewards a MC learns from complete episodes: no bootstrapping a MC uses the simplest possible idea: value= mean return a Caveat: can only apply MC to episodic MDPs All episodes must terminate Monte-Carlo Learning Monte-Carlo Policy Evaluation Goal: learn Vr from episodes of experience under policy T A1,R2,…,S k~丌 Recall that the return is the total discounted reward Gt=Rt+1+Rt+2+…+y T-1 RT a Recall that the value function is the expected return t a Monte-Carlo policy evaluation uses empirical mean return instead of expected return Monte-Carlo Learning First-Visit Monte-Carlo Policy Evaluation To evaluate state s a The first time-step t that state s is visited in an episode a Increment counter N(s)<Ns)+1 ■ increment total return s(s)←5(5)+Gt Value is estimated by mean return V(S=S(S/N(s a By law of large numbers, V(s)->vr(s)as N(s)-> Monte-Carlo Learning Every-Visit Monte-Carlo Policy Evaluation To evaluate state s a Every time-step t that state s is visited in an episode, a Increment counter N(s)<Ns)+1 ■ increment total return s(s)←5(5)+Gt Value is estimated by mean return V(S=S(S/N(s) gain, V(s)->vr(s)as N(s) Lecture 4: Model-Free predict Monte-Carlo Learning Blackjack Example Blackjack Example a States(200 of them) a Current sum(12-21) Dealers showing card (ace-10 Do I have a"useable"ace?(yes-no) Action stick: Stop receiving cards(and terminate a Action twist: Take another card (no replacement ■ Reward for stick: a+l if sum of cards sum of dealer cards 0 if sum of cards sum of dealer cards 1 if sum of cards sum of dealer cards Reward for twist -1 if sum of cards> 21(and terminate) 0 otherwise Transitions: automatically twist if sum of cards 12 Lecture 4: Model-Free predict Monte-Carlo Learning Blackjack Example Blackjack Value Function after Monte-Carlo Learning After 10,000 episodes After 500,000 episodes sable ace No usable ace Policy: stick if sum of cards 20, otherwise twist Monte-Carlo Learning Incremental Monte-Carlo Incremental mean The mean 41, 42,.. of a sequence x1, X2, ..can be computed incrementally, k=∑ Xk+ Xi k(xk+(k-1) 1k-1 4k-1+( k(×kpk-1

...展开详情