Reinforcement+Learning+–+Policy+Op5miza5on+
!
"#$%$&!'(($$)!
*+!,$&-$)$.!/!01$2'3!/!4&56$7891$!
!
:$#2;9&8$<$2%!=$5&2#2>!
?@#>A&$!79A&8$B!CAD92!E!,5&%9F!GHHIJ!
K9L2!C8LA)<52!E!"#$%$&!'(($$)!M!01$2'3!N!*+!,$&-$)$.!
u
t
"9)#8.!01O<#P5O92!
K9L2!C8LA)<52!E!"#$%$&!'(($$)!M!01$2'3!N!*+!,$&-$)$.!
⇡
✓
(u|s)
u
t
?@#>A&$!79A&8$B!CAD92!E!,5&%9F!GHHIJ!
"9)#8.!01O<#P5O92!
n +927#6$&!892%&9)!19)#8.!15&5<$%$&#P$6!
(.!15&5<$%$&!Q$8%9&!
!
n 0R$2!7%98L57O8!19)#8.!8)577!S7<99%L7!
9A%!%L$!1&9()$<TB!
!!!!!!!!!!!!!!!!!!!!B!1&9(5(#)#%.!9;!58O92!A!#2!7%5%$!7!!
✓
max
✓
E[
H
X
t=0
R(s
t
)|⇡
✓
]
⇡
✓
(u|s)
⇡
✓
(u|s)
u
t
?@#>A&$!79A&8$B!CAD92!E!,5&%9F!GHHIJ!
K9L2!C8LA)<52!E!"#$%$&!'(($$)!M!01$2'3!N!*+!,$&-$)$.!
n 0R$2!!!!!!852!($!7#<1)$&!%L52!U!9&!V!
n WX>XF!&9(9O8!>&571!
n VB!69$72Y%!1&$78&#($!58O927!
n Z9A)6!2$$6!6.25<#87!<96$)!SN!89<1A%$!G!,$))<52!(58-[A1T!
n UB!2$$6!%9!($!5()$!%9!$\8#$2%).!79)Q$!
n +L5))$2>$!;9&!892O2A9A7!/!L#>L[6#<$27#925)!58O92!7158$7
]!
ZL.!"9)#8.!01O<#P5O92!
⇡
]
79<$!&$8$2%!^9&-!S15&O5)).T!566&$77#2>!%L#7B!!
!_'@B!4AF!=#))#8&51F!CA%7-$Q$&F!=$Q#2$!3+`=!abGc!
!321A%!+92Q$d!__7B!'<97F!eAF!f9)%$&!5&e#Q!abGc!!
arg max
u
Q
✓
(s, u)
K9L2!C8LA)<52!E!"#$%$&!'(($$)!M!01$2'3!N!*+!,$&-$)$.!
评论0
最新资源