ReinforcementLearningandOptimalControlbyDimitriP.Bertsekas(MIT)资源-CSDN文库

共8个文件

pdf：8个

Reinforcemen

Optimal

cont

需积分: 39 136 浏览量 2018-12-20 19:05:42 上传评论 5 收藏 3.72MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

MIT-RL.rar （8个子文件）

MIT-RL

RL_MONOGRAPH3.pdf 503KB

Slides_RL_and_Optimal_Control.pdf 1.67MB

RL_MONOGRAPH1.pdf 594KB

RL_MONOGRAPH4.pdf 1MB

RL_MONOGRAPH5.pdf 22KB

RL_MONOGRAPH_RefsIndex.pdf 102KB

RL_frontmatter.pdf 66KB

RL_MONOGRAPH2.pdf 794KB

Reinforcement Learning and Optimal Control

A Selective Overview

Dimitri P. Bertsekas

Laboratory for Information and Decision Systems

Massachusetts Institute of Technology

2018 CDC

December 2018

Bertsekas (M.I.T.) Reinforcement Learning 1 / 33

Reinforcement Learning (RL): A Happy Union of AI and

Decision/Control Ideas

Decision/

Control/DP

Principle of

Optimality

Markov Decision

Problems

POMDP

Policy Iteration

Value Iteration

AI/RL

Learning through

Experience

Simulation,

Model-Free Methods

Feature-Based

Representations

A*/Games/

Heuristics

Complementary

Ideas

Late 80s-Early 90s

Historical highlights

Exact DP, optimal control (Bellman, Shannon, 1950s ...)

First major successes: Backgammon programs (Tesauro, 1992, 1996)

Algorithmic progress, analysis, applications, ﬁrst books (mid 90s ...)

Machine Learning, BIG Data, Robotics, Deep Neural Networks (mid 2000s ...)

AlphaGo and Alphazero (DeepMind, 2016, 2017)

Bertsekas (M.I.T.) Reinforcement Learning 2 / 33

AlphaGo (2016) and AlphaZero (2017)

AlphaZero (Google-Deep Mind)

Plays diﬀerent!

Learned fr om scra tch ... with 4 ho urs of training!

Plays much better than all chess programs

Same algorithm learned multiple games (Go, Shogi)

Bertsekas (M.I.T.) Reinforcement Learning 3 / 33

AlphaZero was Trained Using Self-Generated Data

1 if j ∈ I

0 if j/∈ I

= 0 if i/∈ I

ˆp

(u)=

i=1

j=1

(u)φ

ˆg(f,u)=

i=1

j=1

(u)g(i, u, j)

Representative Features Feature Space F

F (j) φ

Neural Network Features Approximate Cost

Policy Improvement

Neural Network Features Approximate Cost

Policy Improvement

F = {f

}

Representative Feature States d

f with Aggre gation

Current Policy µ Improved Policy ˜µˆµ

Φr Φr = ΠT

Φr

Generate “Improved” Policy ˆµ

State Space Feature Space Subspace J = {Φr | s ∈ℜ

}

ℓ=1

ℓ

(i, v)r

ℓ

r =(r

,...,r

)

State iy(i) Ay(i)+bF

(i, v) F

(i, v) Linear Weighting of

Features

Cost = 2αϵ r

k+1

= arg min

r∈ℜ

t=1

−1

τ =0

φ(i

τ,t

)

′

r − c

τ,t

)

ℓ

1 −

ℓ

1 if j ∈ I

0 if j/∈ I

= 0 if i/∈ I

ˆp

(u)=

i=1

j=1

(u)φ

ˆg(f,u)=

i=1

j=1

(u)g(i, u, j)

Representative Features Feature Space F

F (j) φ

Neural Network Features Approximate Cost

Policy Improvement

Neural Network Features Approximate Cost

Policy Improvement

F = {f

}

Representative Feature States d

f with Aggre gation

Current Policy µ Improved Policy ˜µˆµ

Φr Φr = ΠT

Φr

Generate “Improved” Policy ˆµ

State Space Feature Space Subspace J = {Φr | s ∈ℜ

}

ℓ=1

ℓ

(i, v)r

ℓ

r =(r

,...,r

)

State iy(i) Ay(i)+bF

(i, v) F

(i, v) Linear Weighting of

Features

Cost = 2αϵ r

k+1

= arg min

r∈ℜ

t=1

−1

τ =0

φ(i

τ,t

)

′

r − c

τ,t

)

ℓ

1 −

ℓ

1 if j ∈ I

0 if j/∈ I

= 0 if i/∈ I

ˆp

(u)=

i=1

j=1

(u)φ

ˆg(f,u)=

i=1

j=1

(u)g(i, u, j)

Representative Features Feature Space F

F (j) φ

Neural Network Features Approximate Cost

Policy Improvement

Neural Network Features Approximate Cost

Policy Improvement

F = {f

}

Representative Feature States d

f with Aggre gation

Current Policy µ Improved Policy ˜µˆµ

Φr Φr = ΠT

Φr

Generate “Improved” Policy ˆµ

State Space Feature Space Subspace J = {Φr | s ∈ℜ

}

ℓ=1

ℓ

(i, v)r

ℓ

r =(r

,...,r

)

State iy(i) Ay(i)+bF

(i, v) F

(i, v) Linear Weighting of

Features

Cost = 2αϵ r

k+1

= arg min

r∈ℜ

t=1

−1

τ =0

φ(i

τ,t

)

′

r − c

τ,t

)

ℓ

1 −

ℓ

Tail problem approximation u

Constraint Relaxation

AlphaZero (Google-Deep Mind) Plays much b etter than all computer programs F (i)Cost

F (i)

Plays diﬀerent! Approximate Value Function Player Feature s Mapping

At State x

Current state x

... MCTS Loo kahead Minimization Cos t-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States x

k+2

min

,µ

k+1

,...,µ

k+ℓ−1

m=k+1

,µ

),w

k+ℓ

)

Subspace S = {Φr | r ∈ℜ

} x

∗

˜x

Rollout: Simulation with ﬁxed policy Parametric approximation at the end Monte Carlo tree search

(λ)

(x)=T (

x) x = P

(c)

(x)

x − T (x) y − T (y) ∇f (x)

x − P

(c)

(x) x

k+1

k+2

Slope = −

(λ)

(x)=T (

x) x = P

(c)

(x)

Extrapolation by a Factor of 2 T

(λ)

= P

(c)

·T = T · P

(c)

Extrapolation Formula T

(λ)

= P

(c)

· T = T · P

(c)

Multistep Extrapolation T

(λ)

= P

(c)

· T = T · P

(c)

Tail problem approximation u

Constraint Relaxation

AlphaZero (Google-Deep Mind) Plays much b etter than all computer programs F (i)Cost

F (i)

Plays diﬀerent! Approximate Value Function Player Feature s Mapping

At State x

Current state x

... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States x

k+2

min

,µ

k+1

,...,µ

k+ℓ−1

m=k+1

,µ

),w

k+ℓ

)

Subspace S = {Φr | r ∈ℜ

} x

∗

˜x

Rollout: Simulation with ﬁxed policy Parametric approximation at the end Monte Carlo tree search

(λ)

(x)=T (

x) x = P

(c)

(x)

x − T (x) y − T (y) ∇f (x)

x − P

(c)

(x) x

k+1

k+2

Slope = −

(λ)

(x)=T (

x) x = P

(c)

(x)

Extrapolation by a Factor of 2 T

(λ)

= P

(c)

·T = T · P

(c)

Extrapolation Formula T

(λ)

= P

(c)

· T = T · P

(c)

Multistep Extrapolation T

(λ)

= P

(c)

· T = T · P

(c)

Tail problem approximation u

Self-Learning/Policy Iteration Constraint Relaxa tion

AlphaZero (Google-Deep Mind) Plays much b etter than all computer programs F (i)Cost

F (i)

Plays diﬀerent! Approximate Value Function Player Feature s Mapping

At State x

Current state x

... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States x

k+2

min

,µ

k+1

,...,µ

k+ℓ−1

m=k+1

,µ

),w

k+ℓ

)

Subspace S = {Φr | r ∈ℜ

} x

∗

˜x

Rollout: Simulation with ﬁxed policy Parametric approximation at the end Monte Carlo tree search

(λ)

(x)=T (

x) x = P

(c)

(x)

x − T (x) y − T (y) ∇f (x)

x − P

(c)

(x) x

k+1

k+2

Slope = −

(λ)

(x)=T (

x) x = P

(c)

(x)

Extrapolation by a Factor of 2 T

(λ)

= P

(c)

·T = T · P

(c)

Extrapolation Formula T

(λ)

= P

(c)

· T = T · P

(c)

Multistep Extrapolation T

(λ)

= P

(c)

· T = T · P

(c)

Tail problem approximation u

Self-Learning/Policy Iteration Constraint Relaxation

Learned from scratch ... with 4 hours of training! Current “Improved”

AlphaZero (Google-Deep Mind) Plays much better tha n all computer programs F (i)Cost

F (i)

Plays diﬀerent! Approximate Value Function Player Fea tures Mapping

At State x

Current state x

... MCTS Lookahead Minimization Cost-to -go Approximation

Empty schedule LOOKAHEAD MINIMIZATIO N ROLLOUT States x

k+2

min

,µ

k+1

,...,µ

k+ℓ−1

m=k+1

,µ

),w

k+ℓ

)

Subspace S = {Φr | r ∈ℜ

} x

∗

˜x

Rollout: Simulation with ﬁxed policy Parametric approximation at the end Monte Carlo tree se arch

(λ)

(x)=T (

x) x = P

(c)

(x)

x − T (x) y − T (y) ∇f(x)

x − P

(c)

(x) x

k+1

k+2

Slope = −

(λ)

(x)=T (x) x = P

(c)

(x)

Extrapolation by a Factor of 2 T

(λ)

= P

(c)

·T = T · P

(c)

Extrapolation Formula T

(λ)

= P

(c)

· T = T · P

(c)

Multistep Extrapolation T

(λ)

= P

(c)

· T = T · P

(c)

Tail problem approximation u

Self-Learning/Policy Iteration Constraint Relaxation

Learned from scratch ... with 4 hours of training! Current “Improved”

AlphaZero (Google-Deep Mind) Plays much better tha n all computer programs F (i)Cost

F (i)

Plays diﬀerent! Approximate Value Function Player Fea tures Mapping

At State x

Current state x

... MCTS Lookahead Minimization Cost-to -go Approximation

Empty schedule LOOKAHEAD MINIMIZATIO N ROLLOUT States x

k+2

min

,µ

k+1

,...,µ

k+ℓ−1

m=k+1

,µ

),w

k+ℓ

)

Subspace S = {Φr | r ∈ℜ

} x

∗

˜x

Rollout: Simulation with ﬁxed policy Parametric approximation at the end Monte Carlo tree sear ch

(λ)

(x)=T (

x) x = P

(c)

(x)

x − T (x) y − T (y) ∇f(x)

x − P

(c)

(x) x

k+1

k+2

Slope = −

(λ)

(x)=T (x) x = P

(c)

(x)

Extrapolation by a Factor of 2 T

(λ)

= P

(c)

·T = T · P

(c)

Extrapolation Formula T

(λ)

= P

(c)

· T = T · P

(c)

Multistep Extrapolation T

(λ)

= P

(c)

· T = T · P

(c)

Position “values” Move “probabilities”

Cho ose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate States

Use a Neur al Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Feat ur es F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy ˆµ by “Solving” the Aggregate P rob l em

Has been used to learn other games (Go, Shogi)

Aggregate costs r

⇤

Cost function

(i) Cost function

(j)

Approximation i n a space of basis functions Plays much better than

all chess programs

Cost ↵

g(i, u, j) Transition probabilities p

(u) W

Controlled Markov Chain Evaluate Approximate Cost

Evalu at e Approximate Cost



F (i)



F (i)=



(i),...,F

(i)



: Vector of Features of i



F (i)



: Feature-based architecture Final Features



F (i)



`=1

(i)r

it is a linear feature-based architecture

,...,r

: Scalar weights)

: Functions J 

with J( x

) ! 0 for all p-stable ⇡

: Functions J 

with J( x

) ! 0 for all p

-stable ⇡



J | J  J

,J(t)=0

VI converges to J

from within W

Cost: g(x

)  0 VI converges to

from within W

: Lyapounov region for p

VI converges to

from within

: Lyapounov region for p(x) ⌘ 1, x 6= t W

1 if j ∈ I

0 if j/∈ I

= 0 if i/∈ I

ˆp

(u)=

i=1

j=1

(u)φ

ˆg(f,u)=

i=1

j=1

(u)g(i, u, j)

Representative Features Feature Space F

F (j) φ

Neural Network Features Approximate Cost

Policy Improvement

Neural Network Features Approximate Cost

Policy Improvement

F = {f

}

Representative Feature States d

f with Aggre gation

Current Policy µ Improved Policy ˜µˆµ

Φr Φr = ΠT

Φr

Generate “Improved” Policy ˆµ

State Space Feature Space Subspace J = {Φr | s ∈ℜ

}

ℓ=1

ℓ

(i, v)r

ℓ

r =(r

,...,r

)

State iy(i) Ay(i)+bF

(i, v) F

(i, v) Linear Weighting of

Features

Cost = 2αϵ r

k+1

= arg min

r∈ℜ

t=1

−1

τ =0

φ(i

τ,t

)

′

r − c

τ,t

)

ℓ

1 −

ℓ

Neural Net Policy Evaluation Improvement of Cur rent Policy µ by

Lookahead Min

States x

k+1

States x

k+2

Heuristic/ Suboptimal Base Policy

Approximation

Adaptive Simulation Terminal cost approximation

Heuristic Policy

Simulation with

Cost

F (i),r

of i ≈ J

(i) J

(i) Feature Map

F (i),r

: Feature- based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggre gation and Disaggreg ation Pr obabilities

Use a Neural Net work or Other Sc heme Form the Aggregate States

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy ˆµ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r

∗

ℓ

Cost function

(i) Cost function

(j)

Approximation in a space of bas is functions Plays much better than

all chess prog rams

Cost α

g(i, u, j) Transition probabilities p

(u) W

Controlled Markov Chain Evaluate Approximate Cost

Evaluate Approximate Cost

F (i)

F (i)=

(i),...,F

(i)

:VectorofFeaturesofi

F (i)

: Feature- based architecture

Final Features

F (i),r

ℓ=1

ℓ

(i)r

ℓ

it is a linear feature-based architecture

,...,r

: Scalar weights)

: Functions J ≥

with J(x

) → 0 for all p-stable π

′

: Functions J ≥

′

with J(x

) → 0 for all p

′

-stable π

Neural Network Policy Evalua tion Improvement of Current Policy µ

by Lookahead Min

States x

k+1

States x

k+2

Heuristic/ Suboptimal Base Policy

Approximation

Adaptive Simulation Terminal cost approximation

Heuristic Policy

Simulation with

Cost

F (i),r

of i ≈ J

(i) J

(i) Feature Map

F (i),r

: Feature- based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggre gation and Disaggreg ation Pr obabilities

Use a Neural Net work or Other Sc heme Form the Aggregate States

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy ˆµ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r

∗

ℓ

Cost function

(i) Cost function

(j)

Approximation in a space of bas is functions Plays much better than

all chess prog rams

Cost α

g(i, u, j) Transition probabilities p

(u) W

Controlled Markov Chain Evaluate Approximate Cost

Evaluate Approximate Cost

F (i)

F (i)=

(i),...,F

(i)

:VectorofFeaturesofi

F (i)

: Feature- based architecture

Final Features

F (i),r

ℓ=1

ℓ

(i)r

ℓ

it is a linear feature-based architecture

,...,r

: Scalar weights)

: Functions J ≥

with J(x

) → 0 for all p-stable π

′

: Functions J ≥

′

with J(x

) → 0 for all p

′

-stable π

Neural Network Policy Evalua tion Improvement of Current Policy µ

by Lookahead Min

States x

k+1

States x

k+2

Heuristic/ Suboptimal Base Policy

Approximation

Adaptive Simulation Terminal cost approximation

Heuristic Policy

Simulation with

Cost

F (i),r

of i ≈ J

(i) J

(i) Feature Map

F (i),r

: Feature- based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggre gation and Disaggreg ation Pr obabilities

Use a Neural Net work or Other Sc heme Form the Aggregate States

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy ˆµ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r

∗

ℓ

Cost function

(i) Cost function

(j)

Approximation in a space of bas is functions Plays much better than

all chess prog rams

Cost α

g(i, u, j) Transition probabilities p

(u) W

Controlled Markov Chain Evaluate Approximate Cost

Evaluate Approximate Cost

F (i)

F (i)=

(i),...,F

(i)

:VectorofFeaturesofi

F (i)

: Feature- based architecture

Final Features

F (i),r

ℓ=1

ℓ

(i)r

ℓ

it is a linear feature-based architecture

,...,r

: Scalar weights)

: Functions J ≥

with J(x

) → 0 for all p-stable π

′

: Functions J ≥

′

with J(x

) → 0 for all p

′

-stable π

Neural Network Policy Evalua tion Improvement of Current Policy µ

by Lookahead Min

States x

k+1

States x

k+2

Heuristic/ Suboptimal Base Policy

Approximation

Adaptive Simulation Terminal cost approximation

Heuristic Policy

Simulation with

Cost

F (i),r

of i ≈ J

(i) J

(i) Feature Map

F (i),r

: Feature- based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggre gation and Disaggreg ation Pr obabilities

Use a Neural Net work or Other Sc heme Form the Aggregate States

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy ˆµ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r

∗

ℓ

Cost function

(i) Cost function

(j)

Approximation in a space of bas is functions Plays much better than

all chess prog rams

Cost α

g(i, u, j) Transition probabilities p

(u) W

Controlled Markov Chain Evaluate Approximate Cost

Evaluate Approximate Cost

F (i)

F (i)=

(i),...,F

(i)

:VectorofFeaturesofi

F (i)

: Feature- based architecture

Final Features

F (i),r

ℓ=1

ℓ

(i)r

ℓ

it is a linear feature-based architecture

,...,r

: Scalar weights)

: Functions J ≥

with J(x

) → 0 for all p-stable π

′

: Functions J ≥

′

with J(x

) → 0 for all p

′

-stable π

Neural Network Policy Evalua tion Improvement of Current Policy µ

by Lookahead Min

States x

k+1

States x

k+2

Heuristic/ Suboptimal Base Policy

Approximation

Adaptive Simulation Terminal cost approximation

Heuristic Policy

Simulation with

Cost

F (i),r

of i ≈ J

(i) J

(i) Feature Map

F (i),r

: Feature- based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggre gation and Disaggreg ation Pr obabilities

Use a Neural Net work or Other Sc heme Form the Aggregate States

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy ˆµ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r

∗

ℓ

Cost function

(i) Cost function

(j)

Approximation in a space of bas is functions Plays much better than

all chess prog rams

Cost α

g(i, u, j) Transition probabilities p

(u) W

Controlled Markov Chain Evaluate Approximate Cost

Evaluate Approximate Cost

F (i)

F (i)=

(i),...,F

(i)

:VectorofFeaturesofi

F (i)

: Feature- based architecture

Final Features

F (i),r

ℓ=1

ℓ

(i)r

ℓ

it is a linear feature-based architecture

,...,r

: Scalar weights)

: Functions J ≥

with J(x

) → 0 for all p-stable π

′

: Functions J ≥

′

with J(x

) → 0 for all p

′

-stable π

Corrected

Cost



F (i),r



of i ⇡ J

(i) J

(i) Feature Map



F (i),r



: Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities”

Cho ose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate States

Use a Neur al Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Feat ur es F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy ˆµ by “Solving” the Aggregate P rob l em

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r

⇤

Cost function

(i) Cost function

(j)

Approximation i n a space of basis functions Plays much better than

all chess programs

Cost ↵

g(i, u, j) Transition probabilities p

(u) W

Controlled Markov Chain Evaluate Approximate Cost

Evalu at e Approximate Cost



F (i)



F (i)=



(i),...,F

(i)



: Vector of Features of i



F (i)



: Feature-based architecture Final Features



F (i),r



`=1

(i)r

it is a linear feature-based architecture

,...,r

: Scalar weights)

: Functions J 

with J( x

) ! 0 for all p-stable ⇡

: Functions J 

with J( x

) ! 0 for all p

-stable ⇡



J | J  J

,J(t)=0

VI converges to J

from within W

Cost: g(x

)  0 VI converges to

from within W

The “current" player plays games that are used to “train" an “improved" player

At a given position, the “move probabilities" and the “value" of a position are

approximated by a deep neural net (NN)

Successive NNs are trained using self-generated data and a form of regression

A form of randomized policy improvement Monte-Carlo Tree Search (MCTS)

generates move probabilities

AlphaZero bears similarity to earlier works, e.g., TD-Gammon (Tesauro,1992), but

is more complicated because of the MCTS and the deep NN

The success of AlphaZero is due to a skillful implenentation/integration of known

ideas, and awesome computational power

Bertsekas (M.I.T.) Reinforcement Learning 4 / 33

Approximate DP/RL Methodology is now Ambitious and Universal

Exact DP applies (in principle) to a very broad range of optimization problems

Deterministic <—-> Stochastic

Combinatorial optimization <—-> Optimal control w/ inﬁnite state/control spaces

One decision maker <—-> Two player games

... BUT is plagued by the curse of dimensionality and need for a math model

Approximate DP/RL overcomes the difﬁculties of exact DP by:

Approximation (use neural nets and other architectures to reduce dimension)

Simulation (use a computer model in place of a math model)

State of the art:

Broadly applicable methodology: Can address broad range of challenging

problems. Deterministic-stochastic-dynamic, discrete-continuous, games, etc

There are no methods that are guaranteed to work for all or even most problems

There are enough methods to try with a reasonable chance of success for most

types of optimization problems

Role of the theory: Guide the art, delineate the sound ideas

Bertsekas (M.I.T.) Reinforcement Learning 5 / 33

评论收藏

内容反馈

企鹅先生的丫头

粉丝: 3
资源: 2

Reinforcement Learning and Optimal Control by Dimitri P. Bertsek...

最新资源

Reinforcement Learning and Optimal Control by Dimitri P. Bertsek...

reinforcement learning and optimal control

强化学习和最优控制（Dimitri P. Bertsekas）扩展演讲/摘要

Reinforcement Learning and Optimal Control草稿本

Reinforcement Learning for Optimal Feedback Control

Robust and Optimal Control

DIMITRI BERTSEKAS_Convex Optimization Theory_solutions

概率论习题全部答案是pdf格式

reinforce learning.pdf

Reinforcement Learning of Optimal Controls

Continuous control with deep reinforcement learning.pdf

A Reinforcement Learning Framework for Medical Image Segmentation.pdf

Reinforcement learning合集

Reinforcement Learning.pdf

An Introduction to Deep Reinforcement Learning

Reinforcement Learning：An Introduction.pdf

Deep.Reinforcement.Learning.Han.-.Maxim.Lapan.pdf

Reinforcement Learning 2nd(Richard_S._Sutton).pdf

Reinforcement Learning : With Open AI, TensorFlow and Keras Using Python

Practical Reinforcement Learning Develop self-evolving, intelligent agents with

Reinforcement Learning With Open AI TensorFlow and.Keras Using Python

RL_MONOGRAPH1.pdf

DQ深度学习Deep Reinforcement Learning with Double Q-Learning.pdf

Human-level control through deep reinforcement learning.pdf

Reinforcement Learning-Theory and Algorithm.pdf

ChatGPT教程（终极版）最全整理

yolov8调用zed相机实现三维测距（版本一）

博客中Kmeans以及FCM算法数据（免积分）

hugging face的models-openai-clip-vit-large-patch14文件夹

最新资源