【免费】MasteringthegameofGowithouthumanknowledge(阿尔法元)资源-CSDN文库

共1个文件

pdf：1个

需积分: 0 74 浏览量 2017-10-24 19:32:53 上传评论收藏 2.2MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Mastering the game of Go without human knowledge.zip （1个子文件）

Mastering the game of Go without human knowledge.pdf 2.44MB

Mastering the Game of Go without Human Knowledge

David Silver*, Julian Schrittwieser*, Karen Simonyan*, Ioannis Antonoglou, Aja Huang, Arthur

Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy

Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis.

DeepMind, 5 New Street Square, London EC4A 3TW.

*These authors contributed equally to this work.

A long-standing goal of artiﬁcial intelligence is an algorithm that learns, tabula rasa, su-

perhuman proﬁciency in challenging domains. Recently, AlphaGo became the ﬁrst program

to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated posi-

tions and selected moves using deep neural networks. These neural networks were trained

by supervised learning from human expert moves, and by reinforcement learning from self-

play. Here, we introduce an algorithm based solely on reinforcement learning, without hu-

man data, guidance, or domain knowledge beyond game rules. AlphaGo becomes its own

teacher: a neural network is trained to predict AlphaGo’s own move selections and also the

winner of AlphaGo’s games. This neural network improves the strength of tree search, re-

sulting in higher quality move selection and stronger self-play in the next iteration. Starting

tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning

100-0 against the previously published, champion-defeating AlphaGo.

Much progress towards artiﬁcial intelligence has been made using supervised learning sys-

tems that are trained to replicate the decisions of human experts

1–4

. However, expert data is often

expensive, unreliable, or simply unavailable. Even when reliable data is available it may impose a

ceiling on the performance of systems trained in this manner

. In contrast, reinforcement learn-

ing systems are trained from their own experience, in principle allowing them to exceed human

capabilities, and to operate in domains where human expertise is lacking. Recently, there has been

rapid progress towards this goal, using deep neural networks trained by reinforcement learning.

These systems have outperformed humans in computer games such as Atari

6, 7

and 3D virtual en-

vironments

8–10

. However, the most challenging domains in terms of human intellect – such as the

game of Go, widely viewed as a grand challenge for artiﬁcial intelligence

– require precise and

sophisticated lookahead in vast search spaces. Fully general methods have not previously achieved

human-level performance in these domains.

AlphaGo was the ﬁrst program to achieve superhuman performance in Go. The published

version

, which we refer to as AlphaGo Fan, defeated the European champion Fan Hui in October

2015. AlphaGo Fan utilised two deep neural networks: a policy network that outputs move prob-

abilities, and a value network that outputs a position evaluation. The policy network was trained

initially by supervised learning to accurately predict human expert moves, and was subsequently

reﬁned by policy-gradient reinforcement learning. The value network was trained to predict the

winner of games played by the policy network against itself. Once trained, these networks were

combined with a Monte-Carlo Tree Search (MCTS)

13–15

to provide a lookahead search, using the

policy network to narrow down the search to high-probability moves, and using the value net-

work (in conjunction with Monte-Carlo rollouts using a fast rollout policy) to evaluate positions in

the tree. A subsequent version, which we refer to as AlphaGo Lee, used a similar approach (see

Methods), and defeated Lee Sedol, the winner of 18 international titles, in March 2016.

Our program, AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee

in several im-

portant aspects. First and foremost, it is trained solely by self-play reinforcement learning, starting

from random play, without any supervision or use of human data. Second, it only uses the black

and white stones from the board as input features. Third, it uses a single neural network, rather

than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this

single neural network to evaluate positions and sample moves, without performing any Monte-

Carlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that

incorporates lookahead search inside the training loop, resulting in rapid improvement and precise

and stable learning. Further technical differences in the search algorithm, training procedure and

network architecture are described in Methods.

1 Reinforcement Learning in AlphaGo Zero

Our new method uses a deep neural network f

with parameters θ. This neural network takes as an

input the raw board representation s of the position and its history, and outputs both move probabil-

ities and a value, (p, v) = f

(s). The vector of move probabilities p represents the probability of

selecting each move (including pass), p

= P r(a|s). The value v is a scalar evaluation, estimating

the probability of the current player winning from position s. This neural network combines the

roles of both policy network and value network

into a single architecture. The neural network

consists of many residual blocks

of convolutional layers

16, 17

with batch normalisation

and

rectiﬁer non-linearities

(see Methods).

The neural network in AlphaGo Zero is trained from games of self-play by a novel reinforce-

ment learning algorithm. In each position s, an MCTS search is executed, guided by the neural

network f

. The MCTS search outputs probabilities π

π of playing each move. These search proba-

bilities usually select much stronger moves than the raw move probabilities p of the neural network

(s); MCTS may therefore be viewed as a powerful policy improvement operator

20, 21

. Self-play

with search – using the improved MCTS-based policy to select each move, then using the game

winner z as a sample of the value – may be viewed as a powerful policy evaluation operator. The

main idea of our reinforcement learning algorithm is to use these search operators repeatedly in

a policy iteration procedure

22, 23

: the neural network’s parameters are updated to make the move

probabilities and value (p, v) = f

(s) more closely match the improved search probabilities and

self-play winner (π

π, z); these new parameters are used in the next iteration of self-play to make the

search even stronger. Figure 1 illustrates the self-play training pipeline.

The Monte-Carlo tree search uses the neural network f

to guide its simulations (see Figure

2). Each edge (s, a) in the search tree stores a prior probability P (s, a), a visit count N(s, a),

and an action-value Q(s, a). Each simulation starts from the root state and iteratively selects

moves that maximise an upper conﬁdence bound Q(s, a) + U(s, a), where U(s, a) ∝ P (s, a)/(1+

N(s, a))

12, 24

, until a leaf node s

is encountered. This leaf position is expanded and evaluated just

Figure 1: Self-play reinforcement learning in AlphaGo Zero. a The program plays a game s

, ..., s

against itself.

In each position s

, a Monte-Carlo tree search (MCTS) α

is executed (see Figure 2) using the latest neural network

. Moves are selected according to the search probabilities computed by the MCTS, a

∼ π

. The terminal position

is scored according to the rules of the game to compute the game winner z. b Neural network training in AlphaGo

Zero. The neural network takes the raw board position s

as its input, passes it through many convolutional layers

with parameters θ, and outputs both a vector p

, representing a probability distribution over moves, and a scalar value

, representing the probability of the current player winning in position s

. The neural network parameters θ are

updated so as to maximise the similarity of the policy vector p

to the search probabilities π

, and to minimise the

error between the predicted winner v

and the game winner z (see Equation 1). The new parameters are used in the

next iteration of self-play a.

Figure 2: Monte-Carlo tree search in AlphaGo Zero. a Each simulation traverses the tree by selecting the edge

with maximum action-value Q, plus an upper conﬁdence bound U that depends on a stored prior probability P and

visit count N for that edge (which is incremented once traversed). b The leaf node is expanded and the associated

position s is evaluated by the neural network (P (s, ·), V (s)) = f

(s); the vector of P values are stored in the outgoing

edges from s. c Action-values Q are updated to track the mean of all evaluations V in the subtree below that action. d

Once the search is complete, search probabilities π

π are returned, proportional to N

1/τ

, where N is the visit count of

each move from the root state and τ is a parameter controlling temperature.

once by the network to generate both prior probabilities and evaluation, (P (s

, ·), V (s

)) = f

Each edge (s, a) traversed in the simulation is updated to increment its visit count N(s, a), and to

update its action-value to the mean evaluation over these simulations, Q(s, a) = 1/N(s, a)

|s,a→s

V (s

where s, a → s

indicates that a simulation eventually reached s

after taking move a from position

MCTS may be viewed as a self-play algorithm that, given neural network parameters θ and

a root position s, computes a vector of search probabilities recommending moves to play, π

π =

(s), proportional to the exponentiated visit count for each move, π

∝ N(s, a)

1/τ

, where τ is a

temperature parameter.

The neural network is trained by a self-play reinforcement learning algorithm that uses

MCTS to play each move. First, the neural network is initialised to random weights θ

. At each

subsequent iteration i ≥ 1, games of self-play are generated (Figure 1a). At each time-step t,

an MCTS search π

= α

i−1

) is executed using the previous iteration of neural network f

i−1

and a move is played by sampling the search probabilities π

. A game terminates at step T when

评论收藏

内容反馈

学霸一枚

粉丝: 4
资源: 1

Mastering the game of Go without human knowledge(阿尔法元)

最新资源

Mastering the game of Go without human knowledge(阿尔法元)

Mastering the Game of Go without Human Knowledge

Mastering the game of Go without human knowledge原文

Mastering the game of Go without human knowledge翻译1

Mastering the Game of Go without Human Knowledge（机器学习英文著作）.pdf

mastering the game of Go

Mastering the game of Go with deep neural networks and tree search

Mastering the game of Go with deep neural networks and tree search 中英文

mastering the game of go with deep neural networks and tree search.pdf

Mastering Android Game Development with Unity

Mastering the Art of Problem Determination

Mastering SFML Game Development

课程设计基于max-min算法以及深度强化学习的井字棋游戏python源码

Mastering LibGDX Game Development pdf 0分

Mastering C++ Game Development Create professional and realistic 3D games epub

Mastering IOS Game Development mobi

Mastering C++ Game Development 无水印原版PDF

more mastering the art of indexing 1

Mastering.SFML.Game.Development.epub

Mastering Android Game Development with Unity epub

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

YOLOV5 + 双目相机实现三维测距（新版本）

全新的SOTA模型YOLOv9

YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

最新资源