Mastering the game of Go without human knowledge
本文为英文版完整论文。论文摘要翻译：长期以来，人工智能算法的目标就是让机器能够学习，在具有挑战性的专业领域，从婴儿般的状态（没有经验、知识基础）发展到超人类的级别。近期，AlphaGo成为了首个打败人类围棋世界冠军的程序。AlphaGo中的树形检索（tree search）可以利用深度神经网络评估棋局并进行落子，甚至能通过自我对弈实现强化学习（reinforcement learning）。本文（nature24270）介绍一种纯粹基于强化学习的算法，无需人类数据、指导或者超出游戏规则的专业知识。AlphaGo成为了自己的老师：建立了一个神经网络来预测AlphaGo的落子选择和比赛胜负方。这个神经网络强化了树形检索的能力，求解了更优的落子选择，并为下一次迭代提供了更强的自我对弈。从“婴儿”开始，我们的新程序AlphaGo Zero表现出了超越人类的“才能”，面对旧版AlphaGo——冠军终结者，战绩是100（胜）-0（败）。
1 Reinforcement Learning in AlphaGo zero Our new method uses a deep neural network fe with parameters 6. This neural network takes as an input the raw board representation s of the position and its history, and outputs both move probabil ities and a value,(p,u)=fo(s). The vector of move probabilities p represents the probability of selecting each move(including pass), pa= Pr(as). The value v is a scalar evaluation, estimating the probability of the current player winning from position s. This neural network combines the roles of both policy network and valuc network 2 into a single architecture. The neural network consists of many residual blocks of convolutional layers 6, I7 with batch normalisation and rectifier non- linearities (see Methods) The neural network in AlphaGo Zero is trained from games of self-play by a novel reinforce ment learning algorithm. In each position s, an MctS search is executed, guided by the neural network fe. The MCTS search outputs probabilities T of playing each move. These search proba bilities usually select much stronger moves than the raw move probabilities p of the neural network fe(s); MCTS may therefore be viewed as a powerful policy improvement operator 20, 2. Self-play with search- using the improved MCTs-based policy to select each move, then using the game winner z as a sample of the value- may be viewed as a powerful policy evaluation operator The main idea of our reinforcement learning algorithm is to use thesc search operators repeatedly in a policy iteration procedure 22, 23: the neural network's parameters are updated to make the move probabilities and value(p, a)=fe(s) more closely match the improved search probabilities and self-play winner(T. 2); these new parameters are used in the next iteration of self-play to make the search even stronger. Figure I illustrates the self-play training pipeline The Monte-Carlo tree search uses the neural network fe to guide its simulations(see Figure 2). Each edge(s, a) in the search tree stores a prior probability P(s, a), a visit count N(s,a and an action-value Q(s, a). Each simulation starts from the root state and iteratively selects moves that maximise an upper confidence bound Q(s, a)+U(s, a), where U(s, a)x P(s,a)/(1+ N(s, a))2, 24, until a leaf node s' is encountered. This leaf position is expanded and evaluated just a, Self-Play 1~丌1 2 田→…→ 72 73 b Neural Network Training ↓ fe fe f T 丌2 丌3 Figure 1: Self-play reinforcement learning in ALphaGo Zero a The program plays a game s1, . sT against itself In each position St, a Monte-Carlo tree search(MCts)ae is executed(see Figure 2)using the latest neural network fe. Moves are selected according to the search probabilities computed by the mcts, at N Tt. The terminal position sr is scored according to the rules of the game to compute the game winner z. b Neural network training in Alphago Zero. The neural network takes the raw board position St as its input, passes it through many convolutional layers with parameters B, and outputs both a vector pt, representing a probability distribution over moves, and a scalar value vt, representing the probability of the current player winning in position st. The neural network parameters 6 are updated so as to maximise the similarity of the policy vector pt to the search probabilities Tt, and to minimise the error between the predicted winner ut and the game winner z(see Equation 1). The new parameters are used in the next iteration of self-play a a select b Expand and evaluate C Backu d Play +(Q+U /O9 V Q+U不3、+U (P,v)=f0 Figure 2: Monte-Carlo tree search in AlphaGo Zero. a Each simulation traverses the tree by selecting the edge with maximum action-value Q, plus an upper confidence bound U that depends on a stored prior probability P and visit count N for that edge(which is incremented once traversed). b The leaf node is expanded and the associated position s is evaluated by the neural network(P(s,),V(s))=fe(s); the vector of P values are stored in the outgoin edges from s. c Action-values Q are updated to track the mean of all evaluations v in the subtree below that action. d Once the search is complete, search probabilities T are returned, proportional to N/, where N is the visit count of each move from the root state and t is a parameter controlling temperature once by the network to generate both prior probabilities and evaluation, (P(s, ,v(s))-fe(s) Each edge(s, a)traversed in the simulation is updated to increment its visit count N(s, a), and to update its action-value to the mean evaluation over these simulations, Q(s, a)=1/N(S,a)2s s|s,a→s′ where s, a-s indicates that a simulation eventually reached s after taking move a from position MCTS may be viewed as a self-play algorithm that, given neural network parameters 6 and a root position s, computes a vector of search probabilities recommending moves to play,T ) proportional to the exponentiated visit count for each move, Ta x N(s, a )1, where T is a temperature parameter. The neural network is trained by a self-play reinforcement learning algorithm that uses MCtS to play each move. First the neural network is initialised to random weights 0o. At each subsequent iteration 2> 1, games of self-play are generated(Figure la). At each time-step t, an MCtS search Tt=aB-1(st)is executed using the previous iteration of neural network fe_ and a move is played by sampling the search probabilities Tt. A game terminates at step T when both players pass, when the search value drops below a resignation threshold, or when the game exceeds a maximum length; the game is then scored to give a final reward ofTT E 1,+1 (see Methods for details ). The data for each time-step t is stored as(St, Tt, 2t) where 2t T'r Is the game winner from the perspective of the current player at step t. In parallel( Figure 1b),new network parameters Bi are trained from data(s. T, 2) sampled uniformly among all time-steps of the last iteration(s)of self-play. The neural network(p, v)= fe, (s) is adjusted to minimise the error between the predicted value v and the self-play winner z, and to maximise the similarity of the neural network move probabilities p to the search probabilities T. Specifically, the p arame ters g are adjusted by gradient descent on a loss function l that sums over mean-squared error and cross-entropy losses respectively, (p, u)=fe(s), 2-0 ogp+cle where c is a parameter controlling the level of L2 weight regularisation( to prevent overfitting 2 Empirical Analysis of AlphaGo Zero Training We applied our reinforcement learning pipeline to train our program AlphaGo Zero. Training started from completely random behaviour and continued without human intervention for approx imately 3 days Over the course of training, 4.9 million games of self-play were generated, using 1, 600 simu- lations for each MCts, which corresponds to approximately o4s thinking time per move. Param eters were updated from 700, 000 mini-batches of 2,048 positions. The neural network contained 20 residual blocks(see Methods for further details Figure 3a shows the performance of AlphaGo Zero during self-play reinforcement learning, as a function of training time, on an Elo scale 25. Learning progressed smoothly throughout train ing, and did not suffer from the oscillations or catastrophic forgetting suggested in prior literature b 3000 0.3 2000 0.25 生acooct 8 2000 3000 Reinforcement Learning Reinforcement Learning .e. AlphaGo Lee Supervised Learning 0.15 Supervised Learning 010203045060z0 Training time(hours) Training time (hours Training time(hours) Figure 3: Empirical evaluation of AlphaGo Zero. a Performance of self-play reinforcement learning. The plot shows the performance of each MCts player ae, from each iteration i of reinforcement learning in AlphaGo zero Elo ratings were computed from evaluation games between different players, using 0. 4 seconds of thinking time per move(see Methods). For comparison, a similar player trained by supervised learning from human data, using the KGS data-set, is also shown. b Prediction accuracy on human professional moves. The plot shows the accuracy of the neural network e,, at each iteration of self-play i, in predicting human professional moves from the GoKiu data-set The accuracy measures the percentage of positions in which the neural network assigns the highest probability to the human move. The accuracy of a neural network trained by supervised learning is also shown. c Mean-squared error MSE)on human professional game outcomes. The plot shows the Mse of the neural network fe,, at each iteration of self-play i, in predicting the outcome of human professional games from the GoKifu data-set. The Mse is between the actual outcome z e -1, +1 and the neural network value u, scaled by a factor of i to the range [0, 1 ].The MSE of a neural network trained by supervised learning is also shown 7 26-28. Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 hours; for compari- son, AlphaGo lee was trained over several months. After 72 hours, we evaluated AlphaGo zero against the exact version of alphago lee that defeated Lee sedol, under the 2 hour time controls and match conditions as were used in the man-machine match in Seoul(see Methods). AlphaGo Zero used a single machine with 4 Tensor Processing Units(TPUs)29, while AlphaGo Lee was distributed over many machines and used 48 TPUS. AlphaGo Zero defeated Alpha Go lee by 100 games too(see Extended Data Figure 5 and Supplementary information) To assess the merits of sclf-play reinforcement learning, compared to learning from human data, we trained a second neural network (using the same architecture) to predict expert moves in the kGs data-set; this achieved state-of-the-art prediction accuracy compared to prior work 12, 30-33(see Extended Data Table 1 and 2 respectively). Supervised learning achieved better initial performance, and was better at predicting the outcome of human professional games(Figure 3) Notably, although supervised learning achieved higher move prediction accuracy, the self-learned player performed much better overall, defeating the human-trained player within the first 24 hours of training This suggests that Alpha Go Zero may be learning a strategy that is qualitatively differ ent to human play. To separate the contributions of architecture and algorithm, we compared the performance of the neural network architecture in Alpha Go Zero with the previous neural network architecture used in AlphaGo Lee(see Figure 4). Four neural networks were created, using either separate policy and value networks, as in AlphaGo lee, or combined policy and value networks, as in AlphaGo zero; and using either the convolutional network architecture from AlphaGo lee or the residual network architecture from AlphaGo Zero. Each network was trained to minimise the same loss function(Equation 1)using a fixed data-set of self-play games generated by AlphaGo Zero after 72 hours of sclf-play training. Using a residual nctwork was more accurate, achieved lower error, and improved performance in Alpha Go by over 600 Elo. Combining policy and value together into a single network slightly reduced the move prediction accuracy but reduced the value error and boosted playing performance in AlphaGo by around another 600 Elo. This is partly due to a。4500 b.0.50 052 0.19 0.51 0.18 3500 5 3000 0.16 2500 dual-res sep-res dual-coIv seu-corIv dual-res sep-res dual-convsep-conv dua p-res dual-curIV Figure 4: Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee. Comparison of neural network architectures using either separate(sep")or combined policy and value networks("dual ), and using either convolutional(“conv) or residual networks(res”). The combinations“dual-res”and“ sep-conv""correspond to the neural network architectures used in AlphaGo Zero and alphagro lee respectively. Each network was trained on fixed data-set generated by a previous run of AlphaGo Zero. a Each trained network was combined with AlphaGo Zero's search to obtain a different player. Elo ratings were computed from evaluation games between these different players, using 5 seconds of thinking time per move. b Prediction accuracy on human professional moves(from the GoKifu data-set) for each network architecture. c Mean-squared error on human professional game outcomes(from the gokifu data-set)for each network architecture improved computational efficiency, but more importantly the dual objective regularises the network to a common representation that supports multiple use cases 3 Knowledge Learned by AlphaGo Zero Alpha Go Zero discovered a remarkable level of go knowledge during its self-play training process This included fundamental elements of human go knowledge. and also non -standard strategies beyond the scope of traditional go knowledge Figure 5 shows a timeline indicating when professional joseki (corner sequences) were dis covered(Figure 5a, Extended Data Figure 1); ultimately AlphaGo Zero preferred new joseki vari- ants that were previously unknown(Figure 5b, Extended Data Figure 2). Figure 5c and the sup- plementary Information show several fast self-play games played at different stages of training Tournament length games played at regular intervals throughout training are shown in Extended Data Figure 3 and Supplementary Information. AlphaGo Zero rapidly progressed from entirely random moves towards a sophisticated understanding of Go concepts including fuseki (opening), tesuji(tactics), life-and-death, ko (repeated board situations ) yose(endgame), capturing races sente(initiative), shape, influence and territory, all discovered from first principles. Surprisingly, shicho(" " ladder"capture sequences that may span the whole board)-one of the first elements of Go knowledge learned by humans- were only understood by AlphaGo Zero much later in training 4 Final Performance of AlphaGo Zero We subsequently applied our reinforcement learning pipeline to a second instance of Alpha go zero using a larger neural network and over a longer duralion. Training again started rom completely random behaviour and continued for approximately 40 days Over the course of training, 29 million games of self-play were generated Parameters were updated from 3. 1 million mini-batches of 2,048 positions each. The neural network contained
Mastering the game of Go without human knowledge 英文高清完整.pdf版下载2017-10-19
讲述alpha zero的原文，发表在nature。 A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.
Mastering the game of Go without human knowledge原文2018-04-04
Mastering the Game of Go without Human Knowledge论文原文，值得学习
Mastering the Game of Go without Human Knowledge2017-10-19
alphago zero nature paper : Mastering the Game of Go without Human Knowledge
Mastering the game of Go without human knowledge(阿尔法元)2017-10-24
阿尔法元nature论文，Mastering the Game of Go without Human Knowledge
Mastering the game of Go with deep neural networks and tree search2016-03-10
mastering the game of go with deep neural networks and tree search.pdf2019-02-15
alpha go 系列1，主要讲述Monte-Carlo tree search (MCTS) 蒙特卡洛树搜索，作为入门第一篇还是可以的
- Mastering the game of Go without human knowledge 伪代码 642020-07-08好不容易啃完了文章 写一个psuedo code下次看起来方便。文章写得比较清晰 但是也有比较迷惑的地方 琢磨了很久 看得时候遇到相似问题的朋友也可以从这找到点提示。 原文链接：AlphaGo-Zero Begin randomize parameter for the network of the best player while the time limitation for learning is not reached do Procedure Self-play:
- 谷歌Nature论文alphaGo Zero: Mastering the game of Go without human knowledge论文详解 9942019-01-09背景：谷歌的阿尔法围棋算法（AlphaGo）是第一个击败人类职业围棋选手、第一个战胜围棋世界冠军的人工智能机器人，由谷歌（Google）旗下DeepMind公司戴密斯·哈萨比斯领衔的团队开发。其主要工作原理是“深度学习(deep learning)”。2017年，谷歌推出了Alpha Zero，通过强化学习（reinforcement learning）的方法，在不依赖人类经验的基础和击败了Alp...
mastering the game of Go2017-11-27
alpha zero 原理说明 如何进行无知识学习的
Mastering the game of Go with deep neural networks and tree search 中英文2016-03-25
Mastering SFML Game Development2017-04-24
Mastering SFML Game Development by Raimondas Pupius English | 2017 | ISBN: 178646988X | 433 Pages | True PDF | 8 MB SFML is a cross-platform software development library written in C++ with bindings ...
Mastering the Art of Problem Determination2011-11-13
Chapter 3: The /proc Filesystem Chapter 4: Compiling Chapter 5: The Stack Chapter 6: The GNU Debugger (GDB) Chapter 7: Linux System Crashes and Hangs Chapter 8: Kernel Debugging with KDB Chapter 9: ...
Mastering C++ Game Development 无水印原版PDF2018-08-22
Mastering C++ Game Development : Create Professional and Realistic 3D Games Using C++ 17
Mastering the C++17 STL Make full use of the standard library components in azw32017-10-16
Mastering the C++17 STL Make full use of the standard library components in C++17 英文azw3 本资源转载自网络，如有侵权，请联系上传者或csdn删除 本资源转载自网络，如有侵权，请联系上传者或csdn删除
Mastering LibGDX Game Development pdf 0分2016-01-13
If you are an intermediate-level game developer who wants to create an RPG video game but found the creation process overwhelming, either by lack of tutorials or by getting lost in a sea of game-...
Mastering Android Game Development with Unity2017-06-14
You will begin with the basic concepts of Android game development, a brief history of Android games, the building blocks of Android games in Unity 5, and the basic flow of games. You will configure ...
Mastering C++ Game Development Create professional and realistic 3D games epub2018-03-05
Mastering C++ Game Development Create professional and realistic 3D games using C++ 17 英文epub 本资源转载自网络，如有侵权，请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书
Mastering C++ Game Development.pdf2018-06-16
This book is intended for intermediate to advanced C++ game developers who are looking to take their skills to the next level and learn the deep concepts of 3D game development. The reader will learn ...
Mastering IOS Game Development mobi2018-01-17
Mastering IOS Game Development 英文mobi 本资源转载自网络，如有侵权，请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书
more mastering the art of indexing 12010-12-23
more mastering the art of indexing 1 Yoshinori Matsunobu
Mastering IOS Game Development 无水印原版pdf2018-02-01
SpriteKit is part of the native SDK from Apple, and enables developers to make simple entry into game development without unnecessary overhead and a long learning process. SpriteKit also provides ...
Cadence Sigrity Power SI 仿真操作流程(一).pdf
Cadence Sigrity Power SI 仿真操作流程(一).pdf