Mastering the game of Go without human knowledge

本文为英文版完整论文。论文摘要翻译：长期以来，人工智能算法的目标就是让机器能够学习，在具有挑战性的专业领域，从婴儿般的状态（没有经验、知识基础）发展到超人类的级别。近期，AlphaGo成为了首个打败人类围棋世界冠军的程序。AlphaGo中的树形检索（tree search）可以利用深度神经网络评估棋局并进行落子，甚至能通过自我对弈实现强化学习（reinforcement learning）。本文（nature24270）介绍一种纯粹基于强化学习的算法，无需人类数据、指导或者超出游戏规则的专业知识。AlphaGo成为了自己的老师：建立了一个神经网络来预测AlphaGo的落子选择和比赛胜负方。这个神经网络强化了树形检索的能力，求解了更优的落子选择，并为下一次迭代提供了更强的自我对弈。从“婴儿”开始，我们的新程序AlphaGo Zero表现出了超越人类的“才能”，面对旧版AlphaGo——冠军终结者，战绩是100（胜）0（败）。
1 Reinforcement Learning in AlphaGo zero Our new method uses a deep neural network fe with parameters 6. This neural network takes as an input the raw board representation s of the position and its history, and outputs both move probabil ities and a value,(p,u)=fo(s). The vector of move probabilities p represents the probability of selecting each move(including pass), pa= Pr(as). The value v is a scalar evaluation, estimating the probability of the current player winning from position s. This neural network combines the roles of both policy network and valuc network 2 into a single architecture. The neural network consists of many residual blocks of convolutional layers 6, I7 with batch normalisation and rectifier non linearities (see Methods) The neural network in AlphaGo Zero is trained from games of selfplay by a novel reinforce ment learning algorithm. In each position s, an MctS search is executed, guided by the neural network fe. The MCTS search outputs probabilities T of playing each move. These search proba bilities usually select much stronger moves than the raw move probabilities p of the neural network fe(s); MCTS may therefore be viewed as a powerful policy improvement operator 20, 2. Selfplay with search using the improved MCTsbased policy to select each move, then using the game winner z as a sample of the value may be viewed as a powerful policy evaluation operator The main idea of our reinforcement learning algorithm is to use thesc search operators repeatedly in a policy iteration procedure 22, 23: the neural network's parameters are updated to make the move probabilities and value(p, a)=fe(s) more closely match the improved search probabilities and selfplay winner(T. 2); these new parameters are used in the next iteration of selfplay to make the search even stronger. Figure I illustrates the selfplay training pipeline The MonteCarlo tree search uses the neural network fe to guide its simulations(see Figure 2). Each edge(s, a) in the search tree stores a prior probability P(s, a), a visit count N(s,a and an actionvalue Q(s, a). Each simulation starts from the root state and iteratively selects moves that maximise an upper confidence bound Q(s, a)+U(s, a), where U(s, a)x P(s,a)/(1+ N(s, a))2, 24, until a leaf node s' is encountered. This leaf position is expanded and evaluated just a, SelfPlay 1~丌1 2 田→…→ 72 73 b Neural Network Training ↓ fe fe f T 丌2 丌3 Figure 1: Selfplay reinforcement learning in ALphaGo Zero a The program plays a game s1, . sT against itself In each position St, a MonteCarlo tree search(MCts)ae is executed(see Figure 2)using the latest neural network fe. Moves are selected according to the search probabilities computed by the mcts, at N Tt. The terminal position sr is scored according to the rules of the game to compute the game winner z. b Neural network training in Alphago Zero. The neural network takes the raw board position St as its input, passes it through many convolutional layers with parameters B, and outputs both a vector pt, representing a probability distribution over moves, and a scalar value vt, representing the probability of the current player winning in position st. The neural network parameters 6 are updated so as to maximise the similarity of the policy vector pt to the search probabilities Tt, and to minimise the error between the predicted winner ut and the game winner z(see Equation 1). The new parameters are used in the next iteration of selfplay a a select b Expand and evaluate C Backu d Play +(Q+U /O9 V Q+U不3、+U (P,v)=f0 Figure 2: MonteCarlo tree search in AlphaGo Zero. a Each simulation traverses the tree by selecting the edge with maximum actionvalue Q, plus an upper confidence bound U that depends on a stored prior probability P and visit count N for that edge(which is incremented once traversed). b The leaf node is expanded and the associated position s is evaluated by the neural network(P(s,),V(s))=fe(s); the vector of P values are stored in the outgoin edges from s. c Actionvalues Q are updated to track the mean of all evaluations v in the subtree below that action. d Once the search is complete, search probabilities T are returned, proportional to N/, where N is the visit count of each move from the root state and t is a parameter controlling temperature once by the network to generate both prior probabilities and evaluation, (P(s, ,v(s))fe(s) Each edge(s, a)traversed in the simulation is updated to increment its visit count N(s, a), and to update its actionvalue to the mean evaluation over these simulations, Q(s, a)=1/N(S,a)2s ss,a→s′ where s, as indicates that a simulation eventually reached s after taking move a from position MCTS may be viewed as a selfplay algorithm that, given neural network parameters 6 and a root position s, computes a vector of search probabilities recommending moves to play,T ) proportional to the exponentiated visit count for each move, Ta x N(s, a )1, where T is a temperature parameter. The neural network is trained by a selfplay reinforcement learning algorithm that uses MCtS to play each move. First the neural network is initialised to random weights 0o. At each subsequent iteration 2> 1, games of selfplay are generated(Figure la). At each timestep t, an MCtS search Tt=aB1(st)is executed using the previous iteration of neural network fe_ and a move is played by sampling the search probabilities Tt. A game terminates at step T when both players pass, when the search value drops below a resignation threshold, or when the game exceeds a maximum length; the game is then scored to give a final reward ofTT E 1,+1 (see Methods for details ). The data for each timestep t is stored as(St, Tt, 2t) where 2t T'r Is the game winner from the perspective of the current player at step t. In parallel( Figure 1b),new network parameters Bi are trained from data(s. T, 2) sampled uniformly among all timesteps of the last iteration(s)of selfplay. The neural network(p, v)= fe, (s) is adjusted to minimise the error between the predicted value v and the selfplay winner z, and to maximise the similarity of the neural network move probabilities p to the search probabilities T. Specifically, the p arame ters g are adjusted by gradient descent on a loss function l that sums over meansquared error and crossentropy losses respectively, (p, u)=fe(s), 20 ogp+cle where c is a parameter controlling the level of L2 weight regularisation( to prevent overfitting 2 Empirical Analysis of AlphaGo Zero Training We applied our reinforcement learning pipeline to train our program AlphaGo Zero. Training started from completely random behaviour and continued without human intervention for approx imately 3 days Over the course of training, 4.9 million games of selfplay were generated, using 1, 600 simu lations for each MCts, which corresponds to approximately o4s thinking time per move. Param eters were updated from 700, 000 minibatches of 2,048 positions. The neural network contained 20 residual blocks(see Methods for further details Figure 3a shows the performance of AlphaGo Zero during selfplay reinforcement learning, as a function of training time, on an Elo scale 25. Learning progressed smoothly throughout train ing, and did not suffer from the oscillations or catastrophic forgetting suggested in prior literature b 3000 0.3 2000 0.25 生acooct 8 2000 3000 Reinforcement Learning Reinforcement Learning .e. AlphaGo Lee Supervised Learning 0.15 Supervised Learning 010203045060z0 Training time(hours) Training time (hours Training time(hours) Figure 3: Empirical evaluation of AlphaGo Zero. a Performance of selfplay reinforcement learning. The plot shows the performance of each MCts player ae, from each iteration i of reinforcement learning in AlphaGo zero Elo ratings were computed from evaluation games between different players, using 0. 4 seconds of thinking time per move(see Methods). For comparison, a similar player trained by supervised learning from human data, using the KGS dataset, is also shown. b Prediction accuracy on human professional moves. The plot shows the accuracy of the neural network e,, at each iteration of selfplay i, in predicting human professional moves from the GoKiu dataset The accuracy measures the percentage of positions in which the neural network assigns the highest probability to the human move. The accuracy of a neural network trained by supervised learning is also shown. c Meansquared error MSE)on human professional game outcomes. The plot shows the Mse of the neural network fe,, at each iteration of selfplay i, in predicting the outcome of human professional games from the GoKifu dataset. The Mse is between the actual outcome z e 1, +1 and the neural network value u, scaled by a factor of i to the range [0, 1 ].The MSE of a neural network trained by supervised learning is also shown 7 2628. Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 hours; for compari son, AlphaGo lee was trained over several months. After 72 hours, we evaluated AlphaGo zero against the exact version of alphago lee that defeated Lee sedol, under the 2 hour time controls and match conditions as were used in the manmachine match in Seoul(see Methods). AlphaGo Zero used a single machine with 4 Tensor Processing Units(TPUs)29, while AlphaGo Lee was distributed over many machines and used 48 TPUS. AlphaGo Zero defeated Alpha Go lee by 100 games too(see Extended Data Figure 5 and Supplementary information) To assess the merits of sclfplay reinforcement learning, compared to learning from human data, we trained a second neural network (using the same architecture) to predict expert moves in the kGs dataset; this achieved stateoftheart prediction accuracy compared to prior work 12, 3033(see Extended Data Table 1 and 2 respectively). Supervised learning achieved better initial performance, and was better at predicting the outcome of human professional games(Figure 3) Notably, although supervised learning achieved higher move prediction accuracy, the selflearned player performed much better overall, defeating the humantrained player within the first 24 hours of training This suggests that Alpha Go Zero may be learning a strategy that is qualitatively differ ent to human play. To separate the contributions of architecture and algorithm, we compared the performance of the neural network architecture in Alpha Go Zero with the previous neural network architecture used in AlphaGo Lee(see Figure 4). Four neural networks were created, using either separate policy and value networks, as in AlphaGo lee, or combined policy and value networks, as in AlphaGo zero; and using either the convolutional network architecture from AlphaGo lee or the residual network architecture from AlphaGo Zero. Each network was trained to minimise the same loss function(Equation 1)using a fixed dataset of selfplay games generated by AlphaGo Zero after 72 hours of sclfplay training. Using a residual nctwork was more accurate, achieved lower error, and improved performance in Alpha Go by over 600 Elo. Combining policy and value together into a single network slightly reduced the move prediction accuracy but reduced the value error and boosted playing performance in AlphaGo by around another 600 Elo. This is partly due to a。4500 b.0.50 052 0.19 0.51 0.18 3500 5 3000 0.16 2500 dualres sepres dualcoIv seucorIv dualres sepres dualconvsepconv dua pres dualcurIV Figure 4: Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee. Comparison of neural network architectures using either separate(sep")or combined policy and value networks("dual ), and using either convolutional(“conv) or residual networks(res”). The combinations“dualres”and“ sepconv""correspond to the neural network architectures used in AlphaGo Zero and alphagro lee respectively. Each network was trained on fixed dataset generated by a previous run of AlphaGo Zero. a Each trained network was combined with AlphaGo Zero's search to obtain a different player. Elo ratings were computed from evaluation games between these different players, using 5 seconds of thinking time per move. b Prediction accuracy on human professional moves(from the GoKifu dataset) for each network architecture. c Meansquared error on human professional game outcomes(from the gokifu dataset)for each network architecture improved computational efficiency, but more importantly the dual objective regularises the network to a common representation that supports multiple use cases 3 Knowledge Learned by AlphaGo Zero Alpha Go Zero discovered a remarkable level of go knowledge during its selfplay training process This included fundamental elements of human go knowledge. and also non standard strategies beyond the scope of traditional go knowledge Figure 5 shows a timeline indicating when professional joseki (corner sequences) were dis covered(Figure 5a, Extended Data Figure 1); ultimately AlphaGo Zero preferred new joseki vari ants that were previously unknown(Figure 5b, Extended Data Figure 2). Figure 5c and the sup plementary Information show several fast selfplay games played at different stages of training Tournament length games played at regular intervals throughout training are shown in Extended Data Figure 3 and Supplementary Information. AlphaGo Zero rapidly progressed from entirely random moves towards a sophisticated understanding of Go concepts including fuseki (opening), tesuji(tactics), lifeanddeath, ko (repeated board situations ) yose(endgame), capturing races sente(initiative), shape, influence and territory, all discovered from first principles. Surprisingly, shicho(" " ladder"capture sequences that may span the whole board)one of the first elements of Go knowledge learned by humans were only understood by AlphaGo Zero much later in training 4 Final Performance of AlphaGo Zero We subsequently applied our reinforcement learning pipeline to a second instance of Alpha go zero using a larger neural network and over a longer duralion. Training again started rom completely random behaviour and continued for approximately 40 days Over the course of training, 29 million games of selfplay were generated Parameters were updated from 3. 1 million minibatches of 2,048 positions each. The neural network contained
 3.84MB
Mastering the game of Go without human knowledge 英文高清完整.pdf版下载
20171019讲述alpha zero的原文，发表在nature。 A longstanding goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from selfplay. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger selfplay in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, championdefeating AlphaGo.
 2.10MB
Mastering the game of Go without human knowledge原文
20180404Mastering the Game of Go without Human Knowledge论文原文，值得学习
 2.44MB
Mastering the Game of Go without Human Knowledge
20171019alphago zero nature paper : Mastering the Game of Go without Human Knowledge
 2.19MB
Mastering the game of Go without human knowledge(阿尔法元)
20171024阿尔法元nature论文，Mastering the Game of Go without Human Knowledge
 2.55MB
Mastering the game of Go with deep neural networks and tree search
20160310Google AlphaGo
 1.57MB
mastering the game of go with deep neural networks and tree search.pdf
20190215alpha go 系列1，主要讲述MonteCarlo tree search (MCTS) 蒙特卡洛树搜索，作为入门第一篇还是可以的
 Mastering the game of Go without human knowledge 伪代码 6420200708好不容易啃完了文章 写一个psuedo code下次看起来方便。文章写得比较清晰 但是也有比较迷惑的地方 琢磨了很久 看得时候遇到相似问题的朋友也可以从这找到点提示。 原文链接：AlphaGoZero Begin randomize parameter for the network of the best player while the time limitation for learning is not reached do Procedure Selfplay:
 谷歌Nature论文alphaGo Zero: Mastering the game of Go without human knowledge论文详解 99420190109背景：谷歌的阿尔法围棋算法（AlphaGo）是第一个击败人类职业围棋选手、第一个战胜围棋世界冠军的人工智能机器人，由谷歌（Google）旗下DeepMind公司戴密斯·;哈萨比斯领衔的团队开发。其主要工作原理是“深度学习(deep learning)”。2017年，谷歌推出了Alpha Zero，通过强化学习（reinforcement learning）的方法，在不依赖人类经验的基础和击败了Alp...
 3.84MB
mastering the game of Go
20171127alpha zero 原理说明 如何进行无知识学习的
 31.31MB
Mastering the game of Go with deep neural networks and tree search 中英文
20160325Google的deepmind团队发表在nature上有关alphago的论文，包含原有的英文版，我翻译的中文版，以及一个20分钟对alphago工作原理的讲述。
 7.87MB
Mastering SFML Game Development
20170424Mastering SFML Game Development by Raimondas Pupius English  2017  ISBN: 178646988X  433 Pages  True PDF  8 MB SFML is a crossplatform software development library written in C++ with bindings ...
 4.36MB
Mastering the Art of Problem Determination
20111113Chapter 3: The /proc Filesystem Chapter 4: Compiling Chapter 5: The Stack Chapter 6: The GNU Debugger (GDB) Chapter 7: Linux System Crashes and Hangs Chapter 8: Kernel Debugging with KDB Chapter 9: ...
 7.30MB
Mastering C++ Game Development 无水印原版PDF
20180822Mastering C++ Game Development : Create Professional and Realistic 3D Games Using C++ 17
 1.68MB
Mastering the C++17 STL Make full use of the standard library components in azw3
20171016Mastering the C++17 STL Make full use of the standard library components in C++17 英文azw3 本资源转载自网络，如有侵权，请联系上传者或csdn删除 本资源转载自网络，如有侵权，请联系上传者或csdn删除
 6.31MB
Mastering LibGDX Game Development pdf 0分
20160113If you are an intermediatelevel game developer who wants to create an RPG video game but found the creation process overwhelming, either by lack of tutorials or by getting lost in a sea of game...
 11.27MB
Mastering Android Game Development with Unity
20170614You will begin with the basic concepts of Android game development, a brief history of Android games, the building blocks of Android games in Unity 5, and the basic flow of games. You will configure ...
 4.84MB
Mastering C++ Game Development Create professional and realistic 3D games epub
20180305Mastering C++ Game Development Create professional and realistic 3D games using C++ 17 英文epub 本资源转载自网络，如有侵权，请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书
 6.87MB
Mastering C++ Game Development.pdf
20180616This book is intended for intermediate to advanced C++ game developers who are looking to take their skills to the next level and learn the deep concepts of 3D game development. The reader will learn ...
 11.31MB
Mastering IOS Game Development mobi
20180117Mastering IOS Game Development 英文mobi 本资源转载自网络，如有侵权，请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书
 736KB
more mastering the art of indexing 1
20101223more mastering the art of indexing 1 Yoshinori Matsunobu
 4.75MB
Mastering IOS Game Development 无水印原版pdf
20180201SpriteKit is part of the native SDK from Apple, and enables developers to make simple entry into game development without unnecessary overhead and a long learning process. SpriteKit also provides ...

下载
行业分类纺织造纸用于滚筒洗衣机的蒸汽洗控制方法.zip
行业分类纺织造纸用于滚筒洗衣机的蒸汽洗控制方法.zip

下载
行业分类作业装置 液氮冷量的梯级利用系统及其控制方法.zip
行业分类作业装置 液氮冷量的梯级利用系统及其控制方法.zip

下载
行业分类机械工程一种柴油车尾气油烟过滤装置.zip
行业分类机械工程一种柴油车尾气油烟过滤装置.zip

下载
行业分类纺织造纸抑菌可降解复合纤维的制备方法.zip
行业分类纺织造纸抑菌可降解复合纤维的制备方法.zip

下载
行业分类机械工程一种茶叶加工用机械震动式烘干装置.zip
行业分类机械工程一种茶叶加工用机械震动式烘干装置.zip

下载
行业教育软件学习软件软件下载_学习软件_高考真题_上外_北外_中大等最后模卷免费下载.zip
行业教育软件学习软件软件下载_学习软件_高考真题_上外_北外_中大等最后模卷免费下载.zip

下载
行业分类机械工程一种层状岩体掏槽爆破引起的地表振动速度预测方法.zip
行业分类机械工程一种层状岩体掏槽爆破引起的地表振动速度预测方法.zip

下载
Screenrecorder20210730143652726(0).mp4
Screenrecorder20210730143652726(0).mp4

下载
行业分类机械工程一种采用碳纤维增强复合材料的汽车刹车片.zip
行业分类机械工程一种采用碳纤维增强复合材料的汽车刹车片.zip

下载
Cadence Sigrity Power SI 仿真操作流程(一).pdf
Cadence Sigrity Power SI 仿真操作流程(一).pdf