Mastering the game of Go without human knowledge

所需积分/C币:25 2017-10-19 22:19:13 2.44MB PDF
收藏 收藏

本文为英文版完整论文。论文摘要翻译:长期以来,人工智能算法的目标就是让机器能够学习,在具有挑战性的专业领域,从婴儿般的状态(没有经验、知识基础)发展到超人类的级别。近期,AlphaGo成为了首个打败人类围棋世界冠军的程序。AlphaGo中的树形检索(tree search)可以利用深度神经网络评估棋局并进行落子,甚至能通过自我对弈实现强化学习(reinforcement learning)。本文(nature24270)介绍一种纯粹基于强化学习的算法,无需人类数据、指导或者超出游戏规则的专业知识。AlphaGo成为了自己的老师:建立了一个神经网络来预测AlphaGo的落子选择和比赛胜负方。这个神经网络强化了树形检索的能力,求解了更优的落子选择,并为下一次迭代提供了更强的自我对弈。从“婴儿”开始,我们的新程序AlphaGo Zero表现出了超越人类的“才能”,面对旧版AlphaGo——冠军终结者,战绩是100(胜)-0(败)。
1 Reinforcement Learning in AlphaGo zero Our new method uses a deep neural network fe with parameters 6. This neural network takes as an input the raw board representation s of the position and its history, and outputs both move probabil ities and a value,(p,u)=fo(s). The vector of move probabilities p represents the probability of selecting each move(including pass), pa= Pr(as). The value v is a scalar evaluation, estimating the probability of the current player winning from position s. This neural network combines the roles of both policy network and valuc network 2 into a single architecture. The neural network consists of many residual blocks of convolutional layers 6, I7 with batch normalisation and rectifier non- linearities (see Methods) The neural network in AlphaGo Zero is trained from games of self-play by a novel reinforce ment learning algorithm. In each position s, an MctS search is executed, guided by the neural network fe. The MCTS search outputs probabilities T of playing each move. These search proba bilities usually select much stronger moves than the raw move probabilities p of the neural network fe(s); MCTS may therefore be viewed as a powerful policy improvement operator 20, 2. Self-play with search- using the improved MCTs-based policy to select each move, then using the game winner z as a sample of the value- may be viewed as a powerful policy evaluation operator The main idea of our reinforcement learning algorithm is to use thesc search operators repeatedly in a policy iteration procedure 22, 23: the neural network's parameters are updated to make the move probabilities and value(p, a)=fe(s) more closely match the improved search probabilities and self-play winner(T. 2); these new parameters are used in the next iteration of self-play to make the search even stronger. Figure I illustrates the self-play training pipeline The Monte-Carlo tree search uses the neural network fe to guide its simulations(see Figure 2). Each edge(s, a) in the search tree stores a prior probability P(s, a), a visit count N(s,a and an action-value Q(s, a). Each simulation starts from the root state and iteratively selects moves that maximise an upper confidence bound Q(s, a)+U(s, a), where U(s, a)x P(s,a)/(1+ N(s, a))2, 24, until a leaf node s' is encountered. This leaf position is expanded and evaluated just a, Self-Play 1~丌1 2 田→…→ 72 73 b Neural Network Training ↓ fe fe f T 丌2 丌3 Figure 1: Self-play reinforcement learning in ALphaGo Zero a The program plays a game s1, . sT against itself In each position St, a Monte-Carlo tree search(MCts)ae is executed(see Figure 2)using the latest neural network fe. Moves are selected according to the search probabilities computed by the mcts, at N Tt. The terminal position sr is scored according to the rules of the game to compute the game winner z. b Neural network training in Alphago Zero. The neural network takes the raw board position St as its input, passes it through many convolutional layers with parameters B, and outputs both a vector pt, representing a probability distribution over moves, and a scalar value vt, representing the probability of the current player winning in position st. The neural network parameters 6 are updated so as to maximise the similarity of the policy vector pt to the search probabilities Tt, and to minimise the error between the predicted winner ut and the game winner z(see Equation 1). The new parameters are used in the next iteration of self-play a a select b Expand and evaluate C Backu d Play +(Q+U /O9 V Q+U不3、+U (P,v)=f0 Figure 2: Monte-Carlo tree search in AlphaGo Zero. a Each simulation traverses the tree by selecting the edge with maximum action-value Q, plus an upper confidence bound U that depends on a stored prior probability P and visit count N for that edge(which is incremented once traversed). b The leaf node is expanded and the associated position s is evaluated by the neural network(P(s,),V(s))=fe(s); the vector of P values are stored in the outgoin edges from s. c Action-values Q are updated to track the mean of all evaluations v in the subtree below that action. d Once the search is complete, search probabilities T are returned, proportional to N/, where N is the visit count of each move from the root state and t is a parameter controlling temperature once by the network to generate both prior probabilities and evaluation, (P(s, ,v(s))-fe(s) Each edge(s, a)traversed in the simulation is updated to increment its visit count N(s, a), and to update its action-value to the mean evaluation over these simulations, Q(s, a)=1/N(S,a)2s s|s,a→s′ where s, a-s indicates that a simulation eventually reached s after taking move a from position MCTS may be viewed as a self-play algorithm that, given neural network parameters 6 and a root position s, computes a vector of search probabilities recommending moves to play,T ) proportional to the exponentiated visit count for each move, Ta x N(s, a )1, where T is a temperature parameter. The neural network is trained by a self-play reinforcement learning algorithm that uses MCtS to play each move. First the neural network is initialised to random weights 0o. At each subsequent iteration 2> 1, games of self-play are generated(Figure la). At each time-step t, an MCtS search Tt=aB-1(st)is executed using the previous iteration of neural network fe_ and a move is played by sampling the search probabilities Tt. A game terminates at step T when both players pass, when the search value drops below a resignation threshold, or when the game exceeds a maximum length; the game is then scored to give a final reward ofTT E 1,+1 (see Methods for details ). The data for each time-step t is stored as(St, Tt, 2t) where 2t T'r Is the game winner from the perspective of the current player at step t. In parallel( Figure 1b),new network parameters Bi are trained from data(s. T, 2) sampled uniformly among all time-steps of the last iteration(s)of self-play. The neural network(p, v)= fe, (s) is adjusted to minimise the error between the predicted value v and the self-play winner z, and to maximise the similarity of the neural network move probabilities p to the search probabilities T. Specifically, the p arame ters g are adjusted by gradient descent on a loss function l that sums over mean-squared error and cross-entropy losses respectively, (p, u)=fe(s), 2-0 ogp+cle where c is a parameter controlling the level of L2 weight regularisation( to prevent overfitting 2 Empirical Analysis of AlphaGo Zero Training We applied our reinforcement learning pipeline to train our program AlphaGo Zero. Training started from completely random behaviour and continued without human intervention for approx imately 3 days Over the course of training, 4.9 million games of self-play were generated, using 1, 600 simu- lations for each MCts, which corresponds to approximately o4s thinking time per move. Param eters were updated from 700, 000 mini-batches of 2,048 positions. The neural network contained 20 residual blocks(see Methods for further details Figure 3a shows the performance of AlphaGo Zero during self-play reinforcement learning, as a function of training time, on an Elo scale 25. Learning progressed smoothly throughout train ing, and did not suffer from the oscillations or catastrophic forgetting suggested in prior literature b 3000 0.3 2000 0.25 生acooct 8 2000 3000 Reinforcement Learning Reinforcement Learning .e. AlphaGo Lee Supervised Learning 0.15 Supervised Learning 010203045060z0 Training time(hours) Training time (hours Training time(hours) Figure 3: Empirical evaluation of AlphaGo Zero. a Performance of self-play reinforcement learning. The plot shows the performance of each MCts player ae, from each iteration i of reinforcement learning in AlphaGo zero Elo ratings were computed from evaluation games between different players, using 0. 4 seconds of thinking time per move(see Methods). For comparison, a similar player trained by supervised learning from human data, using the KGS data-set, is also shown. b Prediction accuracy on human professional moves. The plot shows the accuracy of the neural network e,, at each iteration of self-play i, in predicting human professional moves from the GoKiu data-set The accuracy measures the percentage of positions in which the neural network assigns the highest probability to the human move. The accuracy of a neural network trained by supervised learning is also shown. c Mean-squared error MSE)on human professional game outcomes. The plot shows the Mse of the neural network fe,, at each iteration of self-play i, in predicting the outcome of human professional games from the GoKifu data-set. The Mse is between the actual outcome z e -1, +1 and the neural network value u, scaled by a factor of i to the range [0, 1 ].The MSE of a neural network trained by supervised learning is also shown 7 26-28. Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 hours; for compari- son, AlphaGo lee was trained over several months. After 72 hours, we evaluated AlphaGo zero against the exact version of alphago lee that defeated Lee sedol, under the 2 hour time controls and match conditions as were used in the man-machine match in Seoul(see Methods). AlphaGo Zero used a single machine with 4 Tensor Processing Units(TPUs)29, while AlphaGo Lee was distributed over many machines and used 48 TPUS. AlphaGo Zero defeated Alpha Go lee by 100 games too(see Extended Data Figure 5 and Supplementary information) To assess the merits of sclf-play reinforcement learning, compared to learning from human data, we trained a second neural network (using the same architecture) to predict expert moves in the kGs data-set; this achieved state-of-the-art prediction accuracy compared to prior work 12, 30-33(see Extended Data Table 1 and 2 respectively). Supervised learning achieved better initial performance, and was better at predicting the outcome of human professional games(Figure 3) Notably, although supervised learning achieved higher move prediction accuracy, the self-learned player performed much better overall, defeating the human-trained player within the first 24 hours of training This suggests that Alpha Go Zero may be learning a strategy that is qualitatively differ ent to human play. To separate the contributions of architecture and algorithm, we compared the performance of the neural network architecture in Alpha Go Zero with the previous neural network architecture used in AlphaGo Lee(see Figure 4). Four neural networks were created, using either separate policy and value networks, as in AlphaGo lee, or combined policy and value networks, as in AlphaGo zero; and using either the convolutional network architecture from AlphaGo lee or the residual network architecture from AlphaGo Zero. Each network was trained to minimise the same loss function(Equation 1)using a fixed data-set of self-play games generated by AlphaGo Zero after 72 hours of sclf-play training. Using a residual nctwork was more accurate, achieved lower error, and improved performance in Alpha Go by over 600 Elo. Combining policy and value together into a single network slightly reduced the move prediction accuracy but reduced the value error and boosted playing performance in AlphaGo by around another 600 Elo. This is partly due to a。4500 b.0.50 052 0.19 0.51 0.18 3500 5 3000 0.16 2500 dual-res sep-res dual-coIv seu-corIv dual-res sep-res dual-convsep-conv dua p-res dual-curIV Figure 4: Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee. Comparison of neural network architectures using either separate(sep")or combined policy and value networks("dual ), and using either convolutional(“conv) or residual networks(res”). The combinations“dual-res”and“ sep-conv""correspond to the neural network architectures used in AlphaGo Zero and alphagro lee respectively. Each network was trained on fixed data-set generated by a previous run of AlphaGo Zero. a Each trained network was combined with AlphaGo Zero's search to obtain a different player. Elo ratings were computed from evaluation games between these different players, using 5 seconds of thinking time per move. b Prediction accuracy on human professional moves(from the GoKifu data-set) for each network architecture. c Mean-squared error on human professional game outcomes(from the gokifu data-set)for each network architecture improved computational efficiency, but more importantly the dual objective regularises the network to a common representation that supports multiple use cases 3 Knowledge Learned by AlphaGo Zero Alpha Go Zero discovered a remarkable level of go knowledge during its self-play training process This included fundamental elements of human go knowledge. and also non -standard strategies beyond the scope of traditional go knowledge Figure 5 shows a timeline indicating when professional joseki (corner sequences) were dis covered(Figure 5a, Extended Data Figure 1); ultimately AlphaGo Zero preferred new joseki vari- ants that were previously unknown(Figure 5b, Extended Data Figure 2). Figure 5c and the sup- plementary Information show several fast self-play games played at different stages of training Tournament length games played at regular intervals throughout training are shown in Extended Data Figure 3 and Supplementary Information. AlphaGo Zero rapidly progressed from entirely random moves towards a sophisticated understanding of Go concepts including fuseki (opening), tesuji(tactics), life-and-death, ko (repeated board situations ) yose(endgame), capturing races sente(initiative), shape, influence and territory, all discovered from first principles. Surprisingly, shicho(" " ladder"capture sequences that may span the whole board)-one of the first elements of Go knowledge learned by humans- were only understood by AlphaGo Zero much later in training 4 Final Performance of AlphaGo Zero We subsequently applied our reinforcement learning pipeline to a second instance of Alpha go zero using a larger neural network and over a longer duralion. Training again started rom completely random behaviour and continued for approximately 40 days Over the course of training, 29 million games of self-play were generated Parameters were updated from 3. 1 million mini-batches of 2,048 positions each. The neural network contained

试读 42P Mastering the game of Go without human knowledge
限时抽奖 低至0.43元/次
身份认证后 购VIP低至7折
  • 分享宗师

关注 私信
Mastering the game of Go without human knowledge 25积分/C币 立即下载
Mastering the game of Go without human knowledge第1页
Mastering the game of Go without human knowledge第2页
Mastering the game of Go without human knowledge第3页
Mastering the game of Go without human knowledge第4页
Mastering the game of Go without human knowledge第5页
Mastering the game of Go without human knowledge第6页
Mastering the game of Go without human knowledge第7页
Mastering the game of Go without human knowledge第8页
Mastering the game of Go without human knowledge第9页

试读结束, 可继续读4页

25积分/C币 立即下载