【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借 ...

原作者: [db:作者] 来自: [db:来源] 收藏邀请

【原文作者及来源：Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of Go without human knowledge[J]. Nature, 2017, 550(7676):354-359.】

【此译文由COCO主要完成，对MarkDown编辑器正在熟悉过程中，因此，文章中相关公式存在问题，请见谅】

【原文】A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.

【翻译】人工智能的长期目标是后天自主学习，并且在一些具有挑战性的领域中实现超人的算法。最近，AlphaGo成为第一个在围棋中击败人类世界冠军的程序。AlphaGo的树搜索使用深度神经网络来评估棋局和选定下棋位置。神经网络是利用对人类专业棋手的移动进行监督学习，同时通过自我博弈进行强化学习来进行训练的。在这里，我们引入了一种没有人类的数据、指导或超越游戏规则的领域知识的、基于强化学习的算法。AlphaGo成为了自己的老师：神经网络被训练用来预测AlphaGo自己的落子选择和胜负。这种神经网络提高了树搜索的强度，从而提高了落子选择的质量和在下一次迭代中的自我博弈能力。从零开始，我们的新程序“AlphaGo Zero”取得了“超人”的成绩，以100比0战胜了的此前公布的AlphaGo版本（指代和李世石对弈的AlphaGo）。
【原文】Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts . However, expert data sets are often expensive, unreliable or simply unavailable. Even when reliable data sets are available, they may impose a ceiling on the performance of systems trained in this manner . By contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities, and to operate in domains where human expertise is lacking. Recently, there has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning. These systems have outperformed humans in computer games, such as Atari and 3D virtual environments . However, the most challenging domains in terms of human intellect—such as the game of Go, widely viewed as a grand challenge for artificial intelligence —require a precise and sophisticated looka head in vast search spaces. Fully general methods have not previously achieved human level performance in these domains.
【翻译】使用监督学习系统来做出与人类棋手一样的决策使人工智能取得了很大进展。然而，人类棋手的数据集通常是昂贵的、不可靠的或根本不可用的。即使在可靠的数据集可用时，人类的认知局限也可能对以这种方式训练的系统的性能施加上限。相比之下，强化学习系统是通过自己的经验训练的，原则上他们能够超越人的能力，并在缺乏人类知识的领域中运作。近年来，利用强化学习训练的深层神经网络在这一目标上取得了快速的进展。这些系统在电脑游戏如Atari和3D虚拟环境上已经超过了人类。但是，在人类智力方面最具挑战性的领域，如围棋领域，使用完全通用的方法没有办法实现与人类相媲美的性能。因为围棋被广泛视为是人工智能的一大挑战——它需要在庞大的搜索空间上进行精确和复杂的前瞻(预判，也就是我们所说的看几步棋)。
【原文】AlphaGo was the first program to achieve superhuman performance in Go. The published version, which we refer to as AlphaGo Fan, defeated the European champion Fan Hui in October 2015. AlphaGo Fan used two deep neural networks: a policy network that outputs move probabilities and a value network that outputs a position evaluation. The policy network was trained initially by supervised learning to accurately predict human expert moves, and was subsequently refined by policygradient reinforcement learning. The value network was trained to predict the winner of games played by the policy network against itself. Once trained, these networks were combined with a Monte Carlo tree search (MCTS) to provide a lookahead search, using the policy network to narrow down the search to high probability moves, and using the value network (in conjunction with Monte Carlo rollouts using a fast rollout policy) to evaluate positions in the tree. A subsequent version, which we refer to as AlphaGo Lee, used a similar approach , and defeated Lee Sedol, the winner of international titles, in March 2016.
【翻译】AlphaGo是第一个在围棋比赛中实现超人表现的程序。之前发布的我们称之为AlphaGo Fan的版本，在2015年10月击败了欧洲冠军樊麾（法国国家围棋队总教练）。AlphaGo Fan使用两个深层神经网络：一个是策略网络，来输出下一步落子的概率；另一个是价值网络，来输出对棋局的评估，也就是落子的胜率。策略网络最初是通过监督学习来精确预测人类专业棋手的落子，随后又通过策略梯度强化学习对系统进行了增强。价值网络通过使用策略网络进行自我博弈来预测谁是赢家从而完成训练。一旦经过训练，这些网络结合蒙特卡洛树搜索（MCTS）提供对未来局势的预测。运用策略网络来缩小高概率落子的搜索过程，运用价值网络结合蒙特卡洛快速走子策略来评估树中的落子位置。随后开发的版本，我们称之为AlphaGo Lee，用类似的方法，在2016年3月击败具有国际冠军头衔的Lee Sedol（曾获18项国际冠军）。
【原文】Our program, AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee in several important aspects. First and foremost, it is trained solely by self-play reinforcement learning, starting from random play, without any supervision or use of human data. Second, it uses only the black and white stones from the board as input features. Third, it uses a single neural network, rather than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte Carlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning. Further technical differences in the search algorithm, training procedure and network architecture are described in Methods.

【翻译】我们现在的程序AlphaGo Zero，与AlphaGo Fan和AlphaGo Lee存在以下几点的差异。首先，它完全由自我博弈强化学习进行训练，从刚开始的随机博弈开始，就没有任何监督或使用人类的数据。第二，它只使用棋盘上的黑白棋作为输入特征。第三，它使用单一的神经网络，而不是分离的策略网络和价值网络。最后，它使用了一个简化版搜索树，这个搜索树依靠单一的神经网络进行棋局评价和落子采样，不执行任何蒙特卡洛rollout。为了实现上述结果，我们引入一个新的强化学习算法，在训练过程中完成前向搜索，从而达到迅速的提高以及精确、稳定的学习过程。在搜索算法、训练过程和网络架构方面更多的技术差异在方法中进行了描述。

【原文】Reinforcement learning in AlphaGo Zero，Our new method uses a deep neural network with parameters θ. This neural network takes as an input the raw board representation S of the position and its history, and outputs both move probabilities and a value, . The vector of move probabilities p represents the probability of selecting each move a (including pass) . The value v is a scalar evaluation, estimating the probability of the current player winning from position s. This neural network combines the roles of both policy network and value network into a single architecture. The neural network consists of many residual blocks of convolutional layers with batch normalization and rectifier nonlinearities(see Methods).

【翻译】我们在AlphaGo Zero的强化学习中，法使用一个参数为θ的深度神经网络。该神经网络将棋局和其历史的原始图作为输入，输出落子概率和价值。落子概率向量p代表选择每个落子动作a（包括放弃行棋）的概率，。价值v是标量评估，估计当前玩家在棋局状态为s时获胜的概率。这个神经网络将策略网络和价值网络合并成一个单一的体系结构。神经网络包括许多残差块、批量归一化和整流器非线性的卷积层。
【原文】The neural network in AlphaGo Zero is trained from games of selfplay by a novel reinforcement learning algorithm. In each position s, an MCTS search is executed, guided by the neural network . The MCTS search outputs probabilities π of playing each move. These search probabilities usually select much stronger moves than the raw move probabilities p of the neural network ; MCTS may therefore be viewed as a powerful policy improvement operator20,21. Selfplay with search—using the improved MCTSbased policy to select each move, then using the game winner z as a sample of the value—may be viewed as a powerful policy evaluation operator. The main idea of our reinforcement learning algorithm is to use these search operators Article reSeArcH19 OcTObER2017 | VOL 550 | NATURE | 355repeatedly in a policy iteration procedure22,23: the neural network’s parameters are updated to make the move probabilities and value more closely match the improved search probabilities and selfplay winner (π, z); these new parameters are used in the next iteration of selfplay to make the search even stronger. Figure 1 illustrates the selfplay training pipeline.

【翻译】AlphaGo Zero的神经网络是通过新的强化学习算法利用自我博弈训练出来的。在每一个棋局s，通过神经网络的指导来执行蒙特卡洛搜索。MCTS搜索输出每次落子的概率分布π。经过搜索后的落子概率通常比神经网络输出的落子概率p更强，因此MCTS被看作是一个强大的策略改进算法。带有搜索的自我博弈——采用改进的以MCTS为基础的策略来选择的每一次落子，然后用游戏的赢家z作为价值的样本——可以被看作是一个强有力的策略评估运算符。我们采用的强化学习算法的主要思想是在策略迭代过程中反复地利用这些搜索算子（文章research19 october2017 |卷550 |自然| 355）；神经网络的参数被更新，使移动概率值更紧密地与改进的搜索概率和自我博弈的赢家（π，z）相配；这些新的参数用于下一次的自我博弈迭代，以使搜索更强大。图1展示了自我博弈的训练流程。

【原文】Figure 1 | Self-play reinforcement learning in AlphaGo Zero.

【翻译】图一 AlphaGo Zero中的自我博弈清华学习

【原文】a, The program plays a game s1, ..., sT against itself. In each position st, an MCTS αθ is executed (see Fig. 2) using the latest neural network fθ. Moves are selected according to the search probabilities computed by the MCTS, at∼πt. The terminal position st is scored according to the rules of the game to compute the game winner z.
【翻译】a.这个程序进行自我博弈s1, ..., sT。在每个棋局st，执行一个使用最新的神经网络fθ 的MCTS αθ（见图2）。根据MCTS计算的搜索概率来选择落子，at∼πt。根据游戏规则在最终的棋局st记分，来计算比赛的胜出者z。
【原文】b, Neural network training in AlphaGo Zero. The neural network takes the raw board position st as its input, passes it through many convolutional layers with parameters θ, and outputs both a vector pt, representing a probability distribution over moves, and a scalar value vt, representing the probability of the current player winning in position st. The neural network parameters θ are updated to maximize the similarity of the policy vector pt to the search probabilities πt, and to minimize the error between the predicted winner V t and the game winner z (see equation (1)). The new parameters are used in the next iteration of selfplay as in a.
【翻译】b，AlphaGo Zero中的神经网络训练。神经网络以原始棋盘状态st作为输入，通过参数为θ的多个卷积层，输出代表落子概率分布的向量pt，和一个表示当前玩家在棋局状态st处胜率的标量值vt。神经网络参数θ朝着使策略矢量pt与搜索概率πt相似度最大化的方向更新，同时最大限度地减少预测赢家vt和游戏赢家z之间的误差（见公式（1））。如a所示，在下一次迭代中使用新的参数。
【原文】The MCTS uses the neural network f θ to guide its simulations (see Fig. 2). Each edge (s, a) in the search tree stores a prior probability P(s, a), a visit count N(s, a), and an action value Q(s, a). Each simulation starts from the root state and iteratively selects moves that maximize an upper confidence bound Q(s, a) +U(s, a), where U(s, a) ∝P(s, a) / (1 +N(s, a)) (refs 12, 24), until a leaf node s′ is encountered. This leaf position is expanded and evaluated only once by the network to generate both prior probabilities and evaluation Each edge (s, a) traversed in the simulation is updated to increment its visit count N(s, a), and to update its action value to the mean evaluation over these simulations, ,where s, a→s′ indicates that a simulation eventually reached s′after taking move a from position s.

【翻译】MCTS采用神经网络来指导它的模拟（见图2）。搜索树中的每个边（s,a）存储先验概率p（s,a）、访问次数n（s,a）和一个动作价值Q（s，a）。每次模拟从根开始，反复选择落子，使置信上限Q（s,a）+ U（s,a）最大化，其中U（s,a）∝P（s,a）/（1 + N（s,a））（参考文献12, 24），直到遇到叶节点s′。叶子的位置被扩展，通过网络对该叶子的棋局进行扩展和评估，产生先验概率和价值。在模拟中的每条边（s,a）被更新，访问数量N（s,a）增加，并且将其动作值更新为对这些模拟的平均评价，，其中s，a→s’表示在从位置s移动a之后，模拟最终达到s’。

【原文】Figure 2 | MCTS in AlphaGo Zero.

【翻译】图二 AlphaGo Zero的MCTS搜索

【原文】a, Each simulation traverses the tree by selecting the edge with maximum action value Q, plus an upper confidence bound U that depends on a stored prior probability P and visit count N for that edge (which is incremented once traversed).
【翻译】a，每个模拟通过选择具有最大动作值Q，加上取决于存储的先验概率p和该边的访问计数n的一个置信区间上限u（当遍历的时候递增）的边来对树进行遍历。
【原文】b, The leaf node is expanded and the associated position s is evaluated by the neural network (P(s, ·),V(s)) =fθ(s); the vector of P values are stored in the outgoing edges from s.
【翻译】b、叶节点的扩展和对应棋局s的评价是由神经网络（P（S，·）、V（S））= Fθ（S）完成的；p值的向量存储在从s出发的外向边中。
【原文】c, Action value Q is updated to track the mean of all evaluations V in the subtree below that action.
【翻译】c,更新动作价值Q，来跟踪那个落子动作下面的子树中所有评价V的平均值。
【原文】d, Once the search is complete, search probabilities π are returned, proportional to N1/τ, where N is the visit count of each move from the root state and τ is a parameter controlling temperature.
【翻译】d，一旦搜索完成后，返回搜索概率π，与N1 /τ成正比，其中N是从根开始的每个落子的访问次数，τ是温度控制参数。
【原文】MCTS may be viewed as a selfplay algorithm that, given neural network parameters θ and a root position s, computes a vector of search probabilities recommending moves to play, π=αθ(s), proportional to the exponentiated visit count for each move, πa∝N(s, a)1/τ, where τ is a temperature parameter.
【翻译】蒙特卡洛可以看作是一个自我博弈算法，给出了神经网络参数θ和根的棋局状态s，计算搜索概率推荐的移动向量π=αθ（s），它与每次落子动作的访问计数的指数成正比，π∝N（S，A）1 /τ，其中τ是温度参数。

【原文】The neural network is trained by a selfplay reinforcement learning algorithm that uses MCTS to play each move. First, the neural network is initialized to random weights θ0. At each subsequent iteration i≥ 1, games of selfplay are generated (Fig. 1a). At each timestep t, an MCTS search is executed using the previous iteration of neural network θ−fi1 and a move is played by sampling the search probabilities πt. A game terminates at step T when both players pass, when the search value drops below a resignation threshold or when the game exceeds a maximum length; the game is then scored to give a final reward of rT∈ {−1,+1} (see Methods for details). The data for each timestep t is stored as , where zt=±rT is the game winner from the perspective of the current player at step t. In parallel (Fig. 1b), new network parameters θi are trained from data sampled uniformly among all timesteps of the last iteration(s) of selfplay. The neural network is adjusted to minimize the error between the predicted value v and the selfplay winner z, and to maximize the similarity of the neural network move probabilities p to the search probabilities π. Specifically, the parameters θ are adjusted by gradient descent on a loss function l that sums over the meansquared error and crossentropy losses, respectively:

where c is a parameter controlling the level of L weight regularization (to prevent overfitting).

【翻译】神经网络通过自我强化学习进行训练，该强化学习算法使用MCTS计算每个落子动作。首先，神经网络初始化为随机权重θ0 。在随后的每次迭代i≥1时，产生了自我博弈（图1a）。在每一个时间步t，利用上一次迭代的神经网络执行MCTS搜索，并且通过概率分布πt 进行采样来落子。当双方放弃行棋时，或者当搜索值低于阈值，或者当比赛超过最大长度时，比赛终止于步骤T；然后为比赛计分，给予奖励 rT∈ {−1,+1}（见方法细节）。每个时间步t的数据存储为，其中zt=±rT 是在步骤t从当前玩家的视角来看的赢家。并行地（图1b），新的网络参数θi 利用数据进行训练，数据是从自我博弈的上一次迭代的所有时间步中均匀取样的。调整神经网络，使预测值v和自我博弈的赢家z之间的误差最小，并且最大限度地提高神经网络移动概率p与搜索概率π的相似度。具体来说，通过使用对均方误差和交叉熵损耗求和的损失函数l，利用梯度下降来调整参数θ：

其中，c是一个控制L2权重正则化水平的参数（防止过拟合）。

【原文】Empirical analysis of AlphaGo Zero training
We applied our reinforcement learning pipeline to train our program AlphaGo Zero. Training started from completely random behavior and continued without human intervention for approximately three days. Over the course of training, 4.9 million games of selfplay were generated, using 1,600 simulations for each MCTS, which corresponds to approximately 0.4 s thinking time per move. Parameters were updated from 700,000 minibatches of 2,048 positions. The neural network contained 20 residual blocks.
AlphaGo Zero训练的实验分析
【翻译】应用我们的强化学习流程来训练AlphaGo Zero。训练从完全随机的落子开始，在没有人工干预的情况下持续大约三天。在训练过程中，每次MCTS使用1600次模拟，生成了490万场自我博弈，每次落子使用约0.4s的思考时间。使用大小为2048的700000个小批量更新参数。神经网络包含20个残差块。

【原文】Figure 3a shows the performance of AlphaGo Zero during selfplay reinforcement learning, as a function of training time, on an Elo scale 25. Learning progressed smoothly throughout training, and did not suffer from the oscillations or catastrophic forgetting that have been suggested in previous literature Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 h. In comparison, AlphaGo Lee was trained over several months. After 72 h, we evaluated AlphaGo Zero against the exact version of AlphaGo Lee that defeated Lee Sedol, under the same 2 h time controls and match conditions that were used in the man–machine match in Seoul (see Methods). AlphaGo Zero used a single machine with 4 tensor processing units (TPUs) 29, whereas AlphaGo Lee was distributed over many machines and used 48 TPUs. AlphaGo Zero defeated AlphaGo Lee by 100 games to 0 (see Extended Data Fig. 1 and Supplementary Information).

【翻译】图3a显示了以训练时间为横轴，使用ELO评分规则时AlphaGo Zero在自我博弈强化学习期间的性能。在整个训练期间学习进展顺利，并没有遭受在相关文献中提及的振荡或灾难性的遗忘。令人惊讶的是，在仅训练36小时之后，AlphaGo Zero就超过了AlphaGo Lee的性能，因为AlphaGo Lee训练了几个月。训练72小时后，我们评估AlphaGo Zero，让他和在首尔打败过李世石的AlphaGo Lee使用2小时控制时间和比赛环境下进行比赛。AlphaGo Zero使用具有4个TPU的单机，而AlphaGo Lee则是分布在许多机器上，并且使用48个TPU。AlphaGo Zero以100比0击败AlphaGo Lee（参见扩展数据图1和补充资料）。

【原文】Figure 3 | Empirical evaluation of AlphaGo Zero.

【翻译】图三 AlphaGo Zero的实证评价

【原文】a, Performance of selfplay reinforcement learning. The plot shows the performance of each MCTS player αθi from each iteration i of reinforcement learning in AlphaGo Zero. Elo ratings were computed from evaluation games between different players, using 0.4 s of thinking time per move (see Methods). For comparison, a similar player trained by supervised learning from human data, using the KGS dataset, is also shown.
【翻译】a，自我博弈强化学习的表现。图中显示了AlphaGo Zero强化学习在每次迭代i中MCTS $\alpha_{\theta_{i}}$αθi 的表现。通过与不同玩家的比赛，来评估ELO评级。在比赛中每次落子的思考时间为0.4秒（见方法）。为了对比，我们也展示出了使用KGS数据，由人类经验数据进行监督学习训练的模型。
【原文】b, Prediction accuracy on human professional moves. The plot shows the accuracy of the neural network θfi, at each iteration of selfplay i, in predicting human professional moves from the GoKifu dataset. The accuracy measures the probability to the human move. The accuracy of a neural network trained by supervised learning is also shown.

【翻译】b、对人类棋手落子的预测精度。该图显示了在每一次自我博弈迭代i中，神经网络根据KGS数据集预测人类棋手落子的准确性。通过监督学习的神经网络的训练精度也显示在图中。

【原文】c, Meansquared error (MSE) of human professional game outcomes. The plot shows the MSE of the neural networkθfi, at each iteration of selfplay i, in predicting the outcome of human professional games from the GoKifu dataset. The MSE is between the actual outcome z∈ {− 1, +1} and the neural network value v, scaled by a factor of 14 to the range of 0–1. The MSE of a neural network trained by supervised learning is also shown.

【翻译】c，在人类职业比赛结果上的均方误差（MSE）。该图显示了在每一次自我博弈迭代i中，神经网络从gokifu数据中预测人类职业比赛结果的MSE。MSE是在实际结果z∈ {− 1, +1} 和神经网络的价值v，按1/4的比例缩小到0 - 1的范围之间。图中还显示出经过监督学习训练的神经网络的MSE。

【原文】To assess the merits of selfplay reinforcement learning, compared to learning from human data, we trained a second neural network (using the same architecture) to predict expert moves in the KGS Server dataset; this achieved stateoftheart prediction accuracy compared to previous work12,30–33 (see Extended Data Tables 1 and 2 for current and previous results, respectively). Supervised learning achieved a better initial performance, and was better at predicting human professional moves (Fig. 3). Notably, although supervised learning achieved higher move prediction accuracy, the selflearned player performed much better overall, defeating the humantrained player within the first 24 h of training. This suggests that AlphaGo Zero may be learning a strategy that is qualitatively different to human play.

【翻译】为了评估自我博弈强化学习相对于使用人类数据进行学习的优势，我们训练了第二个神经网络（使用相同的架构）来预测在KGS服务器数据上人类专业棋手的落子动作，取得了与以前的工作（12,30–33）相比更准确的预测精度（当前和以前的结果分别参见扩展数据表1 和2）。监督学习在一开始获得了非常好的性能，并且更好地预测了人类棋手的动作（图3）。但值得注意的是，虽然监督学习取得了较高的落子预测精度，但是总体而言，这个自学的棋手表现更好，在经过24小时的训练后击败了用人类数据进行训练的程序。这表明，AlphaGo Zero可以学习到完全与人类不同的技能。

【原文】To separate the contributions of architecture and algorithm, we compared the performance of the neural network architecture in AlphaGo Zero with the previous neural network architecture used in AlphaGo Lee (see Fig. 4). Four neural networks were created, using either separate policy and value networks, as were used in AlphaGo Lee, or combined policy and value networks, as used in AlphaGo Zero; and using either the convolutional network architecture from AlphaGo Lee or the residual network architecture from AlphaGo Zero. Each network was trained to minimize the same loss function (equation (1)), using a fixed dataset of selfplay games generated by AlphaGo Zero after 72 h of selfplay training. Using a residual network was more accurate, achieved lower error and improved performance in AlphaGo by over 600 Elo. Combining policy and value together into a single network slightly reduced the move prediction accuracy, but reduced the value error and boosted playing performance in AlphaGo by around another 600 Elo. This is partly due to improved computational efficiency, but more importantly the dual objective regularizes the network to a common representation that supports multiple use cases.
【翻译】为了将结构和算法的贡献分离，我们将AlphaGo Zero使用的神经网络体系结构的性能与AlphaGo Lee使用的神经网络结构进行了比较（见图4）。我们创建了四个神经网络，就像在AlphaGo Lee中那样，使用独立的策略网络和价值网络；或者使用AlphaGo Lee使用的卷积网络架构或AlphaGo Zero使用的残差网络架构。训练网络时都最大限度地减少相同的损失函数（方程（1）），使用的数据集是AlphaGo Zero在72小时的自我博弈训练后产生的固定数据集。利用残差网络更准确，使AlphaGo 达到较低的错误率和性能的改进，达到了超过600Elo。将策略和价值合成一个单一的网络会轻微地降低落子预测精度，但同时降低了价值误差，并且使AlphaGo的性能提高大约600Elo。这是由于提高了计算效率，但更重要的是具有双重目的的网络成为支持多个案例的通用表示。
什麽是ELO?
ELO等级分制度是指由匈牙利裔美国物理学家 Arpad Elo创建的一个衡量各类对弈活动水平的评价方法，是当今对弈水平评估的公认的权威方法。
ELO怎麽产生的?

最早, ELO等级分制度是基于统计学的一个评估棋手水平的方法. 之后被广泛用于国际象棋、围棋、足球、篮球等运动。线上游戏英雄联盟、魔兽世界内的竞技对战系统也採用此分级制度. 现在不少Destiny网站也使用此统计系统.

【原文】Figure 4 | Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee

【翻译】图4 AlphaGo Zero和AlphaGo Lee中神经网络结构的比较

【原文】Comparison of neural network architectures using either separate (sep) or combined policy and value (dual) networks, and using either convolutional (conv) or residual (res) networks. The combinations ‘dual–res’ and ‘sep–conv’ correspond to the neural network architectures used in AlphaGo Zero and AlphaGo Lee, respectively. Each network was trained on a fixed dataset generated by a previous run of AlphaGo Zero.
【翻译】使用单独的（SEP）或联合的策略和价值（dual）网络的神经网络结构比较，以及使用卷积（conv）或残差（res）网络的比较。 ‘dual–res’和‘sep–conv’ 的组合分别与AlphaGo Zero 和 AlphaGo Lee中使用的神经网络结构相对应。每个网络在一个固定的数据集上进行训练，这个数据集是由AlphaGo Zero以前的运行产生的。
【原文】a, Each trained network was combined with AlphaGo Zero’s search to obtain a different player. Elo ratings were computed from evaluation games between these different players, using 5 s of thinking time per move.
b, Prediction accuracy on human professional moves (from the GoKifu dataset) for each network architecture.
c ,MSE of human professional game outcomes (from the GoKifu dataset) for each network architecture.
【翻译】a,每个训练过的网络与AlphaGo Zero的搜索相结合，来获得不同的程序。通过这些不同的程序之间的比赛来计算ELO评级。在比赛中，每次落子使用5秒的思考时间。
b, 每个网络架构对专业人类棋手的落子预测精度（使用gokifu数据集）。
c, 每个网络架构在人类专业职业比赛结果的MSE（使用gokifu数据集）。
【原文】Knowledge learned by AlphaGo Zero
AlphaGo Zero discovered a remarkable level of Go knowledge during its selfplay training process. This included not only fundamental elements of human Go knowledge, but also nonstandard strategies beyond the scope of traditional Go knowledge.
【翻译】AlphaGo Zero学习到的知识
AlphaGo Zero在自我博弈训练过程中发现了围棋的新境界。这不仅包括人类围棋知识的基本要素，而且还包括超出传统围棋知识范围之外的非标准策略。

【原文】Figure 5 shows a timeline indicating when professional joseki (corner sequences) were discovered (Fig. 5a and Extended Data Fig. 2); ultimately AlphaGo Zero preferred new joseki variants that were previously unknown (Fig. 5b and Extended Data Fig. 3). Figure 5c shows several fast selfplay games played at different stages of training (see Supplementary Information). Tournament length games played at regular intervals throughout training are shown in Extended Data Fig. 4 and in the Supplementary Information. AlphaGo Zero rapidly progressed from entirely random moves towards a sophisticated understanding of Go concepts, including fuseki (opening), tesuji (tactics), lifeanddeath, ko (repeated board situations), yose (endgame), capturing races, sente (initiative), shape, influence and territory, all discovered from first principles. Surprisingly, shicho (‘ladder’ capture sequences that may span the whole board)—one of the first elements of Go knowledge learned by humans—were only understood by AlphaGo Zero much later in training.

【翻译】图5显示了专业的定式（位于边角的序列上）被发现的时间（图5A和扩展的数据如图2所示）；最终AlphaGo Zero使用了新的定式变种（图5B和扩展数据图3）。图5c显示了在不同的训练阶段进行的几次快速自我博弈的进行情况（参见补充信息）。在整个训练过程中定期进行的比赛长度在扩展数据图4和补充信息中显示。（在训练过程中一般游戏长度中都有一些间隔，这些间隔都显示在了额外数据4和补充信息中。）AlphaGo Zero迅速从“一块白板”走向成熟，对围棋概念有了深奥理解，包括布局（开放），手筋（战术），活和死，劫（重复的棋盘情况），官子（残局），提子比赛，森特（主动）（初始）、形态（成型）、影响和领土（占领），都能在第一时间迅速掌握。令人惊讶的是，shicho抓住了整个棋盘的序列——在人类学习围棋中比较早被人类掌握的围棋知识点，却在AlphaGo Zero训练比较晚的时候才掌握到。

【原文】Figure 5 | Go knowledge learned by AlphaGo Zero.

【翻译】图5 AlphaGo Zero学习的围棋知识

【原文】a, Five human Joseki (common corner sequences) discovered during AlphaGo Zero training. The associated timestamps indicate the first time each sequence occurred (taking account of rotation and reflection) during selfplay training. Extended Data Figure 2 provides the frequency of occurence over training for each sequence.
【翻译】a，在AlphaGo Zero训练过程中的五个常见的角点序列。在自我博弈训练期间，相关的时间段显示了每个序列第一次形成的时间（考虑旋转和反射）。扩展数据图2提供了每个序列在训练中出现的频率。
【原文】b, Five joseki favoured at different stages of selfplay training. Each displayed corner sequence was played with the greatest frequency, among all corner sequences, during an iteration of selfplay training. The timestamp of that iteration is indicated on the timeline. At 10 h a weak corner move was preferred. At 47 h the 3–3 invasion was most frequently played. This joseki is also common in human professional play however AlphaGo Zero later discovered and preferred a new variation. Extended Data Figure 3 provides the frequency of occurence over time for all five sequences and the new variation.
【翻译】b,五定式在自我博弈训练的不同阶段被青睐的程度。在自我博弈训练的一次迭代中，在所有的角序列中，每一个显示的角序列都出现的频率最高。该迭代的时间戳在时间轴上表示。在10小时时，弱角移动是首选。在47小时时，3 - 3的入侵是最经常发生的。这个定式在人类职业比赛中也常见。不过AlphaGo Zero随后发现并偏向于这个新变化。扩展数据图3提供了所有五个序列和新变化随时间变化的频率。

【原文】c, The first 80 moves of three selfplay games that were played at different stages of training, using 1,600 simulations (around 0.4 s) per search. At 3 h, the game focuses greedily on capturing stones, much like a human beginner. At 19 h, the game exhibits the fundamentals of lifeanddeath, influence and territory.At 70 h, the game is remarkably balanced, involving multiple battles and a complicated ko fight, eventually resolving into a halfpoint win for white. See Supplementary Information for the full game.

【翻译】C，在不同训练阶段进行的三个自我博弈的前80步，每次搜索使用1600次模拟（大约0.4秒）。在3小时后，游戏专注于吃对方的棋子，就像人类初级棋手一样。在19小时时，游戏展现了死活、影响力和占领的基本方面，在70小时时，游戏非常平衡，包括多场战斗和复杂的劫战斗，最终白方以半目赢得胜利。有关完整游戏见补充信息。

【原文】Final performance of AlphaGo Zero

We subsequently applied our reinforcement learning pipeline to a second instance of AlphaGo Zero using a larger neural network and over a longer duration. Training again started from completely random behaviour and continued for approximately 40 days.

Over the course of training, 29 million games of selfplay were generated. Parameters were updated from 3.1 million minibatches of 2,048 positions each. The neural network contained 40 residual blocks. The learning curve is shown in Fig. 6a. Games played at regular intervals throughout training are shown in Extended Data Fig. 5 and in the Supplementary Information.

【翻译】AlphaGo Zero的最后的表现

随后，我们使用更大的神经网络，在一个较长的时间将我们的强化学习流程应用到AlphaGo Zero的第二个实例。训练又从完全随机的行为开始，持续了大约40天。

在训练过程中，产生了2900万场自我博弈。参数大小为2048的310万个小批量中更新。神经网络包含40个残差块。学习曲线显示在图6a，在扩展的数据图5中和补充信息中显示了在训练期间定期进行的比赛。

【原文】Figure 6 | Performance of AlphaGo Zero.

【翻译】图6 AlphaGo Zero的表现

【原文】a, Learning curve for AlphaGo Zero using a larger 40block residual network over 40 days. The plotshows the performance of each player αθi from each iteration i of our reinforcement learning algorithm. Elo ratings were computed from evaluation games between different players, using 0.4 s per search (see Methods).

【翻译】a, 使用大型的40块残差网络，训练超过40天的AlphaGo Zero的学习曲线。该学习曲线展示了在我们的强化学习算法中，每次迭代i中的表现。利用不同玩家的比赛计算ELO评级，在游戏中每次搜索使用0.4秒（见方法）。

【原文】b, Final performance of AlphaGo Zero. AlphaGo Zero was trained for 40 days using a 40block residual neural network. The plot shows the results of a tournament between: AlphaGo Zero, AlphaGo Master (defeated top human professionals 60–0 in online games), Alpha ee (defeated Lee Sedol), AlphaGo Fan (defeated Fan Hui), as well as previous Go programs Crazy Stone, Pachi and GnuGo. Each program was given 5 s of thinking time per move. AlphaGo Zero and AlphaGo Master played on a single machine on the Google Cloud; AlphaGo Fan and AlphaGo Lee were distributed over many machines. The raw neural network from AlphaGo Zero is also included, which directly selects the move a with maximum probability pa, without using MCTS. Programs were evaluated on an Elo scale25: a 200point gap corresponds to a 75% probability of winning.

【翻译】b, AlphaGo Zero的最终性能。AlphaGo Zero 使用40块残差神经网络训练40天。该图显示了AlphaGo Zero、AlphaGo Master（在在线游戏上以60–0击败人体专业顶级选手）、Alpha Lee（击败Lee Sedol）、AlphaGo Fan（击败樊麾），以及以前的围棋程序Crazy Stone，Pachi和gnugo之间的比赛。允许每个程序每次移动使用5秒的思考时间。AlphaGo Zero 和 AlphaGo Master在谷歌云上的单机进行；AlphaGo Fan和AlphaGo Lee分别分布在多台机器上。AlphaGo Zero的原神经网络也包括在内，它没有使用MCTS，直接选择最大概率为的移动。程序以ELO 模式评价：200点的gap相当于75%的胜率。

【原文】We evaluated the fully trained AlphaGo Zero using an internal tournament against AlphaGo Fan, AlphaGo Lee and several previous Go programs. We also played games against the strongest existing program, AlphaGoMaster—a program based on the algorithm and architecture presented in this paper but using human data and features (see Methods)—which defeated the strongest human professional players 60–0 in online games in January2017 34 . In our evaluation, all programs were allowed 5 s of thinking time per move; AlphaGo Zero and AlphaGo Master each played on a single machine with 4 TPUs; AlphaGo Fan and AlphaGo Lee were distributed over 176 GPUsand 48 TPUs, respectively. We also included a player based solely on the raw neural network of AlphaGo Zero; this player simply selected the move with maximum probability.

【翻译】我们通过内部比赛对AlphaGo Fan，AlphaGo Lee和几个以前的Go程序评估了全面训练的AlphaGo Zero。我们还让其对战现有最强的程序，AlphaGo Master——一个基于本文的算法和架构但利用人类数据和特征的算法（见方法）的程序，于2017年1月在网络游戏上击败了人类最强的职业选手60–0。在我们的评估中，所有程序都只允许使用5秒时间思考每次落子；AlphaGo Zero和AlphaGo Master每个在使用4个TPU的单一机器上进行；AlphaGo Fan和AlphaGo Lee分别分布在176个GPU和48个TPU上。我们还引入一个完全基于AlphaGo Zero原始神经网络的程序，该程序以最大的概率来选择落子。

【原文】Figure 6b shows the performance of each program on an Elo scale. The raw neural network, without using any lookahead, achieved an Elo rating of 3,055. AlphaGo Zero achieved a rating of 5,185, compared to 4,858 for AlphaGo Master, 3,739 for AlphaGo Lee and 3,144 for AlphaGo Fan.

【翻译】图6b显示了在Elo量表上每个程序的性能。没有使用任何前向搜索的原始神经网络，Elo评级为3,055。相比之下，AlphaGo Zero达到了5185的等级， AlphaGo Master达到了4858 等级，AlphaGo Lee达到了3739和AlphaGo Fan 达到了3144。

【原文】Finally, we evaluated AlphaGo Zero head to head against AlphaGo Master in a 100game match with 2h time controls. AlphaGo Zero won by 89 games to 11 (see Extended Data Fig. 6 and Supplementary Information).

【翻译】最后，我们使用具有两小时控制时间的100场比赛对AlphaGo Zero和AlphaGo Master进行评估。AlphaGo Zero以89比11赢得了比赛（参见扩展数据图6和补充资料）。

【原文】Conclusion

Our results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without humanexamples or guidance, given no knowledge of the domain beyond basic rules. Furthermore, a pure reinforcement learning approach requires just a few more hours to train, and achieves much better asymptotic performance, comparedto training on human expert data. Using this approach, AlphaGo Zero defeated the strongest previous versions of AlphaGo, which were trained from human data using handcrafted features, by a large margin

【翻译】讨论

我们的研究结果证明，即便是在最具挑战性的领域中，单纯使用强化学习的方法也是完全可行的：没有人类实例或指导，没有基本规则之外的领域知识，训练达到超人的性能是完全可能的。此外，与通过人类棋手数据进行训练相比，单纯的强化学习方法只需要训练几个小时，并且可以取得更好的渐近性能。使用这种方法，AlphaGo Zero打败了AlphaGo 先前最强的版本，那个版本使用手工制作的特征，利用人类数据进行大幅度训练。

【原文】Humankind has accumulated Go knowledge from millions of games played over thousands of years, collectively distilled into patterns, proverbs and books. In the space of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel strategies that provide new insights into the oldest of games.

【翻译】人类从几千年来进行的围棋比赛中积累了大量的知识，并提取其精华写入模式、谚语和书籍。然而在短短几天内，从零开始的AlphaGo Zero能够重新发现很多围棋知识以及新的策略，为这古老的游戏提供了新的见解。

【原文】METHODS

Reinforcement learning. Policy iteration is a classic algorithm that generates a sequence of improving policies, by alternating between policy evaluation—estimating the value function of the current policy—and policy improvement—using the current value function to generate a better policy. A simple approach to policy evaluation is to estimate the value function from the outcomes of sampled trajectories. A simple approach to policy improvement is to select actions greedily with respect to the value function. In large state spaces, approximations are necessary to evaluate each policy and to represent its improvement.

【翻译】方法

强化学习。策略迭代是一种经典算法，它通过估计当前策略下的价值函数的“策略评估”，和利用当前价值函数产生更好策略的“策略改善”，来产生一系列改进策略。策略评估的一个简单方法是从采样轨迹的结果中估计值函数。策略完善的一个简单方法是利用价值函数贪婪地选择动作。在大的状态空间中，近似对评估每个策略并表示其改进是必要的。

【原文】Classificationbased reinforcement learning improves the policy using a simple Monte Carlo search. Many rollouts are executed for each action; the action with the maximum mean value provides a positive training example, while all other actions provide negative training examples; a policy is then trained to classify actions as positive or negative, and used in subsequent rollouts. This may be viewed as a precursor to the policy component of AlphaGo Zero’s training algorithm when τ→ 0.

【翻译】基于分类的强化学习利用简单的蒙特卡洛搜索对策略进行了改进。每个落子动作执行许多次rollout；具有最大平均价值的动作提供了一个积极的训练实例，而所有其他的动作提供了负面的训练样例；策略是训练用来对动作的正面或负面进行分类，并用于后续的rollout。这可以看作是τ→0时，对AlphaGo Zero训练算法的策略组成的前身。

【原文】A more recent instantiation, classificationbased modified policy iteration (CBMPI), also performs policy evaluation by regressing a value function towards truncated rollout values, similar to the value component of AlphaGo Zero; this achieved stateoftheart results in the game of Tetris. However, this previous work was limited to simple rollouts and linear function approximation using handcrafted features.

【翻译】最近的一个实例，是基于分类的改进策略迭代（CBMPI），也通过截断的rollout对价值函数进行回归，从而执行策略评估，这类似于AlphaGo Zero的价值部分；这在俄罗斯方块游戏中达到了最先进的水平。然而，这以前的工作仅限于简单的rollout和使用手工特征的线性函数的近似。

【原文】The AlphaGo Zero selfplay algorithm can similarly be understood as an approximate policy iteration scheme in which MCTS is used for both policy improvement and policy evaluation. Policy improvement starts with a neural network policy, executes an MCTS based on that policy’s recommendations, and then projects the (much stronger) search policy back into the function space of the neural network. Policy evaluation is applied to the (much stronger) search policy: the outcomes of selfplay games are also projected back into the function space of the neural network. These projection steps are achieved by training the neural network parameters to match the search probabilities and selfplay game outcome respectively.

【翻译】AlphaGo Zero自我博弈算法同样可以理解为一个近似的策略迭代计划，其中，MCTS用于策略改善和策略评价。策略改善从一个神经网络策略开始，执行基于该神经网络策略的MCTS，然后将搜索策略（更强）回归到神经网络的函数空间中。策略评估应用于（更强大的）搜索策略：自我博弈的结果也被投射回神经网络的函数空间。这些投射步骤是通过训练神经网络参数来匹配搜索概率和自我博弈结果而实现的。

【原文】Guo et al. 7 also project the output of MCTS into a neural network, either by regressing a value network towards the search value, or by classifying the action selected by MCTS. This approach was used to train a neural network for playing Atari games; however, the MCTS was fixed—there was no policy iteration—and did not make any use of the trained networks.

【翻译】郭等人也将MCTS的输出投影到神经网络中，或者通过搜索价值对价值网络进行回归，或者对通过MCTS选择的落子动作进行分类。这种方法通过训练神经网络来玩Atari游戏；然而，MCTS是固定的，没有策略迭代，并且没有使用任何训练过的网络。

【原文】Self-play reinforcement learning in games. Our approach is most directly applicable to Zerosum games of perfect information. We follow the formalism of alternating Markov games described in previous work12, noting that algorithms based on value or policy iteration extend naturally to this setting39.

【翻译】游戏中的自我博弈强化学习。我们的方法最直接地适用于完全信息的零和博弈。我们遵循在先前的工作12中描述的交替马尔可夫游戏的形式，指出基于价值或自然延伸到此设置的策略迭代的算法39。

【原文】Selfplay reinforcement learning has previously been applied to the game of Go. NeuroGo40,41 used a neural network to represent a value function, using a sophisticated architecture based on Go knowledge regarding connectivity, territory and eyes. This neural network was trained by temporaldifference learning42 to predict territory in games of selfplay, building on previous work43. A related approach, RLGO44, represented the value function instead by a linear combination of features, exhaustively enumerating all 3 × 3 patterns of stones; it was trained by temporaldifference learning to predict the winner in games of selfplay. Both NeuroGo and RLGO achieved a weak amateur level of play.

【翻译】自我博弈强化学习先前就已被应用到围棋中。NeuroGo40,41使用神经网络来表示价值函数，使用基于关于连接性，疆域和眼的围棋知识的成熟架构。该神经网络是通过时间差分学习42进行训练来预测依赖以前的工作43建立的自我博弈的疆域。另一个相关的方法，RLGO44，所代表的是价值函数而不是特征的线性组合，详尽列举所有棋子的3×3特征；它通过时间差分学习进行训练来预测自我博弈的赢家。NeuroGo和RLGO都达到了业余段位。

【原文】MCTS may also be viewed as a form of selfplay reinforcement learning45. The nodes of the search tree contain the value function for the positions encountered during search; these values are updated to predict the winner of simulated games of selfplay. MCTS programs have previously achieved strong amateur level in Go46,47, but used substantial domain expertise: a fast rollout policy, based on handcrafted features13,48, that evaluates positions by running simulations until the end of the game; and a tree policy, also based on handcrafted features, that selects moves within the search tree47.

【翻译】MCTS也可以看作是一种自我博弈强化学习45。搜索树的节点包含搜索过程中遍历的棋局的价值函数；更新这些值来预测自我博弈的赢家。MCTS程序以前在围棋领域达到了较强的业余段位水平46,47，但使用了大量的领域专业知识：基于手工特征的快速走棋策略，模拟运行直到比赛结束来评价棋局；树策略，也是基于手工制作特征的，在搜索树47中选择落子动作。

【原文】Selfplay reinforcement learning approaches have achieved high levels of performance in other games: chess49-51, checkers52, backgammon53, othello54, Scrabble55 and most recently poker56. In all of these examples, a value function was trained by regression54-56 or temporaldifference learning49-53 from training data generated by selfplay. The trained value function was used as an evaluation function in an alpha–beta search49-54, a simple Monte Carlo search 55,57or counterfactual regret minimization56. However, these methods used handcrafted input features49-53,56 or handcrafted feature templates54,55. In addition, the learning process used supervised learning to initialize weights58, handselected weights for piece values49,51,52, handcrafted restrictions on the action space56 or used preexisting computer programs as training opponents49,50, or to generate game records51.

【翻译】自我博弈强化学习方法在其他游戏上取得了高性能：国际象棋49-51，西洋棋52, 西洋双陆棋53, 奥赛罗54, 拼字游戏55 和最近的纸牌56。在所有这些例子中，价值函数是利用时间差分学习49-53，通过回归54-56或利用自我博弈生成的数据进行训练的进行训练的。受过训练的价值函数在α-β搜索49-54、简单的蒙特卡洛搜索55,57或者假设遗憾最小化56中作为评价函数。然而，这些方法使用手工输入特征49-53,56或者手工特征范本54,55。此外，学习过程使用的监督学习来初始化权重58、为piece value手工选择权重49,51,52、在动作空间的手工限制56、或使用之前的计算机程序作为训练对手49,50、或生成的游戏记录51。

【原文】Many of the most successful and widely used reinforcement learning methods were first introduced in the context of Zerosum games: temporaldifference learning was first introduced for a checkersplaying program59, while MCTS was introduced for the game of Go13. However, very similar algorithms have subsequently proven highly effective in video games6-8,10, robotics60, industrial control61-63 and online recommendation systems64,65.

【翻译】许多最成功和使用最广泛的强化学习方法在零和博弈的内容中第一次做了介绍：时间差学习首先利用跳棋程序介绍的，MCTS是利用围棋介绍的。然而，随后在电子游戏，机器人，工业控制和在线推荐中，非常相似的算法得到了效果很好的证实。

【原文】AlphaGo versions. We compare three distinct versions of AlphaGo:

【翻译】AlphaGo版本。我们比较了三个不同的AlphaGo版本：

【原文】(1) AlphaGo Fan is the previously published program12 that played against Fan Hui in October 2015. This program was distributed over many machines using 176 GPUs.

【翻译】（1）AlphaGo Fan是先前公布在2015年10月与樊麾交手的程序。这个程序分布在多台机器上，使用了176个GPU。

【原文】(2) AlphaGo Lee is the program that defeated Lee Sedol 4–1 in March 2016.It was previously unpublished, but is similar in most regards to AlphaGo Fan12.However, we highlight several key differences to facilitate a fair comparison. First, he value network was trained from the outcomes of fast games of selfplay by AlphaGo, rather than games of selfplay by the policy network; this procedure was iterated several times—an initial step towards the tabula rasa algorithm presented in this paper. Second, the policy and value networks were larger than those described in the original paper—using12 convolutional layers of 256 planes—and were trained for more iterations. This player was also distributed over many machines using 48 TPUs, rather than GPUs, enabling it to evaluate neural networks faster during search.

【翻译】（2）AlphaGo Lee是在2016年3月以4：1击败Lee Sedol的程序。这个程序以前未公布，但它在大多数方面与AlphaGo Fan12是相似的。然而，为了有一个公平的比较，我们强调几个关键性的差异。首先，价值网络是利用AlphaGo自我博弈的快速游戏结果进行训练的，而不是利用策略网络的自我博弈游戏进行训练的；这个过程反复了几次——初步提出了tabula rasa算法。其次，策略网络和价值网络都比原创论文中描写的大——使用具有256个特征平面的12个卷积层——并且在训练中进行了更多次的迭代。这个程序也分布在很多机器上，使用48个TPU，而不是GPU，使搜索过程中神经网络的评估速度更快。

(【原文】3) AlphaGo Master is the program that defeated top human players by 60–0 in January 201734. It was previously unpublished, but uses the same neural network architecture, reinforcement learning algorithm, and MCTS algorithm as described in this paper. However, it uses the same handcrafted features and rollouts as AlphaGo Lee12 and training was initialized by supervised learning from human data.

【翻译】（3）AlphaGo Master是在2017年一月以60：0击败人类头号玩家34的程序。这是以前未公开的程序，但使用了本文中提到过的相同的神经网络结构、强化学习算法和MCTS算法。但是，它和AlphaGo Lee12使用相同的手工特征和rollout，并且通过对人类数据的监督学习进行初始化训练的。

【原文】(4) AlphaGo Zero is the program described in this paper. It learns from selfplay reinforcement learning, starting from random initial weights, without using rollouts, with no human supervision and using only the raw board history as input features. It