David Silver (DeepMind) (Nov 2021)

David Silver (DeepMind Sr. Researcher) – AlphaZero Fundamentals (Nov 2021)

We’ve applied MuZero inside DeepMind to optimize the codecs which are used on internet traffic. Bearing a mind 50% of all internet traffic is video, that’s a significant saving. And more exciting applications coming soon. I wish I could tell you about some of them that I can’t.
– Silver @ 46:47

Chapters

00:00:16 A Family of Search Algorithms

00:06:47 Understanding Search Algorithms

Focus on Fundamentals:
David Silver sets the agenda to go back to basics and focus on the underlying theory of various reinforcement learning algorithms. He plans to discuss algorithms such as Expecting Max Search, Monte Carlo Search, Nested Monte Carlo Search, and Alpha Zero Search.

The methods used to explain these algorithms will involve equations, diagrams, and pseudocode.

Common Characteristics of Algorithms:
All the algorithms under discussion have a common method: they evaluate a policy and estimate a value function.

The value function estimates the expected total reward for each state or each action within a state, termed as the state value function and the action value function, respectively.

Algorithms iteratively evaluate and improve policies using these value functions, with the aim to eventually arrive at the optimal value function and policy given enough time and resources.

Role of Backup Operators:
The concept of “backup operators” is introduced, represented by bold letters. These operators modify the value function or policy.

An algorithm is essentially a sequence or composition of these backup operators.

The primary goal is to find a composition of operators that will produce the optimal value and policy.

Types of Backups:
Two main kinds of backups are discussed: evaluation backups and improvement backups. The former evaluates the current policy and the latter improves the policy based on those values.

Backup diagrams, used to illustrate the dependencies involved in each backup, are also discussed.

Backup Operator Examples:
The Bellman evaluation backup operator is presented as an example. This operator computes the expected value for each state and updates the value function at the root state.

Improvement backups are achieved by updating the policy towards the action that has the maximum backed-up value.

Other backup operators like TD0, n-step TD backup, and Monte Carlo backups are also mentioned.

Sampling to Break Curse of Dimensionality:
Noting the high cost of the Bellman evaluation backups due to the consideration of the entire action and state space, Silver mentions a method to break the curse of dimensionality – by sampling.

Instead of considering all different successors, the backup can be achieved by sampling a pathway through the tree, and this sampling-based method leads to TD0 backup or n-step TD backup.

Composition of Backup Operators:
The composition of backup operators is an important concept introduced by Silver. By composing these operators in different orders, different results can be achieved.

Silver emphasizes that regardless of the operators used, if we repeatedly evaluate and improve using any of these operators infinitely often across the entire state space, the result will converge to the optimal value function and the optimal policy, an idea known as policy iteration.

Efficient Policy Iteration:
Towards the end of the discussion, Silver raises a key question about the efficiency of policy iteration. While the concept guarantees convergence to the optimal policy if applied infinitely, he asks which states should be prioritized for the iteration process to converge more rapidly and efficiently.

00:15:59 Search Control

Concept of Search Control:
Search control refers to the strategy used to decide the order in which backup operations are applied in a state. There are many different strategies for search control, but the discussion focuses on two core ideas: recursion and rollouts.

Importance of Recursion in Search Control:
In recursion, improvements are made recursively at each successive state before performing a backup from the successor states. This approach essentially forms a search tree where the value and policy at each successor state are determined before backing up to the higher level.

If the problem structure is tree-like or even a directed acyclic graph (DAG), recursion provides a nice ordering of backups.

With recursion, once the right value of each successor state is computed, only a single backup is needed to ensure the correct value of a state. Recursion helps localize memory resources to the particular subtree being worked on, which is particularly useful when working with value functions or policies with vast memory requirements.

Rollouts as an Effective Strategy in Search Control:
Rollouts involve sampling future trajectories using the current policy, instead of considering the full exhaustive search tree. Rollouts focus on what’s likely to happen in the future according to your policy, ignoring improbable states. It allows for the correct calculation of the vast majority of values.

Importantly, rollouts enable deep search. As the computation grows exponentially with the depth of the search tree in an exhaustive depth-first search, rollouts break this “curse of dimensionality” by merely sampling. This allows for deep dives into the problem.

Combining Recursion and Rollouts: Recursive Rollout:
Recursive rollout attempts to combine the advantages of both recursion and rollouts. Examples include nested Monte Carlo search or nested Monte Carlo tree search.

The speaker argues that Alpha Zero, a highly successful AI algorithm, operates within the recursive rollout framework. At the outer level, it performs rollouts at the game level, and at the inner level, from each of the states it reaches, it performs inner rollouts within its Monte Carlo search.

00:21:04 Search Algorithms

00:28:45 Nested Monte Carlo Tree Search and Function Approximators

Nested Monte Carlo Search:
The Nested Monte Carlo Search is a method that combines recursion and rollouts in Monte Carlo tree search. The approach uses recursion to perform additional rollouts from each state encountered during a rollout.

The nested rollouts undergo evaluation and improvement operations before the primary rollout is backed up. This method can be applied recursively to as many levels as needed.

The process is similar in the Nested Monte Carlo Tree Search, where inner rollouts are performed first followed by an outer backup using a Monte Carlo backup. The Monte Carlo backup could be replaced by Temporal Difference (TD) backup if preferred.

Pseudocode for Nested Monte Carlo Tree Search:
In a rollout procedure, actions are sampled from a policy and rewards are sampled from the environment/model iteratively until a terminal state or depth limit is reached.

In a search procedure, multiple rollouts are performed and evaluation backups are applied along the way to evaluate the process using Monte Carlo backups.

A nested Monte Carlo tree search includes a nested layer in the rollout procedure where, before proceeding at the current level of the algorithm, a search is performed one level lower, improving the value and policy.

Introduction of Function Approximators in RL:
Table lookup algorithms, although exact, are unfeasible in problems with large state spaces. Instead, function approximators like neural networks can be used.

A new projection operator is introduced that projects the results from lower-level searches onto a representation that the neural network can handle. The projection operator finds the best fit using the neural network parameters to backup targets, providing powerful algorithms when interleaved with evaluation and improvement backups. Every step involves evaluation, improvement, and a step of gradient descent to optimize parameters to fit the targets as well as possible.

Transient Approximations:
There are two methods of passing parameters in the function: call by reference and call by value. Using call by value allows for transient approximations, where a new set of parameters is instantiated each time the function recurses.

This approach enables the neural network to focus its resources on finding the best local approximation for the current point of nesting. This concept is akin to the depth of search idea but allows the consideration of a new approximation at every stage.

This strategy permits the memory resources to focus solely on the local search problem, compute the root value, discard the entire thing, and move to another part in the search.

00:37:47 AlphaZero & Beyond

AlphaZero’s Structure:
AlphaZero operates as a two-level nested Monte Carlo tree search with transient approximation.

The first level essentially conducts a Monte Carlo tree search. The inner level uses temporal difference backups as a valuation operator. It also uses a variant of UCT (Upper Confidence Bound applied to Trees) or PUCT (a variant of UCT that uses a prior probability distribution to guide the search) to improve the policy based on visit counts.

The outer level plays out games. From each state reached, several inner rollouts are performed.

In recent versions of AlphaZero, temporal difference learning is more effective than Monte Carlo backups. There is no improvement operator at this level, only an identity operator combined with the projection of the neural network.

MuZero Extension:
MuZero takes AlphaZero and learns a model in a value equivalent way, then applies value alpha zero to that learned model.

It constructs a composed value function. The model of the environment (which is being learned) is composed multiple times to give a sequence of actions.

This technique is successful in learning the rules and achieving superhuman performance in games like chess, shogi, and go. It also outperforms in standard reinforcement frameworks like Atari.

Gumbel AlphaZero Extension:
While the PUCT method used in AlphaZero is not a true improvement operator, it can be replaced with one, improving AlphaZero’s performance especially with small numbers of simulations.

With a true policy improvement operator, AlphaZero can run effectively even with limited inner rollouts for each state encountered in the outer rollouts.

Policy Improvement at Outer Level:
Policy improvement can be introduced at the outer level using a policy gradient operator. This can significantly enhance performance in certain environments, such as Atari.

Sampling for Large Action Spaces:
AlphaZero can handle large or continuous action spaces by using sampling. Actions are sampled using the agent’s policy at each level of recursion, and local Monte Carlo search is performed with respect to those actions.

A special improvement operator can correct for any bias introduced by this sampling method.

Recursion Levels:
The focus on two levels of recursion in AlphaZero and MuZero is pragmatic. The first level is special because it’s the only possible way to traverse a real environment, like a robot, without a reset.

More recursion levels could potentially provide better targets, especially when combined with approximation. Deeper recursion might be pursued as more computational resources become available.

Applications of AlphaZero:
AlphaZero has been successfully applied to various fields, including chemical synthesis and optimizing quantum dynamics.

Abstract

In the world of Artificial Intelligence, search algorithms hold a key position in solving complex problems. At the forefront of these breakthroughs is AlphaZero, an algorithm that has demonstrated the ability to master multiple games like Chess, Shogi, and Go, at a superhuman level. Spearheaded by David Silver, these advancements were made possible by iterating on prior methods, focusing on the fundamentals of reinforcement learning, introducing the concept of backup operators, and detailing intricate search control strategies. This article explores the evolution and implications of AlphaZero, sheds light on the underlying reinforcement learning algorithms, and discusses the strategic interplay between recursion and rollouts in search control.

AlphaZero, as discussed by David Silver, stems from the evolution of search algorithms starting with AlphaGo, which defeated the human Go champion in 2016. Its successor, AlphaGoZero, improved by learning to play without any human knowledge, relying solely on the known game rules. AlphaZero then took the next step by mastering multiple games with only the game rules at its disposal. The most recent iteration, mu0, expanded on this, learning the rules of various games, demonstrating the potential to master other environments with unknown dynamics.

AlphaZero’s core ability to learn from zero prior knowledge and play at a superhuman level is revolutionary. The algorithm starts with a randomly initialized neural network and, through thousands of self-play games, learns and improves, eventually defeating existing world champion programs. It’s not just the victory but the process of learning, refining, and innovating strategies that have led to rethinking game play by humans. Several books about AlphaZero’s strategies reveal how its unique approaches have altered our understanding of these games.

At the heart of these algorithms, Silver presents the common thread of evaluating a policy and estimating a value function. The algorithms aim to iteratively improve policies using these value functions, ultimately aiming for the optimal value function and policy. Here, backup operators, which modify the value function or policy, play a pivotal role. The aim is to sequence or compose these operators in a way that produces optimal results. Moreover, Silver introduces two types of backups: evaluation backups and improvement backups, the former evaluates the current policy, and the latter improves the policy based on those values.

To effectively manage backup operations in a state, Silver introduces the concept of search control. The focus here is on two core ideas: recursion and rollouts. Recursion makes improvements at each successive state before performing a backup from the successor states, which helps localize memory resources. Rollouts, on the other hand, involve sampling future trajectories using the current policy, thereby enabling deep search by sampling instead of considering the full exhaustive search tree.

Moreover, Silver dives into the intricacies of various search algorithms in reinforcement learning, like Expecting Max Search, Monte Carlo Search, Temporal Difference (TD) search, and Nested Monte Carlo Tree Search. Each algorithm has its unique approach but is fundamentally linked to the same principle of applying evaluation and improvement backups to estimate the optimal value function and policy.

With AlphaZero functioning as a two-level nested Monte Carlo tree search with transient approximation, further extensions have been introduced, such as MuZero and Gumbel AlphaZero. The former learns a model in a value equivalent way and then applies value alpha zero to that learned model. Gumbel AlphaZero, on the other hand, replaces the PUCT method used in AlphaZero with a true policy improvement operator, enhancing AlphaZero’s performance especially with small numbers of simulations.

Silver’s work on AlphaZero and its extensions stands as a testament to the potential of artificial intelligence, leading us to rethink the ways we approach problems and strategies. It’s not just about winning games; it’s about understanding the underlying principles of learning and adaptation. The continuous refinement and evolution of these algorithms highlight the very essence of AI: the ability to learn, adapt, and overcome.

Notes by: Systemic01

David Silver (DeepMind Sr. Researcher) – AlphaZero Fundamentals (Nov 2021)

Chapters

Abstract

Related posts: