David Silver (DeepMind Sr. Researcher) – AlphaZero Fundamentals (Nov 2021)

We’ve applied MuZero inside DeepMind to optimize the codecs which are used on internet traffic. Bearing a mind 50% of all internet traffic is video, that’s a significant saving. And more exciting applications coming soon. I wish I could tell you about some of them that I can’t.

– Silver @ 46:47

Chapters

00:00:16 A Family of Search Algorithms
00:06:47 Understanding Search Algorithms
00:15:59 Search Control
00:21:04 Search Algorithms
00:28:45 Nested Monte Carlo Tree Search and Function Approximators
00:37:47 AlphaZero & Beyond

Abstract

In the world of Artificial Intelligence, search algorithms hold a key position in solving complex problems. At the forefront of these breakthroughs is AlphaZero, an algorithm that has demonstrated the ability to master multiple games like Chess, Shogi, and Go, at a superhuman level. Spearheaded by David Silver, these advancements were made possible by iterating on prior methods, focusing on the fundamentals of reinforcement learning, introducing the concept of backup operators, and detailing intricate search control strategies. This article explores the evolution and implications of AlphaZero, sheds light on the underlying reinforcement learning algorithms, and discusses the strategic interplay between recursion and rollouts in search control.

AlphaZero, as discussed by David Silver, stems from the evolution of search algorithms starting with AlphaGo, which defeated the human Go champion in 2016. Its successor, AlphaGoZero, improved by learning to play without any human knowledge, relying solely on the known game rules. AlphaZero then took the next step by mastering multiple games with only the game rules at its disposal. The most recent iteration, mu0, expanded on this, learning the rules of various games, demonstrating the potential to master other environments with unknown dynamics.

AlphaZero’s core ability to learn from zero prior knowledge and play at a superhuman level is revolutionary. The algorithm starts with a randomly initialized neural network and, through thousands of self-play games, learns and improves, eventually defeating existing world champion programs. It’s not just the victory but the process of learning, refining, and innovating strategies that have led to rethinking game play by humans. Several books about AlphaZero’s strategies reveal how its unique approaches have altered our understanding of these games.

At the heart of these algorithms, Silver presents the common thread of evaluating a policy and estimating a value function. The algorithms aim to iteratively improve policies using these value functions, ultimately aiming for the optimal value function and policy. Here, backup operators, which modify the value function or policy, play a pivotal role. The aim is to sequence or compose these operators in a way that produces optimal results. Moreover, Silver introduces two types of backups: evaluation backups and improvement backups, the former evaluates the current policy, and the latter improves the policy based on those values.

To effectively manage backup operations in a state, Silver introduces the concept of search control. The focus here is on two core ideas: recursion and rollouts. Recursion makes improvements at each successive state before performing a backup from the successor states, which helps localize memory resources. Rollouts, on the other hand, involve sampling future trajectories using the current policy, thereby enabling deep search by sampling instead of considering the full exhaustive search tree.

Moreover, Silver dives into the intricacies of various search algorithms in reinforcement learning, like Expecting Max Search, Monte Carlo Search, Temporal Difference (TD) search, and Nested Monte Carlo Tree Search. Each algorithm has its unique approach but is fundamentally linked to the same principle of applying evaluation and improvement backups to estimate the optimal value function and policy.

With AlphaZero functioning as a two-level nested Monte Carlo tree search with transient approximation, further extensions have been introduced, such as MuZero and Gumbel AlphaZero. The former learns a model in a value equivalent way and then applies value alpha zero to that learned model. Gumbel AlphaZero, on the other hand, replaces the PUCT method used in AlphaZero with a true policy improvement operator, enhancing AlphaZero’s performance especially with small numbers of simulations.

Silver’s work on AlphaZero and its extensions stands as a testament to the potential of artificial intelligence, leading us to rethink the ways we approach problems and strategies. It’s not just about winning games; it’s about understanding the underlying principles of learning and adaptation. The continuous refinement and evolution of these algorithms highlight the very essence of AI: the ability to learn, adapt, and overcome.


Notes by: Systemic01