Geoffrey Hinton (Google Scientific Advisor) – Lecture 13/16 (Jan 2017)
Chapters
Abstract
The Evolution of Machine Learning: From Backpropagation to Deep Belief Nets
Unraveling the Journey of Backpropagation and its Alternatives
Historical Context of Backpropagation
Backpropagation, a pivotal algorithm in deep learning, emerged independently in the 1970s and 1980s. However, its true potential was unlocked in the late 1990s when advancements in computing power, access to larger datasets, and improved initialization techniques addressed previous limitations. While the prevailing explanation blamed multiple layers of nonlinear features for backpropagation’s shortcomings, the real hurdles were practical, not theoretical.
Support Vector Machines (SVMs), a powerful tool in classification tasks, gained popularity during this period. They employ the “kernel trick,” allowing them to operate in high dimensional spaces, thereby reducing overfitting and achieving maximum margin hyperplanes. Despite their strengths, SVMs are limited in their ability to learn intricate multi-layered representations.
Computational Limitations and the Rise of SVMs
The limited success of early backpropagation attempts can be attributed to insufficient computational resources and the lack of large datasets. These factors hindered its ability to learn complex patterns, giving rise to the popularity of SVMs, which excelled in specific classification tasks. However, SVMs faced their own challenges in learning multi-layered representations.
The 1995 Bet: A Turning Point
In 1995, a pivotal wager between Larry Jackal and Vladimir Vapnik captured the era’s uncertainty surrounding neural networks. The bet centered on whether a comprehensive theoretical understanding of large neural nets trained with backpropagation would emerge by the year 2000. This wager highlighted the practical limitations of backpropagation due to computational and data constraints. Vapnik bet against the prospect, but with a side wager that if he were to solve the challenge, he would still be declared the winner. Ultimately, both predictions proved incorrect as the limitations were practical rather than theoretical.
Geoffrey Hinton’s Shift
Geoffrey Hinton’s transition to belief nets in the 1990s marked a significant shift in the field. Confronted with the scarcity of labeled data, Hinton explored generative models, which led him to investigate belief nets as an alternative to gradient descent learning. Generative models aim to replicate the input data rather than predicting a label, a task suited for gradient descent learning. Graphical models employ discrete graph structures coupled with probability distributions to represent dependencies between variables, offering a potent framework for generative modeling.
Statistical vs. AI Machine Learning Tasks
This period also witnessed a clear distinction between statistical and AI machine learning tasks. Statistical tasks dealt with low-dimensional, noisy data, while AI tasks demanded handling high-dimensional data with intricate structures.
The Emergence and Challenges of Sigmoid Belief Nets
Background and Learning Challenges
Sigmoid belief nets, introduced by Radford Neal in 1992, offered a new direction in machine learning. They are locally normalized models, making learning simpler compared to Boltzmann machines. However, learning these nets posed challenges due to phenomena like “explaining away,” which complicates sampling from the posterior distribution. Boltzmann machines, early examples of graphical models, employed real-valued computations to infer variable probabilities. Radford Neal’s sigmoid belief nets, an extension of Boltzmann machines to directed graphical models, were introduced in 1992.
Deep Belief Nets: Overcoming Limitations
Researchers developed methods like the Monte Carlo Method, Variational Methods, and the Wake-Sleep Algorithm to overcome these limitations. These approaches approximated the posterior distribution, marking a significant departure from traditional learning methods. Deep networks with multiple hidden layers faced slow learning due to poor weight initialization. Despite its usefulness, backpropagation could get stuck in poor local optima. Overcoming these challenges required moving beyond backpropagation.
Variational Learning and Neurobiological Implications
Variational learning, popular in the late 1990s, aimed to approximate the posterior distribution, acknowledging the impracticality of exact sampling. This approach suggested a new theory for how the brain might learn, proposing a factorial distribution over binary vectors. Unsupervised learning aims to maximize the probability of a generative model generating sensory input. This approach focuses on modeling input structure rather than input-output relationships. Generative models can be energy-based (like Boltzmann machines) or causal models made of idealized neurons.
The Flaws of the Wake-Sleep Algorithm
Despite its innovative approach, the Wake-Sleep Algorithm had drawbacks. It struggled with data-sparse regions and incorrect mode averaging. This highlighted the complexities of learning deep belief nets and emphasized the need for more robust algorithms.
Wake-Sleep Algorithm:
– It is a generative model with two sets of weights: recognition weights for approximate posterior distribution and generative weights for defining the model’s probability distribution.
Wake Phase:
– Data is input at the visible layer.
– Forward pass using recognition weights generates stochastic binary states for hidden units.
– Maximum likelihood learning is performed for the generative weights.
Sleep Phase:
– Random vector starts at the top hidden layer.
– Binary states are generated from the prior and propagated down the system using generative weights.
– Recognition weights are trained to recover hidden states from the generated data.
Flaws of the Algorithm:
– Recognition weights learn to invert the generative model in data-sparse regions.
– Recognition weights don’t follow the correct gradient, leading to incorrect mode averaging.
– Independence approximation for the posterior distribution may be inaccurate.
Mode Averaging:
– Due to the independence approximation, the recognition distribution assigns equal probability to multiple modes of the posterior distribution.
– This can result in poor recognition performance, as the model fails to select the most likely mode.
Criticisms:
– Carl Friston proposed that the brain uses this algorithm, but Hinton believes it has too many problems.
– Hinton suggests that better algorithms exist for learning generative models.
The Evolving Landscape of Machine Learning
The journey of machine learning, from backpropagation to deep belief nets, reflects a dynamic field that continuously adapts to new challenges and discoveries. The historical context provides valuable insights into the evolution of these algorithms, underscoring the impact of computational limitations, data availability, and the quest for more effective learning methods. As we delve deeper into the fields of AI and machine learning, understanding these developments becomes crucial for future innovations and advancements.
Notes by: Random Access