Geoffrey Hinton (Google) (Aug 2015)

Geoffrey Hinton (Google Scientific Advisor) – A Computational Principle that Explains Sex, the Brain, and Sparse Coding (Aug 2015)

Chapters

00:00:08 Neural Communication: Advantages of Stochastic Binary Spikes over Real Numbers

00:09:43 Geometric Mean Model Averaging for Neural Networks

Neural Network Communication:
When simulating large neural networks with billions of neurons, they must be spread across multiple processes. Processes communicate the states of neurons, which can be transmitted as individual bits or 32-bit real numbers. Sending individual bits for neuron states reduces communication by 32 times compared to sending real numbers, similar to the efficiency gain achieved by using GPUs.

Model Averaging Challenges:
Training large deep neural networks is time-consuming, making it impractical to train millions of models. Averaging the predictions of many models is beneficial for improving performance, but it is slow for both training and testing. Decision trees are faster and can be averaged efficiently, but they are not as powerful as deep neural networks.

Model Averaging with Sampling Architectures:
A method is proposed for efficient model averaging with neural networks. Each time an image is presented to the network, a subset of hidden units is randomly removed with a probability of 0.5, effectively sampling an architecture. The network is trained on this sampled architecture, sharing weights across all architectures. At test time, the predictions of all possible architectures are combined using a geometric mean, resulting in improved accuracy.

Weight Sharing Regularization:
Sharing weights across different architectures acts as a strong regularizer, preventing overfitting. This regularization is more effective than techniques like L1 and L2 regularization.

Implementation Details:
For networks with multiple hidden layers, the math is not exact, but the approximation is close and effective. Leaving out half the units in each hidden layer typically provides better results than choosing one hidden layer and leaving out half the units. This trick can also be applied to the input layer, which is known as a denoising autoencoder.

Similar Techniques:
A similar technique is commonly used at Google for logistic regression with a large number of features. By randomly dropping out some features, the model’s performance can be improved.

00:18:51 Dropout Variants for Neural Networks

00:21:59 Dropout and Regularization for Deep Networks

00:24:23 Dropout Regularization for Deep Neural Networks

00:28:58 Dropout: Model Averaging and Regularization

00:32:39 Sparse Coding and Dropout: Understanding Generalization and Genetic Algorithms

00:37:56 Stochastic Neurons: Improved Generalization through Spiking and Dropout

00:41:21 Stochastic Regularization using Boltzmann Machines

Why Brains Don’t Use Spike Times for Real Values:
Neuroscientists argue that brains cannot use spike times to transmit real values due to the energy cost of sending spikes. Hinton challenges this notion, asserting that sending spikes accurately doesn’t require more energy than sending inaccurate spikes.

Evolution’s Role in Neural Network Design:
Evolution’s goal is not to fit training data but to create systems that generalize well. To achieve good generalization, brains employ a large number of neural parameters compared to training examples. Random dropout or sending stochastic bits acts as a regularizer, preventing overfitting.

Restricted Boltzmann Machines vs. Denoising Autoencoders:
Restricted Boltzmann Machines (RBMs) and denoising autoencoders achieve similar performance. RBMs make the hidden units noisy, while denoising autoencoders make the visible units noisy. The noise in RBMs is crucial for their performance, not the complex mathematics of generative models.

Contrast Divergence (CD) Training:
CD training is commonly used for RBMs, but it also works well for deep nets. CD aims to match the distribution of reconstructions to the distribution of data, not to reconstruct individual data points accurately. Noisy pixels are ignored in RBMs trained with maximum likelihood, but they affect the hidden units in CD training.

CD Learning Rule:
The CD learning rule consists of two terms: a reconstruction error term and a weight update term. The reconstruction error term is the derivative of the reconstruction error with respect to the weight. The weight update term is the expectation of the hidden unit being on together minus the expectation when making a reconstruction.

Backpropagation in CD Training:
CD training performs backpropagation in a forward direction instead of the usual backward direction. It takes the error derivative from the visible units and multiplies it by the weights from the hidden units to get the error derivative for the hidden units. This error derivative is then propagated forward through the non-linearity to update the weights.

00:49:13 Boltzmann Machines: Training Methods and Fast Mixing

00:58:55 Variational Learning for Boltzmann Machine Training

01:01:59 Deep Belief Nets and Boltzmann Machines

Abstract

Harnessing Computational Intelligence: Unraveling the Mysteries of Neural Communication and Evolution (Updated)

In the field of computational intelligence, Geoffrey Hinton, a prominent figure, has proposed a groundbreaking principle that intertwines the mechanisms of sexual reproduction, neural communication, and efficient computational processes. This article explores Hinton’s insights, exploring their implications for understanding neural networks and evolutionary biology.

The Core of Computational Ingenuity

Geoffrey Hinton’s hypothesis posits that the brain may use spike timing to represent real numbers, enabling efficient scalar product computations. This challenges traditional views on signal processing by suggesting that the timing of spikes in cortical neurons, previously considered as noise, is a sophisticated computational strategy. Hinton’s approach suggests that stochastic binary spikes are optimized for brain tasks, fitting numerous models to complex realities, thus enhancing model adaptability.

Co-Adaptations and Sexual Reproduction

Hinton emphasizes the importance of sexual reproduction in disrupting co-adaptations, which can lead to evolutionary dead-ends and sensitivity to environmental changes. He argues for the advantage of small co-adaptations of a few genes over complex co-adaptations of many genes, as the former are less likely to be disrupted by sexual reproduction.

Stochastic Binary Spikes in Neural Communication

Hinton challenges the notion that neurons cannot communicate real numbers by presenting a model that uses spike times for this purpose. This model posits that the brain employs stochastic binary spikes, which are more effective in fitting models to complex reality. The efficiency of this approach is highlighted by its reduced communication needs, paralleling the efficiency gains achieved by GPUs.

Revolutionizing Neural Network Training: Dropout Regularization

Hinton’s introduction of dropout regularization for neural networks is a significant advancement. This technique involves randomly dropping out hidden units during training to create a diverse ensemble of models. It serves as a strong regularizer, preventing overfitting and reducing the need for large datasets. Dropout’s application extends to complex network architectures, consistently lowering error rates.

Model Averaging Challenges

Hinton addresses challenges in training large deep neural networks through model averaging with sampling architectures. He proposes a method where hidden units are randomly removed during training, effectively sampling different architectures. This shared-weights approach leads to improved accuracy by combining predictions of all possible architectures at test time.

Dropout and Naive Bayes

Hinton’s dropout technique is compared to an extreme version of Naive Bayes, which uses one feature for logistic regression. He suggests that learning the probability of dropping out each feature can enhance performance, as demonstrated in experiments with dropout techniques on MNIST.

The Dropout Phenomenon: Beyond Regularization

Dropout is more than a regularization tool; it’s a paradigm shift in neural network dynamics. By randomly setting units to zero during training, the network learns robust and generalizable features, functioning similarly to adding noise to the network.

Sparse Coding and Genetic Algorithms: Extending the Dropout Concept

Hinton’s dropout concept is related to sparse coding and genetic algorithms. Randomly setting coefficients to zero during sparse coding training, similar to dropout, improves generalization. In genetic algorithms, dropout enhances neuron robustness to the loss of collaborators.

Understanding Neural Networks Through the Lens of Evolution

Hinton’s insights extend to the evolutionary aspects of neural networks. He suggests that evolution prioritizes generalization over fitting training data. This principle is evident in the brain’s use of random dropout and stochastic bits as regularizers against overfitting.

Sparse Coding, Dropout, and Genetic Algorithms

Sparse coding generalizes well to new data because it uses a large dictionary of basis functions and enforces most coefficients to be zero. Randomly setting coefficients to zero during training, akin to dropout, bolsters generalization. Dropout makes neurons robust to the loss of collaborators, enhancing genetic algorithm performance. Neurons sending spikes randomly is analogous to dropout in a hidden layer.

Insights into Stochasticity in Neural Networks and Learning

Neurons spiking can be viewed as hidden units with dropout, having the same expected output. Stochasticity, introduced by dropout and stochastic neurons, improves neural network performance on generalization tasks. This stochastic nature of cortical neurons is key to the brain’s efficient learning and generalization abilities.

Insights and Key Points from Geoffrey Hinton’s Lecture on Restricted Boltzmann Machines

Hinton counters the argument that brains can’t use spike times to transmit real values due to energy costs, asserting that accurate spikes don’t require more energy than inaccurate ones. He emphasizes that evolution aims for systems that generalize well, with brains employing numerous neural parameters for this purpose. Hinton compares Restricted Boltzmann Machines (RBMs) and denoising autoencoders, highlighting the importance of noise in RBMs. He explains that Contrast Divergence (CD) training, used for RBMs and deep nets, aims to match data distributions rather than reconstruct individual data points accurately.

Boltzmann Machines and Autoencoders

Restricted Boltzmann machines (RBMs) are a type of autoencoder with stochastic binary hidden units. RBM training, akin to maximum likelihood learning, employs contrast divergence (CD) for efficiency. CD training can be seen as a stochastic approximation of the full derivative.

Fantasy Particles and Learning

Boltzmann machines use fantasy particles, representing different states of hidden and visible units, for training

. These fantasy particles are instrumental in altering the energy landscape during training by increasing energy around themselves and decreasing it around the data. This approach facilitates fast mixing of the Markov chain and assists the fantasy particles in escaping local minima.

Negative Particles and Agitators

In the learning process, negative particles act like political agitators, identifying and rectifying issues (local minima) that lack sufficient data. These particles play a crucial role in ensuring a thorough exploration of the energy landscape and in learning effectively from the data.

FastPCD and Overlay Energy Surfaces

FastPCD, a method used in training, utilizes two energy surfaces: a slow-changing base surface and a rapidly decaying overlay. The overlay surface enables quick learning and good mixing without disturbing the long-term energy function, proving particularly useful for training Boltzmann machines with connections between hidden units.

Training Full Boltzmann Machines

The training of full Boltzmann machines, which permit connections between all hidden units, is complex due to the intricate energy landscape. Various methods, including Markov chain Monte Carlo and mean-field approaches, have been proposed to address this challenge.

Variational Methods: Their Drawbacks and Effective Applications

Variational methods, used to approximate the posterior distribution in Boltzmann machine training, face challenges due to a negative term that penalizes the difference between true and approximating distributions.

Instabilities in Variational Learning for Boltzmann Machines

Variational learning with a negative sign causes increasing scale divergence, leading to unstable behavior and the failure of these methods in Boltzmann machine training.

Wake-Sleep Algorithm

The wake-sleep algorithm uses variational learning in its sleep phase, aiming to minimize the Kullback-Leibler divergence between true and approximating distributions. However, this approach is problematic and can yield incorrect results.

Effective Use of Variational Learning

A more successful application of variational learning involves replacing the negative term with a positive one, allowing for the minimization of KL divergence. This adjustment leads to a more stable and effective training process for Boltzmann machines.

Boltzmann Machines and Deep Belief Nets

Boltzmann machines are trained using persistent contrastive divergence (PCD) for model expectations and variational learning for data-dependent expectations. Pre-training Boltzmann machines with sensible weights enhances their performance.

A New Horizon in Computational Intelligence

Geoffrey Hinton’s contributions, ranging from proposing novel computational principles to pioneering dropout regularization and advancing Boltzmann machine learning, represent significant strides in our understanding of neural networks and evolutionary biology. His insights challenge existing paradigms in signal processing and neural communication, paving the way for more robust and efficient computational models. These advancements echo across various domains of artificial intelligence and biological understanding, marking a new horizon in computational intelligence.

Notes by: Alkaid

Geoffrey Hinton (Google Scientific Advisor) – A Computational Principle that Explains Sex, the Brain, and Sparse Coding (Aug 2015)

Chapters

Abstract

Related posts: