Geoffrey Hinton (Google) (Apr 2017)

Geoffrey Hinton (Google Scientific Advisor) – What’s Wrong with Convolutional Neural Networks (Apr 2017)

Chapters

00:01:27 Capsules: A New Approach to Neural Networks

00:04:42 The Computational Basis of High Dimensional Coincidences

00:08:24 Convolutional Neural Networks and Their Limitations

Psychological Misfit:
Hinton argues that CNNs, specifically the pooling layers, are a poor fit for the psychology of shape perception. He emphasizes the importance of rectangular coordinate frames in human shape recognition and how a slight shift in the coordinate frame can make the same object unrecognizable. CNNs lack the notion of imposing coordinate frames, making it difficult to explain how the same pixels can be processed differently based on the coordinate frame.

Tetrahedron Puzzle:
Hinton presents a simple puzzle involving two pieces of a tetrahedron that most people find challenging to reassemble. He demonstrates how people’s natural coordinate frame for the tetrahedron differs from the one imposed on the individual pieces, leading to the puzzle’s difficulty. This puzzle highlights the importance of coordinate frames in shape perception and the limitations of CNNs in capturing this aspect.

Map Perception:
Hinton uses an example of a map to illustrate how the perception of familiar objects can be influenced by the imposed coordinate frame. People’s inability to recognize a map of Africa when the orientation is altered demonstrates the crucial role of coordinate frames in object recognition.

Pooling as a Routing Problem:
Hinton criticizes the pooling operation in CNNs, viewing it as a primitive approach to routing information. He emphasizes the need for a more sophisticated routing mechanism to handle viewpoint changes and dimension hopping, where information moves from one set of pixels to another. He draws a parallel to the coding of patient records in hospitals, where mixing different coding schemes can hinder machine learning efforts.

Underlying Linear Manifold:
Hinton points out that CNNs fail to utilize the underlying linear manifold, which is a powerful tool in computer graphics and is believed to be involved in human shape perception. He argues that CNNs’ lack of attention to this linear manifold limits their ability to deal with viewpoint effects effectively.

Conclusion:
Hinton concludes his critique by asserting that CNNs, specifically the pooling aspect, are not psychologically sound and do not align well with the way humans perceive shapes. He highlights the importance of coordinate frames and routing mechanisms in shape perception and argues for a better understanding of these aspects in neural network architectures.

00:19:25 Coordinate Frames and Equivariance in Visual Perception

Argument for Coordinate Frames:
Hinton argues that the human visual system uses rectangular coordinate frames embedded in objects and parts of objects to represent their shape and position. Evidence for this comes from studies showing that people can accurately judge the orientation of objects within 1 degree, even when they are rotated or tilted. Hinton proposes that these coordinate frames are represented by a bunch of separate neural activities, rather than a single neuron, which allows for more precise representation.

Equivariance:
Hinton discusses the concept of equivariance, which refers to the property of a neural representation that changes in a predictable way when the input changes. He distinguishes between place equivariance, where the active neurons change as the input moves across space, and rate-coded equivariance, where the activity of the same neurons changes as the input moves. Hinton suggests that the visual system uses both types of equivariance at different levels of processing, with place equivariance at lower levels and rate-coded equivariance at higher levels.

Linear Manifolds:
Hinton proposes that the visual system uses linear manifolds to represent objects and their properties. This means that the relationship between the input pixels and the internal representation of an object is linear, allowing for efficient extrapolation and generalization across different viewpoints. Hinton argues that current neural networks lack this built-in bias towards linear generalization, which limits their ability to generalize across viewpoints without extensive training.

Inverse Graphics:
Hinton suggests that vision can be thought of as inverse graphics, where the goal is to reconstruct the 3D structure of a scene from 2D images. He proposes that the visual system performs this reconstruction by using a process that is similar to computer graphics in reverse, where it starts with the input pixels and gradually builds up a representation of the objects in the scene.

Conclusion:
Hinton’s arguments provide strong evidence for the use of coordinate frames, equivariance, and linear manifolds in the human visual system. These concepts can help to explain how the visual system can achieve high accuracy and generalization across different viewpoints, even with limited training data. Hinton’s proposal to use inverse graphics as a model for vision offers a promising direction for developing more powerful and efficient neural networks for computer vision tasks.

00:29:29 Inverse Graphics and Coincidence Filtering for Shape Recognition

00:34:12 Routing in Capsule Networks

00:38:28 Routing by Agreement in Capsule Networks

00:44:23 Capsule Networks: Explaining Pose Estimation and Agreement Computation

00:49:45 Capsule Networks for Image Recognition

Background:
Hinton introduces a novel approach to digit classification using capsule networks. Capsule networks utilize a mixture of Gaussian and uniform distributions to model the data. The goal is to find clusters of data points and assign them to different classes.

Key Points:

1. Calculating the Score:
The score is computed as the difference in log probabilities of the data under a mixture of Gaussian and uniform distributions. The score helps identify clusters of data points and determine the class of each data point.

2. Softmax and Decision Making:
The score is used as the logit in a Softmax function to make decisions about the class of a data point. Backpropagation is employed to adjust the network’s weights based on the decision made.

3. Inner Loop for EM and Cluster Finding:
An inner loop involving Expectation-Maximization (EM) is used to find clusters of data points. The EM algorithm iteratively refines the cluster assignments and the parameters of the Gaussian distribution.

4. Interpreting the Results:
The votes for each class are visualized as clusters, with the size of each cluster representing the posterior probability of the data point belonging to that class. The variance of the Gaussian distribution indicates the sharpness of the cluster.

5. Comparison with Convolutional Neural Networks:
Capsule networks achieve similar performance to convolutional neural networks on the MNIST dataset. However, capsule networks are computationally more expensive due to the inner loop for EM.

6. Limitations and Future Improvements:
The current approach focuses on single-digit classification and lacks deeper hierarchies and real image processing capabilities. Future work aims to incorporate unsupervised learning to obtain primary capsules and explore more efficient algorithms for cluster finding.

Conclusion:
Hinton presents a novel capsule network architecture for digit classification that utilizes a mixture of Gaussian and uniform distributions to model the data. The network identifies clusters of data points and assigns them to different classes based on the calculated score. While the approach is computationally expensive, it demonstrates the potential of capsule networks for various computer vision tasks.

00:56:35 Capsules for Object Recognition

01:06:01 Deep Learning Algorithms for Object Perception

01:09:47 Capsule Networks for Efficient Representation of Complex Features

Abstract

Understanding Capsule Networks: Revolutionizing Neural Network Architecture and Perception

Capsule Networks: A Structural Revolution in Neural Networks

The recent advancements in neural network architecture, particularly the introduction of capsule networks, mark a significant shift in our approach to artificial intelligence. Capsule networks, developed by Geoffrey Hinton, aim to add structural organization and entity representation to neural networks. This innovative concept enhances the network’s ability to recognize and understand objects and their properties. Unlike traditional neural networks that lack explicit entity representation, capsules group neurons to represent entities, each with presence probability and pose parameters like orientation, size, and velocity.

Capsules are proposed as a new type of neural network architecture that addresses these limitations. Each capsule represents an entity and its properties, such as orientation, size, velocity, and color. Capsules communicate with each other hierarchically, passing on information about the entities they represent.

Rectangular coordinate frames are suggested to be inherent features of the human visual system, aiding accurate judgment of object orientation. These frames are implemented in the brain using separate neural activities, facilitating precise representation of object pose.

Capsule Computation: Mimicking Human Perception

The computation within capsule networks mimics human perception techniques, employing methods similar to RANSAC and Hough transforms in computer vision. Capsules receive predictions about generalized pose from lower-level capsules, identifying and agreeing upon consistent predictions while ignoring outliers. This method significantly improves the ability of the network to perceive and interpret complex data, mirroring the human brain’s columnar structure and its high-dimensional coincidence detection capabilities.

Capsule predictions are vectors that represent the pose of the capsule. The goal is to find a cluster of predictions that indicate the presence of a capsule. Capsules have high-dimensional pose spaces, which are represented as 20-50 dimensional spaces. High-dimensional coincidences are unlikely to happen by chance. The more dimensions in which two things agree, the less likely it is that their agreement is a coincidence. This principle can be used to detect meaningful patterns in data, such as intelligence information.

The visual system utilizes two types of equivariance: place equivariance and rate-coded equivariance. Place equivariance involves the active neurons changing as the input moves across space, while rate-coded equivariance involves the activity of the same neurons changing as the input moves.

The Shortcomings of Convolutional Neural Networks

Convolutional Neural Networks (ConvNets) have been the cornerstone of image recognition. However, their reliance on pooling layers, which reduce feature count and lose positional information, is a critical flaw. This approach leads to viewpoint invariance, contrasting with human perception where the same object can appear different under various coordinate frames. The ConvNets’ lack of a coordinate frame concept limits their psychological accuracy in shape perception, as evidenced by psychological experiments like Ervin Rock’s.

Hinton argues that CNNs, specifically the pooling layers, are a poor fit for the psychology of shape perception. He emphasizes the importance of rectangular coordinate frames in human shape recognition and how a slight shift in the coordinate frame can make the same object unrecognizable. CNNs lack the notion of imposing coordinate frames, making it difficult to explain how the same pixels can be processed differently based on the coordinate frame.

Hinton’s Vision: From Tetrahedron Puzzle to Psychological Evidence

Geoffrey Hinton’s exploration of shape perception, including the Tetrahedron puzzle, showcases the limitations of current models. His findings demonstrate that humans perceive shapes by imposing rectangular coordinate frames, suggesting a more complex, multi-parameter representation of object pose. This insight led Hinton to propose alternative approaches involving coordinate frames and routing information, promising a better capture of human shape perception.

Hinton presents a simple puzzle involving two pieces of a tetrahedron that most people find challenging to reassemble. He demonstrates how people’s natural coordinate frame for the tetrahedron differs from the one imposed on the individual pieces, leading to the puzzle’s difficulty. This puzzle highlights the importance of coordinate frames in shape perception and the limitations of CNNs in capturing this aspect.

Very young children track proto-objects that aren’t fully fleshed out, possibly due to their low-resolution perception helping them learn more easily. The relational aspect of object perception, such as the linear transformation of objects, is hardwired into the system.

The Role of Equivariance and Linear Manifold in Perception

In this new model, convolutional networks without max pooling exhibit viewpoint equivariance, adapting as the object’s position changes. The relationship between pixels and coordinate representation lies on a linear manifold, enabling recognition of objects vastly different from training data. Vision, conceptualized as inverse graphics, reconstructs 3D structure from 2D images, a critical challenge for current neural networks that lack bias for generalizing across viewpoints.

Hinton points out that CNNs fail to utilize the underlying linear manifold, which is a powerful tool in computer graphics and is believed to be involved in human shape perception. He argues that CNNs’ lack of attention to this linear manifold limits their ability to deal with viewpoint effects effectively.

Humans are good at recognizing objects and finding orientations quickly, within about 250 milliseconds. Mental rotation of objects, especially in 3D, is a much slower process, taking over 100 milliseconds.

Capsule Networks and Inverse Graphics

Capsule networks use “inverse graphics” to derive the pose of an object from its parts. This results in viewpoint-invariant weights, ensuring consistency regardless of perspective. Capsules represent an object’s properties, such as pose and identity, and remain invariant to viewpoint changes. They facilitate shape recognition by aligning predictions of familiar parts, with agreement indicating object presence. This approach parallels the Hough transform but leverages modern machine learning for feature extraction.

To understand the brain, we need to understand the computations it performs. One computation that the brain needs to perform is detecting high-dimensional coincidences. This computation may be performed by columns in the brain. Capsule networks may be more robust to noise and occlusions than traditional neural networks. They may also be better at modeling complex relationships between entities.

Capsule networks aim to find agreement between activity vectors, essentially chasing the covariance structure. The computation involved in finding these agreements efficiently among high-dimensional random data presents a challenge. Hinton suggests he has ideas for improving this efficiency but will discuss them only if they prove successful.

Routing and Decision-Making in Capsule Networks

Routing in capsule networks is pivotal, directing information based on relevance and compatibility. High-level capsules establish relationships between parts and wholes, forming a parse tree that represents the hierarchical organization. This process, aided by Hinton’s “routing by agreement” algorithm, emphasizes consistency over sheer signal strength.

Hinton criticizes the pooling operation in CNNs, viewing it as a primitive approach to routing information. He emphasizes the need for a more sophisticated routing mechanism to handle viewpoint changes and dimension hopping, where information moves from one set of pixels to another. He draws a parallel to the coding of patient records in hospitals, where mixing different coding schemes can hinder machine learning efforts.

Hinton’s graduate student attempted to apply capsule networks to speech recognition. The task proved challenging, but the student managed to achieve performance comparable to standard neural networks using capsule-based ideas. Capsule networks appear to be more naturally suited for vision tasks.

Innovations and Limitations in Capsule Network Technology

Capsule networks bring a plethora of innovations: primary capsules extracting features, coordinate transforms for prediction adjustments, and unsupervised learning for model parameter determination. However, they face limitations like computational intensity and challenges in handling multiple simultaneous digits or deeper hierarchies.

Geoffrey Hinton’s vision extends to unsupervised learning, identifying natural classes for efficient learning with fewer labels. His methods outperform traditional unsupervised pre-training and generic mixture models. These advancements promise significant improvements in object recognition, mirroring children’s ability to perceive the essence of objects despite low-resolution vision.

Hinton’s algorithm for learning object representations takes two days, compared to existing algorithms that run in 10 minutes. Heuristic approximations and computational optimizations can be used to significantly reduce the learning time.

Coordinate Frames, Equivariance, and Linear Manifolds: Supplements

Coordinate Frames:

Hinton proposes that rectangular coordinate frames are built-in features of the human visual system, allowing us to accurately judge the orientation of objects. These frames are implemented in the brain using separate neural activities, enabling precise representation of object pose.

Equivariance:

Geoffrey Hinton draws a distinction between place equivariance and rate-coded equivariance, highlighting that the visual system utilizes both at different processing levels. While place equivariance involves the active neurons changing as the input moves across space, rate-coded equivariance involves the activity of the same neurons changing as the input moves.

Linear Manifolds:

The linear manifold concept suggests that the relationship between input pixels and the internal representation of an object is linear. This inherent bias allows for efficient extrapolation and generalization across different viewpoints, which current neural networks struggle with. Hinton proposes that the visual system uses these manifolds for object representation.

Capsule Networks as the Future of AI Perception

Capsule networks, with their structured approach and advanced perception capabilities, represent a groundbreaking shift in neural network architecture. Their ability to mimic human perception, address the limitations of ConvNets, and incorporate complex concepts like equivariance and linear manifold positions them at the forefront of AI development. While challenges remain, the potential of capsule networks in revolutionizing how machines interpret and understand the world is immense.

Notes by: Alkaid

Geoffrey Hinton (Google Scientific Advisor) – What’s Wrong with Convolutional Neural Networks (Apr 2017)

Chapters

Abstract

Related posts: