Geoffrey Hinton (Google Scientific Advisor) – What’s Wrong with Convolutional Neural Networks (Apr 2017)


Chapters

00:01:27 Capsules: A New Approach to Neural Networks
00:04:42 The Computational Basis of High Dimensional Coincidences
00:08:24 Convolutional Neural Networks and Their Limitations
00:19:25 Coordinate Frames and Equivariance in Visual Perception
00:29:29 Inverse Graphics and Coincidence Filtering for Shape Recognition
00:34:12 Routing in Capsule Networks
00:38:28 Routing by Agreement in Capsule Networks
00:44:23 Capsule Networks: Explaining Pose Estimation and Agreement Computation
00:49:45 Capsule Networks for Image Recognition
00:56:35 Capsules for Object Recognition
01:06:01 Deep Learning Algorithms for Object Perception
01:09:47 Capsule Networks for Efficient Representation of Complex Features

Abstract

Understanding Capsule Networks: Revolutionizing Neural Network Architecture and Perception

Capsule Networks: A Structural Revolution in Neural Networks

The recent advancements in neural network architecture, particularly the introduction of capsule networks, mark a significant shift in our approach to artificial intelligence. Capsule networks, developed by Geoffrey Hinton, aim to add structural organization and entity representation to neural networks. This innovative concept enhances the network’s ability to recognize and understand objects and their properties. Unlike traditional neural networks that lack explicit entity representation, capsules group neurons to represent entities, each with presence probability and pose parameters like orientation, size, and velocity.

Capsules are proposed as a new type of neural network architecture that addresses these limitations. Each capsule represents an entity and its properties, such as orientation, size, velocity, and color. Capsules communicate with each other hierarchically, passing on information about the entities they represent.

Rectangular coordinate frames are suggested to be inherent features of the human visual system, aiding accurate judgment of object orientation. These frames are implemented in the brain using separate neural activities, facilitating precise representation of object pose.

Capsule Computation: Mimicking Human Perception

The computation within capsule networks mimics human perception techniques, employing methods similar to RANSAC and Hough transforms in computer vision. Capsules receive predictions about generalized pose from lower-level capsules, identifying and agreeing upon consistent predictions while ignoring outliers. This method significantly improves the ability of the network to perceive and interpret complex data, mirroring the human brain’s columnar structure and its high-dimensional coincidence detection capabilities.

Capsule predictions are vectors that represent the pose of the capsule. The goal is to find a cluster of predictions that indicate the presence of a capsule. Capsules have high-dimensional pose spaces, which are represented as 20-50 dimensional spaces. High-dimensional coincidences are unlikely to happen by chance. The more dimensions in which two things agree, the less likely it is that their agreement is a coincidence. This principle can be used to detect meaningful patterns in data, such as intelligence information.

The visual system utilizes two types of equivariance: place equivariance and rate-coded equivariance. Place equivariance involves the active neurons changing as the input moves across space, while rate-coded equivariance involves the activity of the same neurons changing as the input moves.

The Shortcomings of Convolutional Neural Networks

Convolutional Neural Networks (ConvNets) have been the cornerstone of image recognition. However, their reliance on pooling layers, which reduce feature count and lose positional information, is a critical flaw. This approach leads to viewpoint invariance, contrasting with human perception where the same object can appear different under various coordinate frames. The ConvNets’ lack of a coordinate frame concept limits their psychological accuracy in shape perception, as evidenced by psychological experiments like Ervin Rock’s.

Hinton argues that CNNs, specifically the pooling layers, are a poor fit for the psychology of shape perception. He emphasizes the importance of rectangular coordinate frames in human shape recognition and how a slight shift in the coordinate frame can make the same object unrecognizable. CNNs lack the notion of imposing coordinate frames, making it difficult to explain how the same pixels can be processed differently based on the coordinate frame.

Hinton’s Vision: From Tetrahedron Puzzle to Psychological Evidence

Geoffrey Hinton’s exploration of shape perception, including the Tetrahedron puzzle, showcases the limitations of current models. His findings demonstrate that humans perceive shapes by imposing rectangular coordinate frames, suggesting a more complex, multi-parameter representation of object pose. This insight led Hinton to propose alternative approaches involving coordinate frames and routing information, promising a better capture of human shape perception.

Hinton presents a simple puzzle involving two pieces of a tetrahedron that most people find challenging to reassemble. He demonstrates how people’s natural coordinate frame for the tetrahedron differs from the one imposed on the individual pieces, leading to the puzzle’s difficulty. This puzzle highlights the importance of coordinate frames in shape perception and the limitations of CNNs in capturing this aspect.

Very young children track proto-objects that aren’t fully fleshed out, possibly due to their low-resolution perception helping them learn more easily. The relational aspect of object perception, such as the linear transformation of objects, is hardwired into the system.

The Role of Equivariance and Linear Manifold in Perception

In this new model, convolutional networks without max pooling exhibit viewpoint equivariance, adapting as the object’s position changes. The relationship between pixels and coordinate representation lies on a linear manifold, enabling recognition of objects vastly different from training data. Vision, conceptualized as inverse graphics, reconstructs 3D structure from 2D images, a critical challenge for current neural networks that lack bias for generalizing across viewpoints.

Hinton points out that CNNs fail to utilize the underlying linear manifold, which is a powerful tool in computer graphics and is believed to be involved in human shape perception. He argues that CNNs’ lack of attention to this linear manifold limits their ability to deal with viewpoint effects effectively.

Humans are good at recognizing objects and finding orientations quickly, within about 250 milliseconds. Mental rotation of objects, especially in 3D, is a much slower process, taking over 100 milliseconds.

Capsule Networks and Inverse Graphics

Capsule networks use “inverse graphics” to derive the pose of an object from its parts. This results in viewpoint-invariant weights, ensuring consistency regardless of perspective. Capsules represent an object’s properties, such as pose and identity, and remain invariant to viewpoint changes. They facilitate shape recognition by aligning predictions of familiar parts, with agreement indicating object presence. This approach parallels the Hough transform but leverages modern machine learning for feature extraction.

To understand the brain, we need to understand the computations it performs. One computation that the brain needs to perform is detecting high-dimensional coincidences. This computation may be performed by columns in the brain. Capsule networks may be more robust to noise and occlusions than traditional neural networks. They may also be better at modeling complex relationships between entities.

Capsule networks aim to find agreement between activity vectors, essentially chasing the covariance structure. The computation involved in finding these agreements efficiently among high-dimensional random data presents a challenge. Hinton suggests he has ideas for improving this efficiency but will discuss them only if they prove successful.

Routing and Decision-Making in Capsule Networks

Routing in capsule networks is pivotal, directing information based on relevance and compatibility. High-level capsules establish relationships between parts and wholes, forming a parse tree that represents the hierarchical organization. This process, aided by Hinton’s “routing by agreement” algorithm, emphasizes consistency over sheer signal strength.

Hinton criticizes the pooling operation in CNNs, viewing it as a primitive approach to routing information. He emphasizes the need for a more sophisticated routing mechanism to handle viewpoint changes and dimension hopping, where information moves from one set of pixels to another. He draws a parallel to the coding of patient records in hospitals, where mixing different coding schemes can hinder machine learning efforts.

Hinton’s graduate student attempted to apply capsule networks to speech recognition. The task proved challenging, but the student managed to achieve performance comparable to standard neural networks using capsule-based ideas. Capsule networks appear to be more naturally suited for vision tasks.

Innovations and Limitations in Capsule Network Technology

Capsule networks bring a plethora of innovations: primary capsules extracting features, coordinate transforms for prediction adjustments, and unsupervised learning for model parameter determination. However, they face limitations like computational intensity and challenges in handling multiple simultaneous digits or deeper hierarchies.

Geoffrey Hinton’s vision extends to unsupervised learning, identifying natural classes for efficient learning with fewer labels. His methods outperform traditional unsupervised pre-training and generic mixture models. These advancements promise significant improvements in object recognition, mirroring children’s ability to perceive the essence of objects despite low-resolution vision.

Hinton’s algorithm for learning object representations takes two days, compared to existing algorithms that run in 10 minutes. Heuristic approximations and computational optimizations can be used to significantly reduce the learning time.

Coordinate Frames, Equivariance, and Linear Manifolds: Supplements

Coordinate Frames:

Hinton proposes that rectangular coordinate frames are built-in features of the human visual system, allowing us to accurately judge the orientation of objects. These frames are implemented in the brain using separate neural activities, enabling precise representation of object pose.

Equivariance:

Geoffrey Hinton draws a distinction between place equivariance and rate-coded equivariance, highlighting that the visual system utilizes both at different processing levels. While place equivariance involves the active neurons changing as the input moves across space, rate-coded equivariance involves the activity of the same neurons changing as the input moves.

Linear Manifolds:

The linear manifold concept suggests that the relationship between input pixels and the internal representation of an object is linear. This inherent bias allows for efficient extrapolation and generalization across different viewpoints, which current neural networks struggle with. Hinton proposes that the visual system uses these manifolds for object representation.

Capsule Networks as the Future of AI Perception

Capsule networks, with their structured approach and advanced perception capabilities, represent a groundbreaking shift in neural network architecture. Their ability to mimic human perception, address the limitations of ConvNets, and incorporate complex concepts like equivariance and linear manifold positions them at the forefront of AI development. While challenges remain, the potential of capsule networks in revolutionizing how machines interpret and understand the world is immense.


Notes by: Alkaid