Geoffrey Hinton (Google Scientific Advisor) – What is Wrong With Convolutional Neural Nets? (Sep 2017)


Chapters

00:00:00 Beyond the Convolutional Neural Network
00:03:38 Vector-Based Neurons for Covariance Detection and Nonlinearity Exploration
00:08:26 Capsules and Covariance: A New Approach to Neural Networks
00:19:36 Exploring the Limitations of Convolutional Neural Networks in Object Perception
00:29:36 Coordinate Representation of Faces for Interpolation and Extrapolation
00:31:57 Understanding Capsule Networks for Shape Representation
00:35:38 Equivariance and Place Coding in Capsule Networks
00:39:21 Geometric Vision with Capsules
00:50:02 Capsules: A New Way of Thinking About Vision

Abstract

Geoffrey Hinton’s Insights and the Evolution of Neural Networks: Transforming Computer Vision with Capsule Networks

In the field of neural network research and development, Geoffrey Hinton’s pioneering work stands as a cornerstone, particularly his revolutionary concept of capsule networks. This article delves into Hinton’s critique of current neural network architectures, including convolutional neural networks (CNNs), and his proposal of capsules as a more effective alternative. Hinton’s insights encompass the limitations of conventional methods in representing objects and the advantages of capsules in handling orientation, noise, and deformations. With their capability to maintain invariance to viewpoint changes and efficiently represent object parts and relationships, capsule networks offer a promising new direction in computer vision. However, challenges such as computational cost and limited exploration also surface in Hinton’s discourse.



Current Neural Network Architectures: Limitations and Innovations

Hinton challenges the prevailing belief that the current neural network architecture, consisting of layers of ReLUs and LSTMs, is the optimal approach. He emphasizes the importance of exploring diverse architectures that may outperform existing models. Standard neural networks cannot perform the “same” function, which determines if two inputs are identical or not. A solution is to introduce a hidden layer and divide the cases into specific patterns, such as one-one and zero-zero, to achieve the same function. An alternative approach is to create neurons capable of performing the same function directly. Neural sense spikes provide some inspiration for same units. In certain brain regions, coincidence detectors use spike timing to determine the difference between two directions or delays. Same units can simplify certain computations and make them more efficient. They allow for direct measurement of correlations between activity vectors, which is crucial for tasks involving covariance structure.

Traditional neural networks faced initial difficulties, like the XOR problem, but advanced architectures have since emerged to overcome these challenges.



The Concept and Potential of Capsules

Capsules are vectors that represent an entity’s properties. Information flows from one capsule layer to the next through weighted connections. High-dimensional agreement among capsule outputs is significant and serves as a robust filter.

Entities and Objects: Capsules aim to identify and understand entities within a neural network. Traditional neural networks lack a structured approach to identifying entities. Hebb emphasized the importance of understanding why vision produces objects. Capsules address this concern by representing the world in terms of entities with properties.

Capsule Architecture and Communication: Capsules are vectors representing entity properties, such as shape and color, and communicate through vector connections, offering a more natural and intuitive object representation. The connectionist binding problem is a challenge in traditional neural networks, where determining which features belong to which object is difficult. Capsules tackle this problem by assuming only one entity of a particular type exists within a receptive field. Psychological evidence, such as crowding, supports the commitment to having only one entity per receptive field. Crowding occurs when multiple similar features are present within the same receptive field, making it harder to perceive them. Each capsule is associated with a logistic unit that indicates its activation or presence. Activation occurs when a pattern of votes from lower-level entities exhibits high-dimensional agreement.

Initially proposed in 2011, capsules have shown effectiveness in tasks like object recognition and image segmentation, pointing towards a potential revolution in computer vision.



Hinton’s Critique of Conventional Networks and the Proposal of Capsules

Hinton criticizes current neural networks for their inability to represent objects effectively, losing essential information like pose during pooling operations. Capsules, with their pose-invariant properties and ability to represent equivariantly, address these limitations, maintaining the representation of objects consistently across different viewpoints.

Viewpoint Variation and Equivariance: Hinton highlights the importance of considering different viewpoints when representing objects. Capsules aim to provide equivariance, where the properties of the representation change in the same way as the properties of the object. This differs from convolutional neural networks (CNNs) with pooling layers, which lose explicit information about the pose of an object.

Perceptual Frames and Different Interpretations: Hinton discusses the concept of perceptual frames, which influence how we perceive objects. He uses the example of a square rotated 45 degrees, which can be perceived as either a square or a diamond, depending on the frame imposed. This demonstrates that our knowledge of objects is relative to the frame we use to perceive them.

Cube Demonstration and the Hexahedron: Hinton presents a demonstration using a wireframe cube to illustrate the concept of perceptual frames. Rotating the cube and asking participants to point out the corners reveals that many people perceive a different shape, such as a hexahedron, rather than the actual cube. This highlights the influence of the imposed coordinate system on our perception.

Tetrahedron Puzzle: Hinton introduces a puzzle involving a tetrahedron sliced into two pieces with a square cross-section. Many people find it challenging to reassemble the pieces into a tetrahedron, despite it being a two-piece jigsaw puzzle. This difficulty arises because our perceptual system imposes a frame of reference on the pieces, making it hard to see the solution.

Limitations of Convolutions: Hinton emphasizes that current CNNs do not possess the ability to represent objects in multiple ways, as humans do. They lack the two completely different representations of the same object, such as a cube and a hexahedron, which are necessary for a comprehensive understanding of the object.



Advanced Concepts in Neural Networks: Manifolds and Coordinate Representations

Hinton discusses the importance of identifying underlying manifolds in data for effective interpolation and extrapolation. The article explores how coordinate representation can efficiently encode object properties and handle significant viewpoint variations.

Linear Manifolds and Limitations: Neural network researchers emphasize finding the underlying manifold of high-dimensional data for interpolation and extrapolation. However, extrapolation is limited by the Taylor expansion, preventing large-distance extrapolation. A linear manifold example is provided, where researchers focusing on finding the manifold despite the known linear structure would be seen as misguided.

Coordinate Representation: In facial feature coordinates, interpolation and extrapolation are linear. Changing coordinates linearly produces more faces, avoiding non-face results. Applying the same linear transformation to all coordinates preserves face structure.



Hinton’s Vision for Future Neural Network Architectures

Hinton suggests a return to geometric vision techniques from the 1980s, with capsules encoding spatial relationships between features. He criticizes the over-reliance on large-scale datasets and discriminative training objectives in current deep learning approaches. The article underscores Hinton’s call for more modular and interpretable neural network architectures, focusing on understanding underlying principles rather than empirical results alone.



Capsule Networks: A Paradigm Shift in Deep Learning

Capsule networks represent a significant advancement in computer vision, overcoming the limitations of CNNs in handling viewpoint and orientation variations.

Innovative Features of Capsule Networks: Capsule networks excel in recognizing objects regardless of their position or orientation, offering robustness to noise and a compact representation of object parts.

Challenges and Future Directions: Despite their potential, capsule networks face challenges like computational cost and the need for further exploration and theoretical understanding.



Supplemental Additions

Convolutions and Equivariance:

– Traditional convolutional neural networks use max pooling to achieve invariance to object transformations. However, equivariance, not invariance, is desired in neural networks.

– Convolutions themselves do not make representations invariant; it is the max pooling that achieves invariance in convolutional networks.

Rate Coding and Place Coding:

– Rate coding involves changing the values of the activities of neurons to represent changes in an object’s position.

– Place coding involves the activation of different neurons to represent changes in an object’s position.

Capsules and Coding Position:

– Capsules can code for an object’s position using both rate coding and place coding.

– Fine position is coded by the phase of the capsule’s activity, while coarse position is coded by the capsule itself.

– In the visual system, simple cells code for fine position through their phase, while place cells code for coarse position through their activation.

The Infratemporal Pathway and Coding Transformation:

– The infratemporal pathway in the visual system transforms place coding into rate coding.

– As we move up the visual processing hierarchy, representations become more rate-coded and less place-coded.

In conclusion, Geoffrey Hinton’s insights and the development of capsule networks mark a transformative phase in neural network research. With their novel approach to object representation and ability to handle complex visual tasks, capsule networks stand at the forefront of the next generation of computer vision technologies. However, as with any pioneering technology, they bring forth challenges that necessitate ongoing research and development.


Notes by: BraveBaryon