Geoffrey Hinton (Google Scientific Advisor) – Does the Brain do Inverse Graphics? (Aug 2015)


Chapters

00:00:08 Coordinate Frames in Object Recognition
00:05:30 Object Recognition and the Influence of Coordinate Frames
00:10:19 Mental Images as Structural Representations
00:16:26 Neural Network Representation of Geometric Relations in Computer Vision
00:23:00 Learning Geometric Relationships for Image Generation
00:28:08 Extracting Primitive Parts and Poses from Images
00:31:49 Capsule Networks: A Novel Approach to Shape Recognition
00:40:10 Transforming Autoencoders for Learning Low-Level Features
00:44:46 Neural Network Reconstruction of Stereo Images
00:47:26 Understanding the 3D Structure of Objects Using Transforming Autoencoders

Abstract

The Evolution of Neural Networks: From Convolutional to Capsule Systems – Updated

Abstract:

In the field of artificial intelligence and neural network design, significant strides have been made, particularly in the area of object recognition. This article delves into the progression of neural network design, beginning with Geoffrey Hinton’s critique of convolutional neural networks (CNNs) and tracing the evolution to the emergence of capsule networks, highlighting the transition from CNNs to more advanced structures that closely emulate human perception and understanding.



1. Geoffrey Hinton’s Critique and the Limitations of CNNs

Geoffrey Hinton, a prominent figure in AI, argues that while CNNs have achieved notable success in object recognition, they suffer from fundamental limitations due to their inherent loss of positional information. He emphasizes the importance of retaining spatial relationships for tasks like face recognition or scene context understanding. CNNs’ crucial shortcoming lies in their loss of positional information due to subsampling, which impedes their ability to process precise spatial relationships, vital for complex recognition tasks. To address these limitations, Hinton proposes a shift towards a hierarchical representation of objects, analogous to computer graphics.

Hinton highlights the issue with CNNs, where subsampling leads to the loss of information about precise spatial relationships between features, hindering their ability to tackle higher-level object recognition tasks. He suggests the need for a shift from invariant representations that eliminate irrelevant information like lighting and viewpoint, to equivariant activities where neural activities transform with object movement, thereby preserving the underlying shape knowledge in the weights.

2. The Illusions of Object Recognition and Mental Image Representation

The traditional understanding of object recognition faces challenges when dealing with rotating objects or interpreting them in different orientations. This is evident in phenomena like the diamond and square illusion or the cube illusion, where changing coordinate frames lead to varied interpretations of the same object. These illusions demonstrate that our mental images are not just 2D arrays but are more elaborate structures with viewpoints that are influenced by knowledge and perspective. For instance, a tilted diamond or an upright square can affect our perception of right angles, and imagining a wireframe cube can challenge our ability to identify its corners in an unfamiliar orientation. The cube, with its three-fold rotational symmetry, can be perceived in multiple ways, such as a crown with triangular flaps or a zigzag with a central rectangle. These interpretations highlight the brain’s ability to represent objects in multiple ways relative to coordinate frames.

3. Mental Image Formation and Computer Vision

The contrast between computer graphics and computer vision becomes clear in this context. Computer graphics involve deconstructing objects into parts, while computer vision is about reconstructing objects from these parts. In terms of mental image formation, we relate different parts of a task or concept, enabling efficient processing. This extends to object recognition, where mental rotation is part of a more complex process involving the identification of mirror images and orientations. Both computer graphics and animal vision handle viewpoint seamlessly, with computer graphics using a hierarchy of parts and matrices. To understand shapes, humans impose coordinate frames on objects. Recognizing countries, for example, often fails without additional context, demonstrating our reliance on these frames.

4. Linear Models in Neural Representation

Linear models have shown significant advantages in neural networks, especially for extrapolation in tasks like 3D viewpoint recognition. These models focus on representing handedness and pose vectors, offering a data-efficient method for training. They encapsulate invariant information about shape, allowing extrapolation across various sizes and orientations. Mental rotation is crucial in object recognition, particularly for unfamiliar orientations. The brain modifies its representation of an object to match the unfamiliar orientation. Pose estimation involves extracting part poses and representing them as vectors of position, orientation, and size. Linear models are preferred for extrapolation due to their consistent behavior with changing input values.

5. Factor Analysis for Shape Recognition

Factor analysis is a powerful tool for identifying factors that explain relationships between variables. In shape recognition, it helps in learning a linear model that elucidates the poses of object parts. Charlie Tang’s model using factor analysis learns the relationship between the pose of the whole shape and its parts. By manipulating factors, new shapes with different orientations and sizes can be generated. This model overcomes limitations such as knowledge of shape or correspondence between features and can handle substantial distortions in input data.

6. The Emergence of Capsule Networks

Capsule networks, introduced by Geoffrey Hinton, represent a significant advancement in neural networks. Capsules, or groups of neurons, encode an object’s pose, enhancing shape recognition. These networks are resilient against noise and deformations, understand part-whole relationships, and are more efficient than CNNs. They can generalize viewpoints and distortions unseen in training, requiring less data. Capsules encapsulate results of internal computation, and the networks can be trained using methods like transforming autoencoders or by defining simple decoders with fixed templates.

7. Transforming Autoencoders and 3D Structure Understanding

Transforming autoencoders are a step forward in learning low-level features, including transformations like translation, and extending this learning to 3D. This approach offers superior computational efficiency and a refined understanding of objects’ 3D structure and transformations.

8. Capsule Network’s Architecture and Its Unsupervised Clustering Capabilities

The capsule network’s architecture, particularly in reconstructing MNIST digits, demonstrates its efficiency in using learned templates to capture key shape features. It employs factor analysis on capsule outputs to represent digits linearly, allowing for accurate reconstruction of distorted digits. The network uses sparsity or dropout during training, enhancing its ability to reconstruct digits under considerable translations.

9. Transforming Autoencoders: Utilizing Knowledge of Image Transformations for Learning

Transforming autoencoders (TAEs) learn meaningful features from images by using information about image transformations. This approach furnishes more information than static images, enabling the network to learn pose parameters beneficial for tasks like object recognition and tracking. However, TAEs are more intricate to train than CNNs and are not as widely used.

10. Neural Network Learns 3D Structure and

Reconstructs Stereo Pairs

Geoffrey Hinton’s research in training a neural network with stereo image pairs to reconstruct a transformed stereo pair showcases the network’s ability to understand the 3D structure of objects. This ability extends to comprehending depths and transforming the 3D structure in line with viewpoint changes. The network’s competence is further evidenced in its successful reconstruction of stereo pairs for new objects, like cars with spoilers, demonstrating its capacity to generalize and understand foreshortening.

Transforming Autoencoders and Dropout

Transforming autoencoders (TAEs) are capable of reconstructing objects in diverse perspectives, indicating an internal understanding of the 3D structure of objects. The dropout technique enhances the performance of neural networks by preventing overfitting. This technique randomly deactivates units in the network during training, improving generalization. Further information on transforming autoencoders is available on Geoffrey Hinton’s website, and though the work on Dropout was rejected by Science, it is accessible on the archive. This progression in neural networks, from CNNs to capsule networks and transforming autoencoders, marks a significant advancement in the field, moving towards systems that more closely emulate human perception and understanding.


Notes by: ChannelCapacity999