Geoffrey Hinton (Google Scientific Advisor) – Capsule Theory talk at MIT (Nov 2017)


Chapters

00:00:37 Capsule Networks: A Novel Approach to Neural Net Architectures
00:06:36 Neural Networks for Object Recognition and Caption Generation
00:09:43 Why Convolutional Nets Are No Good
00:12:47 The Impossibility of Reassembling a Tetrahedron
00:19:07 Evidence for the Use of Rectangular Coordinate Frames in Human Vision
00:24:23 Neural Network Invariance in Visual Systems
00:29:11 Computer Vision from 1980s: Utilizing Parts and Invariance for
00:35:13 Parse Trees via Weak Bets and Clusters
00:38:10 Capsule Networks for Object Recognition
00:49:27 Capsule Networks for Digit Recognition
00:56:16 Learning Rotation-Invariant Object Recognition
01:09:14 Neural Networks That Seek Agreement Between Activity Vectors

Abstract

Revolutionizing AI: The Emergence of Capsule Networks and the Evolution of Convolutional Neural Networks

In the dynamic landscape of artificial intelligence, Geoffrey Hinton’s novel proposition of “capsules” in neural networks marks a transformative step. These capsules, characterized by their unique parameters – presence probability and generalized pose, and a distinct agreement-based routing mechanism, aim to enhance entity representation and structural integrity in AI models. This approach, alongside critiques and advancements in Convolutional Neural Networks (ConvNets) developed by Yann LeCun, signifies a pivotal shift towards a more nuanced and effective understanding of visual processing and intelligence in computational models. Hinton’s insights, juxtaposed with the inherent challenges and limitations of existing ConvNets, underscore the need for a paradigm shift in how AI understands and interacts with the world around it.

Expanding on Main Ideas:

Capsule Networks introduce a novel way to represent entities in neural networks, enhancing explicit entity representation which is lacking in current models. These capsules operate on two key parameters: the probability of an entity’s presence and its generalized pose, providing essential spatial information. They also feature a unique routing mechanism based on agreement between predictions, ensuring coherent and consistent entity representation. This innovative approach, inspired by the brain’s functioning, promises to revolutionize capabilities in visual processing and intelligence.

ConvNets, pioneered by Yann LeCun, have significantly advanced object recognition, employing learned feature detectors and progressively larger spatial domains across layers. However, Geoffrey Hinton has critiqued ConvNets, particularly max pooling, for their limitations in accurately perceiving shapes and ensuring viewpoint invariance. He emphasizes the need for a more sophisticated method to represent objects from different viewpoints. ConvNets lack the representation of rectangular coordinate frames, crucial for precise object recognition, an aspect that capsule networks aim to address. Challenges faced by ConvNets, such as difficulty in recognizing handedness and incorporating mental rotation, demonstrate limitations in their approach to equivariance and object representation.

The integration of ConvNets with recurrent neural networks has opened new avenues, like image captioning. This integration leverages the activities of a trained ConvNet’s last hidden layer to train a recurrent network, raising questions about the true nature of understanding in AI models. Is the generated caption a reflection of true understanding, or just a byproduct of language model capabilities?

Innovations in ConvNets include big receptive fields, rate-coded equivariance, and the utilization of linear manifolds for object representation. Capsule networks, promising in recognizing occluded objects and building hierarchical object representations, face challenges like computational expense and training complexity. Hinton’s proof of concept on the MNIST dataset demonstrated the practical viability of capsule networks, yet their widespread application is still emerging.

Hinton’s innovative approach in AI merges unsupervised and supervised learning, enhancing performance with fewer labeled examples and outperforming traditional methods. This method demonstrates the system’s ability to decompose and reconstruct objects from their parts, achieving impressive accuracy with minimal labeled data, and highlighting the potential of capsule networks.

Convolutional Nets Cannot Explain the Effect of Coordinate Frames on Object Recognition:

Hinton presents a tetrahedron puzzle that challenges many, including MIT professors. The solution hinges on recognizing and aligning the natural coordinate frame of tetrahedron pieces. People familiar with tetrahedral cartons or quadrahedron models find this puzzle easier due to their mental models aligning with the natural coordinate frame of the pieces.

Coordinate Frames Significantly Impact Object Recognition:

Hinton argues that humans use rectangular coordinate frames for object recognition, a concept absent in convolutional nets. He demonstrates this through the example of accurately judging right angles within a tilted square when perceived as a square but not as a diamond. Hinton suggests that the visual system and computer graphics share a commonality in imposing coordinate frames on objects and their parts.

Rectangular Frames and Convolutional Nets:

In further evidence, Hinton notes that humans use rectangular coordinate frames for object recognition, unlike convolutional nets. This is exemplified by the ability to judge right angles in a tilted square when viewed as a square, not a diamond. He argues that both computer graphics and the visual system impose coordinate frames on objects and parts.

Imposing Coordinate Frames:

The visual system imposes a hierarchy of rectangular coordinate frames to recognize objects, utilizing linear transformations to map points between frames. Hinton acknowledges potential challenges to this argument, suggesting a complex interplay in our perception process.

Evidence for Separate Neural Activities:

Hinton proposes that coordinate frames are represented by multiple neural activities instead of a single neuron. This is illustrated by the example of recognizing a capital letter ‘R’ and determining its orientation, where immediate determination of handedness is challenging, hinting at a distributed representation.

High-Order Parity Problems and Handedness:

Hinton explains that neural networks struggle with high-order parity problems relevant to handedness recognition. This struggle is evident in convolutional nets’ inability to recognize handedness, with coordinate frame representation dispersed across multiple numbers, akin to computer graphics.

Continuous Transformation and Mental Rotation:

The brain addresses the handedness problem through continuous transformations, simplifying the task. Mental rotation tasks serve not primarily for object recognition but for resolving handedness issues.

Receptive Field Size and Accuracy:

For precise object localization, neurons require large overlapping receptive fields. This division into numerous tiny regions increases surface area for accurate representation but at the cost of resolution, making it ideal only for limited object types like faces.

Equivariance and Place Coding:

Convolutional networks without max pooling exhibit equivariance, where neural activities vary in tandem with viewpoint changes. There are two types of equivariance: place equivariance and rate-coded equivariance, with the visual system likely using both types.

Invariance and Training Data:

Current neural networks achieve invariance by training on multiple viewpoints, necessitating extensive data and time. This method lacks an inherent bias for viewpoint generalization, posing overfitting risks.

Linear Manifold and Extrapolation:

A more effective method involves using a linear manifold for object pose representation, enabling massive extrapolation and recognition of objects in varied sizes, orientations, and positions with limited training data.

Inverse Graphics:

Hinton suggests using inverse graphics within the vision system, essentially performing graphics operations in reverse, aligning with the concept that vision is an inverse graphics problem.

Capsule Network: Routing Information:

In capsule networks, routing information to the right capsules is crucial. Hinton proposes using attention mechanisms for routing, directing information to capsules best equipped to handle it.

Capsule Network: Routing Agreement, Primary Capsules, and Predictions:

Capsule networks use a “routing by agreement” algorithm, differing from loopy belief propagation. Primary capsules, the first level, convert pixel intensities into pose parameters, indicating entity presence. Each primary capsule predicts the pose and presence of specific entities, with these predictions weighted and combined for final outcomes. The network, trainable via backpropagation, learns to extract features, predict poses, and assemble parts into larger entities. The relationship between features and entity poses is linear, and the network employs a Gaussian-uniform distribution mix for prediction modeling.

EM-based Capsule Learning:

Capsule networks utilize a Gaussian and uniform distribution mix, with scores indicating the likelihood of meaningful object clusters. The network employs Softmax and backpropagation for classification, with an EM loop for clustering data points. Capsule activation visualization helps in understanding data clustering. Although similar in performance to convolutional neural networks on the MNIST dataset, capsule networks are slower in training. Future work aims to extend these networks to handle more complex scenarios and explore unsupervised learning for primary capsule acquisition.

In-depth Analysis of Geoffrey Hinton’s Presentation on Unsupervised Learning and Object Recognition:

Hinton highlights how unsupervised learning aids in object recognition by extracting innate graphics models and reconstructing images through capsules. Post unsupervised learning, supervised learning concatenates pose parameters from each capsule, using factor analysis for modeling. This method, efficient with limited labeled examples, outperforms standard methods and shows promise in capturing linear manifolds and finding natural classes.

Capsule Networks: Understanding the Essence and Key Concept:

Capsule networks differ from traditional neural networks in their focus on agreement among activity vectors, aiming to capture the covariance structure within data for pattern and relationship identification. The challenge lies in efficient computation as data dimensionality increases, a focus area for ongoing research in the field.

This comprehensive review of “Revolutionizing AI: The Emergence of Capsule Networks and the Evolution of Convolutional Neural Networks” elucidates the significant strides and challenges in the field of AI, highlighting the innovative work of Geoffrey Hinton and others in pushing the boundaries of neural network capabilities and understanding.


Notes by: crash_function