Geoffrey Hinton (Google) (Nov 2017)

Geoffrey Hinton (Google Scientific Advisor) – Capsule Theory talk at MIT (Nov 2017)

Chapters

00:00:37 Capsule Networks: A Novel Approach to Neural Net Architectures

00:06:36 Neural Networks for Object Recognition and Caption Generation

Pooling Layers:
Neural networks employ pooling layers to achieve translation invariance, which makes them less sensitive to the exact location of features in an image. Pooling neurons observe nearby neurons in a layer and report the activity level of the most active one, discarding its specific location. This reduces the number of active neurons, allowing for more feature types in subsequent layers.

ConvNets Architecture:
Convolutional neural networks (ConvNets) consist of multiple layers of learned feature detectors that operate locally within the image. These feature detectors are replicated across space, assuming that a feature worth detecting in one location is likely to be relevant in others. As the layers progress, the spatial domains of the feature detectors increase, capturing larger regions of the image.

Interleaved Pooling:
ConvNets incorporate pooling layers between feature extraction layers to achieve translation invariance and reduce the number of active neurons. Pooling layers can be max pooling, average pooling, or other types of pooling operations. By attending to the most active feature in a local region, pooling layers provide a degree of translational invariance and facilitate the identification of higher-level features.

Challenges and Drawbacks:
The success of ConvNets with pooling layers poses a challenge for developing alternative approaches, as their effectiveness makes it difficult to replace them. The reliance on max pooling to achieve translation invariance comes at the cost of losing precise positional information. While overlapping pools can mitigate this issue, it reduces the advantage gained from reducing the number of units.

Image Captioning with Recurrent Neural Networks:
By utilizing the activities in the last hidden layer of a convolutional net trained for object recognition, recurrent neural networks can be employed to generate image captions. This involves training the recurrent neural network to produce a caption based on the visual features extracted by the convolutional net. The stochastic nature of recurrent neural networks leads to variations in the generated captions, raising questions about the true level of understanding and the role of language models in caption generation.

00:09:43 Why Convolutional Nets Are No Good

00:12:47 The Impossibility of Reassembling a Tetrahedron

00:19:07 Evidence for the Use of Rectangular Coordinate Frames in Human Vision

Rectangular Frames and Convolutional Nets:
Hinton presents evidence suggesting that humans use rectangular coordinate frames to recognize objects, unlike convolutional nets. He points out that people can accurately judge right angles within a tilted square when viewing it as a square but not as a diamond. Hinton argues that computer graphics and the visual system share similarities in imposing coordinate frames on objects and parts.

Imposing Coordinate Frames:
The visual system imposes a hierarchy of rectangular coordinate frames to recognize objects. This process involves mapping points between frames using linear transformations. Hinton acknowledges the possibility of challenges to his argument.

Evidence for Separate Neural Activities:
Hinton proposes that coordinate frames are represented by multiple neural activities rather than a single neuron. He demonstrates this with the example of recognizing a capital letter R and determining its orientation. The inability to immediately determine handedness is seen as evidence for this distributed representation.

High-Order Parity Problems and Handedness:
Hinton explains that neural networks struggle with high-order parity problems, which are relevant to determining handedness. This difficulty leads to the inability of convolutional nets to recognize handedness. The representation of coordinate frames is spread across multiple numbers, similar to computer graphics.

Continuous Transformation and Mental Rotation:
The brain solves the handedness problem by performing continuous transformations to simplify the task. Mental rotation tasks are not primarily for object recognition but for resolving handedness issues.

Conclusion of Argument One:
Humans use embedded rectangular coordinate frames in objects and parts. These coordinate frames are represented by multiple neural activities, indicating a distributed representation.

Argument Two: Equivariance and Convolutional Nets:
Convolutional nets aim to make representations invariant to viewpoint, which differs from human perception. When humans see a face, they know its precise orientation and location, not just its identity. Some neuroscientists believe that large receptive fields imply low accuracy, but Hinton disagrees.

00:24:23 Neural Network Invariance in Visual Systems

00:29:11 Computer Vision from 1980s: Utilizing Parts and Invariance for

Inverse Graphics:
Hinton advocates for “inverse graphics” in computer vision, where the pose of a part is used to determine the pose of the whole, rather than the traditional approach of using the pose of the whole to determine the pose of the parts. This approach provides viewpoint invariance, meaning that the relationship between a whole and its parts remains constant regardless of the viewpoint.

Invariance and Crowding:
Hinton emphasizes the importance of viewpoint invariance in the weights of neural networks, rather than in the neural activities themselves. He explains that capsule networks can achieve perfect viewpoint invariance in the weights, but this is limited by the fact that each capsule can only deal with one entity at a time due to its use of simultaneity for binding. Crowding occurs when things are placed too close together, making it difficult to see them, and it is caused by violating the constraint of one entity per capsule.

Shape Recognition and Coincidence Filtering:
Hinton proposes a method for shape recognition using capsule networks, where familiar parts are identified and their pose parameters are used to predict the pose of the larger object. Coincidence filtering is employed to determine whether the predictions from different parts are consistent with each other, indicating the presence of the larger object. This approach is robust to noise and can ignore predictions that do not agree with the majority.

Modern Hough Transforms:
Hinton draws a comparison between capsule networks and Hough transforms, which are traditional methods for shape recognition. He argues that capsule networks can be viewed as modern Hough transforms, but with the key difference that they utilize machine learning to extract reliable high-dimensional features from pixels. This allows capsule networks to make point predictions, eliminating the need for bins and multiple votes in traditional Hough transforms.

Routing Information:
Hinton discusses the importance of routing information to the appropriate capsules in a capsule network. He suggests that routing can be achieved through attention mechanisms, where the most active neurons are routed to the relevant capsules. This routing principle ensures that information is directed to the capsules that are best equipped to handle it.

00:35:13 Parse Trees via Weak Bets and Clusters

00:38:10 Capsule Networks for Object Recognition

00:49:27 Capsule Networks for Digit Recognition

00:56:16 Learning Rotation-Invariant Object Recognition

How Unsupervised Learning Helps with Object Recognition:
Unsupervised learning enables systems to extract innate graphics models and learn to reconstruct images by extracting and reconstructing capsules. The graphics system learns templates for each entity and can translate, scale, and add them to the image. This approach allows the system to invert graphics and learn what entities should be.

Applying Supervised Learning on Top of Unsupervised:
After unsupervised learning, supervised learning can be applied by concatenating pose parameters from each capsule. Factor analysis is used to find underlying factors that model the relationship between the whole and the part. Fitting a mixture of factor analyzers to the concatenated instantiation parameters of the primary capsules yields good results.

Efficiency of the Approach:
With 2,000 labeled examples, the approach achieves a 1.75% error rate, significantly better than standard unsupervised followed by supervised methods. This efficiency is attributed to capturing the linear manifold and finding natural classes through unsupervised learning.

Comparison with Other Methods:
The approach outperforms unsupervised pre-training followed by supervised learning and generic mixture models like mixture of factor analyzers on pixels. It is likely that fixing around 1,000 components would be needed to achieve similar performance with a generic mixture model.

Human Object Perception and Low-Resolution Representations:
The role of low-resolution representations in facilitating the mapping from pixel space to pose space is not yet clear and requires further investigation.

Human Performance in Object Recognition:
Humans are relatively good at recognizing objects in different orientations, with a penalty of around 10 milliseconds for upside-down objects and a longer delay for mental rotation. Mental rotation is much slower for 3D objects compared to 2D objects.

Computational Approximations for Faster Learning:
To improve the learning speed of the system, optimizations such as avoiding MATLAB programming and employing more efficient coding practices can be explored.

01:09:14 Neural Networks That Seek Agreement Between Activity Vectors

Abstract

Revolutionizing AI: The Emergence of Capsule Networks and the Evolution of Convolutional Neural Networks

In the dynamic landscape of artificial intelligence, Geoffrey Hinton’s novel proposition of “capsules” in neural networks marks a transformative step. These capsules, characterized by their unique parameters – presence probability and generalized pose, and a distinct agreement-based routing mechanism, aim to enhance entity representation and structural integrity in AI models. This approach, alongside critiques and advancements in Convolutional Neural Networks (ConvNets) developed by Yann LeCun, signifies a pivotal shift towards a more nuanced and effective understanding of visual processing and intelligence in computational models. Hinton’s insights, juxtaposed with the inherent challenges and limitations of existing ConvNets, underscore the need for a paradigm shift in how AI understands and interacts with the world around it.

Expanding on Main Ideas:

Capsule Networks introduce a novel way to represent entities in neural networks, enhancing explicit entity representation which is lacking in current models. These capsules operate on two key parameters: the probability of an entity’s presence and its generalized pose, providing essential spatial information. They also feature a unique routing mechanism based on agreement between predictions, ensuring coherent and consistent entity representation. This innovative approach, inspired by the brain’s functioning, promises to revolutionize capabilities in visual processing and intelligence.

ConvNets, pioneered by Yann LeCun, have significantly advanced object recognition, employing learned feature detectors and progressively larger spatial domains across layers. However, Geoffrey Hinton has critiqued ConvNets, particularly max pooling, for their limitations in accurately perceiving shapes and ensuring viewpoint invariance. He emphasizes the need for a more sophisticated method to represent objects from different viewpoints. ConvNets lack the representation of rectangular coordinate frames, crucial for precise object recognition, an aspect that capsule networks aim to address. Challenges faced by ConvNets, such as difficulty in recognizing handedness and incorporating mental rotation, demonstrate limitations in their approach to equivariance and object representation.

The integration of ConvNets with recurrent neural networks has opened new avenues, like image captioning. This integration leverages the activities of a trained ConvNet’s last hidden layer to train a recurrent network, raising questions about the true nature of understanding in AI models. Is the generated caption a reflection of true understanding, or just a byproduct of language model capabilities?

Innovations in ConvNets include big receptive fields, rate-coded equivariance, and the utilization of linear manifolds for object representation. Capsule networks, promising in recognizing occluded objects and building hierarchical object representations, face challenges like computational expense and training complexity. Hinton’s proof of concept on the MNIST dataset demonstrated the practical viability of capsule networks, yet their widespread application is still emerging.

Hinton’s innovative approach in AI merges unsupervised and supervised learning, enhancing performance with fewer labeled examples and outperforming traditional methods. This method demonstrates the system’s ability to decompose and reconstruct objects from their parts, achieving impressive accuracy with minimal labeled data, and highlighting the potential of capsule networks.

Convolutional Nets Cannot Explain the Effect of Coordinate Frames on Object Recognition:

Hinton presents a tetrahedron puzzle that challenges many, including MIT professors. The solution hinges on recognizing and aligning the natural coordinate frame of tetrahedron pieces. People familiar with tetrahedral cartons or quadrahedron models find this puzzle easier due to their mental models aligning with the natural coordinate frame of the pieces.

Coordinate Frames Significantly Impact Object Recognition:

Hinton argues that humans use rectangular coordinate frames for object recognition, a concept absent in convolutional nets. He demonstrates this through the example of accurately judging right angles within a tilted square when perceived as a square but not as a diamond. Hinton suggests that the visual system and computer graphics share a commonality in imposing coordinate frames on objects and their parts.

Rectangular Frames and Convolutional Nets:

In further evidence, Hinton notes that humans use rectangular coordinate frames for object recognition, unlike convolutional nets. This is exemplified by the ability to judge right angles in a tilted square when viewed as a square, not a diamond. He argues that both computer graphics and the visual system impose coordinate frames on objects and parts.

Imposing Coordinate Frames:

The visual system imposes a hierarchy of rectangular coordinate frames to recognize objects, utilizing linear transformations to map points between frames. Hinton acknowledges potential challenges to this argument, suggesting a complex interplay in our perception process.

Evidence for Separate Neural Activities:

Hinton proposes that coordinate frames are represented by multiple neural activities instead of a single neuron. This is illustrated by the example of recognizing a capital letter ‘R’ and determining its orientation, where immediate determination of handedness is challenging, hinting at a distributed representation.

High-Order Parity Problems and Handedness:

Hinton explains that neural networks struggle with high-order parity problems relevant to handedness recognition. This struggle is evident in convolutional nets’ inability to recognize handedness, with coordinate frame representation dispersed across multiple numbers, akin to computer graphics.

Continuous Transformation and Mental Rotation:

The brain addresses the handedness problem through continuous transformations, simplifying the task. Mental rotation tasks serve not primarily for object recognition but for resolving handedness issues.

Receptive Field Size and Accuracy:

For precise object localization, neurons require large overlapping receptive fields. This division into numerous tiny regions increases surface area for accurate representation but at the cost of resolution, making it ideal only for limited object types like faces.

Equivariance and Place Coding:

Convolutional networks without max pooling exhibit equivariance, where neural activities vary in tandem with viewpoint changes. There are two types of equivariance: place equivariance and rate-coded equivariance, with the visual system likely using both types.

Invariance and Training Data:

Current neural networks achieve invariance by training on multiple viewpoints, necessitating extensive data and time. This method lacks an inherent bias for viewpoint generalization, posing overfitting risks.

Linear Manifold and Extrapolation:

A more effective method involves using a linear manifold for object pose representation, enabling massive extrapolation and recognition of objects in varied sizes, orientations, and positions with limited training data.

Inverse Graphics:

Hinton suggests using inverse graphics within the vision system, essentially performing graphics operations in reverse, aligning with the concept that vision is an inverse graphics problem.

Capsule Network: Routing Information:

In capsule networks, routing information to the right capsules is crucial. Hinton proposes using attention mechanisms for routing, directing information to capsules best equipped to handle it.

Capsule Network: Routing Agreement, Primary Capsules, and Predictions:

Capsule networks use a “routing by agreement” algorithm, differing from loopy belief propagation. Primary capsules, the first level, convert pixel intensities into pose parameters, indicating entity presence. Each primary capsule predicts the pose and presence of specific entities, with these predictions weighted and combined for final outcomes. The network, trainable via backpropagation, learns to extract features, predict poses, and assemble parts into larger entities. The relationship between features and entity poses is linear, and the network employs a Gaussian-uniform distribution mix for prediction modeling.

EM-based Capsule Learning:

Capsule networks utilize a Gaussian and uniform distribution mix, with scores indicating the likelihood of meaningful object clusters. The network employs Softmax and backpropagation for classification, with an EM loop for clustering data points. Capsule activation visualization helps in understanding data clustering. Although similar in performance to convolutional neural networks on the MNIST dataset, capsule networks are slower in training. Future work aims to extend these networks to handle more complex scenarios and explore unsupervised learning for primary capsule acquisition.

In-depth Analysis of Geoffrey Hinton’s Presentation on Unsupervised Learning and Object Recognition:

Hinton highlights how unsupervised learning aids in object recognition by extracting innate graphics models and reconstructing images through capsules. Post unsupervised learning, supervised learning concatenates pose parameters from each capsule, using factor analysis for modeling. This method, efficient with limited labeled examples, outperforms standard methods and shows promise in capturing linear manifolds and finding natural classes.

Capsule Networks: Understanding the Essence and Key Concept:

Capsule networks differ from traditional neural networks in their focus on agreement among activity vectors, aiming to capture the covariance structure within data for pattern and relationship identification. The challenge lies in efficient computation as data dimensionality increases, a focus area for ongoing research in the field.

This comprehensive review of “Revolutionizing AI: The Emergence of Capsule Networks and the Evolution of Convolutional Neural Networks” elucidates the significant strides and challenges in the field of AI, highlighting the innovative work of Geoffrey Hinton and others in pushing the boundaries of neural network capabilities and understanding.

Notes by: crash_function

Geoffrey Hinton (Google Scientific Advisor) – Capsule Theory talk at MIT (Nov 2017)

Chapters

Abstract

Related posts: