Geoffrey Hinton (Google Scientific Advisor) – Capsule Theory talk at MIT (Nov 2017)
Chapters
00:00:37 Capsule Networks: A Novel Approach to Neural Net Architectures
Neural Networks’ Limitations: Current neural networks lack a clear notion of “entities” and have limited structural levels.
Introducing Capsules: Capsules represent entities and their properties, such as presence, orientation, size, and color.
Capsule Properties: Capsules output probabilities of entity presence and generalized pose information. They seek agreement among predictions from lower-level capsules.
High-Dimensional Coincidences: High-dimensional coincidences are unlikely to occur by chance. They indicate a significant event or entity presence.
Understanding the Brain: To understand the brain, we need to comprehend the computations it performs. Capsules provide a computational model for entity recognition and pose estimation.
Mini Columns as Capsules: Mini columns in the brain might be responsible for capsule computations.
Marian Wild Speculation: Hinton’s theory is speculative but based on necessary computations for brain functioning.
00:06:36 Neural Networks for Object Recognition and Caption Generation
Pooling Layers: Neural networks employ pooling layers to achieve translation invariance, which makes them less sensitive to the exact location of features in an image. Pooling neurons observe nearby neurons in a layer and report the activity level of the most active one, discarding its specific location. This reduces the number of active neurons, allowing for more feature types in subsequent layers.
ConvNets Architecture: Convolutional neural networks (ConvNets) consist of multiple layers of learned feature detectors that operate locally within the image. These feature detectors are replicated across space, assuming that a feature worth detecting in one location is likely to be relevant in others. As the layers progress, the spatial domains of the feature detectors increase, capturing larger regions of the image.
Interleaved Pooling: ConvNets incorporate pooling layers between feature extraction layers to achieve translation invariance and reduce the number of active neurons. Pooling layers can be max pooling, average pooling, or other types of pooling operations. By attending to the most active feature in a local region, pooling layers provide a degree of translational invariance and facilitate the identification of higher-level features.
Challenges and Drawbacks: The success of ConvNets with pooling layers poses a challenge for developing alternative approaches, as their effectiveness makes it difficult to replace them. The reliance on max pooling to achieve translation invariance comes at the cost of losing precise positional information. While overlapping pools can mitigate this issue, it reduces the advantage gained from reducing the number of units.
Image Captioning with Recurrent Neural Networks: By utilizing the activities in the last hidden layer of a convolutional net trained for object recognition, recurrent neural networks can be employed to generate image captions. This involves training the recurrent neural network to produce a caption based on the visual features extracted by the convolutional net. The stochastic nature of recurrent neural networks leads to variations in the generated captions, raising questions about the true level of understanding and the role of language models in caption generation.
Why Hinton Doesn’t Believe in Pooling: Pooling is a poor fit for the psychology of shape perception. We want knowledge to be invariant to viewpoint, not neural activities. Convolutional nets fail to use an underlying linear manifold, which computer graphics and brains use.
Pooling as a Primitive Routing Mechanism: Pooling is an ineffective way to handle dimension hopping caused by viewpoint changes. Machine learning algorithms need to sort out dimension hopping to work effectively. Convolutional net researchers haven’t adequately addressed the routing problem.
Introduction to Hinton’s Talk: Hinton aims to persuade the audience that max pooling in convolutional nets is problematic, despite their impressive performance. He begins by arguing that pooling is a poor fit for the psychology of shape perception.
00:12:47 The Impossibility of Reassembling a Tetrahedron
Convolutional Nets Cannot Explain the Effect of Coordinate Frames on Object Recognition: Convolutional nets cannot explain how the same pixels can be processed differently depending on the coordinate frame, as they have no notion of imposing a coordinate frame.
Coordinate Frames Significantly Impact Object Recognition: Changing the rectangular coordinate frame of an object can make it unrecognizable, demonstrating the significant effect of coordinate frames on object recognition.
The Tetrahedron Puzzle: Hinton demonstrates the difficulty of a simple puzzle involving two pieces of a tetrahedron that most people, including MIT professors, find challenging.
The Key to Solving the Puzzle: The key to solving the puzzle lies in recognizing the natural coordinate frame of the pieces and aligning it with the natural coordinate frame of the tetrahedron.
The Role of Experience and Models: People who have experience with tetrahedral cartons or quadrahedron models find the puzzle easier to solve, as they have a mental model that aligns with the natural coordinate frame of the pieces.
Ervin Rock’s Map Experiment: Hinton cites Ervin Rock’s experiment, where people were shown a map of Africa upside down and recognized it only when they were told to imagine Sarah Palin’s perspective.
Conclusion: Hinton concludes that convolutional nets are not psychologically correct due to their inability to explain the effect of coordinate frames on object recognition, supported by evidence from the tetrahedron puzzle and Ervin Rock’s map experiment.
00:19:07 Evidence for the Use of Rectangular Coordinate Frames in Human Vision
Rectangular Frames and Convolutional Nets: Hinton presents evidence suggesting that humans use rectangular coordinate frames to recognize objects, unlike convolutional nets. He points out that people can accurately judge right angles within a tilted square when viewing it as a square but not as a diamond. Hinton argues that computer graphics and the visual system share similarities in imposing coordinate frames on objects and parts.
Imposing Coordinate Frames: The visual system imposes a hierarchy of rectangular coordinate frames to recognize objects. This process involves mapping points between frames using linear transformations. Hinton acknowledges the possibility of challenges to his argument.
Evidence for Separate Neural Activities: Hinton proposes that coordinate frames are represented by multiple neural activities rather than a single neuron. He demonstrates this with the example of recognizing a capital letter R and determining its orientation. The inability to immediately determine handedness is seen as evidence for this distributed representation.
High-Order Parity Problems and Handedness: Hinton explains that neural networks struggle with high-order parity problems, which are relevant to determining handedness. This difficulty leads to the inability of convolutional nets to recognize handedness. The representation of coordinate frames is spread across multiple numbers, similar to computer graphics.
Continuous Transformation and Mental Rotation: The brain solves the handedness problem by performing continuous transformations to simplify the task. Mental rotation tasks are not primarily for object recognition but for resolving handedness issues.
Conclusion of Argument One: Humans use embedded rectangular coordinate frames in objects and parts. These coordinate frames are represented by multiple neural activities, indicating a distributed representation.
Argument Two: Equivariance and Convolutional Nets: Convolutional nets aim to make representations invariant to viewpoint, which differs from human perception. When humans see a face, they know its precise orientation and location, not just its identity. Some neuroscientists believe that large receptive fields imply low accuracy, but Hinton disagrees.
00:24:23 Neural Network Invariance in Visual Systems
Receptive Field Size and Accuracy: For high accuracy in object localization, neurons should have large overlapping receptive fields. This divides space into many tiny regions, increasing the surface area for more accurate representation. However, this approach sacrifices resolution and is only suitable for objects with limited numbers, such as faces.
Equivariance and Place Coding: Convolutional networks without max pooling exhibit equivariance, meaning the neural activities change similarly to viewpoint changes. There are two types of equivariance: place equivariance (neurons change with pixel shifts) and rate-coded equivariance (neuron activities change with object movement). The visual system likely uses both types of equivariance, with small domains at low levels (rate-coded) and larger domains at high levels (place coding).
Invariance and Training Data: Current neural networks deal with invariance by training on various viewpoints, requiring substantial training data and time. This approach lacks a built-in bias for generalizing across viewpoints, leading to potential overfitting.
Linear Manifold and Extrapolation: A better approach is to use a linear manifold to represent object poses, allowing for massive extrapolation. This enables recognition of objects with different sizes, orientations, and positions, even with limited training data.
Inverse Graphics: Hinton proposes using inverse graphics within the vision system, literally performing graphics operations in reverse. This approach aligns with the idea that vision is an inverse graphics problem, where the system reconstructs the 3D world from 2D images.
00:29:11 Computer Vision from 1980s: Utilizing Parts and Invariance for
Inverse Graphics: Hinton advocates for “inverse graphics” in computer vision, where the pose of a part is used to determine the pose of the whole, rather than the traditional approach of using the pose of the whole to determine the pose of the parts. This approach provides viewpoint invariance, meaning that the relationship between a whole and its parts remains constant regardless of the viewpoint.
Invariance and Crowding: Hinton emphasizes the importance of viewpoint invariance in the weights of neural networks, rather than in the neural activities themselves. He explains that capsule networks can achieve perfect viewpoint invariance in the weights, but this is limited by the fact that each capsule can only deal with one entity at a time due to its use of simultaneity for binding. Crowding occurs when things are placed too close together, making it difficult to see them, and it is caused by violating the constraint of one entity per capsule.
Shape Recognition and Coincidence Filtering: Hinton proposes a method for shape recognition using capsule networks, where familiar parts are identified and their pose parameters are used to predict the pose of the larger object. Coincidence filtering is employed to determine whether the predictions from different parts are consistent with each other, indicating the presence of the larger object. This approach is robust to noise and can ignore predictions that do not agree with the majority.
Modern Hough Transforms: Hinton draws a comparison between capsule networks and Hough transforms, which are traditional methods for shape recognition. He argues that capsule networks can be viewed as modern Hough transforms, but with the key difference that they utilize machine learning to extract reliable high-dimensional features from pixels. This allows capsule networks to make point predictions, eliminating the need for bins and multiple votes in traditional Hough transforms.
Routing Information: Hinton discusses the importance of routing information to the appropriate capsules in a capsule network. He suggests that routing can be achieved through attention mechanisms, where the most active neurons are routed to the relevant capsules. This routing principle ensures that information is directed to the capsules that are best equipped to handle it.
Parse Tree Assumptions: The world is opaque and can be modeled as a parse tree. Each discovered part has one parent or possibly no parent (single parent constraint).
Capsule Behavior: When a part is discovered (e.g., a circle), the capsule sends its pose to multiple high-level capsules with different weights. The high-level capsule looks for clusters of incoming weak bets that agree. Initially, each capsule uses prior knowledge to send weak bets to high-level capsules with high weights for likely parents and low weights for unlikely parents. A “magic computation” (not specified) is used to find the clusters. The capsules use top-down feedback and lateral interactions to refine their predictions. The result is a parse tree where each discovered part is assigned to a parent, establishing the part-whole relationships.
Routing Agreement: Capsule networks employ a “routing by agreement” algorithm for routing information between capsules. This algorithm uses consistency to route information, unlike loopy belief propagation, which revises opinions based on incoming information. Routing by agreement allows for efficient and effective routing of information based on high-dimensional agreement.
Primary Capsules: Primary capsules are the first level of capsules that have explicit pose coordinates. They convert pixel intensities into vectors of pose parameters and indicate the presence or absence of an entity. The pose parameters and activation of primary capsules are learned through gradient descent from the right answers.
Predictions: Each primary capsule makes predictions about the pose and presence of a specific entity, such as an edge, corner, or object part. These predictions are made by applying coordinate transforms to the pose parameters of the primary capsule. Predictions from different primary capsules are weighted and combined to make final predictions for the presence and pose of entities in the image.
Backpropagation and Learning: The capsule network can be trained using backpropagation to minimize the error between the network’s predictions and the ground truth labels. The weights of the coordinate transforms, biases, and other parameters are learned during training. The network can learn to extract features, predict poses, and assemble parts into larger entities.
Additional Insights: The relationship between pixels and features is highly nonlinear, but the relationship between feature poses and larger entity poses is linear. The network uses a mixture of a Gaussian and a uniform distribution to model the predictions for each high-level capsule. The mean, variance, and mixing proportion of the Gaussian distribution are learned during training. The network can learn to identify and cluster entities based on their pose parameters and presence probabilities.
Gaussian and Uniform Mixture Score: A mixture of a Gaussian and a uniform distribution is used to model the data. The score is calculated as the difference in the log probabilities of the data under the mixture and under the uniform distribution. A higher score indicates a tighter cluster, which is more likely to be a meaningful object.
Softmax and Backpropagation: The score is used as the logit in a Softmax function to make a decision about the class of the object. If the decision is wrong, the score for the correct class is increased and the score for the wrong class is decreased. Derivatives of the score are backpropagated through the system to update the weights.
EM-based Clustering: An inner loop of EM (Expectation-Maximization) is used to find the cluster of data points that best explains the observations. This process involves iteratively updating the parameters of the Gaussian and uniform distributions to maximize the score.
Visualizing Capsule Activations: The posterior probability of each data point being accounted for by the Gaussian is visualized as a circle. The size of the circle indicates the strength of the belief that the data point belongs to the cluster. The votes for different classes are shown as clusters in the coordinate space.
Comparison with Convolutional Neural Networks: The capsule network achieves similar performance to a convolutional neural network on the MNIST dataset. However, the capsule network is much slower to train due to the inner loop of EM.
Limitations and Future Work: The capsule network currently handles only single digits and does not redistribute votes or reweight them based on routing by agreement. Future work will focus on extending the model to handle multiple simultaneous digits, deeper hierarchies, and real images. Unsupervised learning will be explored to obtain primary capsules without labeled data.
How Unsupervised Learning Helps with Object Recognition: Unsupervised learning enables systems to extract innate graphics models and learn to reconstruct images by extracting and reconstructing capsules. The graphics system learns templates for each entity and can translate, scale, and add them to the image. This approach allows the system to invert graphics and learn what entities should be.
Applying Supervised Learning on Top of Unsupervised: After unsupervised learning, supervised learning can be applied by concatenating pose parameters from each capsule. Factor analysis is used to find underlying factors that model the relationship between the whole and the part. Fitting a mixture of factor analyzers to the concatenated instantiation parameters of the primary capsules yields good results.
Efficiency of the Approach: With 2,000 labeled examples, the approach achieves a 1.75% error rate, significantly better than standard unsupervised followed by supervised methods. This efficiency is attributed to capturing the linear manifold and finding natural classes through unsupervised learning.
Comparison with Other Methods: The approach outperforms unsupervised pre-training followed by supervised learning and generic mixture models like mixture of factor analyzers on pixels. It is likely that fixing around 1,000 components would be needed to achieve similar performance with a generic mixture model.
Human Object Perception and Low-Resolution Representations: The role of low-resolution representations in facilitating the mapping from pixel space to pose space is not yet clear and requires further investigation.
Human Performance in Object Recognition: Humans are relatively good at recognizing objects in different orientations, with a penalty of around 10 milliseconds for upside-down objects and a longer delay for mental rotation. Mental rotation is much slower for 3D objects compared to 2D objects.
Computational Approximations for Faster Learning: To improve the learning speed of the system, optimizations such as avoiding MATLAB programming and employing more efficient coding practices can be explored.
01:09:14 Neural Networks That Seek Agreement Between Activity Vectors
Core Essence of Capsule Networks: Capsule networks differ from traditional neural networks in their core mechanism of “agreement between activity vectors.” The focus is not on the agreement between a weight vector and an activity vector, as seen in filters, but on agreement among activity vectors.
Seeking Covariance Structure: Capsule networks aim to capture the covariance structure within data, which allows for the identification of patterns and relationships among different features. This computation involves finding high-dimensional agreements among a large set of random elements.
Efficient Computation: The challenge lies in making this computation efficient, as the dimensionality of the data increases. Geoffrey Hinton suggests that there are potential ways to achieve efficiency, which he plans to discuss further if they prove successful.
Abstract
Revolutionizing AI: The Emergence of Capsule Networks and the Evolution of Convolutional Neural Networks
In the dynamic landscape of artificial intelligence, Geoffrey Hinton’s novel proposition of “capsules” in neural networks marks a transformative step. These capsules, characterized by their unique parameters – presence probability and generalized pose, and a distinct agreement-based routing mechanism, aim to enhance entity representation and structural integrity in AI models. This approach, alongside critiques and advancements in Convolutional Neural Networks (ConvNets) developed by Yann LeCun, signifies a pivotal shift towards a more nuanced and effective understanding of visual processing and intelligence in computational models. Hinton’s insights, juxtaposed with the inherent challenges and limitations of existing ConvNets, underscore the need for a paradigm shift in how AI understands and interacts with the world around it.
Expanding on Main Ideas:
Capsule Networks introduce a novel way to represent entities in neural networks, enhancing explicit entity representation which is lacking in current models. These capsules operate on two key parameters: the probability of an entity’s presence and its generalized pose, providing essential spatial information. They also feature a unique routing mechanism based on agreement between predictions, ensuring coherent and consistent entity representation. This innovative approach, inspired by the brain’s functioning, promises to revolutionize capabilities in visual processing and intelligence.
ConvNets, pioneered by Yann LeCun, have significantly advanced object recognition, employing learned feature detectors and progressively larger spatial domains across layers. However, Geoffrey Hinton has critiqued ConvNets, particularly max pooling, for their limitations in accurately perceiving shapes and ensuring viewpoint invariance. He emphasizes the need for a more sophisticated method to represent objects from different viewpoints. ConvNets lack the representation of rectangular coordinate frames, crucial for precise object recognition, an aspect that capsule networks aim to address. Challenges faced by ConvNets, such as difficulty in recognizing handedness and incorporating mental rotation, demonstrate limitations in their approach to equivariance and object representation.
The integration of ConvNets with recurrent neural networks has opened new avenues, like image captioning. This integration leverages the activities of a trained ConvNet’s last hidden layer to train a recurrent network, raising questions about the true nature of understanding in AI models. Is the generated caption a reflection of true understanding, or just a byproduct of language model capabilities?
Innovations in ConvNets include big receptive fields, rate-coded equivariance, and the utilization of linear manifolds for object representation. Capsule networks, promising in recognizing occluded objects and building hierarchical object representations, face challenges like computational expense and training complexity. Hinton’s proof of concept on the MNIST dataset demonstrated the practical viability of capsule networks, yet their widespread application is still emerging.
Hinton’s innovative approach in AI merges unsupervised and supervised learning, enhancing performance with fewer labeled examples and outperforming traditional methods. This method demonstrates the system’s ability to decompose and reconstruct objects from their parts, achieving impressive accuracy with minimal labeled data, and highlighting the potential of capsule networks.
Convolutional Nets Cannot Explain the Effect of Coordinate Frames on Object Recognition:
Hinton presents a tetrahedron puzzle that challenges many, including MIT professors. The solution hinges on recognizing and aligning the natural coordinate frame of tetrahedron pieces. People familiar with tetrahedral cartons or quadrahedron models find this puzzle easier due to their mental models aligning with the natural coordinate frame of the pieces.
Hinton argues that humans use rectangular coordinate frames for object recognition, a concept absent in convolutional nets. He demonstrates this through the example of accurately judging right angles within a tilted square when perceived as a square but not as a diamond. Hinton suggests that the visual system and computer graphics share a commonality in imposing coordinate frames on objects and their parts.
Rectangular Frames and Convolutional Nets:
In further evidence, Hinton notes that humans use rectangular coordinate frames for object recognition, unlike convolutional nets. This is exemplified by the ability to judge right angles in a tilted square when viewed as a square, not a diamond. He argues that both computer graphics and the visual system impose coordinate frames on objects and parts.
Imposing Coordinate Frames:
The visual system imposes a hierarchy of rectangular coordinate frames to recognize objects, utilizing linear transformations to map points between frames. Hinton acknowledges potential challenges to this argument, suggesting a complex interplay in our perception process.
Evidence for Separate Neural Activities:
Hinton proposes that coordinate frames are represented by multiple neural activities instead of a single neuron. This is illustrated by the example of recognizing a capital letter ‘R’ and determining its orientation, where immediate determination of handedness is challenging, hinting at a distributed representation.
High-Order Parity Problems and Handedness:
Hinton explains that neural networks struggle with high-order parity problems relevant to handedness recognition. This struggle is evident in convolutional nets’ inability to recognize handedness, with coordinate frame representation dispersed across multiple numbers, akin to computer graphics.
Continuous Transformation and Mental Rotation:
The brain addresses the handedness problem through continuous transformations, simplifying the task. Mental rotation tasks serve not primarily for object recognition but for resolving handedness issues.
Receptive Field Size and Accuracy:
For precise object localization, neurons require large overlapping receptive fields. This division into numerous tiny regions increases surface area for accurate representation but at the cost of resolution, making it ideal only for limited object types like faces.
Equivariance and Place Coding:
Convolutional networks without max pooling exhibit equivariance, where neural activities vary in tandem with viewpoint changes. There are two types of equivariance: place equivariance and rate-coded equivariance, with the visual system likely using both types.
Invariance and Training Data:
Current neural networks achieve invariance by training on multiple viewpoints, necessitating extensive data and time. This method lacks an inherent bias for viewpoint generalization, posing overfitting risks.
Linear Manifold and Extrapolation:
A more effective method involves using a linear manifold for object pose representation, enabling massive extrapolation and recognition of objects in varied sizes, orientations, and positions with limited training data.
Inverse Graphics:
Hinton suggests using inverse graphics within the vision system, essentially performing graphics operations in reverse, aligning with the concept that vision is an inverse graphics problem.
Capsule Network: Routing Information:
In capsule networks, routing information to the right capsules is crucial. Hinton proposes using attention mechanisms for routing, directing information to capsules best equipped to handle it.
Capsule Network: Routing Agreement, Primary Capsules, and Predictions:
Capsule networks use a “routing by agreement” algorithm, differing from loopy belief propagation. Primary capsules, the first level, convert pixel intensities into pose parameters, indicating entity presence. Each primary capsule predicts the pose and presence of specific entities, with these predictions weighted and combined for final outcomes. The network, trainable via backpropagation, learns to extract features, predict poses, and assemble parts into larger entities. The relationship between features and entity poses is linear, and the network employs a Gaussian-uniform distribution mix for prediction modeling.
EM-based Capsule Learning:
Capsule networks utilize a Gaussian and uniform distribution mix, with scores indicating the likelihood of meaningful object clusters. The network employs Softmax and backpropagation for classification, with an EM loop for clustering data points. Capsule activation visualization helps in understanding data clustering. Although similar in performance to convolutional neural networks on the MNIST dataset, capsule networks are slower in training. Future work aims to extend these networks to handle more complex scenarios and explore unsupervised learning for primary capsule acquisition.
In-depth Analysis of Geoffrey Hinton’s Presentation on Unsupervised Learning and Object Recognition:
Hinton highlights how unsupervised learning aids in object recognition by extracting innate graphics models and reconstructing images through capsules. Post unsupervised learning, supervised learning concatenates pose parameters from each capsule, using factor analysis for modeling. This method, efficient with limited labeled examples, outperforms standard methods and shows promise in capturing linear manifolds and finding natural classes.
Capsule Networks: Understanding the Essence and Key Concept:
Capsule networks differ from traditional neural networks in their focus on agreement among activity vectors, aiming to capture the covariance structure within data for pattern and relationship identification. The challenge lies in efficient computation as data dimensionality increases, a focus area for ongoing research in the field.
This comprehensive review of “Revolutionizing AI: The Emergence of Capsule Networks and the Evolution of Convolutional Neural Networks” elucidates the significant strides and challenges in the field of AI, highlighting the innovative work of Geoffrey Hinton and others in pushing the boundaries of neural network capabilities and understanding.
Capsule networks, inspired by human perception, enhance neural networks with structural organization and entity representation, addressing limitations of traditional networks. Capsule networks employ concepts like coordinate frames, equivariance, and linear manifolds to improve object recognition and perception....
Geoffrey Hinton's contributions to neural networks include introducing rectified linear units (ReLUs) and developing capsule networks, which can maintain invariance to transformations and handle occlusions and noise in visual processing.Capsule networks aim to capture object properties such as coordinates, albedo, and velocity, enabling efficient representation of position, scale, orientation, and...
Capsule networks, proposed by Geoffrey Hinton, address limitations of current neural networks by representing objects as vectors with properties like shape and pose, enabling equivariance and robustness to viewpoint changes. Despite challenges, capsule networks offer a promising new direction in computer vision....
The evolution of AI, driven by pioneers like Hinton, LeCun, and Bengio, has shifted from CNNs to self-supervised learning, addressing limitations and exploring new horizons. Advancement in AI, such as the transformer mechanism and stacked capsule autoencoders, aim to enhance perception and handling of complex inference problems....
Geoffrey Hinton's groundbreaking work in neural networks revolutionized AI by mimicking the brain's learning process and achieving state-of-the-art results in tasks like speech recognition and image processing. His approach, inspired by the brain, laid the foundation for modern AI and raised questions about the potential and limitations of neural networks....
Geoff Hinton's research in unsupervised learning, particularly capsule networks, is shaping the future of AI by seeking to understand and replicate human learning processes. Hinton's work on unsupervised learning algorithms like capsule networks and SimClear, along with his insights into contrastive learning and the relationship between AI learning systems and...