Geoffrey Hinton (Google Scientific Advisor) – How to represent part-whole hierarchies in a neural net (Jul 2021)
Chapters
00:00:01 Unifying Representation Glom by Geoff Hinton
Introduction of Geoff Hinton: Geoff Hinton, a distinguished scholar in artificial intelligence, is introduced as the keynote speaker for the second keynote session of the SGP. Hinton’s academic journey is highlighted, including his PhD from the University of Edinburgh, postdoc at University of Sussex and University of California, San Diego, and faculty positions at Carnegie Mellon University and the University of Toronto. His current affiliation is as an Emeritus Distinguished Professor at the University of Toronto and part-time work at Google.
Contributions to Artificial Intelligence: Hinton’s extensive contributions to artificial intelligence are mentioned, encompassing Boltzmann machines, variational learning, capture networks, and data visualization. He played a pivotal role in introducing and popularizing the backpropagation algorithm, a cornerstone of modern machine learning. Hinton was the first to utilize the backpropagation algorithm for running word embeddings, a significant advancement in natural language processing.
Recognition and Awards: Hinton’s achievements were recognized with the prestigious Turing Award in 2018, shared with Yann LeCun and Yoshua Bengio. This award acknowledged their conceptual and engineering breakthroughs that transformed deep neural networks into a critical component of computing.
Keynote Address Topic: Hinton’s keynote address will focus on his latest work, a unifying representation called Glom. The audience is invited to learn about this innovative concept from Hinton himself.
00:02:04 GLOM: A Hierarchical System of Embeddings for Visual Perception
Background: Traditional neural networks are limited by their inability to dynamically represent hierarchical structures like parse trees, which are essential for understanding images. Geoffrey Hinton proposes GLOM, a novel neural network architecture inspired by recent advancements in transformers, unsupervised learning, and generative models.
GLOM: GLOM is a conceptual design for a neural network that simulates the human visual system. It utilizes a hierarchical structure of embeddings to represent different levels of abstraction, from individual pixels to complex objects. These embeddings are dynamically updated to form “islands of agreement,” which represent the parse tree of the image. The network operates like an Ising model or Markov random field, using real values and coordinate transformations to communicate between levels.
Part-Whole Hierarchy and Coordinate Frames: Hinton emphasizes the psychological reality of the part-whole hierarchy and coordinate frames in human vision. He argues that we perceive images by breaking them down into parts and subparts, which are then organized into a hierarchical structure. Hinton provides a demonstration to illustrate the use of coordinate frames in visual perception.
Conclusion: GLOM remains a theoretical concept, but it serves as a starting point for exploring how the human visual system might work. Hinton encourages further research and collaboration to refine the design and investigate its potential applications.
The Cube Experiment: Geoffrey Hinton presents an experiment involving a cube placed in an unusual orientation to explore human visual perception.
Missing Corners: Many people initially identify only four corners of the cube, despite it having eight corners.
Zigzag Ring: The six missing corners form a zigzag ring structure, unfamiliar to most people.
Different Interpretations: The cube can be perceived in different ways, such as a crown or a hexahedron, leading to distinct interpretations.
Coordinate Frames: Human vision is sensitive to alignment with rectangular coordinate frames rather than specific shapes or features.
Right Angles: When the cube is perceived as a tilted cube, humans are acutely sensitive to right angles and notice even slight deformations.
Multiple Representations: Unlike the Necker cube, the different interpretations of the cube represent the same reality, similar to a sentence with multiple senses but the same truth conditions.
00:14:45 Structural Descriptions and Viewpoint Assignment in Mental Images
Structural Descriptions for Objects: Hinton presents a structural description of a crown, using a diagram from 1979. This description includes objects with intrinsic coordinate frames and coordinate transforms between objects and parts. It resembles old-fashioned computer graphics representations of geometry.
The Relationship Between Nodes and Viewpoint: Each node in a structural description can have an associated coordinate transform to the viewer. This allows for propagation of consistent viewpoint information over the entire structural description. It enables efficient computation of relationships between distant parts in the structure.
Mental Images: Hinton argues that mental images are not made of pixels but involve the assignment of viewpoint information. They have viewpoints and allow for the computation of relationships between different parts of the mental image. Mental images are not tiny or huge and are typically oriented with north being vertical.
Mental Imagery Task: Hinton provides an example of a mental imagery task involving navigating a series of directions and determining the direction back to the starting point. The way people approach this task suggests that mental images have viewpoints and that viewpoint information is used to compute relationships.
Characteristics of Mental Images: Mental images have viewpoints and are used to compute relationships between different parts of the image. They are not tiny or huge and are typically oriented with north being vertical. This suggests that mental images involve the assignment of viewpoint information to allow for efficient computation of relationships.
00:18:53 Contrastive Self-Supervised Learning for Image Representation Extraction
Introduction: Contrastive self-supervised learning aims to extract similar representations from different patches of an image. This technique was initially proposed in 1992 but gained recognition later in 2016.
SimClear: SimClear is a contrastive self-supervised learning algorithm developed in Toronto. It takes two crops of an image (XI and XJ) and passes them through a deep neural network (F) to obtain representations (H). The representations are then reduced in dimensionality using a function g, and a contrastive loss is applied to make Zi and Zj agree if they come from the same image and disagree if they come from different images.
Applications: Contrastive self-supervised learning is effective in extracting representations of scenes and, in cases like ImageNet, objects.
Classification: After obtaining representations using contrastive self-supervised learning, a linear classifier with softmax can be used for classification. This approach achieves comparable performance to supervised learning methods while only training the final layer of weights.
Limitation: Contrastive self-supervised learning may not be intuitively satisfying for object recognition tasks.
Overview of Glom: Glom is a neural network architecture designed to overcome the problem of obtaining different output vectors for similar patches of an image containing different objects. It utilizes attention to achieve spatial coherence of familiar shapes.
Goal: Glom aims to obtain an island of identical vectors at the object level to represent objects like a face, nose, and mouth. It seeks to convert different representations of objects into identical representations, enabling the extraction of common features.
Inspiration from Biology: Glom is partly inspired by biology, where every cell has a complete set of instructions for making proteins, and the environment determines which proteins are expressed. The vector of protein expressions in a cell is analogous to the vector of multiple-level embeddings in Glom. Organs, composed of cells with similar protein expression vectors, are analogous to objects in Glom.
Replicating Weights and Knowledge: Glom replicates weights in the neural network, similar to the replication of knowledge in cells. This replication makes life convenient and allows for efficient processing.
Hardware Equivalents: A column of hardware representing a small patch of an image is equivalent to a cell. The vector of protein expressions in a cell is analogous to the multiple embedding vectors in a column.
Vision as a Sampling Process: Vision is a sampling process where the outer loop decides what to process in detail.
00:25:20 Hierarchical Neural Fields: A New Framework for Object Recognition
Architecture of the Hierarchical Neural Fields Model: This model comprises numerous columns, each containing a three-level embedding system. The embedding at each level is influenced by bottom-up, top-down, and local interactions. Bottom-up interactions involve predicting higher-level embeddings from lower-level ones. Top-down interactions involve predicting lower-level embeddings from higher-level ones. Local interactions involve weighted averaging of nearby column embeddings, promoting agreement among similar embeddings.
Key Characteristics of the Model: The model utilizes a simplified version of transformer architecture, employing exponentiated scalar products for attention. It relies on the mathematical principle that functions of multiple arguments can be implemented using functions of one argument, followed by addition, and a function of the resulting scalar. This enables the use of simple averaging operations as the sole interaction between localities. As the network settles, it forms distinct regions or “islands” of similar embeddings, which facilitate object segmentation.
Handling Object Vector Ambiguities: The model addresses the challenge of using the same object vector for different locations within an object. It employs hierarchical neural fields, where the object vector includes information about the object’s pose relative to image coordinates. This allows the model to generate location-specific predictions, such as distinguishing between a mouth and a nose within the same object.
Neural Fields for Representing Uniform Features: Neural fields are introduced as a means of representing uniform features across multiple pixels. An example is given of a gradient across four pixels, represented by a single set of coefficients (A and B). This representation allows for efficient reconstruction of the pixels from the coefficients.
Transformational Random Fields for Disambiguating Parts: Transformational random fields are employed to handle ambiguities in identifying object parts. A possible mouth, for instance, can specify the pose of a nose it expects to find in relation to itself. This enables the model to disambiguate the mouth and nose by searching for the expected nose pose.
Ambiguous Multimodal Predictions: Predicting the whole object from its ambiguous parts is challenging, especially when dealing with ambiguous parts like circles that could be eyes, wheels, etc. This results in multimodal predictions at the next level up, requiring a more complex mechanism for disambiguation.
Iterative Net for Common Mode Selection: Disambiguation is achieved through an iterative network that averages similar predictions, selecting the common mode. This process identifies the most likely interpretation of the ambiguous parts based on agreement among nearby localities.
Log Probabilities for Combining Distributions: Combining distributions requires using log probabilities instead of raw probabilities. This allows for addition of log probabilities to combine multiple predictions, effectively averaging the distributions.
00:36:05 Composing High-Level Concepts from Sparse Feature Detectors
Introduction: * Geoffrey Hinton presents Glom, an imaginary system that addresses the challenge of representing parse trees without allocating neurons dynamically to nodes.
Glom’s Functionality: * Glom utilizes embedding vectors to represent objects or concepts. * Each neuron in Glom acts as a basis function in an unnormalized log probability space, with its activity representing the coefficient on that basis function. * These basis functions are vague, allowing for the representation of ambiguous objects. * The activities of neurons are combined in the log probability space to obtain a sharper representation.
Visualizing Glom’s Performance: * Glom’s ability to reconstruct objects from ambiguous data is demonstrated using an example of ellipses and objects like faces and sheep.
Training Glom: * Glom can be trained in an unsupervised manner, similar to language models like BERT. * One objective is to encourage the formation of “islands” of similar objects. * Another objective is to train the bottom-up and top-down neural networks to agree with a consensus embedding, derived from information at the same level and other columns. * This training process promotes the formation of islands and facilitates the sharing of knowledge between locations.
Addressing Potential Objections: * Hinton argues that the replication of embedding vectors for identical objects is not wasteful during the search for an interpretation of an image. * He compares the use of multiple vectors pointing to the same object to the concept of pointers in computer science. * The replication is also less expensive due to the use of sparse long-range interactions, which only require sampling a few locations to obtain sufficient information.
Summary of Glom: * Glom combines recent ideas in neural networks to tackle the problem of representing parse trees without dynamic allocation of neurons. * It employs universal capsules, represented by embedding vectors, which can represent anything. * The detailed explanation of Glom can be found in a paper available on the archive.
Additional Questions: * The possibility of skip connections between different levels in the architecture is raised. * Hinton suggests that the hardware levels can be thought of as a window that moves over the hierarchy in a scene. * He provides the example of the Bohr atom to illustrate how attention can be focused on specific levels of the hierarchy.
00:44:41 Neuromorphic Networks: Using Hardware for Thinking About Consciousness
Hardware Usage for Thinking: The brain uses similar hardware for thinking about the solar system and other concepts, enabling a mapping between reality and the hardware. Fixations allow for sequential changes in the mapping between hardware levels and the world.
Retina and Zoom: The retina’s structure allows for zooming by moving the eyes.
Skip Connections in Initialization: Convolutional neural networks are used for initializing different embedding levels. Direct initialization of higher-level objects is possible, bypassing lower levels.
Combining Parts to Detect Objects: Capsule networks allow for combining parts to detect certain objects. The challenge lies in integrating top-down and bottom-up information.
00:47:05 Self-Supervised Learning for 3D Scenes: From Object-Centric
Top-down interactions in capsule networks: In previous capsule papers, top-down interactions were not allowed to change the representation of the level below, resulting in a bottom-up approach that resolved ambiguities one layer at a time. In stacked capsule autoencoders, a set transformer is used to get parts to disambiguate each other, similar to how a nose and a mouth would disambiguate by doing coordinate transforms.
Coordinate transforms and attention weighted averaging: Glom simplifies the process by transmitting ambiguity to the next level via coordinate transforms, which can be resolved by attention weighted averaging. This approach relies on the ability of neural networks to represent multimodal distributions using neurons as basis functions in an unnormalized log probability space.
Applying self-supervised methods to 3D scenes: In human vision, when focusing on an object in an image (similar to ImageNet), it is often possible to find a large bounding box that contains that object. For 3D scenes, however, processing the entire scene at once may be necessary. To apply self-supervised methods to 3D scenes, it may be beneficial to adopt a similar approach to object-centric views, but further research is needed to determine how to effectively transition from an object-centric view to 3D scenes.
00:49:38 Falsifying Theories of Brain Function Through Engineering
Key Concepts and Insights: Attention and Visual Processing: Our visual system utilizes attention to focus on specific parts of an image, processing them in fine detail while disregarding the periphery. The GLOM model currently lacks proper development for recognizing small objects in the periphery. Hinton emphasizes keeping the model’s architecture simple and recognizes the need for more development. Object Perception and Attention Allocation: Hinton suggests that we can only see one object at a time. Recognizing an object requires seeing its parts, but focusing on a specific part results in a richer representation compared to general viewing. The model in the presentation focuses on a single fixation and does not address the sequential allocation of attention.
Questions and Responses: Discovery of New Facial Features: Hinton notes that a well-functioning system could potentially identify facial features not previously known, but the current model’s limitations prevent a definitive answer. He discusses how neural fields with coordinate frames tend to align with an object’s natural symmetries, maximizing efficiency in representation. Falsifying Hypotheses in Computational Neuroscience: Hinton emphasizes the advantage engineers have in falsifying models by demonstrating their practical shortcomings. In the absence of computers, theories about the brain lacked empirical validation. Building and testing models computationally allows for direct falsification based on their performance. Leveraging GLOM with Limited Unlabeled Data: While the human visual system uses relatively little unlabeled data, Hinton acknowledges the need to explore how GLOM can be optimized for such scenarios. He suggests investigating co-distillation and contrastive learning approaches to address this challenge.
00:55:57 Unsupervised Learning Through Contrastive Representation
Introduction of an Unsupervised Learning System: Hinton proposes a novel unsupervised learning system that does not rely on labeled data. The system is designed to learn a hierarchical representation of images.
Key Components of the System: Partial Autoencoder: The system masks out a patch of an image and iteratively fills it in, similar to an autoencoder. Contrastive Learning: The system aims to make nearby patches of the image have similar representations. These two components enable the system to learn without labels.
Connection to Brain Injuries: Hinton discusses a case of semantic access dyslexia, where brain-damaged individuals perceive words differently. He suggests that studying such brain injuries can help validate the effectiveness of brain models.
Relevance to Neural Network Models: Hinton highlights the relevance of neural network models in understanding brain functions. Specifically, he mentions models that map letters to semantics in word reading.
Conclusion: Hinton’s presentation introduces an unsupervised learning system that leverages partial autoencoders and contrastive learning. He emphasizes the potential of studying brain injuries to validate brain models and the role of neural networks in understanding brain functions.
Lateral Interactions and Semantic Attractors: Neural networks with attractors, where certain familiar sets of semantic features form an attractor through lateral interactions, can model the semantics of language.
Forward Pathway to Attractors: A forward pathway gets the network from the letters of a word to the semantics of the word, leading it to the appropriate attractor.
Damage to Attractors or Pathways: If either the attractor or the forward pathway is damaged, the letters can lead the network to a nearby but different attractor, such as a similar word like “apricot” instead of “peach.”
Neural Net Models and Neuropsychology: These neural net models explain seemingly strange phenomena observed by neuropsychologists in terms of mundane neural mechanisms.
Abstract
Geoffrey Hinton’s Visionary Leap in AI: Unraveling the Intricacies of the Human Mind and Visual Perception
Geoffrey Hinton, an acclaimed AI pioneer from the University of Toronto and Google, has significantly advanced the field of artificial intelligence. His groundbreaking work includes the development of the GLOM neural network, a transformative approach aiming to closely mimic human visual processing. This system, which combines elements of transformers, unsupervised learning, and generative models, seeks to understand and replicate the hierarchical and psychological realities of human vision. Hinton’s contribution extends beyond GLOM, encompassing breakthroughs in deep neural networks, earning him the Turing Award in 2018, and his innovative work in areas such as contrastive self-supervised learning and the efficient representation of complex visual structures. His insights into mental imagery, structural parsing, and the handling of ambiguous visual information mark a significant step towards bridging the gap between human cognitive abilities and artificial intelligence.
Introduction of Geoff Hinton:
Geoff Hinton is a luminary in the field of artificial intelligence, known for his pioneering work on Boltzmann machines, variational learning, capture networks, and data visualization. He popularized the backpropagation algorithm, a fundamental technique in deep learning, and played a key role in introducing word embeddings in natural language processing. His transformative contributions to the field earned him the Turing Award in 2018, which he shared with Yann LeCun and Yoshua Bengio.
Segment Summaries and Main Ideas
Geoff Hinton’s Contributions and Current Work:
Geoffrey Hinton’s pioneering work in AI, especially his development of the GLOM framework, has significantly influenced the field’s understanding of visual perception. GLOM’s unique approach to image processing emphasizes part-whole hierarchies and coordinate frames, providing a fresh perspective on how humans perceive and interpret visual information. Hinton’s structural description of a crown, utilizing a 1979 diagram, includes objects with intrinsic coordinate frames and transforms between objects and parts, reminiscent of old-fashioned computer graphics representations of geometry.
Conceptual Framework of GLOM:
GLOM’s architecture, inspired by human vision, encompasses a hierarchical structure of embedding vectors and neural fields to represent and process images. This system mirrors the human brain’s method of visual perception, focusing on parts and wholes, and adeptly manages the complexity of visual information. Each node in a structural description can have an associated coordinate transform to the viewer, facilitating the propagation of consistent viewpoint information throughout the structural description. This allows for efficient computation of relationships between distant parts in the structure.
Understanding Human Perception with GLOM:
Hinton’s insights into the psychological reality of human vision, demonstrated through experiments like cube perception and sentence ambiguity, highlight the complexity and multidimensionality of human perception. GLOM’s design reflects these intricacies, offering a pathway to more human-like AI. Hinton argues that mental images are not pixel-based but involve viewpoint assignment, allowing for the computation of relationships between different parts of the mental image. Mental images typically orient with north being vertical. He also provides an example of a mental imagery task involving navigation and orientation, further illustrating the use of viewpoint information in mental images.
The Role of Contrastive Self-Supervised Learning:
Contrastive self-supervised learning, a method Hinton introduced and refined, is crucial in GLOM’s ability to process and interpret visual information. This technique focuses on extracting similar representations from different image patches, initially proposed in 1992 but gaining recognition in 2016. SimClear, a contrastive self-supervised learning algorithm developed in Toronto, processes two crops of an image to obtain representations, which are then dimensionally reduced and subjected to contrastive loss. This technique is effective in extracting scene representations and, when applied to tasks like ImageNet, object representations. After obtaining representations, a linear classifier with softmax can be used for classification, achieving performance comparable to supervised learning methods while only training the final layer of weights.
The Architectural Complexity and Innovations in GLOM:
GLOM’s intricate architecture, involving multiple levels of embeddings, complex coordinate transformations, and echo chambers, showcases an advanced approach to handling ambiguity and representing visual structures in a way that mimics the human brain’s processing capabilities. Glom, inspired partly by biology, aims to achieve spatial coherence of familiar shapes through attention and to represent objects with identical vectors at the object level. This neural network architecture replicates weights, similar to knowledge replication in cells, and uses hardware equivalents like a column of hardware representing a small patch of an image.
GLOM’s Implementation and Future Potential:
The implementation of GLOM, utilizing techniques like unsupervised learning and knowledge transfer, demonstrates its versatility in AI, capable of addressing complex visual tasks and advancing our understanding of both artificial and human intelligence. Vision is conceptualized as a sampling process, where the focus is on detailed processing of specific areas.
Hinton’s Perspective on AI and Neuroscience:
Hinton’s work, while primarily focused on artificial intelligence, also provides significant insights into neuroscience and cognitive science. He advocates the use of neural networks in scientific research and validates brain models through computational approaches, creating a unique intersection between AI and human cognition.
Geoffrey Hinton’s development of the GLOM neural network exemplifies the potential of AI in mimicking and understanding human cognitive processes. By bridging AI, visual perception, and cognitive science, his contributions deepen our understanding of both artificial and human intelligence, opening new paths for exploration and innovation. His work not only enhances our comprehension of the complex nature of the human mind but also paves the way for more advanced and human-like artificial intelligence systems.
Update:
Hinton’s recent work further explores GLOM’s architecture and functionality, revealing a three-level embedding system within each column of the hierarchical neural fields model. He emphasizes bottom-up, top-down, and local interactions in shaping these embeddings and introduces transformational random fields for disambiguating parts. This enables the model to identify objects relative to each other and proposes an iterative network using log probabilities to combine distributions and select the most likely interpretation of ambiguous parts.
Supplemental Update:
Hinton’s unsupervised learning system learns a hierarchical representation of images without relying on labeled data, using partial autoencoders and contrastive learning. He suggests studying brain injuries to validate brain models and explores neural networks with attractors to model semantic meaning in language. These developments continue to bridge the gap between AI and neuroscience, providing valuable insights into the human mind.
Geoffrey Hinton's GLOM neural network system revolutionizes object recognition by understanding variability in visual representation and spatial relationships, challenging traditional AI's tree structures. Hinton emphasizes the significance of strong beliefs in driving AI progress and expresses concern about unregulated AI's risks, particularly autonomous weapons....
Geoffrey Hinton's revolutionary ideas in neural networks, transformers, and part-whole hierarchies are transforming computer vision, pushing the boundaries of image processing and AI. Ongoing research in combining these techniques promises to further our understanding of vision systems and open new avenues for technological innovation....
Capsule networks, inspired by human perception, enhance neural networks with structural organization and entity representation, addressing limitations of traditional networks. Capsule networks employ concepts like coordinate frames, equivariance, and linear manifolds to improve object recognition and perception....
Capsule Networks introduce a novel approach to entity representation and structural integrity in AI models, while Convolutional Neural Networks have been influential in object recognition but face challenges in shape perception and viewpoint invariance....
Geoffrey Hinton's contributions to neural networks include introducing rectified linear units (ReLUs) and developing capsule networks, which can maintain invariance to transformations and handle occlusions and noise in visual processing.Capsule networks aim to capture object properties such as coordinates, albedo, and velocity, enabling efficient representation of position, scale, orientation, and...
Geoffrey Hinton's intellectual journey, marked by early curiosity and rebellion, led him to challenge conventional norms and make groundbreaking contributions to artificial intelligence, notably in neural networks and backpropagation. Despite initial skepticism and opposition, his unwavering dedication and perseverance revolutionized the field of AI....
Geoffrey Hinton's groundbreaking work in neural networks revolutionized AI by mimicking the brain's learning process and achieving state-of-the-art results in tasks like speech recognition and image processing. His approach, inspired by the brain, laid the foundation for modern AI and raised questions about the potential and limitations of neural networks....