Geoffrey Hinton (Google Scientific Advisor) – What is wrong with convolutional neural nets? | Fields Institute (Sep 2017)


Chapters

00:00:00 Alternative Neural Network Architectures
00:02:16 Exploring Novel Neuronal Architectures for Enhanced Machine Vision
00:08:29 Capsules and Covariance in Deep Learning
00:20:06 Exploring the Significance of Coordinate Systems in Visual Perception
00:27:32 How Peculiarities in Human Perception Hinder AI Development
00:31:23 Coordinate Representations for Shape and Viewpoint
00:35:40 Capsule Networks: Equivariance and Place vs. Rate Coding
00:41:24 Discovering Viewpoint Invariant Representations through Capsule Networks
00:52:13 Understanding Capsule Networks: From Concept to Implementation
00:56:27 Capsule Networks: Advantages and Applications

Abstract

Revolutionizing Visual Processing: Insights from Geoffrey Hinton on Neural Networks and Capsule Networks

The field of artificial intelligence and neural networks has seen revolutionary advancements, particularly in visual processing, thanks to the pioneering work of Geoffrey Hinton and his exploration of neural network non-linearities and the development of capsule networks. This article delves into Hinton’s significant contributions, highlighting the transition from traditional neural networks to the innovative capsule networks that promise to reshape our understanding and capabilities in computer vision.

Transforming Neural Network Non-linearities

Historically, neural networks employed sigmoid or tanh units for non-linearity. However, Hinton’s exploration into rectified linear units (ReLUs) marked a paradigm shift. ReLUs, being easier to back-propagate through and less prone to saturation, have demonstrated superior performance in various applications. The critical insight here is the significance of the choice of non-linearity in neural network design, a factor often overlooked but fundamental in determining the efficiency and capability of these networks.

Hinton’s insights extend beyond just alternative non-linear units. He introduces the concept of neurons performing “same” functions – outputting 1 if two input vectors are identical and 0 otherwise – a novel approach offering simplicity in certain computations. Further, Hinton proposes the exploration of vector nonlinearities, functions operating on vectors instead of scalar values. This approach, particularly in the field of image recognition, could significantly improve neural network performance by directly detecting covariance structures, a crucial aspect in image processing.

Rethinking Neural Network Architectures

Hinton challenges the conventional wisdom that convolutional neural nets are the best approach to neural networks, emphasizing the need to explore alternative architectures that may outperform current methods. He highlights the evolution of non-linear activation functions in neural networks over the past 30 years, from the sigmoid unit to the tanh unit and eventually to the ReLU (max of input and zero). Hinton’s early adoption of ReLUs in 2004 was based on insights from Hugh Wilson, a neuroscientist at York University. Initial experiments showed promise, but more rigorous studies were needed to demonstrate the superiority of ReLUs.

Introducing Capsule Networks

A groundbreaking development in Hinton’s career is the introduction of capsule networks, inspired by the human visual system. Capsules are vector-based units encapsulating entities or objects within an image, with each vector value representing different properties like orientation, size, or color. This design addresses the ‘binding problem’ in visual processing – the challenge of associating different features of an object into a coherent representation.

Capsule networks stand out for their ability to maintain invariance to transformations such as translation, rotation, and scale, enhancing object recognition capabilities even under varied conditions. They surpass traditional convolutional neural networks in object recognition tasks, showing robustness against noise and occlusions and providing a more interpretable representation of input data.

These networks find applications in diverse areas such as image classification, object detection, pose estimation, and medical imaging. Their novel approach to visual processing, particularly in handling occlusions and noise, marks a significant leap from conventional methodologies.

Equivariance vs. Invariance and the Role of Capsules

In the context of capsules, Hinton introduces the concept of equivariance – a property where changes in an object’s properties lead to corresponding changes in its representation, as opposed to invariance where the representation remains unchanged despite such variations. This distinction is crucial in understanding the effectiveness of capsule networks, especially in their ability to handle variations in viewpoint, a known challenge in traditional visual processing techniques.

Capsules are designed as an alternative to pooling layers in convolutional neural networks, which often lose critical pose information of objects. By preserving this information, capsules enable more nuanced and accurate object recognition.

Geoffrey Hinton’s Perspectives on Neural Networks and Human Perception

Hinton’s insights extend to the limitations of current neural networks in forming distinct internal representations of objects, a capability inherently present in human perception. He illustrates this with the example of perceptual frames – our perception of objects like squares or cubes varies depending on the imposed frame, highlighting the relative nature of our knowledge of objects.

This perspective leads to the exploration of the coordinate representation of shapes in neural networks. By using coordinates to identify facial features, for instance, neural networks can reconstruct and manipulate faces with ease. This approach underlies a linear manifold of shape representation, simplifying manipulation and extrapolation of facial features.

Capsules, Coordinate Representation, and Extrapolation

Capsules aim to capture object properties, including coordinates, albedo, and velocity, enabling efficient representation of position, scale, orientation, and shear. This representation is crucial in handling viewpoint variations, a challenge in traditional neural networks. Capsules, if successful, could extrapolate from limited viewpoint variation in training data to significant variations in real-world scenarios.

Hinton also discusses the distinction between place coding and rate coding in neural networks – the former indicating the presence of an object through the activation of a specific capsule, while the latter represents object properties through activity levels within a capsule. This transition from place to rate coding is vital in the visual system’s processing of information.

Capsule Networks: Architecture, Training, and Applications

Capsule networks consist of hierarchically arranged groups of neurons, each encoding different object properties. Training these networks, though challenging, can be achieved through various methods like backpropagation and reinforcement learning. These networks have shown promising results in tasks like object recognition and pose estimation.

The potential applications of capsule networks are vast, ranging from revolutionizing computer vision areas like object recognition and image segmentation to developing new, more efficient neural network types. Hinton’s presentation on capsule networks, using the MNIST digits as a demonstration, showcased their effectiveness, achieving error rates as low as 25, comparable to state-of-the-art systems.

Exploring New Directions in Neural Networks: Vector Nonlinearities and Covariance-Based Units

Current successful deep learning recipes are based on rectified linear units, leaving room for potentially better alternatives. Introducing different types of units with new capabilities can enhance the performance and capabilities of neural networks. Hinton emphasizes the need to explore the space of vector nonlinearities, which operate on entire vectors rather than individual numbers. This could lead to more powerful and versatile neural networks.

Additionally, Hinton discusses the concept of same neurons, which can distinguish between two input vectors that are the same or different, regardless of their specific values. This capability could simplify certain computations and enable new types of neural architectures.

Equivariance vs. Invariance: The Case of Capsules

Capsules are designed as an alternative to pooling layers in convolutional neural networks (ConvNets), which lose explicit information about the pose of an object as layers of pooling are applied. Capsules, on the other hand, are equivariant, meaning that the properties of the representation change in the same way when the properties of the object change. This makes them particularly suitable for tasks involving object recognition under varying conditions, such as rotation or translation.

Different Percepts of the Same Input: The Frame Problem

Our perception of an object can vary depending on the frame we impose on it. For example, a square rotated 45 degrees can be perceived as a square or a diamond, leading to different answers to questions about its orientation. This phenomenon, known as the frame problem, highlights the challenge of representing knowledge about objects in a way that is independent of the frame of reference.

The Cube Demonstration and the Frame Problem

A wireframe cube is rotated so that two diagonally opposite corners are vertically aligned. Participants are asked to use their hands to indicate the locations of the other corners of the cube. Most participants point to four corners, indicating that they perceive a shape with four corners, such as two square-based pyramids stuck base to base. This illustrates the frame problem, as there is another view of the cube, called a hexahedron, which consists of two tripods rotated 60 degrees relative to each other.

Perceptual Differences Between Humans and Neural Nets

Hinton highlights a fundamental difference between human and neural net perception. Neural nets lack the ability to create distinct internal representations of objects, such as two different views of a cube. This implies that they process information differently, and may not be able to solve certain problems that humans can solve easily.

The Tetrahedron Puzzle and Mental Rotation

Hinton presents a puzzle involving a tetrahedron sliced into two pieces with a square cross-section. The challenge is to reassemble the tetrahedron from these pieces. Interestingly, the time taken to solve this puzzle is often correlated with the years of tenure of MIT professors. Some individuals, like Carl Hewitt, even attempt to prove its impossibility. This puzzle highlights the importance of mental rotation, which is the ability to visualize an object from different perspectives, in human problem-solving.

Perceptual Biases and the Frame Problem

Hinton points out that human perception often imposes a frame of reference on objects, leading to difficulties in seeing the solution to puzzles like the tetrahedron puzzle. This is because the perceptual system uses coordinate frames that don’t align with the object’s actual structure. This bias can be overcome by using a coordinate representation of shape, which allows for more efficient representation and manipulation of objects.

Capsule Networks and Coordinate Representation

Capsules aim to capture object properties such as albedo, velocity, and coordinates. The coordinate representation in capsules captures position, scale, orientation, and shear as separate numbers. This allows for linear interpolation and extrapolation, resulting in a more efficient representation. This is in contrast to neural networks, which typically struggle with large viewpoint variations unless explicitly trained on such variations.

Dynamic Routing and the Binding Problem

Dynamic routing is a mechanism for capsules to communicate and determine which capsule represents the most relevant information. This process is crucial for the success of capsules in handling object recognition tasks. By using dynamic routing, capsules can group together features that belong to the same object, even when those features are separated in the image. This addresses the binding problem, which is the challenge of associating different features of an object into a coherent representation.

Conclusion

Geoffrey Hinton’s contributions to the field of neural networks and visual processing, from redefining non-linearities in neural networks to introducing capsule networks, represent a significant leap in our understanding and capability in artificial intelligence. His insights into the nature of vision, the importance of equivariance over invariance, and the potential of vector nonlinearities have paved the way for more advanced, efficient, and robust systems in visual processing. As we continue to explore and develop these innovative technologies, the impact of Hinton’s work will undoubtedly resonate for years to come in the field of artificial intelligence and beyond.

Supplemental Updates from Response 10:

Computational Theory of Vision:

– Hinton emphasizes the importance of understanding the relationship between objects and their viewpoints.

– He suggests using the NORB database to study this relationship.

– Current neural networks achieve good results on NORB, but there is room for improvement.

Training Neural Networks with Transformed Data:

– Hinton proposes a new approach to training neural networks with transformed data.

– Providing the network with pairs of images and the transformation difference would be more informative than simply transforming the images without providing additional information.

Uniformity of Knowledge in Capsule Networks:

– Hinton believes that the knowledge at every layer of a capsule network should be the same.

– This means that low-level knowledge about edges should be the same everywhere, and high-level knowledge about faces should also be the same everywhere.

– He wants to get this knowledge into a capsule that works over a big range at a high level and deals with the variation using rate coding.

Number of Capsules Needed in a Capsule Network:

– Hinton estimates that a capsule system that works really well would require about a billion neurons, similar to the number of neurons in the human brain.

– The number of capsules needed should be linear in the number of pixels in the image.

– If the network needs to learn more different types of things, more capsules would be needed.

ReLU Networks and the Computational Theory of Vision:

– Hinton wonders whether ReLU networks can adequately implement the computational theory of vision.

– He is concerned that ReLUs may be able to fake the desired behavior, making it difficult to determine if they are truly implementing the theory.


Notes by: BraveBaryon