00:00:08 Octopus, Kittens, and Babies: From Seeing to Doing
Introduction: Fei-Fei Li is a renowned scholar in the field of AI and robotics. Her contributions include ImageNet, deep learning research, and a popular TED talk on computer vision. Recently, she has focused on closing the perception-action loop and interdisciplinary research.
AI Vision and Robotics: Alex feels like a student learning robotics and enjoys working with her amazing students. Her talk aims to share thoughts on AI vision and robotics. She believes vision is essential for doing and sees robotics as an extension of computer vision.
Outline: Part 1: A brief history of computer vision and why connecting vision to robotics is exciting. Part 2: Focusing on recent work in her lab on vision-driven robotics, enabling vision and hands or grippers to interact with the world.
Significance of Vision: Professor Li emphasizes the importance of vision as a cornerstone of intelligence. Vision allows humans to perceive, understand, and interact with their environment.
00:03:15 Deep Learning Revolution: The Watershed Moment and Its Impact on Computer Vision
Vision’s Importance in Evolution: Vision has been vital for animals since evolution began and remains crucial for advanced species like humans. Half of the brain’s cortical area is dedicated to visual processing, emphasizing vision’s importance in intelligence.
Recent Progress in Computer Vision: Significant progress in computer vision has been made, as seen in tasks like ImageNet’s 1,000-way object classification using neural networks, notably convolutional neural networks. The year 2012 marked a turning point for deep learning due to the Geoff Hinton and students’ paper on the ImageNet Challenge.
Deep Learning Revolution: The deep learning revolution emerged from the convergence of three forces: computing power, algorithms (especially neural networks), and data. This convergence led to a surge in AI growth, evident in conference attendance, startups, markets, and other measures.
Historical Perspective: The speaker reflects on the origins of the current progress in computer vision and its implications for future research directions. They connect this historical perspective to their personal excitement for vision and robotics.
00:05:42 Origins of Neural Networks and the Search for North Star Problems in Computer Vision
Neural Network Inspiration from Neuroscience: In the late 1950s, Hubel and Wiesel discovered that neurons in the cat’s visual cortex respond to simple oriented edges. They found that the mammalian visual system is organized hierarchically, processing information from simple edges to complex object-like information. This insight led to the development of neural network algorithms, inspired by the hierarchical organization of the visual neural pathway. In the 1980s, Roman Hart, Hinton, and Williams made a major breakthrough with the backpropagation learning rule for neural networks.
Search for North Star Problems in Computer Vision: Computer vision started in the 1960s with Larry Roberts’ PhD thesis on shape understanding using image analysis. The field spent many years searching for its North Star problems, the foundational problems to work on. Early efforts focused on edges and segmentation, followed by 3D vision and stereo animal vision for depth understanding. In the late 1990s, some pioneers started to focus on object recognition, but mostly on single object recognition.
Major Advances in Cognitive Neuroscience: In the meantime, cognitive neuroscience made significant advances in understanding human object detection and recognition capabilities. Molly Potter’s work showed that humans can detect novel objects even at high speeds (100 milliseconds per frame). Simon Thorpe and colleagues demonstrated that complex object categorization tasks in the human brain can occur as fast as 150 milliseconds. Nancy Kevensher and colleagues found evidence for neurons in the monkey brain that respond selectively to specific objects.
00:14:05 Computer Vision's North Star: ImageNet and the Path to Object Recognition
North Star for Computer Vision: Cognitive neuroscience revealed devoted brain areas for object categorization. This revolution influenced computer vision, highlighting the importance of object recognition at the category level.
Early Datasets: Middle-sized datasets emerged in the early 21st century. Pascal VLC dataset pioneered the field’s work on object recognition.
Inspiration from Cognitive Science: Collaboration with Pietro Perona at Caltech and George Miller at Princeton inspired work on the ImageNet dataset. WordNet and the scale of concepts in the human brain motivated the creation of ImageNet.
ImageNet: A North Star and a Path: ImageNet dataset and challenge served as a North Star for computer vision. The challenge provided a path for the field to reach that North Star.
Neural Networks’ Dominance: Neural networks dominated the ImageNet challenge results in the early years. Examples of early winning architectures include AlexNet, GoogleNet, VGGNet, and ResNet.
00:16:43 Visual Storytelling: From Image Captioning to Scene Graph Understanding
Computer Vision’s Shift from Image Classification to Storytelling: Alex’s interest in storytelling through images led to a surge in research on image captioning, dense captioning, and paragraph captioning. Rapid progress in storytelling with images occurred within a few years of the ImageNet Challenge.
Exploring Deeper into Image Understanding and Relationship Understanding: Alex and his students focused on relationship understanding as a crucial aspect of human perception and cognition, particularly for social reasons. They developed scene graphs, a new representation connecting entities with their relationships.
Scene Graphs: A Powerful Tool for Various Tasks: Scene graphs have been applied in image retrieval, relationship understanding, image to scene graph automation, sentence to scene graph, and GAN technique from scene graph to images. This line of research has led to numerous exciting developments.
Plato’s Allegory of the Cave and Static Vision: Alex introduces Plato’s allegory of the cave as a metaphor for the limitations of static vision. Plato’s description of vision as prisoners watching a 2D projection behind their heads highlights the absence of depth and motion information.
00:21:06 Active Perception and Interaction for Intelligent Systems
Evolutionary Perspective: Vision and other perceptual sensors played a critical role in the Cambrian explosion, driving evolution and speciation. Vision led to the development of more sophisticated animal intelligence.
Developmental Psychology: Early childhood development emphasizes movement, exploration, and interaction with the environment. The combination of perception and action stimulates the brain and influences cognitive development.
Held and Hind Kittens Experiment: Kittens confined to a box (P kittens) experienced developmental deficits in perception and motor skills. Kittens allowed to move freely (A kittens) developed robust motor and perception systems. This experiment highlights the bidirectional relationship between perception and action.
Active and Interactive Intelligence: Intelligence emerges from active perception and interaction with the real world. Embodied and active, balancing exploration and exploitation. Multi-modal, involving various sensory systems. Multitask and generalizable, capable of handling diverse tasks and environments. Social and interactive, engaging with others and the environment.
Robotics and RL Applications: Exploring baby-inspired actions through RL, using curiosity as a driving force for learning. Developing self-models that guide agents’ exploration of the environment.
00:28:08 Automated Tool Assembly and Use with Visual Feedback
Introduction of Self-Model and Curiosity Learning: Self-model: A constantly updated world model driven by curiosity learning. Curiosity learning: Drives the self-model to explore the self and the real world. Learning stages: The self-model learns ego motion, object recognition, and more in distinct stages.
Interaction with Tools: Tool usage: Humans are unique in their ability to use and build sophisticated tools. Challenges: High-dimensional visual observations and complex robotic learning in visual worlds. Encoder-decoder network: Transforms high-dimensional visual information into actions. Key point representation: Involves inferring grasp points, function points, and effect points. Key point translation: Translates complex visual input into action space.
Key Point Representation for Tool Usage: Advantages: Robust, classic idea in computer vision. Key points: Grasp point, function point, and effect point. Downstream tasks: Pushing, reaching, and hammering. New tool assembly: Based on key points, new tools can be assembled to perform tasks.
Conclusion: Key point representation enables assembling new tools and performing downstream tasks.
00:33:18 Neural Task Programming for Robot Imitation Tasks
Long Horizon Interactive Tasks: Long horizon interactive tasks involve complex sequential processes with many choices and state changes, e.g., making an iced latte. Planning for these tasks requires considering basic skills, object interactions, state changes, and overall context. The search space for these tasks is prohibitively large.
Structured Task Representation: To control the explosive state space, structured task representation is key. One-shot visual imitation involves imitating a task given a single video demonstration. Hierarchical programs can reduce task space complexity by breaking down the task into subtasks and executing them sequentially.
Neural Task Programming (NTP): NTP generates neural programs that act as reactive policies to control the robot’s execution of demonstrated tasks. NTP traverses down a hierarchical structure of subtasks, executing movements and picking actions at each level. NTP generalizes better to unseen tasks compared to flat or sequential execution methods.
Neural Task Graph Inference: NTP requires a lot of supervision. Neural task graph inference uses a task graph generator and executor to improve generalization with weaker supervision. The task graph generator creates a task graph representation from video demonstrations. The task graph executor decides actions based on the generated task graph and current observations. Neural task graphs generalize better than NTP and other baselines.
Perception and Action Loop: The loop between perception and action is crucial for interactive real-world tasks.
Conclusion: The original and fundamental function of the nervous system is to link perception with action. Octopuses have superb visual and manipulative capabilities, making them fascinating creatures for studying intelligence.
00:41:11 AI Experts Discuss Next Imaging Frontier for Robotics
Robotics and ImageNet: Alex Krizhevsky emphasizes the significance of striving for fundamental capabilities rather than aiming for specific datasets like ImageNet in robotics.
Journey Towards North Star: The focus should be on identifying the “North Stars” of robotics, which are fundamental functionalities and capabilities inspired by the natural world.
Manipulation as a North Star: Manipulation, involving hand movements, planning, and tool creation, is seen as a key area of exploration in robotics.
Quantification and Benchmarking: The need to quantify and create benchmarkable languages for assessing and evaluating manipulation skills in robotics is highlighted.
Collective Effort: A collaborative approach is encouraged, emphasizing the value of collective efforts in advancing robotics towards its North Stars.
Abstract
The Future of Vision: Bridging Perception and Action in AI and Robotics
In the world of artificial intelligence and robotics, vision is more than just seeingit’s the gateway to understanding and interacting with the environment. This is the core message of Fei-Fei Li, a distinguished professor at Stanford University and a leading authority in AI and computer vision. Her recent talk, “Octopus, Kittens, and Babies: From Seeing to Doing,” encapsulates over two decades of research and development in the field. Li argues that vision is integral to intelligence and stresses the importance of connecting perception with action. Her focus on the convergence of computer vision, deep learning, neuroscience, and robotics outlines a future where AI systems interact with the world with the same richness and complexity as living beings.
Vision and Intelligence: The Core of AI
Fei-Fei Li’s profound enthusiasm for vision, stemming from its fundamental role in intelligence, sets the stage for understanding its complexity. Vision occupies half of the brain’s cortical processing, signifying its critical role in our perception of the world. The progress in computer vision, especially the development of convolutional neural networks showcased by ImageNet’s 1,000-way object classification, marks a significant milestone in AI.
North Star for Computer Vision
Cognitive neuroscience revealed devoted brain areas for object categorization. This revolution influenced computer vision, highlighting the importance of object recognition at the category level. The ImageNet dataset and challenge served as a North Star for computer vision, providing a path for the field to reach that North Star. Neural networks dominated the ImageNet challenge results in the early years, with examples of early winning architectures including AlexNet, GoogleNet, VGGNet, and ResNet.
The Quest for “Holy Grail” Problems in Computer Vision
Fei-Fei Li highlights the continuous search for fundamental problems in computer vision, akin to the quest for the “Holy Grail.” This journey has evolved from understanding edges and 3D vision, leading to practical applications like Google Street View, to focusing on object recognition. Cognitive neuroscience has significantly contributed to this domain, especially in object detection and recognition.
Deep Learning and Neuroscience: Foundations of Today’s AI
The deep learning revolution, powered by the convergence of computing power, algorithms, and data, particularly following the 2012 ImageNet Challenge, has been a watershed moment in AI. This surge in growth can be traced back to the late 1950s, with Hubel and Wiesel’s groundbreaking neuroscience research on the mammalian visual system. Their insights into hierarchical information processing laid the groundwork for neural network algorithms, further propelled by the learning rule breakthrough in the 1980s.
The Intersection of Neuroscience and Computer Vision
The fusion of neuroscience and computer vision has fostered a deeper understanding of visual processing and the development of sophisticated computer vision algorithms. This interdisciplinary approach has been instrumental in the success of deep learning, with neural networks like AlexNet, GoogleNet, and ResNet dominating recent ImageNet challenges.
Beyond Recognition: The Richness of Image Captioning and Scene Graphs
Alex’s interest in storytelling through images led to a surge in research on image captioning, dense captioning, and paragraph captioning. Rapid progress in storytelling with images occurred within a few years of the ImageNet Challenge. Image captioning and scene graphs represent a leap beyond single-label classification. These advancements enable AI to narrate the story of an image and understand complex relationships within it. Scene graphs, in particular, facilitate tasks like image retrieval and relationship understanding by representing images as networks of interconnected entities.
Active Vision and Embodied Intelligence
Moving from the static vision akin to Plato’s allegory of the cave, Fei-Fei Li emphasizes the importance of active and embodied vision. Real vision involves interaction with the environment, a concept supported by evolutionary theories, baby development studies, and experiments. Intelligence emerges from this active perception and interaction, characterized as exploratory, multi-modal, multi-task, generalizable, social, and interactive. Vision and other perceptual sensors played a critical role in the Cambrian explosion, driving evolution and speciation. Vision led to the development of more sophisticated animal intelligence.
Translating Active Vision into Robotics and AI
Li’s lab has been at the forefront of translating the philosophy of active vision into practical AI applications. This includes developing a self-aware “world model” that updates through exploration, researching how algorithms can learn to use tools in visual environments, and enabling robots to assemble new tools for specific tasks.
Navigating Complex Tasks with Structured Task Representation
Addressing long-horizon interactive tasks, like making iced lattes, requires navigating a complex sequence of actions. Li’s approach involves learning a structured task representation to control the state space’s exponential growth, utilizing one-shot visual imitation and Neural Task Programs (NTPs).
Neural Task Graph Inference for Generalization
While NTPs offer a structured approach, they require extensive supervision. Neural Task Graph Inference (NTG) presents an improvement, offering better generalization with weaker supervision through a task graph generator and executor.
The Crucial Link Between Perception and Action
The talk culminates in emphasizing the vital loop between perception and action, fundamental for interactive real-world tasks. The octopus, with its impressive visual system and manipulation abilities, serves as an exemplar of this link. The future of robotics lies in identifying and quantifying these “North Stars” of fundamental capabilities, with manipulation as a promising focus.
Conclusion
Fei-Fei Li’s insights pave the way for a new era in AI and robotics, where the focus shifts from mere vision to active, interactive intelligence. Her work not only underscores the importance of vision in AI but also sets a blueprint for future research directions, emphasizing the need for a holistic approach that integrates neuroscience, computer vision, and robotics. This philosophy of embodied intelligence, where seeing leads to doing, promises to revolutionize how AI systems interact with and understand the world around them.
—
Self-Model and Curiosity Learning
* A self-model is a constantly updated world model driven by curiosity learning.
* Curiosity learning drives the self-model to explore the self and the real world.
* The self-model learns ego motion, object recognition, and more in distinct stages.
Interaction with Tools
* Humans are unique in their ability to use and build sophisticated tools.
* The high-dimensional visual observations and complex robotic learning in visual worlds pose challenges.
* An encoder-decoder network transforms high-dimensional visual information into actions.
* Key point representation involves inferring grasp points, function points, and effect points.
* Key point translation translates complex visual input into action space.
Key Point Representation for Tool Usage
* Key point representation is a robust and classic idea in computer vision.
* Key points include grasp point, function point, and effect point.
* Downstream tasks include pushing, reaching, and hammering.
* New tool assembly based on key points allows robots to assemble new tools to perform tasks.
Long Horizon Interactive Tasks
* Long horizon interactive tasks involve complex sequential processes with many choices and state changes, such as making an iced latte.
* Planning for these tasks requires considering basic skills, object interactions, state changes, and overall context.
* The search space for these tasks is prohibitively large.
Structured Task Representation
* Structured task representation is key to controlling the explosive state space.
* One-shot visual imitation involves imitating a task given a single video demonstration.
* Hierarchical programs can reduce task space complexity by breaking down the task into subtasks and executing them sequentially.
Neural Task Programming (NTP)
* NTP generates neural programs that act as reactive policies to control the robot’s execution of demonstrated tasks.
* NTP traverses down a hierarchical structure of subtasks, executing movements and picking actions at each level.
* NTP generalizes better to unseen tasks compared to flat or sequential execution methods.
Neural Task Graph Inference
* NTP requires a lot of supervision.
* Neural task graph inference uses a task graph generator and executor to improve generalization with weaker supervision.
* The task graph generator creates a task graph representation from video demonstrations.
* The task graph executor decides actions based on the generated task graph and current observations.
* Neural task graphs generalize better than NTP and other baselines.
Unveiling the North Stars
* Alex Krizhevsky emphasizes the significance of striving for fundamental capabilities rather than aiming for specific datasets like ImageNet in robotics.
* The focus should be on identifying the “North Stars” of robotics, which are fundamental functionalities and capabilities inspired by the natural world.
* Manipulation, involving hand movements, planning, and tool creation, is seen as a key area of exploration in robotics.
* The need to quantify and create benchmarkable languages for assessing and evaluating manipulation skills in robotics is highlighted.
* A collaborative approach is encouraged, emphasizing the value of collective efforts in advancing robotics towards its North Stars.
ImageNet, an extensive dataset created by Fei-Fei Li, revolutionized AI, particularly computer vision, by advancing deep learning, renewing neural networks, and driving ethical discussions. ImageNet's impact extends beyond object recognition, inspiring research on visual intelligence and fostering ethical considerations in AI development....
ImageNet played a monumental role in the deep learning revolution, revolutionizing computer vision research and fostering interdisciplinary collaboration to address the human impact of AI. The project's success highlights the importance of collaboration and mentorship, driving technological advancements and inspiring future innovations in AI and computer vision....
Computer vision has evolved from basic object recognition to exploring visual intelligence, aided by deep learning and datasets like ImageNet. Despite advancements, AI systems lack comprehensive understanding and struggle to integrate pixel information, world knowledge, and emotion....
Computer vision has evolved from neural network algorithms to play a pivotal role in robotics and AI, with deep learning and ImageNet marking a turning point. Cognitive neuroscience has influenced computer vision's focus on object recognition and relationship understanding, while robotics emphasizes the link between perception and action....
Vision serves as a fundamental aspect of intelligence, driving evolutionary advancements and shaping artificial intelligence systems. The integration of visual understanding, embodied cognition, and interactive learning models enables intelligent machines to perceive, comprehend, and engage with the world similarly to living beings....
Computer vision utilizes deep learning algorithms to understand and interpret visual data, revolutionizing industries with applications in autonomous vehicles and medical diagnostics. Advancements in convolutional neural networks and extensive datasets like ImageNet have propelled object recognition accuracy, enabling machines to perceive and comprehend their environment....
TensorFlow, an open-source machine learning library, has revolutionized research in speech and image recognition thanks to its scalability, flexibility, and real-world applicability. The framework's distributed systems approach and data parallelism techniques enable faster training and execution of complex machine learning models....