Fei-Fei Li (Stanford Professor) – From Seeing to Doing (Dec 2021)


Chapters

00:00:02 From Seeing to Doing: Understanding and Interacting with the Real World
00:04:34 From Human Vision to Object Recognition: A Journey Through ImageNet
00:13:59 Perception of Visual Intelligence
00:19:59 Active Agents See to Act in the World
00:30:22 Benchmarking Embodied AI through Large-Scale, Diverse, and Ecological Tasks
00:35:18 Robotics: From Seeing to Doing

Abstract

The Evolution and Impact of Vision in Intelligence: From Biological Evolution to AI Revolution

The Role of Vision in Evolution and AI

The significance of vision in evolution and artificial intelligence (AI) cannot be understated. Approximately 540 million years ago, the Cambrian explosion marked a dramatic increase in animal species diversity. Zoologist Andrew Parker suggests this surge was propelled by the evolution of vision, which not only triggered an evolutionary arms race but also became a pivotal factor in the development of intelligence. This progression in visual capabilities led to complex nervous systems and, eventually, to humans with advanced cognitive abilities.

Fei-Fei Li, a prominent figure in AI and computer vision, views vision as a cornerstone of intelligence, both in biological entities and artificial machines. Her research aims to understand intelligence and build intelligent machines through the lens of vision. This approach recognizes two fundamental aspects of vision: understanding the real world and using visual information to guide actions and interactions with the environment.

The Role of Vision in Evolution and AI (Supplemental)

Embodied and Active Visual Intelligence: Moving Beyond Passive Understanding

Real visual experience is dynamic and necessitates active movement and exploration to fully comprehend interaction with objects. Philosopher Peter Godfrey Smith emphasized the fundamental role of the nervous system in linking perception to action. This is further evidenced by experiments on kittens, which show that active exploration is crucial for the development of the visual system. Additionally, the presence of mirror neurons in humans and monkeys plays a significant role in the perception of movement and imitation.

Advancements in Visual Understanding and Computer Vision

Visual understanding, particularly in AI, hinges on the ability to recognize and understand objects in various contexts. Humans excel at this task, but for computers, it’s a challenge due to the diversity of objects and the complexity of translating visual information into numerical data. The evolution of object recognition in computer vision has seen a significant transformation from early geometric models to the machine learning revolution, especially with the advent of deep learning.

Key datasets such as the Pascal VOC and Fei-Fei Li’s ImageNet have been instrumental in this transformation. ImageNet, with its vast collection of images across numerous categories, has become a benchmark for object recognition, pushing the boundaries of training and understanding at a real-world scale. This advancement was further propelled by the ImageNet Challenge, which, from 2010 to 2017, set the stage for the deep learning revolution in object recognition.

Enhancing Scene Understanding: Beyond Objects

Understanding visual scenes extends beyond mere object recognition. The relationships between objects form a critical component of scene comprehension. Scene graph representation, a method that captures object identities, attributes, and their interrelations, has emerged as a vital tool in this domain. The Visual Genome dataset, with its extensive collection of images, objects, relationships, and textual descriptions, has enabled significant research in scene graph representation and relationship recognition.

The Intersection of Perception and Language

The integration of perception and language is a significant aspect of visual intelligence. Datasets like ImageNet and Visual Genome, along with methodologies like scene graph representation, have contributed immensely to this field. Image captioning, including dense and paragraph captioning, exemplifies this intersection, where models generate textual descriptions for visual content.

Embodied Cognition and Active Perception (Supplemental)

Embodied Agents: Key Ingredients for Acting in the World

Embodied agents are essential for active interaction in the world, requiring capabilities for embodied perception and action. These agents need the ability to move around, explore, and exploit their environment. Embodiment encompasses multimodal perception, multitasking, generalization, and social interaction.

The Future of AI and Robotics: Embracing Embodied Cognition

Robots in the Real World

For robots to effectively perform complex human tasks, they must possess embodied cognition, enabling them to move, explore, and interact with their environment. This requirement extends to generalizing knowledge to new situations and collaborating with other agents.

Learning Approaches in AI (Supplemental)

Intrinsic Motivation and Explorative Learning

Intrinsic motivation drives explorative learning, where agents learn without a specific goal. This type of learning is guided by novelty-based, skill-based, and world model-based motivations. The proposed self-aware agent model maximizes world model loss predicted by the self-model for effective exploration, demonstrating an ability to learn through self-exploration, similar to human babies.

Goal-Based Exploitative Learning

Exploitative learning involves goal-directed tasks. Current robotic learning often focuses on short-term, punctual skills-level tasks. Encouraging robots to perform longer horizon tasks is a key challenge. Compositional representation enables hierarchical stacking of skill-set-level tasks, allowing robots to perform longer horizon tasks and resist interruptions by composing tasks automatically.

Generalization of Long Horizon Tasks

Curriculum learning is used to train robots on long horizon tasks through a series of simpler target tasks. This approach enables generalization to different simulated environments and tasks.

Challenges and Future Directions

Despite significant progress, robots still struggle with common sense reasoning and adapting to unexpected situations. Future research should focus on developing deeper world understanding and adaptability in robots.

The Future of AI and Robotics: Embracing Embodied Cognition (Supplemental)

Remaining Challenges

Achieving real-world task performance remains a challenge in robotic learning. Embodied perception and action require addressing the key missing piece that connects perception to action.

Behavior Benchmarking for Robotic Embodied Agents

Fei-Fei Li discusses the difficulty of behavioral tasks and the importance of benchmarking robotic embodied agents against behavioral datasets. Li presents a graph comparing the performance of a robotic agent with default behavior against state-of-the-art algorithms. The default behavior results in performance close to zero, demonstrating the challenges in robotic learning. Li highlights the iGiPSA environment, which enables the behavior challenge and provides a benchmark for robotic embodied agents. Li presents their work in robotic learning, including curiosity-based exploratory learning and long-horizon task-driven learning. Li emphasizes the importance of vision as a cornerstone of intelligence, allowing for understanding and action in the real world.

Li cites JJ Gibson’s ecological approach to perception and robotic learning as an inspiration for their research.

Conclusion

The journey from the evolution of vision in the natural world to its emulation and enhancement in artificial intelligence and robotics represents a remarkable synergy of biology and technology. Vision, once a survival tool in the natural world, now stands at the forefront of intelligence, both biological and artificial. As AI continues to evolve, the integration of visual understanding, embodied cognition, and interactive learning models paves the way for a future where intelligent machines can perceive, understand, and interact with the world in ways that were once the sole domain of living beings.


Notes by: Hephaestus