Fei-Fei Li (Stanford Professor) – From Seeing to Doing (Nov 2022)


Chapters

00:00:29 The Evolution of Vision and Its Impact on Intelligence
00:02:57 Rapid Vision and Object Recognition: A Cornerstone of Human and Computer Intelligence
00:09:29 Evolution of Computer Vision from Hand-Designed Models to Deep Learning
00:14:08 Beyond Object Labeling: Scene Understanding and Relationships
00:23:06 Linking Perception with Action: The Role of Embodied Intelligence Systems
00:25:35 Robotic Learning for Unstructured Environments
00:29:41 Robotic Learning in Complex Environments: Challenges and Opportunities
00:38:11 Behavior: A Benchmark for Everyday Household Activities in Virtual Environments
00:44:01 Simulating Realistic Embodied AI Environments for Benchmarking and Training
00:47:52 Challenges and Opportunities in Robotic Learning

Abstract

Exploring the Evolution and Future of Vision and Intelligence: From the Cambrian Explosion to Embodied AI

The Dawn of Vision: Triggering the Cambrian Explosion

About 540 million years ago, animals lived in a simple world with few species. Animal life in the primordial soup was characterized by limited complexity and diversity. Then came the Cambrian explosion, a sudden increase in animal species that occurred over a short geological period. This event resulted in a diverse range of animals with different shapes and behaviors, forever transforming the animal kingdom. Various theories attempt to explain the Cambrian explosion, including climate change and chemical changes in the water. One prominent theory suggests that the sudden evolution of vision triggered this dramatic surge in diversity.

The ability to see and sense the world led to an evolutionary arms race, where animals either evolved or faced extinction. Vision became a primary sensing apparatus for most animals, providing a significant advantage in survival and reproduction. This fundamental shift not only reshaped the animal kingdom but also set the stage for complex interactions and survival strategies based on visual perception.

Human Vision: The Cornerstone of Our Interaction with the World

For humans, vision is central to survival, navigation, manipulation, interaction, and understanding the world. A staggering 50% of our neocortex is dedicated to visual processing, underscoring the importance of visual intelligence in our daily lives. The rapid comprehension of scenes, the ability to recognize objects in milliseconds, and the challenge of deciphering complex visual environments are all testament to the sophistication of human visual intelligence.

Visual Intelligence: From Scene Understanding to Object Recognition

Advancements in understanding visual intelligence have been monumental, especially in the fields of rapid scene understanding, object recognition, and the challenges posed by diverse visual inputs. Early attempts at object recognition in computer science involved creating hand-crafted models, which evolved into statistical machine learning approaches. The introduction of large-scale datasets like ImageNet further revolutionized the field, leading to the development of convolutional neural networks and deep learning techniques that dramatically improved object recognition accuracy.

Beyond Object Labeling: Deepening Understanding of Scenes

Recent progress in computer vision has moved beyond mere object labeling. The introduction of scene graphs and datasets like Visual Genome and Action Genome has enabled a more profound understanding of the relationships and interactions within a scene. These advancements are critical in enhancing algorithms for visual relationship estimation and generating detailed captions and narratives for images, pushing the boundaries of how machines interpret and interact with visual data.

Perception and Action: The Dynamic Duo

Vision is not just about understanding the world; it’s also about participating in it. The relationship between perception and behavior is active and dynamic in the animal world, and this is reflected in the design of embodied AI systems. Embodied AI agents, whether robots or computer simulations, have the ability to perceive their surroundings and take actions to interact with them. This embodied approach allows AI systems to learn more naturally and develop a deeper understanding of the world.

The Philosophical and Biological Foundations of Vision

The philosophical implications of vision, as highlighted by Plato’s Allegory of the Cave, remind us that vision is not merely a passive observation but an active, participatory process. This is mirrored in biological studies showing that active engagement with the environment is crucial for the development of normal perception, as seen in the experiment with active and passive kittens.

Fei-Fei Li and her colleagues at Stanford University emphasize the need for advancements in robotics and embodied AI, particularly in unstructured environments. Their work explores various aspects of robotic learning, including explorative learning and long-horizon task planning. The development of modular learning approaches and curriculum learning with generative task models addresses some of the critical challenges in robotic learning. Their approach advocates creating robotic learning environments that mimic the real world, particularly household environments. These environments offer ecological complexity, dynamics, uncertainties, large viabilities, interactivity, social aspects, and inherent multi-tasking.

The introduction of large-scale, diverse, and photorealistic benchmarks like Behavior Benchmark and OmniGibson simulation environment marks significant progress in evaluating and advancing embodied AI algorithms. These environments aim to replicate real-world complexity and challenge AI systems to navigate and interact with dynamic, uncertain, and multitask scenarios.

Real-world Applications and Challenges in Embodied AI

The ecological approach to robotic learning, inspired by J.J. Gibson’s work, advocates for creating realistic, household-like environments for robot training. State-of-the-art algorithms perform well on short-scale tasks with simplified magic actions but struggle without them. Increased object pose and appearance diversity further decrease algorithm performance. RL algorithms face difficulties in performing long horizon tasks like decorating a store or cleaning a table. Assumptions about action primitives and object states are necessary for reasonable algorithm performance. A digital twin of a real-world apartment was created for sim-to-real algorithms experimentation. This environment allows researchers to challenge robots in real-world settings.

Bridging Perception and Action in AI

The journey from the Cambrian explosion to modern-day AI research encapsulates the profound relationship between vision and intelligence. The advancements in computer vision and embodied AI highlight the ongoing quest to create systems that not only perceive but also meaningfully interact with the world. As we continue to unravel the complexities of visual intelligence, the fusion of perception and action remains a key frontier in the evolution of intelligent systems.


Notes by: Flaneur