Fei-Fei Li (Stanford) (Dec 2021)

Fei-Fei Li (Stanford Professor) – From Seeing to Doing (Dec 2021)

Chapters

00:00:02 From Seeing to Doing: Understanding and Interacting with the Real World

00:04:34 From Human Vision to Object Recognition: A Journey Through ImageNet

00:13:59 Perception of Visual Intelligence

00:19:59 Active Agents See to Act in the World

Defining Passive Understanding and Its Limitations:
Passive understanding, like that of Plato’s Allegory of the Cave, limits our ability to interact with the world effectively. Real visual experience is dynamic, requiring active movement and exploration to fully grasp how to interact with objects.

Visual Intelligence as Linking Perception and Action:
Philosopher Peter Godfrey Smith’s quote highlights the fundamental role of the nervous system in connecting perception to action. Experiments on kittens demonstrate the importance of active exploration for visual system development. Mirror neurons in humans and monkeys facilitate movement perception and imitation.

Embodied Agents: Key Ingredients for Acting in the World:
Embodied agents should be capable of embodied perception and action. Embodied agents should be able to move around, explore, and exploit the environment. Embodiment involves multimodal perception, multitasking, generalization, and social interaction.

The Far-Reaching Dream of Robotic Learning:
Creating robots that can perform complex human tasks is a long-term goal of AI research. Robots like Rosie represent this aspiration, though much work remains to be done.

Intrinsic Motivation and Explorative Learning:
Intrinsic motivation drives explorative learning, where agents learn without a specific goal. Novelty-based, skill-based, and world model-based motivations guide exploration. The proposed self-aware agent model maximizes world model loss predicted by the self-model for effective exploration. The model demonstrates the ability to learn through self-exploration, similar to human babies.

Goal-Based Exploitative Learning:
Exploitative learning involves goal-directed tasks. Current robotic learning often focuses on short-term, punctual skills-level tasks. Encouraging robots to perform longer horizon tasks is a key challenge. Compositional representation enables hierarchical stacking of skill-set-level tasks. Robots can perform longer horizon tasks and resist interruptions by composing tasks automatically.

Generalization of Long Horizon Tasks:
Curriculum learning is used to train robots on long horizon tasks through a series of simpler target tasks. This approach enables generalization to different simulated environments and tasks.

Remaining Challenges:
Achieving real-world task performance remains a challenge in robotic learning. Embodied perception and action require addressing the key missing piece that connects perception to action.

00:30:22 Benchmarking Embodied AI through Large-Scale, Diverse, and Ecological Tasks

Current Robotic Tasks:
Current robotic tasks are typically skill-level and short-horizon, lacking standard metrics and often failing in real-world scenarios.

Ecological Approach:
Fei-Fei Li emphasizes the need for an ecological approach to perception and robotic learning, inspired by JJ Gibson’s work. This involves creating benchmarks that are large scale, diverse, ecological, complex, and have standardized evaluation metrics.

Behavior Benchmark:
Behavior is a benchmark for everyday household activities in virtual, interactive, and ecological environments. It is enabled by the simulation environment iGibson 2.0, which is object-centric and allows for realistic object modeling, photorealistic rendering, and full physically simulated action execution.

iGibson Environment:
iGibson is inspired by concurrent work like Habitat, 3D World, Sapien, and AI2Thor. It aims to be realistic in object modeling, photorealistic in rendering, and allows for both kinematic and non-kinematic state changes, as well as full physically simulated action execution. It also supports a VR interface for human demonstration.

Behavior Dataset:
Behavior consists of 100 different tasks gathered from the American Bureau of Labor Statistics and by sampling what Americans do in their daily lives. It covers a wide range of tasks compared to other datasets, which focus on a narrower band of tasks. The statistics of Behavior track the general statistics of the ATAS tasks.

Ecological and Complex:
Behavior is ecological in general, with extensive statistics analysis showing the diversity of objects and scenes. It also aims for long horizon and complexity, with an average task length of 300 to 20,000 steps, compared to other task benchmarks that are typically smaller than 100 steps or between 100 and 1,000 steps.

Standardized Evaluation Metrics:
Behavior attempts to standardize evaluation metrics by allowing a logic-based representation to score the end state compared to the initial state.

Human VR Demo:
Behavior also allows for human VR demonstration, which can be used for benchmarking against the efficiency of execution.

00:35:18 Robotics: From Seeing to Doing

Abstract

The Evolution and Impact of Vision in Intelligence: From Biological Evolution to AI Revolution

The Role of Vision in Evolution and AI

The significance of vision in evolution and artificial intelligence (AI) cannot be understated. Approximately 540 million years ago, the Cambrian explosion marked a dramatic increase in animal species diversity. Zoologist Andrew Parker suggests this surge was propelled by the evolution of vision, which not only triggered an evolutionary arms race but also became a pivotal factor in the development of intelligence. This progression in visual capabilities led to complex nervous systems and, eventually, to humans with advanced cognitive abilities.

Fei-Fei Li, a prominent figure in AI and computer vision, views vision as a cornerstone of intelligence, both in biological entities and artificial machines. Her research aims to understand intelligence and build intelligent machines through the lens of vision. This approach recognizes two fundamental aspects of vision: understanding the real world and using visual information to guide actions and interactions with the environment.

The Role of Vision in Evolution and AI (Supplemental)

Embodied and Active Visual Intelligence: Moving Beyond Passive Understanding

Real visual experience is dynamic and necessitates active movement and exploration to fully comprehend interaction with objects. Philosopher Peter Godfrey Smith emphasized the fundamental role of the nervous system in linking perception to action. This is further evidenced by experiments on kittens, which show that active exploration is crucial for the development of the visual system. Additionally, the presence of mirror neurons in humans and monkeys plays a significant role in the perception of movement and imitation.

Advancements in Visual Understanding and Computer Vision

Visual understanding, particularly in AI, hinges on the ability to recognize and understand objects in various contexts. Humans excel at this task, but for computers, it’s a challenge due to the diversity of objects and the complexity of translating visual information into numerical data. The evolution of object recognition in computer vision has seen a significant transformation from early geometric models to the machine learning revolution, especially with the advent of deep learning.

Key datasets such as the Pascal VOC and Fei-Fei Li’s ImageNet have been instrumental in this transformation. ImageNet, with its vast collection of images across numerous categories, has become a benchmark for object recognition, pushing the boundaries of training and understanding at a real-world scale. This advancement was further propelled by the ImageNet Challenge, which, from 2010 to 2017, set the stage for the deep learning revolution in object recognition.

Enhancing Scene Understanding: Beyond Objects

Understanding visual scenes extends beyond mere object recognition. The relationships between objects form a critical component of scene comprehension. Scene graph representation, a method that captures object identities, attributes, and their interrelations, has emerged as a vital tool in this domain. The Visual Genome dataset, with its extensive collection of images, objects, relationships, and textual descriptions, has enabled significant research in scene graph representation and relationship recognition.

The Intersection of Perception and Language

The integration of perception and language is a significant aspect of visual intelligence. Datasets like ImageNet and Visual Genome, along with methodologies like scene graph representation, have contributed immensely to this field. Image captioning, including dense and paragraph captioning, exemplifies this intersection, where models generate textual descriptions for visual content.

Embodied Cognition and Active Perception (Supplemental)

Embodied Agents: Key Ingredients for Acting in the World

Embodied agents are essential for active interaction in the world, requiring capabilities for embodied perception and action. These agents need the ability to move around, explore, and exploit their environment. Embodiment encompasses multimodal perception, multitasking, generalization, and social interaction.

The Future of AI and Robotics: Embracing Embodied Cognition

Robots in the Real World

For robots to effectively perform complex human tasks, they must possess embodied cognition, enabling them to move, explore, and interact with their environment. This requirement extends to generalizing knowledge to new situations and collaborating with other agents.

Learning Approaches in AI (Supplemental)

Intrinsic Motivation and Explorative Learning

Intrinsic motivation drives explorative learning, where agents learn without a specific goal. This type of learning is guided by novelty-based, skill-based, and world model-based motivations. The proposed self-aware agent model maximizes world model loss predicted by the self-model for effective exploration, demonstrating an ability to learn through self-exploration, similar to human babies.

Goal-Based Exploitative Learning

Exploitative learning involves goal-directed tasks. Current robotic learning often focuses on short-term, punctual skills-level tasks. Encouraging robots to perform longer horizon tasks is a key challenge. Compositional representation enables hierarchical stacking of skill-set-level tasks, allowing robots to perform longer horizon tasks and resist interruptions by composing tasks automatically.

Generalization of Long Horizon Tasks

Curriculum learning is used to train robots on long horizon tasks through a series of simpler target tasks. This approach enables generalization to different simulated environments and tasks.

Challenges and Future Directions

Despite significant progress, robots still struggle with common sense reasoning and adapting to unexpected situations. Future research should focus on developing deeper world understanding and adaptability in robots.

The Future of AI and Robotics: Embracing Embodied Cognition (Supplemental)

Remaining Challenges

Achieving real-world task performance remains a challenge in robotic learning. Embodied perception and action require addressing the key missing piece that connects perception to action.

Behavior Benchmarking for Robotic Embodied Agents

Fei-Fei Li discusses the difficulty of behavioral tasks and the importance of benchmarking robotic embodied agents against behavioral datasets. Li presents a graph comparing the performance of a robotic agent with default behavior against state-of-the-art algorithms. The default behavior results in performance close to zero, demonstrating the challenges in robotic learning. Li highlights the iGiPSA environment, which enables the behavior challenge and provides a benchmark for robotic embodied agents. Li presents their work in robotic learning, including curiosity-based exploratory learning and long-horizon task-driven learning. Li emphasizes the importance of vision as a cornerstone of intelligence, allowing for understanding and action in the real world.

Li cites JJ Gibson’s ecological approach to perception and robotic learning as an inspiration for their research.

Conclusion

The journey from the evolution of vision in the natural world to its emulation and enhancement in artificial intelligence and robotics represents a remarkable synergy of biology and technology. Vision, once a survival tool in the natural world, now stands at the forefront of intelligence, both biological and artificial. As AI continues to evolve, the integration of visual understanding, embodied cognition, and interactive learning models paves the way for a future where intelligent machines can perceive, understand, and interact with the world in ways that were once the sole domain of living beings.

Notes by: Hephaestus

Fei-Fei Li (Stanford Professor) – From Seeing to Doing (Dec 2021)

Chapters

Abstract

Related posts: