Fei-Fei Li (Stanford Professor) – Octopus, Kittens & Babies (Jul 2020)


Chapters

00:00:08 Octopus, Kittens, and Babies: From Seeing to Doing
00:03:15 Deep Learning Revolution: The Watershed Moment and Its Impact on Computer Vision
00:05:42 Origins of Neural Networks and the Search for North Star Problems in Computer Vision
00:14:05 Computer Vision's North Star: ImageNet and the Path to Object Recognition
00:16:43 Visual Storytelling: From Image Captioning to Scene Graph Understanding
00:21:06 Active Perception and Interaction for Intelligent Systems
00:28:08 Automated Tool Assembly and Use with Visual Feedback
00:33:18 Neural Task Programming for Robot Imitation Tasks
00:41:11 AI Experts Discuss Next Imaging Frontier for Robotics

Abstract

The Future of Vision: Bridging Perception and Action in AI and Robotics

In the world of artificial intelligence and robotics, vision is more than just seeingit’s the gateway to understanding and interacting with the environment. This is the core message of Fei-Fei Li, a distinguished professor at Stanford University and a leading authority in AI and computer vision. Her recent talk, “Octopus, Kittens, and Babies: From Seeing to Doing,” encapsulates over two decades of research and development in the field. Li argues that vision is integral to intelligence and stresses the importance of connecting perception with action. Her focus on the convergence of computer vision, deep learning, neuroscience, and robotics outlines a future where AI systems interact with the world with the same richness and complexity as living beings.

Vision and Intelligence: The Core of AI

Fei-Fei Li’s profound enthusiasm for vision, stemming from its fundamental role in intelligence, sets the stage for understanding its complexity. Vision occupies half of the brain’s cortical processing, signifying its critical role in our perception of the world. The progress in computer vision, especially the development of convolutional neural networks showcased by ImageNet’s 1,000-way object classification, marks a significant milestone in AI.

North Star for Computer Vision

Cognitive neuroscience revealed devoted brain areas for object categorization. This revolution influenced computer vision, highlighting the importance of object recognition at the category level. The ImageNet dataset and challenge served as a North Star for computer vision, providing a path for the field to reach that North Star. Neural networks dominated the ImageNet challenge results in the early years, with examples of early winning architectures including AlexNet, GoogleNet, VGGNet, and ResNet.

The Quest for “Holy Grail” Problems in Computer Vision

Fei-Fei Li highlights the continuous search for fundamental problems in computer vision, akin to the quest for the “Holy Grail.” This journey has evolved from understanding edges and 3D vision, leading to practical applications like Google Street View, to focusing on object recognition. Cognitive neuroscience has significantly contributed to this domain, especially in object detection and recognition.

Deep Learning and Neuroscience: Foundations of Today’s AI

The deep learning revolution, powered by the convergence of computing power, algorithms, and data, particularly following the 2012 ImageNet Challenge, has been a watershed moment in AI. This surge in growth can be traced back to the late 1950s, with Hubel and Wiesel’s groundbreaking neuroscience research on the mammalian visual system. Their insights into hierarchical information processing laid the groundwork for neural network algorithms, further propelled by the learning rule breakthrough in the 1980s.

The Intersection of Neuroscience and Computer Vision

The fusion of neuroscience and computer vision has fostered a deeper understanding of visual processing and the development of sophisticated computer vision algorithms. This interdisciplinary approach has been instrumental in the success of deep learning, with neural networks like AlexNet, GoogleNet, and ResNet dominating recent ImageNet challenges.

Beyond Recognition: The Richness of Image Captioning and Scene Graphs

Alex’s interest in storytelling through images led to a surge in research on image captioning, dense captioning, and paragraph captioning. Rapid progress in storytelling with images occurred within a few years of the ImageNet Challenge. Image captioning and scene graphs represent a leap beyond single-label classification. These advancements enable AI to narrate the story of an image and understand complex relationships within it. Scene graphs, in particular, facilitate tasks like image retrieval and relationship understanding by representing images as networks of interconnected entities.

Active Vision and Embodied Intelligence

Moving from the static vision akin to Plato’s allegory of the cave, Fei-Fei Li emphasizes the importance of active and embodied vision. Real vision involves interaction with the environment, a concept supported by evolutionary theories, baby development studies, and experiments. Intelligence emerges from this active perception and interaction, characterized as exploratory, multi-modal, multi-task, generalizable, social, and interactive. Vision and other perceptual sensors played a critical role in the Cambrian explosion, driving evolution and speciation. Vision led to the development of more sophisticated animal intelligence.

Translating Active Vision into Robotics and AI

Li’s lab has been at the forefront of translating the philosophy of active vision into practical AI applications. This includes developing a self-aware “world model” that updates through exploration, researching how algorithms can learn to use tools in visual environments, and enabling robots to assemble new tools for specific tasks.

Navigating Complex Tasks with Structured Task Representation

Addressing long-horizon interactive tasks, like making iced lattes, requires navigating a complex sequence of actions. Li’s approach involves learning a structured task representation to control the state space’s exponential growth, utilizing one-shot visual imitation and Neural Task Programs (NTPs).

Neural Task Graph Inference for Generalization

While NTPs offer a structured approach, they require extensive supervision. Neural Task Graph Inference (NTG) presents an improvement, offering better generalization with weaker supervision through a task graph generator and executor.

The Crucial Link Between Perception and Action

The talk culminates in emphasizing the vital loop between perception and action, fundamental for interactive real-world tasks. The octopus, with its impressive visual system and manipulation abilities, serves as an exemplar of this link. The future of robotics lies in identifying and quantifying these “North Stars” of fundamental capabilities, with manipulation as a promising focus.

Conclusion

Fei-Fei Li’s insights pave the way for a new era in AI and robotics, where the focus shifts from mere vision to active, interactive intelligence. Her work not only underscores the importance of vision in AI but also sets a blueprint for future research directions, emphasizing the need for a holistic approach that integrates neuroscience, computer vision, and robotics. This philosophy of embodied intelligence, where seeing leads to doing, promises to revolutionize how AI systems interact with and understand the world around them.



Self-Model and Curiosity Learning

* A self-model is a constantly updated world model driven by curiosity learning.

* Curiosity learning drives the self-model to explore the self and the real world.

* The self-model learns ego motion, object recognition, and more in distinct stages.

Interaction with Tools

* Humans are unique in their ability to use and build sophisticated tools.

* The high-dimensional visual observations and complex robotic learning in visual worlds pose challenges.

* An encoder-decoder network transforms high-dimensional visual information into actions.

* Key point representation involves inferring grasp points, function points, and effect points.

* Key point translation translates complex visual input into action space.

Key Point Representation for Tool Usage

* Key point representation is a robust and classic idea in computer vision.

* Key points include grasp point, function point, and effect point.

* Downstream tasks include pushing, reaching, and hammering.

* New tool assembly based on key points allows robots to assemble new tools to perform tasks.

Long Horizon Interactive Tasks

* Long horizon interactive tasks involve complex sequential processes with many choices and state changes, such as making an iced latte.

* Planning for these tasks requires considering basic skills, object interactions, state changes, and overall context.

* The search space for these tasks is prohibitively large.

Structured Task Representation

* Structured task representation is key to controlling the explosive state space.

* One-shot visual imitation involves imitating a task given a single video demonstration.

* Hierarchical programs can reduce task space complexity by breaking down the task into subtasks and executing them sequentially.

Neural Task Programming (NTP)

* NTP generates neural programs that act as reactive policies to control the robot’s execution of demonstrated tasks.

* NTP traverses down a hierarchical structure of subtasks, executing movements and picking actions at each level.

* NTP generalizes better to unseen tasks compared to flat or sequential execution methods.

Neural Task Graph Inference

* NTP requires a lot of supervision.

* Neural task graph inference uses a task graph generator and executor to improve generalization with weaker supervision.

* The task graph generator creates a task graph representation from video demonstrations.

* The task graph executor decides actions based on the generated task graph and current observations.

* Neural task graphs generalize better than NTP and other baselines.

Unveiling the North Stars

* Alex Krizhevsky emphasizes the significance of striving for fundamental capabilities rather than aiming for specific datasets like ImageNet in robotics.

* The focus should be on identifying the “North Stars” of robotics, which are fundamental functionalities and capabilities inspired by the natural world.

* Manipulation, involving hand movements, planning, and tool creation, is seen as a key area of exploration in robotics.

* The need to quantify and create benchmarkable languages for assessing and evaluating manipulation skills in robotics is highlighted.

* A collaborative approach is encouraged, emphasizing the value of collective efforts in advancing robotics towards its North Stars.


Notes by: QuantumQuest