Visuomotor skill learning.
Imitation learning is a prominent approach to learning visuomotor skills
from demonstrations. Diffusion-based behavior cloning improves multi-modal
action modeling but tends to overfit to scene layouts seen in
demonstrations. Prior work addresses this by scaling up data, synthesizing
demonstrations, or discovering skills with reinforcement learning —
but often yields a single policy that struggles to generalize beyond the
training horizon, or is restricted to simulation.
Compositional generalization for robot manipulation.
Generalizing by recombining previously learned skills has been explored
through hierarchical structures, language-guided temporal segmentation,
and symbolic action representations (manual, learned, or with added
segmentation annotations). STACK constructs spatiotemporal compositional
action representations directly from data using foundation-model priors,
and explicitly reasons about geometric constraints — avoiding the
limitations of purely discrete symbolic representations.
Foundation models for robotics.
Foundation models have been adapted as vision-language-action models for
end-to-end visuomotor learning, or used to provide structural priors:
open-world perception, goal interpretation, model specification, visual
dynamics prediction, and high-level decision-making. STACK is distinctive
in asking whether foundation models can provide these structural
abstractions themselves by discovering spatial and temporal structure
from raw demonstrations.