Learning Composable Skills by Discovering
Spatial and Temporal Structure with Foundation Models

To appear at ICRA 2026

NN Neil Nie¹

WH Wenlong Huang¹

JM Jiayuan Mao²

LF Li Fei-Fei¹

WL Weiyu Liu^1†

JW Jiajun Wu^1†

¹Stanford University ²MIT

^† denotes equal advising

PDF Code Supp BibTeX

All tasks learned from just 5–10 demonstrations.

Bookshelf — new geometric constraints

Kitchen — longer horizon (8 steps)

Mug tree — longer horizon (8 steps)

Bookshelf — longer horizon (6 steps)

Summary

Structure from foundation models — no manual annotations.

Foundation models extract spatial and temporal structure from unsegmented demonstrations, eliminating the need for manual segmentation and hand-designed object-centric representations.

Skill samplers and their geometric effects.

For each discovered skill, we learn a diffusion-based trajectory sampler and a skill effect model, both operating in the reference frame of the relevant scene element — enabling geometry-aware skill composition.

Real-world generalization to novel task variations.

STACK enables long-horizon, bimanual, and non-prehensile manipulation in real-world settings, generalizing to new scene configurations, new geometric constraints, and longer task horizons beyond those in training.

Abstract

We present STACK, a framework for discovering and learning composable manipulation skills from unsegmented demonstrations by leveraging spatial and temporal structure extracted from foundation models. STACK automatically extracts temporal structure by segmenting raw demonstrations into short-horizon skills using a video-language model, and spatial structure by identifying skill-relevant elements in 3D point cloud observations. For each discovered skill, we learn a diffusion-based trajectory sampler and a skill effect model, both of which operate in the reference frame of the relevant scene element. At test time, given a language goal, STACK segments the 3D scene, samples skill trajectories, and composes them by simulating geometric effects. This enables generalization to new scene configurations, geometric constraints, and longer task horizons beyond training across diverse real-world manipulation tasks.

Motivation

STACK teaser figure — STACK discovers and learns composable skills from a handful of unsegmented demonstrations using spatial and temporal structure extracted by foundation models.

Humans effortlessly perform long-horizon manipulation tasks. Take the everyday task of storing books on a shelf: we pick up books regardless of their placement, adapt to new obstacles on the shelf by adjusting where books go, and plan ahead by mentally simulating different layouts. Robots today cannot handle such diverse task variations from limited data.

Prior work shows that structure can help generalize from limited demonstrations. Temporal structure decomposes demonstrations into short-horizon composable skills that can be recomposed into new sequences. Spatial structure helps skills generalize to new environments by focusing on only the relevant scene elements. The challenge is that these structures are typically hand-designed. Our insight: video-language models can automatically extract both — giving us the dexterity of learned skills with the generalization of planning, without hand-coded predicates.

Overview

A short walkthrough of the full framework — skill discovery, learning, and planning.

Method

A video-language model performs temporal segmentation, dividing each demonstration into skill segments with natural-language descriptions. Vision foundation models then perform spatial segmentation to extract point clouds of the skill-relevant entities. A trajectory sampler and effect model are trained for each discovered skill. At test time, skills are composed to solve novel long-horizon tasks.

Discovering Skills with Foundation Models

For each raw demonstration, STACK extracts temporal skill boundaries, then identifies the relevant entities for each skill. A video-language model receives the task goal, video frames, and proprioceptive cues (gripper width, joint torques); for bimanual scenes we overlay color-coded arm masks. The model returns skill timestamps in mm:ss format and a summary of each skill. Segmentation is run in two stages: a proposal stage generates coarse candidates, and a refinement stage revisits them.

For every skill segment, STACK extracts the names of the relevant entities and reconstructs their point clouds. Open-vocabulary detectors use those names to find bounding boxes on the first frame; SAM2 then segments and tracks the entity through the video. Foundation models also determine skill arguments and drive fully automatic data augmentation in the entity's reference frame — no task-specific tuning.

Learning Composable Skills

For each discovered skill, STACK trains a trajectory sampler that maps the segmented point clouds of relevant entities to a sequence of end-effector poses, conditioned on robot proprioception. The sampler is a conditional diffusion model, which captures the multimodal distribution of valid trajectories and generalizes across number of entities, number of arms, and dependence on previous end-effector state.

STACK also trains a skill effect model that predicts how executing a skill transforms the relevant entities — a rigid-body transform in SE(3) applied to each entity's point cloud. This prediction enables test-time filtering of infeasible trajectories via collision and kinematic checks, which is what makes long-horizon planning tractable. Labels come from ICP alignment between the first and final frames of each skill segment.

Planning with Learned Skills

At test time, a vision-language model proposes candidate skill sequences for the new scene given a natural-language goal. The entity names in each skill drive open-vocabulary detection and segmentation to extract point clouds. STACK then runs a tree search starting from the first skill: for each skill it samples M candidate trajectories, predicts their effects, and checks collisions by pairwise k-d tree distance. Valid trajectories update the scene geometry; the next skill plans against the updated scene. Transitions between learned skills use a free-space transit skill and a move skill (rigid attachment during transport), with cuRobo computing collision-free connecting paths.

Real World Results

We evaluate STACK on three generalization axes — new scene configuration, new geometric constraints, and longer task horizons — across three real-world environments: bookshelf, kitchen, and mug tree.

Real-world evaluation setups: bookshelf, kitchen, mug tree

Generalization Results

Average partial success rate across 10 trials for baselines and ablations on the three generalization axes — SC (new scene configuration), GC (new geometric constraints), and LH (longer task horizons) — across the three domains.

Method	Mug Tree			Bookshelf			Kitchen
Method	SC	GC	LH	SC	GC	LH	SC	GC	LH
DP3-Long	0	0	0	0	0	0	0	0	0
DP3-Short	16.7	10	0	0	0	0	0	0	0
BLADE	86.7	70	67.1	90.0	46.7	35.0	85.0	65.0	68.8
STACK (ours)	90.0	80.0	77.4	90.0	93.3	86.1	85.0	75.0	71.6
w/o Spatial Aug.	50.0	50.0	14.6	50.0	40.0	23.6	80.0	75.0	64.5
w/o Effect	83.3	75.0	56.7	85.0	47.9	67.5	70.0	60.0	67.9

Temporal Segmentation Results

STACK outperforms segmentation baselines with better alignment to ground truth and fewer spurious segments. ES, MoF: higher is better; #IS: lower is better.

Method	Mug Tree			Bookshelf			Kitchen
Method	ES↑	MoF↑	#IS↓	ES↑	MoF↑	#IS↓	ES↑	MoF↑	#IS↓
UVD	0.55	0.29	2.6	0.42	0.18	4	0.41	0.19	7.2
Contact Heuristic	0.50	0.53	1.2	0.50	0.31	0.36	0.70	0.57	0.16
Video-LM Zero-Shot	0.76	0.68	0	0.52	0.47	0.1	0.59	0.31	1
STACK (ours)	0.76	0.68	0	0.84	0.76	0	0.77	0.72	0

Bookshelf

Test Case 1

Test Case 2

Test Case 3

Test Case 1

Test Case 2

6 Steps

5 Steps

Kitchen

Test Case 1

Test Case 2

Test Case 1

Test Case 2

8 Steps

6 Steps

Mug Tree

Test Case 1

Test Case 2

Test Case 3

Test Case 1

Test Case 2

6 Steps

8 Steps

Visuomotor skill learning. Imitation learning is a prominent approach to learning visuomotor skills from demonstrations. Diffusion-based behavior cloning improves multi-modal action modeling but tends to overfit to scene layouts seen in demonstrations. Prior work addresses this by scaling up data, synthesizing demonstrations, or discovering skills with reinforcement learning — but often yields a single policy that struggles to generalize beyond the training horizon, or is restricted to simulation.

Compositional generalization for robot manipulation. Generalizing by recombining previously learned skills has been explored through hierarchical structures, language-guided temporal segmentation, and symbolic action representations (manual, learned, or with added segmentation annotations). STACK constructs spatiotemporal compositional action representations directly from data using foundation-model priors, and explicitly reasons about geometric constraints — avoiding the limitations of purely discrete symbolic representations.

Foundation models for robotics. Foundation models have been adapted as vision-language-action models for end-to-end visuomotor learning, or used to provide structural priors: open-world perception, goal interpretation, model specification, visual dynamics prediction, and high-level decision-making. STACK is distinctive in asking whether foundation models can provide these structural abstractions themselves by discovering spatial and temporal structure from raw demonstrations.

Future Directions

STACK currently assumes rigid-body and articulated-object interactions and does not handle deformable objects such as cloth. Evaluations focus on open-loop skill composition; the learned trajectory samplers operate at 2–5 Hz and STACK supports replanning after each execution, but extending to dynamic tasks is left for future work. STACK can also be extended to stochastic settings by leveraging diffusion-based effect models to capture multiple outcomes and probabilistic planners to reason under uncertainty. As task horizons grow, planning becomes a harder constraint-satisfaction problem, and hierarchical planning or mixed planning-and-execution strategies may be necessary for scalability.

BibTeX

@inproceedings{nie2026stack,
  title     = {Learning Composable Skills by Discovering Spatial and Temporal Structure with Foundation Models},
  author    = {Nie, Neil and Huang, Wenlong and Mao, Jiayuan and Fei-Fei, Li and Liu, Weiyu and Wu, Jiajun},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
  year      = {2026}
}

Acknowledgments

This work is in part supported by Analog Devices, AFOSR YIP FA9550-23-1-0127, ONR N00014-23-1-2355, ONR YIP N00014-24-1-2117, ONR MURI N00014-24-1-2748, and NSF RI #2211258.

Learning Composable Skills by DiscoveringSpatial and Temporal Structure with Foundation Models

Summary

Structure from foundation models — no manual annotations.

Skill samplers and their geometric effects.

Real-world generalization to novel task variations.

Abstract

Motivation

Overview

Method

Discovering Skills with Foundation Models

Learning Composable Skills

Planning with Learned Skills

Real World Results

Generalization Results

Temporal Segmentation Results

Bookshelf

Kitchen

Mug Tree

Related Work

Future Directions

BibTeX

Acknowledgments

Learning Composable Skills by Discovering
Spatial and Temporal Structure with Foundation Models