Embodied AI Research

$\mathcal{F}_1$: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Operating in dynamic environments requires anticipating future changes. For instance, filming Formula 1 racing demands predicting the car’s next position to capture it effectively. We introduce $\mathcal{F}_1$, a novel paradigm integrating visual foresight generation into the decision-making pipeline and enables robots to plan and execute complex tasks in dynamic environments via predictive inverse dynamics.

View on GitHub Read Paper Watch Demo

Best viewed with sound on

Research Highlights

Visual Foresight

Predictive inverse dynamics modeling for planning-based control

Mixture-of-Transformer

Three specialized experts for understanding, generation, and action

Progressive Training

Three-stage alignment, pretraining, and adaptation strategy

Framework Overview

VLA Paradigm Evolution

Comparison of VLA paradigms. (a) Early end-to-end policies like ACT and DP lack semantic grounding. (b) VLM-integrated policies like $\pi_0$ and gr00t-N1 enhance understanding but remain reactive. (c) Visual prediction-based policies like VPP and Genie Envisioner anticipate future states but lack semantic grounding. (d) Our $\mathcal{F}_1$ framework integrates understanding, generation, and execution for robust foresight-driven control.

$\mathcal{F}_1$ Architecture

$$\mathcal{F}_1$ Architecture$

$\mathcal{F}_1$ framework overview. The Mixture-of-Transformer architecture comprises three core components: an understanding expert, a generation expert, and an action expert. The understanding expert processes instructions and observations to generate a foresight image, which is then fed to the action expert for predictive inverse dynamics modeling.

🧠 Understanding Expert: Processes natural language instructions and visual observations to establish shared multimodal representations, leveraging pretrained vision-language knowledge for robust semantic grounding.
🔮 Generation Expert: Employs next-scale prediction mechanism to synthesize goal-conditioned visual foresight, providing explicit planning targets that guide subsequent action execution.
🤖 Action Expert: Implements predictive inverse dynamics modeling to map multimodal context into executable robot actions, incorporating foresight for goal-directed and temporally consistent behavior.

Three-Stage Training Recipe

Pretrain Stage I

Making the generation expert align with the understanding expert and obtain the capability of generating visual foresight images.

→

Pretrain Stage II

Joint training of all three experts with large-scale vision-language-action datasets to obtain the capability of understanding, generation, and action.

→

Post-train Stage

Finetuning on specific robot platforms with domain-specific data and tasks to obtain the capability of executing actions in downstream environments.

Real-World Robot Experiments

Flower

Precise flower manipulation on Genie-1 platform

Genie-1

Handover (R2H)

Safe object transfer between robot and human

Genie-1

Tea (Shelf)

Tea cup manipulation from shelf to table

Genie-1

Long-horizon

10-step sequential manipulation task execution

ARX LIFT II

Dynamic Environment

Real-time tracking of moving objects on conveyor belt

ARX LIFT II

Sweep

Rapid adaptation for cleaning and organization tasks

Franka

Performance Highlights

Real-World Tasks (Genie-1)

Task	$\mathcal{F}_1$	$\pi_0$	Improvement
Flower	80.0%	66.7%	+13.3%
Handover (R2H)	73.3%	40.0%	+33.3%
Tea (Shelf)	86.7%	73.3%	+13.4%
Average	80.0%	60.0%	+20.0%

Dynamic Environments

Task	Platform	$\mathcal{F}_1$	$\pi_0$	Improvement
Long-horizon	ARX LIFT II	40.0%	0.0%	+40.0%
Dynamic Env	ARX LIFT II	66.7%	33.3%	+33.4%
Adaptation	Franka	66.7%	53.3%	+13.4%
Average	All	57.8%	28.9%	+28.9%

Citation

@article{f1_vla_2025,
  title={F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions},
  author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Michael Yu Wang and Liqiang Nie and Jiangmiao Pang},
  eprint={2509.06951},
  archivePrefix={arXiv},
  year={2025},
  url={https://arxiv.org/abs/2509.06951}
}

\(\mathcal{F}_1\): A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Research Highlights

Visual Foresight

Mixture-of-Transformer

Progressive Training

Framework Overview

VLA Paradigm Evolution

$\mathcal{F}_1$ Architecture

Three-Stage Training Recipe

Pretrain Stage I

Pretrain Stage II

Post-train Stage

Real-World Robot Experiments

Flower

Handover (R2H)

Tea (Shelf)

Long-horizon

Dynamic Environment

Sweep

Performance Highlights

Real-World Tasks (Genie-1)

Dynamic Environments

Citation