Embodied AI Research

\(\mathcal{F}_1\): A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Operating in dynamic environments requires anticipating future changes. For instance, filming Formula 1 racing demands predicting the carโ€™s next position to capture it effectively. We introduce $\mathcal{F}_1$, a novel paradigm integrating visual foresight generation into the decision-making pipeline and enables robots to plan and execute complex tasks in dynamic environments via predictive inverse dynamics.

Best viewed with sound on

Research Highlights

Visual Foresight

Predictive inverse dynamics modeling for planning-based control

Mixture-of-Transformer

Three specialized experts for understanding, generation, and action

Progressive Training

Three-stage alignment, pretraining, and adaptation strategy

Framework Overview

VLA Paradigm Evolution

VLA Paradigm Comparison
Comparison of VLA paradigms. (a) Early end-to-end policies like ACT and DP lack semantic grounding. (b) VLM-integrated policies like $\pi_0$ and gr00t-N1 enhance understanding but remain reactive. (c) Visual prediction-based policies like VPP and Genie Envisioner anticipate future states but lack semantic grounding. (d) Our $\mathcal{F}_1$ framework integrates understanding, generation, and execution for robust foresight-driven control.

$\mathcal{F}_1$ Architecture

$\mathcal{F}_1$ Architecture
$\mathcal{F}_1$ framework overview. The Mixture-of-Transformer architecture comprises three core components: an understanding expert, a generation expert, and an action expert. The understanding expert processes instructions and observations to generate a foresight image, which is then fed to the action expert for predictive inverse dynamics modeling.
  • ๐Ÿง  Understanding Expert: Processes natural language instructions and visual observations to establish shared multimodal representations, leveraging pretrained vision-language knowledge for robust semantic grounding.
  • ๐Ÿ”ฎ Generation Expert: Employs next-scale prediction mechanism to synthesize goal-conditioned visual foresight, providing explicit planning targets that guide subsequent action execution.
  • ๐Ÿค– Action Expert: Implements predictive inverse dynamics modeling to map multimodal context into executable robot actions, incorporating foresight for goal-directed and temporally consistent behavior.

Three-Stage Training Recipe

1

Pretrain Stage I

Making the generation expert align with the understanding expert and obtain the capability of generating visual foresight images.

โ†’
2

Pretrain Stage II

Joint training of all three experts with large-scale vision-language-action datasets to obtain the capability of understanding, generation, and action.

โ†’
3

Post-train Stage

Finetuning on specific robot platforms with domain-specific data and tasks to obtain the capability of executing actions in downstream environments.

Real-World Robot Experiments

Flower

Precise flower manipulation on Genie-1 platform

Genie-1

Handover (R2H)

Safe object transfer between robot and human

Genie-1

Tea (Shelf)

Tea cup manipulation from shelf to table

Genie-1

Long-horizon

10-step sequential manipulation task execution

ARX LIFT II

Dynamic Environment

Real-time tracking of moving objects on conveyor belt

ARX LIFT II

Sweep

Rapid adaptation for cleaning and organization tasks

Franka

Performance Highlights

Real-World Tasks (Genie-1)

Task $\mathcal{F}_1$ $\pi_0$ Improvement
Flower 80.0% 66.7% +13.3%
Handover (R2H) 73.3% 40.0% +33.3%
Tea (Shelf) 86.7% 73.3% +13.4%
Average 80.0% 60.0% +20.0%

Dynamic Environments

Task Platform $\mathcal{F}_1$ $\pi_0$ Improvement
Long-horizon ARX LIFT II 40.0% 0.0% +40.0%
Dynamic Env ARX LIFT II 66.7% 33.3% +33.4%
Adaptation Franka 66.7% 53.3% +13.4%
Average All 57.8% 28.9% +28.9%

Citation

@article{f1_vla_2025,
  title={F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions},
  author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Michael Yu Wang and Liqiang Nie and Jiangmiao Pang},
  eprint={2509.06951},
  archivePrefix={arXiv},
  year={2025},
  url={https://arxiv.org/abs/2509.06951}
}