Operating in dynamic environments requires anticipating future changes. For instance, filming Formula 1 racing demands predicting the carโs next position to capture it effectively. We introduce $\mathcal{F}_1$, a novel paradigm integrating visual foresight generation into the decision-making pipeline and enables robots to plan and execute complex tasks in dynamic environments via predictive inverse dynamics.
Predictive inverse dynamics modeling for planning-based control
Three specialized experts for understanding, generation, and action
Three-stage alignment, pretraining, and adaptation strategy
Making the generation expert align with the understanding expert and obtain the capability of generating visual foresight images.
Joint training of all three experts with large-scale vision-language-action datasets to obtain the capability of understanding, generation, and action.
Finetuning on specific robot platforms with domain-specific data and tasks to obtain the capability of executing actions in downstream environments.
Precise flower manipulation on Genie-1 platform
Safe object transfer between robot and human
Tea cup manipulation from shelf to table
10-step sequential manipulation task execution
Real-time tracking of moving objects on conveyor belt
Rapid adaptation for cleaning and organization tasks
Task | $\mathcal{F}_1$ | $\pi_0$ | Improvement |
---|---|---|---|
Flower | 80.0% | 66.7% | +13.3% |
Handover (R2H) | 73.3% | 40.0% | +33.3% |
Tea (Shelf) | 86.7% | 73.3% | +13.4% |
Average | 80.0% | 60.0% | +20.0% |
Task | Platform | $\mathcal{F}_1$ | $\pi_0$ | Improvement |
---|---|---|---|---|
Long-horizon | ARX LIFT II | 40.0% | 0.0% | +40.0% |
Dynamic Env | ARX LIFT II | 66.7% | 33.3% | +33.4% |
Adaptation | Franka | 66.7% | 53.3% | +13.4% |
Average | All | 57.8% | 28.9% | +28.9% |
@article{f1_vla_2025,
title={F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions},
author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Michael Yu Wang and Liqiang Nie and Jiangmiao Pang},
eprint={2509.06951},
archivePrefix={arXiv},
year={2025},
url={https://arxiv.org/abs/2509.06951}
}