Vision-language-action model

A vision-language-action model (VLA) is a foundation model that allows control of robot actions through vision and language commands.^[1]

One method for constructing a VLA is to fine-tune a vision-language model (VLM) by training it on robot trajectory data and large-scale visual language data^[2] or Internet-scale vision-language tasks.^[3]

Examples of VLAs include RT-2 from Google DeepMind.^[4]

References

^ Jeong, Hyeongyo; Lee, Haechan; Kim, Changwon; Shin, Sungta (October 2024). "A Survey of Robot Intelligence with Large Language Models". Applied Sciences. 14 (19) – via EBSCOhost.
^ Fan, L.; Chen, Z.; Xu, M.; Yuan, M.; Huang, P.; Huang, W. (2024). "Language Reasoning in Vision-Language-Action Model for Robotic Grasping". 2024 China Automation Congress (CAC): 6656–6661. doi:10.1109/CAC63892.2024.10865585.
^ "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control". arXiv. July 28, 2023. Retrieved March 13, 2025.
^ Dotson, Kyt (July 28, 2023). "Google unveils RT-2, an AI language model for telling robots what to do". Silicon Angle. Retrieved March 13, 2025.

[1] Jeong, Hyeongyo; Lee, Haechan; Kim, Changwon; Shin, Sungta (October 2024). "A Survey of Robot Intelligence with Large Language Models". Applied Sciences. 14 (19) – via EBSCOhost.

[2] Fan, L.; Chen, Z.; Xu, M.; Yuan, M.; Huang, P.; Huang, W. (2024). "Language Reasoning in Vision-Language-Action Model for Robotic Grasping". 2024 China Automation Congress (CAC): 6656–6661. doi:10.1109/CAC63892.2024.10865585.

[3] "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control". arXiv. July 28, 2023. Retrieved March 13, 2025.

[4] Dotson, Kyt (July 28, 2023). "Google unveils RT-2, an AI language model for telling robots what to do". Silicon Angle. Retrieved March 13, 2025.

[1]

[2]

[3]

[4]