Jump to content

Vision-language-action model

From Wikipedia, the free encyclopedia
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

A vision-language-action model (VLA) is a foundation model that allows control of robot actions through vision and language commands.[1]

One method for constructing a VLA is to fine-tune a vision-language model (VLM) by training it on robot trajectory data and large-scale visual language data[2] or Internet-scale vision-language tasks.[3]

Examples of VLAs include RT-2 from Google DeepMind.[4]

References

  1. ^ Jeong, Hyeongyo; Lee, Haechan; Kim, Changwon; Shin, Sungta (October 2024). "A Survey of Robot Intelligence with Large Language Models". Applied Sciences. 14 (19): 8868. doi:10.3390/app14198868.
  2. ^ Fan, L.; Chen, Z.; Xu, M.; Yuan, M.; Huang, P.; Huang, W. (2024). "Language Reasoning in Vision-Language-Action Model for Robotic Grasping". 2024 China Automation Congress (CAC). pp. 6656–6661. doi:10.1109/CAC63892.2024.10865585. ISBN 979-8-3503-6860-4.
  3. ^ Brohan, Anthony; et al. (July 28, 2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control". arXiv:2307.15818 [cs.RO].
  4. ^ Dotson, Kyt (July 28, 2023). "Google unveils RT-2, an AI language model for telling robots what to do". Silicon Angle. Retrieved March 13, 2025.