Jump to content

Vision-language model

From Wikipedia, the free encyclopedia

A vision–language model (VLM) is a type of artificial intelligence system that can jointly interpret and generate information from both images and text, extending the capabilities of large language models (LLMs), which are limited to text. It is an example of multimodal learning.

Many widely used commercial applications now rely on this ability. OpenAI introduced vision capabilities to its GPT-4V variant of the GPT-4 model, enabling users to incorporate uploaded photographs or diagrams into their discussions with ChatGPT. It has since become an integral part of ChatGPT's standard offering. Similar capabilities were added to Google’s Gemini, Anthropic’s Claude 3 Opus,[1] and Microsoft’s Copilot with Vision[2]. Alongside these models, several open-source vision–language models — such as LLaVA,[3] InstructBLIP,[4] and MiniGPT-4[5] — have been released by the research community, offering smaller-scale alternatives for experimentation and academic study.

History

[edit]

Vision language models evolved from image captioning systems. Such systems were designed to take images alone (without accompanying instructions), and produce descriptions.

Most image captioning systems used an encoder-decoder architecture, where an encoder summarized images to into feature vectors, which were fed to a decoder to generate the associated description. Early methods (early 2010s), combined handcrafted visual features to encode images, and n-gram or rule-based text templates to generate descriptions[6][7].

With the rise of deep learning, neural networks became dominant in image captioning. In 2015, methods emerged that used variations of convolutional neural networks (CNN) to encode images, and recurrent neural networks (RNN) to generate the captions.[8][9] By 2018, transformer networks replaced RNNs in the role of language decoders.[10] Importantly, training of network parameters was based on datasets of image-text pairs, like MS COCO[11]. The scope of applications was also broadened, to include visual question answering (VQA),[12] phrase grounding[13] and others.

In 2021, OpenAI's release of CLIP (Contrastive Language–Image Pretraining) was a major step towards the later evolution of VLMs. Rather than focus on a specific task like image captioning, CLIP is a general-purpose foundation model which can be extended to a broad range of downstream tasks. Importantly, CLIP's components were trained on a vast dataset of 400 million image-text pairs, producing powerful models. CLIP's general-purpose structure also places this powerful capability at the disposal of systems with far smaller computational budgets.

Starting in 2022, a plethora of VLM architectures have been proposed, based on similar design philosophies (elaborated below). These included Google DeepMind's proprietary Flamingo[14] and an open-source variant,[15] LLaVA,[3] SalesForce's InstructBLIP,[4] Microsoft's Kosmos,[16] KAUST's MiniGPT-4[5] and others. All these merged a separately trained CLIP-like image encoder, an off-the-shelf large language model (LLM) for text encoding, stitched together using specialized components. The resulting joint system was trained on curated datasets.

The release of GPT-4V in 2023 marked the emergence of highly-impactful commercial applications. This was quickly followed by other systems mentioned above (including Google’s Gemini, Anthropic’s Claude 3 Opus,[17] and Microsoft’s Copilot with Vision[18]). These applications are substantially more powerful for general-purpose assignments, typically containing substantially more parameters, trained on massive datasets, requiring enormous compute power. Their architectures have not been disclosed.

Architecture

[edit]

The input to VLMs consists of vision elements (images and videos) and text. The output is typically corresponding text. Generative models, which also generate vision elements (e.g., DALL-E), are beyond the scope of this article.

Below is a description of a few representative models, for whom the architecture is known. Commercial VLMs like GPT-4V, whose designs were not publicly disclosed, are likely based on similar concepts.

LLaVA 1.0

[edit]

LLaVA (Large Language and Vision Assistant)[3] 1.0 is a simple model which captures some of the main concepts of open-source VLMs. The input to the model is an image and an accompanying textual language instruction.

Architecture of LLaVA 1.0

Language model backbone

[edit]

Conceptually, the design is built around an off-the-shelf foundation LLM (a fine-tuned variant of Llama[19] called Vicuna), with components patched on to support the image inputs.

LLaVA borrows the tokenizer and the transformer modules (including their weights) from Vicuna, and uses them to handle the accompanying text. Recall that in a legacy (non-VLM) application of Vicuna, the tokenizer converts text into a stream of tokens, which are transferred into the transformer module, which in turn produces a stream response tokens. These are then converted back to text using the tokenizer.

Vision encoding

[edit]

To this, LLaVA adds two components, to support image inputs:

  • Vision Encoder: This is constructed from an off-the-shelf, separately trained CLIP model (specifically, variant ViT-L/14) from OpenAI. The vision encoder converts the image into an array of embedding vectors (more on this below), which encode useful information on the image. This information could be used straightforwardly by the LLM. This is because the LLM is designed to receive tokens, which have different dimensions. Furthermore, being an off-the-shelf LLM, Vicuna was not trained to recognize and respond to such information.
  • Projection (known elsewhere as a bridge): This module links the vision encoder with the LLM. Namely, it is a simple matrix of trainable parameters, which converts the dimensions of the vision encoder outputs, and can also be trained (see below) to be useful to the LLM. Its outputs are called image tokens.

The image tokens are prepended to the text tokens and processed by the LLM exactly as ordinary text tokens, yielding the final response.

A simple hack on CLIP ViT-L/14 vision encoder was used to obtain more effective encoded vectors. As that module is a vision transformer, a straightforward application would have used the class token at the output of its last transformer layer (see vision transformer class) as a single vector output. LLaVA 1.0, however, uses the grid (non-class) tokens at the output of the previous (penultimate) layer, to produce multiple vector outputs. The grid tokens correspond to spatial patches in these image input and thus capture finer-granularity information.

Training

[edit]

Training was required to align the modules so that they could be combined into a single model. In VLM terminology, this step is referred to as instruction tuning. LLaVA 1.0 achieved this in two stages. Stage 1 focused on preliminary alignment of the projection layer. Only the weights of that module were trained, with those of the other modules being frozen. The dataset was a subset of the CC3M[20] dataset of image-captioning pairs. This dataset was small (595,000 pairs), and limited in its scope, containing only simple image-caption pairs. Stage 2 focused on a more elaborate training of both the projection layer and the LLM. The vision encoder remained frozen. A rich training dataset (LLaVA-Instruct-158K[21]) of image-text pairs was produced for this stage, by harnessing a text-only LLM (GPT-4) to convert the simple captions of image-caption pairs (from the COCO dataset[11]) into elaborate conversation-style prompts.

Subsequent versions of LLaVA introduced several improvements over LLaVA 1.0. Some notable conceptual improvements include the replacement on LLaVA 1.5[22] of the simple projection module with a more elaborate MLP. LLaVA-NeXT[23] added support for multiple image aspect ratios, beyond than LLaVA 1.0's 224x224.

Flamingo

[edit]

Predating LLaVA 1.0 by a year, Flamingo[14] (DeepMind, 2022) actually involves a more elaborate design than LLaVA. Among its benefits are support for multiple images in a single conversation, and support for video.

Architecture of the Flamingo VLM

Architecturally, the design involves a more tightly-coupled integration between the language and vision modules, and a perception-resampler module (described below).

LLM and Vision Backbones

[edit]

Like LLaVA, Flamingo begins with independently designed LLM and vision encoder, for text analysis and image embedding, respectively. Both are pre-trained for their narrow purposes, independently of their final utility as components of Flamingo. Furthermore, as components, their weights remain frozen in the course of joint training (see below).

Flamingo uses DeepMind's off-the-shelf Chinchilla as its LLM backbone. For the vision encoder, they opted for a non-transformer design (the ResNet-based NFNet-F6[24]). They trained this using a CLIP-style contrastive loss on image-caption pairs from the ALIGN[25] and a specially curated dataset called LTIP (Long Text & Image Pairs).

The vision encoder takes single images as inputs (more on videos below) and produces a two-dimensional grid of feature vectors.

Perceiver-Resampler

[edit]

The perceiver-resampler[26] component plays a key role in support for video and variable-number of images at the Flamingo input.

Multiple consecutive images (one or more) are first fed one-by-one into the vision encoder, producing a three-dimensional grid of feature vectors. Videos are converted into a sequence of images by sampling at a rate of 1 frame per second. The resulting grid is flattened into a long, variable-size array of feature vectors.

The perceiver-resampler converts this into a short, fixed-length array of tokens. It uses a design that is based on cross-attention between a fixed number of artificial, predetermined query vectors (whose values are determined by training), and (key,value) pairs derived from the array of feature vectors.

Note that in this context, the consecutive images are assumed to be contiguous, without intervening text. The general case will be discussed later, below.

Gated cross-attention/dense blocks

[edit]

These are multiple blocks (see the figure above), that play a role parallel to LLaVA's Projection module, serving as an interface between the vision and text processing modules. Their design, however, is more entangled with the language model.

Gated cross-attention and dense block

Specifically, between select transformer blocks of the language model, Flamingo inserts these cross-attention-and-dense blocks. These blocks resemble the decoder blocks of encoder-decoder transformer architectures. That is, their queries are obtained from the preceding legacy self-attention transformer block of the backbone LLM. Their keys and values are derived from the vision feature vectors. Their outputs are forwarded to the consecutive backbone LLM block. They also include skip connections.

One important modification to the added blocks, relative to the blocks of encoder-decoder transformers, is the inclusion of a tanh gating. These small modules multiply their inputs are controlled by a trainable scalar weight in the interval (-1, 1), specific to each such block. These weights modulate the impact of the cross-attention-dense block on the text generation process. They are initialized at zero at the beginning of training, when the weights of the other modules of the block are still untrained and random. As training progresses, their values gradually increase. These gates have a crucial role in ensuring training stability.

Chunking

[edit]

To support interleaved images and text sequences, Flamingo introduced a simple adaptation which arguably increases performance for in-context (few-shot) learning. Specifically, they break the input stream into chunks, each of which contains a single vision input (image, contiguous sequence of images, or video). When applying the above-mentioned cross-attention between text and visual features, text tokens are only allowed to attend to the vision input within their chunk. Other vision inputs are masked out.

Note that text tokens are still indirectly influenced by all video inputs, via the intra-text self-attention.

Training

[edit]

During training, the language and text backbones are frozen (as noted above). Training used three datasets: The LTIP dataset mentioned above, a curated dataset of video-text pairs (called VTP), and a massive dataset of interleaved text-image sequences, derived from HTML documents (MultiModal MassiveWeb - M3W).

Qwen2-VL

[edit]

Qwen2-VL[27] by Alibaba (2024) has a conceptually simple architecture that also provides functionality and flexibility exceeding Flamingo.

Like LLaVA 1.0 and Flamingo, it begins with backbone language model (Qwen2) and vision encoder (DFN[28] vision transformer).

Like LLaVA and unlike Flamingo, Qwen2-VL uses unified processing of vision and text data, feeding all tokens into the input of the language model, rather than injecting them into internal blocks. Tokens are then treated equally by the language model, using self-attention rather than cross-attention. To support interleaved text and vision data, and delimit streams of tokens from the latter, special tokens (vision_start and vision_end) are used.

Naive dynamic resolution

[edit]

A key difference from LLaVA 1.0 and Flamingo is that the vision encoder supports arbitrary image resolutions, without first reshaping the image to a fixed shape. The number of encoded tokens is variable and depends on the image shape.

Videos are sampled at 2 frames per second to produce a steam of images, which are each encoded separately. This too contributes to the variable length of the token array.

An MLP aligns the embedding dimension of the ViT with that of the language model. It also has a role in reducing the dimensionality of the image encoding by merging the vector embeddings 2x2 adjacent patches (see ViT). Video encoding also benefits from a 3D convolution that also operates on the temporal dimension.

Multimodal rotary positional encoding (M-RoPE)

[edit]

Standard positional encoding is poorly suited for vision data, especially when the data is encoded into variable-length token embeddings. Specifically, its 1-dimensional representation loses the spatial layout of images and temporal continuity of video.

Qwen2-VL uses a multidimensional variant for vision data, which it calls multimodal rotary positional encoding (M-RoPE). With this implementation, each token is assigned a triplet of indices (i,x,y), defined as follows. For images, x and y represent the spacial coordinates of the token in the image. i is constant for all tokens of the image, and equals the sequence number of the image within the unified input stream to the model. M-RoPE positional encoding is then constructed by interleaving separate 1-D RoPE encodings for the three indices[29].

Video positional encoding uses similar triplets, except that i is not constant and progresses with each image in the video stream. It thus encodes temporal location.

Training

[edit]

Qwen2-VL is trained in a three-stage process that progressively integrates visual and linguistic understanding. In Stage 1, the vision encoder is trained, keeping the other modules frozen. In Stage 2, the entire architecture is unfrozen, and in Stage 3, the vision encoder is frozen while the language model is fine-tuned. The training dataset includes a diverse range of modalities, ultimately amounting to 1.4 trillion tokens (including encoded vision tokens). The training loss is computed over text tokens at the output of the language model.

Visual grounding

[edit]

A key functionality that is enabled by multimodal positional encoding is visual grounding; namely, the ability to reason about specific objects within an image. M-RoPE's preservation of the spatial location of image tokens is essential for this.

To support grounding, much of the training data includes information on objects in images, including captions and bounding box coordinates. Training dataset preparation involves formatting this data into a standard structure, which includes special tokens (object_ref_start, object_ref_end, box_start, box_end).

See also

[edit]

References

[edit]
  1. ^ "Vision". Claude Docs. Retrieved 2025-10-15.
  2. ^ "Using Copilot Vision with Microsoft Copilot - Microsoft Support". support.microsoft.com. Retrieved 2025-10-15.
  3. ^ a b c Liu, Haotian; Li, Chunyuan; Wu, Qingyang; Lee, Yong Jae (2023-12-11), Visual Instruction Tuning, arXiv, doi:10.48550/arXiv.2304.08485, arXiv:2304.08485, retrieved 2025-10-24
  4. ^ a b Dai, Wenliang; Li, Junnan; Li, Dongxu; Tiong, Anthony Meng Huat; Zhao, Junqi; Wang, Weisheng; Li, Boyang; Fung, Pascale; Hoi, Steven (2023-06-15), InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, arXiv, doi:10.48550/arXiv.2305.06500, arXiv:2305.06500, retrieved 2025-10-15
  5. ^ a b Zhu, Deyao; Chen, Jun; Shen, Xiaoqian; Li, Xiang; Elhoseiny, Mohamed (2023-10-02), MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models, arXiv, doi:10.48550/arXiv.2304.10592, arXiv:2304.10592, retrieved 2025-10-15
  6. ^ Kulkarni, Girish; Premraj, Visruth; Dhar, Sagnik; Li, Siming; Choi, Yejin; Berg, Alexander C; Berg, Tamara L (25 June 2011). "Baby talk: Understanding and generating simple image descriptions". Conference on Computer Vision and Pattern Recognition (CVPR): 1601–1608. doi:10.1109/CVPR.2011.5995466.
  7. ^ Paragios, Nikos; Daniilidis, Kostas; Maragos, Petros (2010). Computer Vision - ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg Springer e-books. ISBN 978-3-642-15561-1.
  8. ^ Vinyals, Oriol; Toshev, Alexander; Bengio, Samy; Erhan, Dumitru (2015). "Show and Tell: A Neural Image Caption Generator". CVPR 2015: 3156–3164.
  9. ^ Xu, Kelvin; Ba, Jimmy Lei; Kiros, Ryan; Cho, Kyunghyun; Courville, Aaron; Salakhutdinov, Ruslan; Zemel, Richard S.; Bengio, Yoshua (2015-07-06). "Show, attend and tell: neural image caption generation with visual attention". Proceedings of the 32nd International Conference on International Conference on Machine Learning. ICML'15. Lille, France: JMLR.org: 2048–2057.
  10. ^ Lu, Jiasen; Batra, Dhruv; Parikh, Devi; Lee, Stefan (2019-08-06), ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, arXiv, doi:10.48550/arXiv.1908.02265, arXiv:1908.02265, retrieved 2025-10-20
  11. ^ a b "COCO - Common Objects in Context". cocodataset.org. Retrieved 2025-10-20.
  12. ^ "Vision-Language Models (VLM) vs Visual Question Answering (VQA) in 2025?". www.gravio.com. Retrieved 2025-10-20.
  13. ^ "What is Phrase Grounding?". Roboflow Blog. 2024-11-13. Retrieved 2025-10-21.
  14. ^ a b Alayrac, Jean-Baptiste; Donahue, Jeff; Luc, Pauline; Miech, Antoine; Barr, Iain; Hasson, Yana; Lenc, Karel; Mensch, Arthur; Millican, Katie (2022-11-15), Flamingo: a Visual Language Model for Few-Shot Learning, arXiv, doi:10.48550/arXiv.2204.14198, arXiv:2204.14198, retrieved 2025-10-23
  15. ^ Awadalla, Anas; Gao, Irena; Gardner, Josh; Hessel, Jack; Hanafy, Yusuf; Zhu, Wanrong; Marathe, Kalyani; Bitton, Yonatan; Gadre, Samir (2023-08-07), OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models, arXiv, doi:10.48550/arXiv.2308.01390, arXiv:2308.01390, retrieved 2025-10-23
  16. ^ Huang, Shaohan; Dong, Li; Wang, Wenhui; Hao, Yaru; Singhal, Saksham; Ma, Shuming; Lv, Tengchao; Cui, Lei; Mohammed, Owais Khan (2023-03-01), Language Is Not All You Need: Aligning Perception with Language Models, arXiv, doi:10.48550/arXiv.2302.14045, arXiv:2302.14045, retrieved 2025-10-23
  17. ^ "Vision". Claude Docs. Retrieved 2025-10-15.
  18. ^ "Using Copilot Vision with Microsoft Copilot - Microsoft Support". support.microsoft.com. Retrieved 2025-10-15.
  19. ^ "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality | LMSYS Org". lmsys.org. Retrieved 2025-10-25.
  20. ^ Sharma, Piyush; Ding, Nan; Goodman, Sebastian; Soricut, Radu (2018). Gurevych, Iryna; Miyao, Yusuke (eds.). "Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning". Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics: 2556–2565. doi:10.18653/v1/P18-1238.
  21. ^ "liuhaotian/LLaVA-Instruct-150K · Datasets at Hugging Face". huggingface.co. 2025-06-06. Retrieved 2025-10-25.
  22. ^ Liu, Haotian; Li, Chunyuan; Li, Yuheng; Lee, Yong Jae (2024-05-15), Improved Baselines with Visual Instruction Tuning, arXiv, doi:10.48550/arXiv.2310.03744, arXiv:2310.03744, retrieved 2025-10-26
  23. ^ Li, Feng; Zhang, Renrui; Zhang, Hao; Zhang, Yuanhan; Li, Bo; Li, Wei; Ma, Zejun; Li, Chunyuan (2024-07-28), LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models, arXiv, doi:10.48550/arXiv.2407.07895, arXiv:2407.07895, retrieved 2025-10-26
  24. ^ Brock, Andrew; De, Soham; Smith, Samuel L.; Simonyan, Karen (2021-02-11), High-Performance Large-Scale Image Recognition Without Normalization, arXiv, doi:10.48550/arXiv.2102.06171, arXiv:2102.06171, retrieved 2025-10-28
  25. ^ Jia, Chao; Yang, Yinfei; Xia, Ye; Chen, Yi-Ting; Parekh, Zarana; Pham, Hieu; Le, Quoc V.; Sung, Yunhsuan; Li, Zhen (2021-06-11), Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, arXiv, doi:10.48550/arXiv.2102.05918, arXiv:2102.05918, retrieved 2025-10-28
  26. ^ Jaegle, Andrew; Gimeno, Felix; Brock, Andrew; Zisserman, Andrew; Vinyals, Oriol; Carreira, Joao (2021-06-23), Perceiver: General Perception with Iterative Attention, arXiv, doi:10.48550/arXiv.2103.03206, arXiv:2103.03206, retrieved 2025-10-28
  27. ^ Wang, Peng; Bai, Shuai; Tan, Sinan; Wang, Shijie; Fan, Zhihao; Bai, Jinze; Chen, Keqin; Liu, Xuejing; Wang, Jialin (2024-10-03), Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution, arXiv, doi:10.48550/arXiv.2409.12191, arXiv:2409.12191, retrieved 2025-10-29
  28. ^ Fang, Alex; Jose, Albin Madappally; Jain, Amit; Schmidt, Ludwig; Toshev, Alexander; Shankar, Vaishaal (2023-11-06), Data Filtering Networks, arXiv, doi:10.48550/arXiv.2309.17425, arXiv:2309.17425, retrieved 2025-10-30
  29. ^ tangbasky (2025-09-08). "Qwen2-VL's RoPE Variant— M-RoPE". Everyday AI. Retrieved 2025-10-30.