Text-to-image model

A text-to-image model is a machine learning model which takes as input a natural language description and produces an image matching that description. Such models began to be developed in the mid-2010s, as a result of advances in deep neural networks. In 2022, the output of state of the art text-to-image models, such as OpenAI's DALL-E 2, Google Brain's Imagen and StabilityAI's Stable Diffusion began to approach the quality of real photographs and human-drawn art.

Text-to-image models generally combine a language model, which transforms the input text into a latent representation, and a generative image model, which produces an image conditioned on that representation. The most effective models have generally been trained on massive amounts of image and text data scraped from the web.^[1]

History

Before the rise of deep learning, there were some limited attempts to build text-to-image models, but they were limited to effectively creating collages by arranging together existing component images, such as from a database of clip art.^[2]^[3]

The more tractable inverse problem, image captioning, saw a number of successful deep learning approaches prior to the first text-to-image models.^[4]

The first modern text-to-image model, alignDRAW was introduced in 2015 by researchers from the University of Toronto. AlignDRAW extended the previously-introduced DRAW architecture (which used a recurrent variational autoencoder with an attention mechanism) to be conditioned on text sequences.^[4] Images generated by alignDRAW were blurry and not photorealistic, but the model was able to generalize to objects not represented in the training data (such as a red school bus), and appropriately handled novel prompts such as "a stop sign is flying in blue skies", showing that it was not merely "memorizing" data from the training set.^[4]^[5]

In 2016, Reed, Akata, Yan et al. became the first to use generative adversarial networks for the text-to-image task.^[5]^[6] With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like "an all black bird with a distinct thick, rounded bill". A model trained on the more diverse COCO dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details.^[5]

Architecture and training

Text-to-image models are trained on large datasets of (text, image) pairs, often scraped from the web. With their 2022 Imagen model, Google Brain reported positive results from using a large language model trained separately on a text-only corpus (with its weights subsequently frozen), a departure from the theretofore standard approach.^[7]

References

^ Vincent, James (May 24, 2022). "All these images were generated by Google's latest text-to-image AI". The Verge. Vox Media. Retrieved May 28, 2022.
^ Agnese, Jorge; Herrera, Jonathan; Tao, Haicheng; Zhu, Xingquan (October 2019). "A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis" (PDF). {{cite journal}}: Cite journal requires |journal= (help)
^ Zhu, Xiaojin; Goldberg, Andrew B.; Eldawy, Mohamed; Dyer, Charles R.; Strock, Bradley (2007). "A text-to-picture synthesis system for augmenting communication" (PDF). AAAI. 7: 1590–1595.
^ ^a ^b ^c Mansimov, Elman; Parisotto, Emilio; Lei Ba, Jimmy; Salakhutdinov, Ruslan (November 2015). "Generating Images from Captions with Attention". ICLR.
^ ^a ^b ^c Reed, Scott; Akata, Zeynep; Logeswaran, Lajanugen; Schiele, Bernt; Lee, Honglak (June 2016). "Generative Adversarial Text to Image Synthesis" (PDF). International conference on machine learning.
^ Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas (December 2021). "Adversarial text-to-image synthesis: A review". Neural Networks. 144: 187–209.
^ Saharia, Chitwan (23 May 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". {{cite journal}}: Cite journal requires |journal= (help)

[imagen-verge-1] Vincent, James (May 24, 2022). "All these images were generated by Google's latest text-to-image AI". The Verge. Vox Media. Retrieved May 28, 2022.

[agnese-2] Agnese, Jorge; Herrera, Jonathan; Tao, Haicheng; Zhu, Xingquan (October 2019). "A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis" (PDF). {{cite journal}}: Cite journal requires |journal= (help)

[zhu-2007-3] Zhu, Xiaojin; Goldberg, Andrew B.; Eldawy, Mohamed; Dyer, Charles R.; Strock, Bradley (2007). "A text-to-picture synthesis system for augmenting communication" (PDF). AAAI. 7: 1590–1595.

[mansimov-2015-4] Mansimov, Elman; Parisotto, Emilio; Lei Ba, Jimmy; Salakhutdinov, Ruslan (November 2015). "Generating Images from Captions with Attention". ICLR.

[reed-2016-5] Reed, Scott; Akata, Zeynep; Logeswaran, Lajanugen; Schiele, Bernt; Lee, Honglak (June 2016). "Generative Adversarial Text to Image Synthesis" (PDF). International conference on machine learning.

[frolov-6] Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas (December 2021). "Adversarial text-to-image synthesis: A review". Neural Networks. 144: 187–209.

[imagen-paper-7] Saharia, Chitwan (23 May 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". {{cite journal}}: Cite journal requires |journal= (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

History

Architecture and training

See also

References