Text-to-image personalization

Text-to-Image personalization is a task in deep learning for computer graphics that augments pre-trained text-to-image generative models. In this task, a generative model that was trained on large-scale data (usually a foundation model), is adapted such that it can generate images of novel, user-provided concepts. These concepts are typically unseen during training, and may represent specific objects (such as the user’s pet) or more abstract categories (new artistic styles^[1] or object relations. arXiv:2303.13495. {{cite arXiv}}: Missing or empty |title= (help)CS1 maint: missing class (link) A bot will complete this citation soon. Click here to jump the queue).

Text-to-Image personalization methods typically bind the novel (personal) concept to new words in the vocabulary of the model. These words can then be used in future prompts to invoke the concept for subject-driven generation, inpainting, style transfer and even to correct biases in the model. To do so, models either optimize word-embeddings, fine-tune the generative model itself, or employ a mixture of both approaches.

Technology

Text-to-Image personalization was first proposed during August 2022 by two concurrent works, Textual Inversion^[1] and DreamBooth^[2].

In both cases, a user provides a few images (typically 3-5) of a concept, like their own dog, together with a coarse descriptor of the concept class (like the word "dog"). The model then learns to represent the subject through a reconstruction based objective, where prompts referring to the subject are expected to reconstruct images from the training set.

In Textual Inversion, the personalized concepts are introduced into the text-to-image model by adding new words to the vocabulary of the model. Typical text-to-image models represent words (and sometimes parts-of-words) as tokens, or indices in a predefined dictionary. During generation, an input prompt is converted into such tokens, each of which is converted into a ‘word-embedding’: a continuous vector representation which is learned for each token as part of the model’s training. Textual Inversion proposes to optimize a new word-embedding vector for representing the novel concept. This new embedding vector can then be assigned to a user-chosen string, and invoked whenever the user’s prompt contains this string^[1].

In DreamBooth, rather than optimizing a new word vector, the full generative model itself is fine-tuned. The user first selects an existing token, typically one which rarely appears in prompts. The subject itself is then represented by a string containing this token, followed by a coarse descriptor of the subject's class. A prompt describing the subject will then take the form: "A photo of <token> <class>" (e.g. "a photo of sks cat" when learning to represent a specific cat). The text-to-image model is then tuned so that prompts of this form will generate images of the subject^[2].

Extensions

Several approaches were proposed to refine and improve over the original methods. These include the following.

(1) Low-rank Adaptation (LoRA) - an adapter-based technique for efficient finetuning of models^[3]. In the case of text-to-image models, LoRA is typically used to modify the cross-attention layers of a diffusion model.

(2) Perfusion - a low rank update method that also locks the activations of the key matrix in the diffusion model's cross attention layers to the concept's coarse class^[4].

(3) Extended Textual Inversion - a technique that learns an individual word embedding for each layer in the diffusion model's denoising network^[5].

(4) Encoder-based methods that use another neural network to quickly personalize a model^[6]. arXiv:2302.13848. {{cite arXiv}}: Missing or empty |title= (help)CS1 maint: missing class (link) A bot will complete this citation soon. Click here to jump the queue.

Challenges and limitations

Text-to-image personalization methods must contend with several challenges. At their core is the goal of achieving high-fidelity to the personal concept while maintaining high alignment between novel prompts containing the subject, and the generated images (typically referred to as ‘editability’).

Another challenge that personalization methods must contend with is memory requirements. Initial implementations of personalization methods required more than 20 Gigabytes of GPU memory, and more recent approaches have reported requirements of more than 40 Gigabytes^[6]. However, optimizations such as Flash Attention. arXiv:2205.14135. {{cite arXiv}}: Missing or empty |title= (help)CS1 maint: missing class (link) A bot will complete this citation soon. Click here to jump the queue have since reduced this requirement considerably.

Approaches that tune the entire generative model may also create checkpoints that are several gigabytes in size, making it difficult to share or store many models. Embedding based approaches require only a few kilobytes, but typically struggle to preserve identity while maintaining editability. More recent approaches have proposed hybrid tuning goals which optimize both an embedding and a subset of network weights. These can reduce storage requirements to as little as 100 Kilobytes while achieving quality comparable to full tuning methods^[4].

Finally, optimization processes can be lengthy, requiring several minutes of tuning for each novel concept. Encoder and quick-tuning methods aim to reduce this to seconds or less. arXiv:2304.03411. {{cite arXiv}}: Missing or empty |title= (help)CS1 maint: missing class (link) A bot will complete this citation soon. Click here to jump the queue.

References

^ ^a ^b ^c Gal, Rinon; Alaluf, Yuval; Atzmon, Yuval; Patashnik, Or; Bermano, Amit Haim; Chechik, Gal; Cohen-or, Daniel (2022-09-29). "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion". {{cite journal}}: Cite journal requires |journal= (help)
^ ^a ^b Ruiz, Nataniel; Li, Yuanzhen; Jampani, Varun; Pritch, Yael; Rubinstein, Michael; Aberman, Kfir (2023). "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation": 22500–22510. {{cite journal}}: Cite journal requires |journal= (help)
^ Hu, Edward J.; Shen, Yelong; Wallis, Phillip; Allen-Zhu, Zeyuan; Li, Yuanzhi; Wang, Shean; Wang, Lu; Chen, Weizhu (2021-10-06). "LoRA: Low-Rank Adaptation of Large Language Models". {{cite journal}}: Cite journal requires |journal= (help)
^ ^a ^b Tewel, Yoad; Gal, Rinon; Chechik, Gal; Atzmon, Yuval (2023-07-23). "Key-Locked Rank One Editing for Text-to-Image Personalization". ACM SIGGRAPH 2023 Conference Proceedings. SIGGRAPH '23. New York, NY, USA: Association for Computing Machinery: 1–11. doi:10.1145/3588432.3591506. ISBN 979-8-4007-0159-7.
^ Lorenzi, Daniele (2023-07-22). "Meet P+: A Rich Embeddings Space for Extended Textual Inversion in Text-to-Image Generation". MarkTechPost. Retrieved 2023-08-29.
^ ^a ^b Gal, Rinon; Arar, Moab; Atzmon, Yuval; Bermano, Amit H.; Chechik, Gal; Cohen-Or, Daniel (2023-07-26). "Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models". ACM Transactions on Graphics. 42 (4): 150:1–150:13. doi:10.1145/3592133. ISSN 0730-0301.

[:0-1] Gal, Rinon; Alaluf, Yuval; Atzmon, Yuval; Patashnik, Or; Bermano, Amit Haim; Chechik, Gal; Cohen-or, Daniel (2022-09-29). "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion". {{cite journal}}: Cite journal requires |journal= (help)

[:1-2] Ruiz, Nataniel; Li, Yuanzhen; Jampani, Varun; Pritch, Yael; Rubinstein, Michael; Aberman, Kfir (2023). "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation": 22500–22510. {{cite journal}}: Cite journal requires |journal= (help)

[3] Hu, Edward J.; Shen, Yelong; Wallis, Phillip; Allen-Zhu, Zeyuan; Li, Yuanzhi; Wang, Shean; Wang, Lu; Chen, Weizhu (2021-10-06). "LoRA: Low-Rank Adaptation of Large Language Models". {{cite journal}}: Cite journal requires |journal= (help)

[:2-4] Tewel, Yoad; Gal, Rinon; Chechik, Gal; Atzmon, Yuval (2023-07-23). "Key-Locked Rank One Editing for Text-to-Image Personalization". ACM SIGGRAPH 2023 Conference Proceedings. SIGGRAPH '23. New York, NY, USA: Association for Computing Machinery: 1–11. doi:10.1145/3588432.3591506. ISBN 979-8-4007-0159-7.

[5] Lorenzi, Daniele (2023-07-22). "Meet P+: A Rich Embeddings Space for Extended Textual Inversion in Text-to-Image Generation". MarkTechPost. Retrieved 2023-08-29.

[:3-6] Gal, Rinon; Arar, Moab; Atzmon, Yuval; Bermano, Amit H.; Chechik, Gal; Cohen-Or, Daniel (2023-07-26). "Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models". ACM Transactions on Graphics. 42 (4): 150:1–150:13. doi:10.1145/3592133. ISSN 0730-0301.

[1]

[2]

[3]

[4]

[5]

[6]