Jump to content

Text-to-video model

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Annie gold05 (talk | contribs) at 15:55, 28 February 2025 (Everything). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

A video generated using OpenAI's Sora text-to-video model, using the prompt: Awakening Mandy had always felt like a puzzle with missing pieces. She grew up in foster care, shuffled from home to home, never knowing where she truly belonged. Every time she asked about her parents, she was met with silence or shrugs. No records. No stories. Nothing. Just a void where a history should have been. Still, she refused to let that define her. She pushed herself in school, excelling in math and science, eventually earning a place at New York University, where she studied engineering. But the weight of her unknown past always lingered in the back of her mind. On her 18th birthday, something inside her shifted. She didn’t want to stay in the dorms, surrounded by people who had families to call their own. She wanted independence—her own space, her own choices. So, she moved into a small apartment on the outskirts of the city. And with that decision came another: she dropped out of school. Maybe engineering wasn’t meant for her. Maybe something else was. But “something else” never came. For weeks, she searched for a job—any job. She applied at coffee shops, bookstores, mechanic garages, even fast-food chains. Each rejection chipped away at her confidence. No experience. No qualifications beyond her unfinished degree. No one wanted to give her a chance. Why is life so unfair to me? The thought echoed in her mind as she dragged herself home that evening, empty-handed once again. Her stomach churned with frustration. No parents. No background. No future. Was she cursed? Was she just destined to struggle forever? Needing air, she took a stroll down the streets, lost in thought. The night was cool, the streetlights flickering. She barely noticed where she was going until she stopped at a gas station, the bright neon lights buzzing overhead. That’s when she heard them. Two men. “Hey there, sweetheart,” one of them called, his voice smooth and smug. Mandy ignored them and kept walking. “Aw, come on, don’t be like that,” the second one added, stepping into her path. She sighed. Her mind wasn’t there. She wasn’t in the mood for this. “Not interested,” she muttered and tried to move past them. But one of them—the taller one with a cocky smirk—grabbed her arm. Something inside her snapped. The world seemed to slow. A surge of heat raced through her veins, electric, raw, uncontrollable. Before she even realized what she was doing, her body reacted on instinct. She pushed him. But it wasn’t just a push. A force—something unnatural—erupted from her palm like an invisible wave. The man was lifted off his feet, his body launched through the air as if gravity had turned against him. His scream barely had time to escape before— BANG! He crashed into the side of a moving truck. The impact was brutal. The truck screeched to a halt, the driver stumbling out in horror. Mandy froze. Her heart slammed against her ribs. What just happened? The other man staggered back, eyes wide with terror. He had seen fights before. He had even been in a few. But this—this was something else. “What the hell…” he whispered, shaking his head. Mandy stared at her hands. They were trembling. Did I do that? How? Her breath came fast and uneven. She didn’t know what to do, what to think. People were starting to notice. A small crowd was forming. I have to get out of here. Without another word, she turned and ran. Ran from the gas station. Ran from the wreckage. Ran from herself. Because something had just awakened inside her—and it terrified her. Mandy ran. For weeks, she stayed on the move, darting through towns, sleeping in abandoned buildings, and stealing food when she had to. She had no phone, no ID, no place to call home. The police were searching for her after the gas station incident, but she was always one step ahead. At first, she told herself she was running from something. But deep down, she was running to something. Something inside her had awakened—something dark, something powerful. And the more she ran, the more she encountered situations that forced her to use it. The First Kill One night, she was sleeping in an alleyway behind a closed diner when she woke up to the sound of heavy breathing. A shadow loomed over her. A man. Tall. Smelling of cigarettes and sweat. His eyes were hungry, his grin sickening. “Didn’t mean to wake you, sweetheart,” he said, crouching over her. Mandy’s blood turned to ice. She scrambled back, but he grabbed her wrist. “No, no, don’t be scared. I’ll take good care of you.” A mistake. A sharp, white-hot anger shot through her veins, and before she could even process it, she felt something shift inside her. The man froze, his body stiffening. Then, in a terrifying instant— His head twisted violently off his shoulders. Blood sprayed the alley walls. His lifeless body dropped to the ground with a sickening thud, the head rolling a few feet away, his eyes still frozen in shock. Mandy gasped, stumbling back, staring at her trembling hands. No weapon. No touch. Just… her mind. She had done that. And the worst part? It felt good. The Hunger for More That night changed something in her. As she traveled from town to town, danger found her. Thieves who tried to rob her, men who thought they could take advantage of a homeless girl, gang members who tried to intimidate her—they all died. And with every kill, her power grew. Her purpose became clear. The world was corrupt. Filthy. Full of people who preyed on the weak, just like those men had preyed on her. She was meant to cleanse it. But power alone wasn’t enough. She needed resources. Money. A place to build herself into something even greater. So she decided to rob a bank. The Bank Heist Gone Wrong Mandy walked into the River City Federal Bank, her hoodie pulled over her head, her face blank with determination. Her plan was simple: Make everyone sleep. Take the money. Leave. She had practiced—on animals, on people in the streets—learning to tap into her abilities in smaller ways. Controlling them. This should’ve been easy. She stepped into the center of the bank, raised her hand, and focused. The air thickened. A deep hum filled the room. But something went wrong. Instead of simply putting them to sleep— They stopped breathing. One by one, tellers, guards, customers—they collapsed. Their bodies convulsed for a second before their faces turned blue. She had strangled them. All of them. She blinked. Her head felt heavy, her vision blurred for a second. She hadn’t meant to— Sirens. The police had arrived faster than she expected. Before she could react, she was surrounded. Guns drawn. Orders being shouted. She could have fought back, she could have killed them all—but something in her snapped. For the first time, she felt fear. She surrendered. The Monster in a Cage The headlines called her “The Death Witch.” Mandy was taken to a high-security prison, her hands and feet bound in heavy shackles, her cell surrounded by reinforced walls. They didn’t know how she did what she did—but they knew she was dangerous. And the other inmates? They knew too. They whispered about her, calling her a serial killer, a freak, a demon. One night, two of her cellmates decided to test her. A stocky woman with tattoos and a scarred face cornered her in the showers. “You think you’re scary?” she sneered. “You ain’t nothing in here.” The others laughed. Someone shoved Mandy against the wall. Another grabbed her by the hair. Another mistake. Mandy snapped her fingers. In an instant, their bodies seized up. One by one, they dropped like flies—their spines crushed, their skulls caving in from an unseen force. The other prisoners screamed. Guards rushed in, only to find Mandy standing there, untouched, surrounded by corpses. She lifted her gaze, her lips curling into a slow, dark smile. “I am nothing?” she whispered, stepping over the bodies. The guards backed away. Fear. She could smell it. And in that moment, Mandy stopped running. She knew who she was now. She wasn’t a victim. She wasn’t weak. She was power. She was death. And soon… the whole world would know it.

A text-to-video model is a machine learning model that uses a natural language description as input to produce a video relevant to the input text.[1] Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.[2]

Models

There are different models, including open source models. Chinese-language input[3] CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on GitHub in 2022.[4] That year, Meta Platforms released a partial text-to-video model called "Make-A-Video",[5][6][7] and Google's Brain (later Google DeepMind) introduced Imagen Video, a text-to-video model with 3D U-Net.[8][9][10][11][12]

In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation.[13] The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences.[14] In the same month, Adobe introduced Firefly AI as part of its features.[15]

In January 2024, Google announced development of a text-to-video model named Lumiere which is anticipated to integrate advanced video editing capabilities.[16] Matthias Niessner and Lourdes Agapito at AI company Synthesia work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars.[17] In June 2024, Luma Labs launched its Dream Machine video tool.[18][19] That same month,[20] Kuaishou extended its Kling AI text-to-video model to international users. In July 2024, TikTok owner ByteDance released Jimeng AI in China, through its subsidiary, Faceu Technology.[21] By September 2024, the Chinese AI company MiniMax debuted its video-01 model, joining other established AI model companies like Zhipu AI, Baichuan, and Moonshot AI, which contribute to China’s involvement in AI technology.[22]

Alternative approaches to text-to-video models include[23] Google's Phenaki, Hour One, Colossyan,[3] Runway's Gen-3 Alpha,[24][25] and OpenAI's Sora,[26] [27] Several additional text-to-video models, such as Plug-and-Play, Text2LIVE, and TuneAVideo, have emerged.[28] Google is also preparing to launch a video generation tool named Veo for YouTube Shorts in 2025.[29] FLUX.1 developer Black Forest Labs has announced its text-to-video model SOTA.[30]

Architecture and training

There are several architectures that have been used to create Text-to-Video models. Similar to Text-to-Image models, these models can be trained using Recurrent Neural Networks (RNNs) such as long short-term memory (LSTM) networks, which has been used for Pixel Transformation Models and Stochastic Video Generation Models, which aid in consistency and realism respectively.[31] An alternative for these include transformer models. Generative adversarial networks (GANs), Variational autoencoders (VAEs), — which can aid in the prediction of human motion[32] — and diffusion models have also been used to develop the image generation aspects of the model.[33]

Text-video datasets used to train models include, but are not limited to, WebVid-10M, HDVILA-100M, CCV, ActivityNet, and Panda-70M.[34][35] These datasets contain millions of original videos of interest, generated videos, captioned-videos, and textual information that help train models for accuracy. Text-video datasets used to train models include, but are not limited to PromptSource, DiffusionDB, and VidProM.[34][35] These datasets provide the range of text inputs needed to teach models how to interpret a variety of textual prompts.

The video generation process involves synchronizing the text inputs with video frames, ensuring alignment and consistency throughout the sequence.[35] This predictive process is subject to decline in quality as the length of the video increases due to resource limitations.[35]

Limitations

Despite the rapid evolution of Text-to-Video models in their performance, a primary limitation is that they are very computationally heavy which limits its capacity to provide high quality and lengthy outputs.[36][37] Additionally, these models require a large amount of specific training data to be able to generate high quality and coherent outputs, which brings about the issue of accessibility.[37][36]

Moreover, models may misinterpret textual prompts, resulting in video outputs that deviate from the intended meaning. This can occur due to limitations in capturing semantic context embedded in text, which affects the model’s ability to align generated video with the user’s intended message.[37][35] Various models, including Make-A-Video, Imagen Video, Phenaki, CogVideo, GODIVA, and NUWA, are currently being tested and refined to enhance their alignment capabilities and overall performance in text-to-video generation.[37]

Ethics

The deployment of Text-to-Video models raises ethical considerations related to content generation. These models have the potential to create inappropriate or unauthorized content, including explicit material, graphic violence, misinformation, and likenesses of real individuals without consent.[38] Ensuring that AI-generated content complies with established standards for safe and ethical usage is essential, as content generated by these models may not always be easily identified as harmful or misleading. The ability of AI to recognize and filter out NSFW or copyrighted content remains an ongoing challenge, with implications for both creators and audiences.[38]

Impacts and applications

Text-to-Video models offer a broad range of applications that may benefit various fields, from educational and promotional to creative industries. These models can streamline content creation for training videos, movie previews, gaming assets, and visualizations, making it easier to generate high-quality, dynamic content.[39] These features provide users with economical and personal benefits. The feature film The Reality of Time, the world's first full-length movie to fully integrate generative AI for video, was completed in 2024. Narrated in part by John de Lancie (known for his iconic role as "Q" in Star Trek: The Next Generation). Its production utilized advanced AI tools, including Runway Gen-3 Alpha and Kling 1.6, as described in the book Cinematic A.I. The book explores the limitations of text-to-video technology, the challenges of implementing it, and how image-to-video techniques were employed for many of the film's key shots.

Comparison of existing models

Model/Product Company Year released Status Key features Capabilities Pricing Video length Supported languages
Synthesia Synthesia 2019 Released AI avatars, multilingual support for 60+ languages, customization options[40] Specialized in realistic AI avatars for corporate training and marketing[40] Subscription-based, starting around $30/month Varies based on subscription 60+
InVideo AI InVideo 2021 Released AI-powered video creation, large stock library, AI talking avatars[40] Tailored for social media content with platform-specific templates[40] Free plan available, Paid plans starting at $16/month Varies depending on content type Multiple (not specified)
Fliki Fliki AI 2022 Released Text-to-video with AI avatars and voices, extensive language and voice support[40] Supports 65+ AI avatars and 2,000+ voices in 70 languages[40] Free plan available, Paid plans starting at $30/month Varies based on subscription 70+
Runway Gen-2 Runway AI 2023 Released Multimodal video generation from text, images, or videos[41] High-quality visuals, various modes like stylization and storyboard[41] Free trial, Paid plans (details not specified) Up to 16 seconds Multiple (not specified)
Pika Labs Pika Labs 2024 Beta Dynamic video generation, camera and motion customization[42] User-friendly, focused on natural dynamic generation[42] Currently free during beta Flexible, supports longer videos with frame continuation Multiple (not specified)
Runway Gen-3 Alpha Runway AI 2024 Alpha Enhanced visual fidelity, photorealistic humans, fine-grained temporal control[43] Ultra-realistic video generation with precise key-framing and industry-level customization[43] Free trial available, custom pricing for enterprises Up to 10 seconds per clip, extendable Multiple (not specified)
OpenAI Sora OpenAI 2024 Alpha Deep language understanding, high-quality cinematic visuals, multi-shot videos[44] Capable of creating detailed, dynamic, and emotionally expressive videos; still under development with safety measures[44] Pricing not yet disclosed Expected to generate longer videos; duration specifics TBD Multiple (not specified)

See also

References

  1. ^ Artificial Intelligence Index Report 2023 (PDF) (Report). Stanford Institute for Human-Centered Artificial Intelligence. p. 98. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.
  2. ^ Melnik, Andrew; Ljubljanac, Michal; Lu, Cong; Yan, Qi; Ren, Weiming; Ritter, Helge (6 May 2024). "Video Diffusion Models: A Survey". arXiv:2405.03150 [cs.CV].
  3. ^ a b Wodecki, Ben (11 August 2023). "Text-to-Video Generative AI Models: The Definitive List". AI Business. Informa. Retrieved 18 November 2024.
  4. ^ CogVideo, THUDM, 12 October 2022, retrieved 12 October 2022
  5. ^ Davies, Teli (29 September 2022). "Make-A-Video: Meta AI's New Model For Text-To-Video Generation". Weights & Biases. Retrieved 12 October 2022.
  6. ^ Monge, Jim Clyde (3 August 2022). "This AI Can Create Video From Text Prompt". Medium. Retrieved 12 October 2022.
  7. ^ "Meta's Make-A-Video AI creates videos from text". www.fonearena.com. Retrieved 12 October 2022.
  8. ^ "google: Google takes on Meta, introduces own video-generating AI". The Economic Times. 6 October 2022. Retrieved 12 October 2022.
  9. ^ Monge, Jim Clyde (3 August 2022). "This AI Can Create Video From Text Prompt". Medium. Retrieved 12 October 2022.
  10. ^ "Nuh-uh, Meta, we can do text-to-video AI, too, says Google". The Register. Retrieved 12 October 2022.
  11. ^ "Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction". paperswithcode.com. Retrieved 12 October 2022.
  12. ^ "Papers with Code - Text-driven Video Prediction". paperswithcode.com. Retrieved 12 October 2022.
  13. ^ Luo, Zhengxiong; Chen, Dayou; Zhang, Yingya; Huang, Yan; Wang, Liang; Shen, Yujun; Zhao, Deli; Zhou, Jingren; Tan, Tieniu (2023). "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". arXiv:2303.08320 [cs.CV].
  14. ^ Luo, Zhengxiong; Chen, Dayou; Zhang, Yingya; Huang, Yan; Wang, Liang; Shen, Yujun; Zhao, Deli; Zhou, Jingren; Tan, Tieniu (2023). "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". arXiv:2303.08320 [cs.CV].
  15. ^ "Adobe launches Firefly Video model and enhances image, vector and design models. Adobe Newsroom". Adobe Inc. 10 October 2024. Retrieved 18 November 2024.
  16. ^ Yirka, Bob (26 January 2024). "Google announces the development of Lumiere, an AI-based next-generation text-to-video generator". Tech Xplore. Retrieved 18 November 2024.
  17. ^ "Text to Speech for Videos". Synthesia.io. Retrieved 17 October 2023.
  18. ^ Nuñez, Michael (12 June 2024). "Luma AI debuts 'Dream Machine' for realistic video generation, heating up AI media race". VentureBeat. Retrieved 18 November 2024.
  19. ^ Fink, Charlie. "Apple Debuts Intelligence, Mistral Raises $600 Million, New AI Text-To-Video". Forbes. Retrieved 18 November 2024.
  20. ^ Franzen, Carl (12 June 2024). "What you need to know about Kling, the AI video generator rival to Sora that's wowing creators". VentureBeat. Retrieved 18 November 2024.
  21. ^ "ByteDance joins OpenAI's Sora rivals with AI video app launch". Reuters. 6 August 2024. Retrieved 18 November 2024.
  22. ^ "Chinese ai "tiger" minimax launches text-to-video-generating model to rival OpenAI's sora". Yahoo! Finance. 2 September 2024. Retrieved 18 November 2024.
  23. ^ Text2Video-Zero, Picsart AI Research (PAIR), 12 August 2023, retrieved 12 August 2023
  24. ^ Kemper, Jonathan (1 July 2024). "Runway's Sora competitor Gen-3 Alpha now available". THE DECODER. Retrieved 18 November 2024.
  25. ^ "Generative AI's Next Frontier Is Video". Bloomberg.com. 20 March 2023. Retrieved 18 November 2024.
  26. ^ "OpenAI teases 'Sora,' its new text-to-video AI model". NBC News. 15 February 2024. Retrieved 18 November 2024.
  27. ^ Kelly, Chris (25 June 2024). "Toys R Us creates first brand film to use OpenAI's text-to-video tool". Marketing Dive. Informa. Retrieved 18 November 2024.
  28. ^ Jin, Jiayao; Wu, Jianhang; Xu, Zhoucheng; Zhang, Hang; Wang, Yaxin; Yang, Jielong (4 August 2023). "Text to Video: Enhancing Video Generation Using Diffusion Models and Reconstruction Network". 2023 2nd International Conference on Computing, Communication, Perception and Quantum Technology (CCPQT). IEEE. pp. 108–114. doi:10.1109/CCPQT60491.2023.00024. ISBN 979-8-3503-4269-7.
  29. ^ Forlini, Emily Dreibelbis (18 September 2024). "Google's veo text-to-video AI generator is coming to YouTube shorts". PC Magazine. Retrieved 18 November 2024.
  30. ^ "Announcing Black Forest Labs". Black Forest Labs. 1 August 2024. Retrieved 18 November 2024.
  31. ^ Bhagwatkar, Rishika; Bachu, Saketh; Fitter, Khurshed; Kulkarni, Akshay; Chiddarwar, Shital (17 December 2020). "A Review of Video Generation Approaches". 2020 International Conference on Power, Instrumentation, Control and Computing (PICC). IEEE. pp. 1–5. doi:10.1109/PICC51425.2020.9362485. ISBN 978-1-7281-7590-4.
  32. ^ Kim, Taehoon; Kang, ChanHee; Park, JaeHyuk; Jeong, Daun; Yang, ChangHee; Kang, Suk-Ju; Kong, Kyeongbo (3 January 2024). "Human Motion Aware Text-to-Video Generation with Explicit Camera Control". 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE. pp. 5069–5078. doi:10.1109/WACV57701.2024.00500. ISBN 979-8-3503-1892-0.
  33. ^ Singh, Aditi (9 May 2023). "A Survey of AI Text-to-Image and AI Text-to-Video Generators". 2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC). IEEE. pp. 32–36. arXiv:2311.06329. doi:10.1109/AIRC57904.2023.10303174. ISBN 979-8-3503-4824-8.
  34. ^ a b Miao, Yibo; Zhu, Yifan; Dong, Yinpeng; Yu, Lijia; Zhu, Jun; Gao, Xiao-Shan (8 September 2024). "T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models". arXiv:2407.05965 [cs.CV].
  35. ^ a b c d e Zhang, Ji; Mei, Kuizhi; Wang, Xiao; Zheng, Yu; Fan, Jianping (August 2018). "From Text to Video: Exploiting Mid-Level Semantics for Large-Scale Video Classification". 2018 24th International Conference on Pattern Recognition (ICPR). IEEE. pp. 1695–1700. doi:10.1109/ICPR.2018.8545513. ISBN 978-1-5386-3788-3.
  36. ^ a b Bhagwatkar, Rishika; Bachu, Saketh; Fitter, Khurshed; Kulkarni, Akshay; Chiddarwar, Shital (17 December 2020). "A Review of Video Generation Approaches". 2020 International Conference on Power, Instrumentation, Control and Computing (PICC). IEEE. pp. 1–5. doi:10.1109/PICC51425.2020.9362485. ISBN 978-1-7281-7590-4.
  37. ^ a b c d Singh, Aditi (9 May 2023). "A Survey of AI Text-to-Image and AI Text-to-Video Generators". 2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC). IEEE. pp. 32–36. arXiv:2311.06329. doi:10.1109/AIRC57904.2023.10303174. ISBN 979-8-3503-4824-8.
  38. ^ a b Miao, Yibo; Zhu, Yifan; Dong, Yinpeng; Yu, Lijia; Zhu, Jun; Gao, Xiao-Shan (8 September 2024). "T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models". arXiv:2407.05965 [cs.CV].
  39. ^ Singh, Aditi (9 May 2023). "A Survey of AI Text-to-Image and AI Text-to-Video Generators". 2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC). IEEE. pp. 32–36. arXiv:2311.06329. doi:10.1109/AIRC57904.2023.10303174. ISBN 979-8-3503-4824-8.
  40. ^ a b c d e f "Top AI Video Generation Models of 2024". Deepgram. Retrieved 30 August 2024.
  41. ^ a b "Runway Research | Gen-2: Generate novel videos with text, images or video clips". runwayml.com. Retrieved 30 August 2024.
  42. ^ a b Sharma, Shubham (26 December 2023). "Pika Labs' text-to-video AI platform opens to all: Here's how to use it". VentureBeat. Retrieved 30 August 2024.
  43. ^ a b "Runway Research | Introducing Gen-3 Alpha: A New Frontier for Video Generation". runwayml.com. Retrieved 30 August 2024.
  44. ^ a b "Sora | OpenAI". openai.com. Retrieved 30 August 2024.