The emergence of artificial intelligence AI has been one of the most significant technological advancements of the 21st century. From self-driving cars to virtual assistants and chatbots, AI has become ubiquitous in our daily lives. Its effects on several businesses and society at large are immense.
To complement these technological developments, the renowned American graphics processing unit manufacturer, Nvidia in partnership with researchers from Cornell University, has unveiled an AI video generation model named VideoLDM. The new AI can generate high-resolution videos based on text descriptions.
VideoLDM: Nvidia AI Video Generation Model
Based on a text description, the AI model can create videos with a maximum resolution of 2048 x 1280 pixels, 24 frame rates, and a maximum runtime of 4.7 seconds. The stable diffusion neural network is the foundation of the model. Only 2.7 billion of the 4.1 billion parameters in the NVIDIA solution used video for training.
This is quite modest by the standards of modern AI. Using a powerful Latent Diffusion Model (LDM) method, engineers were able to produce a wide range of high-definition films that were both diversified and time-consistent.
The research team from both Nvidia and Cornell University highlight the following features of this model: Both the creation of customized videos and temporal convolution synthesis. LDM image reference networks that have been fine-tuned beforehand in the DreamBooth picture collection are inserted with temporal layers trained in VideoLDM to translate text to video.
You can produce slightly longer clips with no quality loss by applying the learned time planes wrinkle-wise over time. Additionally, the model can produce films of driving scenes. Videos can last up to 5 minutes and have a resolution of 1024×512 pixels.
By employing bounding boxes to create an engaging environment, synthesizing an appropriate source image, and then producing convincing films, it is feasible to recreate a particular driving experience. Additionally, the model may generate a variety of conceivable missions from a single initial frame to provide multimodal predictions of motion scenarios.
Currently, this research is a participant in the Machine Vision and Pattern Recognition Conference taking place June 18-22 in Vancouver. The described neural network is currently simply a research project, and it is unknown when NVIDIA will make something similar available to the general public.