Nvidia has unveiled a new generative AI model that can create audio from simple text prompts and contextual audio inputs, producing unique sounds. The company sees Fugatto 1 "as a tool for creatives, enabling them to quickly realize their sonic dreams and unheard sounds—an instrument for imagination, not just a substitute for creativity."
Research Insights
In their research paper, the Nvidia team explains that Large Language Models (LLMs) trained on text can deduce instructions from various inputs. However, LLMs focused solely on audio lack this capability since audio does not carry data indicating how it was generated.
Technical Details
Nvidia's Fugatto 1 employs a unique dataset that encompasses a broad range of sounds, along with a technique for interpreting and managing instructions, known as ComposeableART. This enables the model to generate an emergent dataset that assists it in mixing various sounds, including those it wasn’t specifically trained to process.
Demonstration Examples
Nvidia has provided several demonstrations of the model's capabilities on Fugatto's Github page. Notable examples include synthesizing a dog barking in sync with electronic dance music, a typewriter softly whispering every letter typed, and even a saxophone that can bark or meow.
As of now, Nvidia does not have plans to make the model available to the public.