In a groundbreaking leap for audio AI, NVIDIA has introduced Fugatto, a 2.5 billion-parameter AI audio generator designed to redefine how sound is created and transformed. Developed by a team of generative AI researchers, Fugatto is a versatile tool capable of producing and manipulating music, voices, and sounds using simple text and audio prompts. This innovation, heralded as a “Swiss Army knife for sound,” pushes the boundaries of AI-powered creativity, enabling users to generate sounds never heard before.
The All-in-One Solution for Audio AI
While many AI models specialize in isolated tasks like composing songs or voice modulation, Fugatto’s unparalleled flexibility sets it apart. From crafting music snippets based on textual descriptions to adding or removing instruments from existing tracks, it seamlessly handles multiple audio generation and transformation tasks. Fugatto’s versatility unlocks potential in multiple fields:
- Music Production: Artists can prototype song ideas, experiment with styles, and fine-tune tracks.
- Advertising: Agencies can tailor voiceovers with different accents and emotional tones for localized campaigns.
- Education: Language learning tools could replicate a learner’s chosen voice, from a family member to a fictional character.
- Gaming: Developers can dynamically modify or create audio assets based on in-game actions.
Emergent Capabilities in Generative AI Research
Fugatto leverages emergent properties—unexpected abilities arising from its diverse training—allowing users to combine free-form instructions into complex, layered outputs. For instance, it can produce speech in a French accent infused with sadness or blend auditory elements like thunderstorms transitioning into birdsong. With fine-tuning, Fugatto can perform tasks it wasn’t explicitly trained on, such as generating high-quality singing voices from text prompts.
Fugatto ComposableART feature enables real-time instruction blending to give creators nuanced control over attributes like accent intensity or tonal shifts. The model’s temporal interpolation feature further allows users to shape how sound evolves, such as crafting a thunderstorm crescendo that transitions into a serene dawn chorus.
“In my tests,” said Rohan Badlani, an AI researcher who helped design the model, “Fugatto often made me feel like an artist.”
Fugatto’s creation was a monumental undertaking. Its 2.5 billion parameters were trained on NVIDIA DGX systems using 32 H100 Tensor Core GPUs. The development team—a global collaboration spanning Brazil, India, China, and beyond—spent over a year curating millions of diverse audio samples and uncovering new relationships in data.
Read our review of the Murf AI text-to-speech generator to learn more about how generative AI can be used for audio production.