Sunday, December 01, 2024

New AI Audio Model by Nvidia Synthesizes Sounds That Have Never Existed

Amazing stuff! 

Maybe this is exactly what e.g. contemporary music needs since it has become so boring for many years! A new sound machine!

"Nvidia’s newly revealed ‘Fugatto’ model looks to go a step further, using new synthetic training methods and inference-level combination techniques to “transform any mix of music, voices, and sounds,” including the synthesis of sounds that have never existed.

While Fugatto isn’t available for public testing yet, a sample-filled website showcases how Fugatto can be used to dial a number of distinct audio traits and descriptions up or down, resulting in everything from the sound of saxophones barking to people speaking underwater to ambulance sirens singing in a kind of choir. While the results on display can be a bit hit or miss, the vast array of capabilities on display here helps support Nvidia’s description of Fugatto as ‘a Swiss Army knife for sound.’”"

"... Fugatto is a foundational generative transformer model that builds on the team’s prior work in areas such as speech modeling, audio vocoding and audio understanding.

The full version uses 2.5 billion parameters and was trained on a bank of NVIDIA DGX systems packing 32 NVIDIA H100 Tensor Core GPUs.

Fugatto was made by a diverse group of people from around the world, including India, Brazil, China, Jordan and South Korea. Their collaboration made Fugatto’s multi-accent and multilingual capabilities stronger. ..."

From the abstract:
"Fugatto is a versatile audio synthesis and transformation model capable of following free-form text instructions with optional audio inputs. While large language models (LLMs) trained with text on a simple next-token prediction objective can learn to infer instructions directly from the data, models trained solely on audio data lack this capacity. This is because audio data does not inherently contain the instructions that were used to generate it. To overcome this challenge, we introduce a specialized dataset generation approach optimized for producing a wide range of audio generation and transformation tasks, ensuring the data reveals meaningful relationships between audio and language. Another challenge lies in achieving compositional abilities -- such as combining, interpolating between, or negating instructions -- using data alone. To address it, we propose ComposableART, an inference-time technique that extends classifier-free guidance to compositional guidance. It enables the seamless and flexible composition of instructions, leading to highly customizable audio outputs outside the training distribution. Our evaluations across a diverse set of tasks demonstrate that Fugatto performs competitively with specialized models, while ComposableART enhances its sonic palette and control over synthesis. Most notably, we highlight our framework's ability to execute emergent sounds and tasks -- sonic phenomena that transcend conventional audio generation -- unlocking new creative possibilities. \href{https://fugatto.github.io/}{Demo Website.}"

New AI Audio Model Synthesizes Sounds That Have Never Existed - Human Progress

Nvidia’s new AI audio model can synthesize sounds that have never existed "What does a screaming saxophone sound like? The Fugatto model has an answer..."

Now Hear This: World’s Most Flexible Sound Machine Debuts (company blog post) "Using text and audio as inputs, a new generative AI model from NVIDIA can create any combination of music, voices and sounds."

Fugatto 1: Foundational Generative Audio Transformer Opus 1 (open access; blind submission to ICLR 2025)

No comments: