Common Sense: NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second

Thursday, May 08, 2025

NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second

This seems to be quite impressive!

Apparently, the Parakeet ASR models were first released by NVIDIA in January 2024. It published a research paper on it in September of 2023 (see below)

"NVIDIA has unveiled Parakeet TDT 0.6B, a state-of-the-art automatic speech recognition (ASR) model that is now fully open-sourced on Hugging Face. With 600 million parameters, a commercially permissive CC-BY-4.0 license, and a staggering real-time factor (RTF) of 3386, this model sets a new benchmark for performance and accessibility in speech AI.

Blazing Speed and Accuracy

At the heart of Parakeet TDT 0.6B’s appeal is its unmatched speed and transcription quality. The model can transcribe 60 minutes of audio in just one second, a performance that’s over 50x faster than many existing open ASR models. On Hugging Face’s Open ASR Leaderboard, Parakeet V2 achieves a 6.05% word error rate (WER)—the best-in-class among open models. ..."

NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second - MarkTechPost

Investigating End-to-End ASR Architectures for Long Form Audio Transcription (open access)

Figure 2. Architecture of the NVIDIA Parakeet encoder with blocks of downsampling and subsampling, conformer encoder blocks with limited context attention (LCA), and global token (GT)

Thursday, May 08, 2025

NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second

No comments: