Making an Awesome Transcribing Bot
Harnessing NVIDIA Parakeet for High-Throughput Multilingual Speech-to-Text
Transcription used to be a tedious task, often requiring expensive services or clunky software that struggled with accents and multiple languages. Today, we have access to state-of-the-art models that can handle these challenges with ease. In this project, we’ll explore how to build a powerful transcribing bot using NVIDIA Parakeet.
- There are several problems with the codebase and the workflow to deploy the model.
- Several work needed still
What is Parakeet?
Parakeet is a family of state-of-the-art Automatic Speech Recognition (ASR) models developed by NVIDIA. Specifically, we’ll be looking at the parakeet-tdt-0.6b-v3 model.
This model is a Token-and-Duration Transducer (TDT), a breakthrough architecture that is significantly faster than traditional models while maintaining incredible accuracy. It’s multilingual, supporting 25 European languages, and it’s designed for high-throughput transcription—perfect for processing large volumes of audio or video content.
Key Features:
- Multilingual Support: Handles English, Spanish, French, German, and many more.
- High Speed: Up to 10x faster than traditional models.
- Precision: Provides accurate word-level timestamps and punctuation.
- Robustness: Works well across various audio qualities and accents.
Installation Guide
Setting up a modern ASR pipeline can be intimidating, but I’ve broken it down into a few simple options depending on your preferred environment.
uv is the modern way to manage Python environments. It’s fast and reliable.
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
# Create environment and install NeMo
uv venv
source .venv/bin/activate
uv pip install -U nemo_toolkit['asr'] torchIf you prefer a more traditional data science environment:
# Create and activate environment
conda create -n transcribing-bot python=3.10 -y
conda activate transcribing-bot
# Install dependencies
pip install -U nemo_toolkit['asr'] torchFor a “set it and forget it” setup with all drivers pre-configured:
# Pull the NVIDIA PyTorch container
docker pull nvcr.io/nvidia/pytorch:24.01-py3
# Run with GPU access
docker run --gpus all -it --rm -v "$(pwd):/workspace" nvcr.io/nvidia/pytorch:24.01-py3
docker run --gpus all -it --rm nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04Then inside the container:
pip install -U nemo_toolkit['asr']import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-tdt-0.6b-v3")wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wavoutput = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)The current code suggest does not work. Working on fixes
NOTES
On installing
docker run --gpus all -it --rm \
nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
apt update && apt install -y python3-pip
pip install --upgrade pip
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 \
--index-url https://download.pytorch.org/whl/cu121
pip install nemo_toolkit[asr]
import nemo.collections.asr as nemo_asrUsing the Bot
Once you have your environment set up, transcribing audio is remarkably simple with the NeMo library. Here is a basic script to get you started:
import nemo.collections.asr as nemo_asr
# 1. Load the Parakeet model
# This will automatically download the weights from HuggingFace
model_name = "nvidia/parakeet-tdt-0.6b-v3"
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=model_name)
# 2. Transcribe your audio file
# Note: 16kHz mono .wav files are recommended for best results
audio_file = "my_meeting_recording.wav"
transcription = asr_model.transcribe([audio_file])
# 3. See the results!
print(f"Result: {transcription[0]}")Pro Tip: Timestamps
If you need to know when exactly a word was said (e.g., for generating subtitles), the TDT architecture allows you to extract timestamps with high precision. You can enable this by passing extra parameters to the transcribe method.
Why This Matters
For engineers, ASR is a building block for countless applications: * Meeting Summarizers: Automatically capture and summarize team discussions. * Accessibility Tools: Provide real-time captions for video content. * Searchable Archives: Turn years of audio recordings into a searchable database.
By using an open-source, high-performance model like Parakeet, you’re not just building a bot; you’re creating a scalable solution that respects data privacy and offers professional-grade performance.
The world is full of audio waiting to be understood—let’s get to work!
Putting what we learned in Practice
There is a very insightful segment from the movie Waking Life (2001)[1] that I think captures a lot of the recent events and how fast we seem to evolving.
Telescopic Evolution Waking Life