Audio to SRT Converter with Speaker Diarization

Convert audio files to formatted SRT subtitles with automatic speaker detection and identification.

Running on: CPU | Processing time: 5-15 minutes

Step 1: Authentication

Required: You need a Hugging Face token for speaker diarization.

Hugging Face Token

Your token is not stored and only used for this session

Step 2: Upload Your Audio

Audio File

Supports MP3, WAV, Opus, M4A, and most audio formats

Step 3: Identify Speakers (Optional)

Expected processing time:
• Transcription: 2-5 minutes
• Speaker detection: 3-10 minutes
• Formatting: ~30 seconds

Watch the progress bar for real-time updates!

Results

Generated SRT Content

Preview your subtitles or copy to clipboard

Download SRT File

Processing Info

How This Tool Works

Process Overview

Audio Upload
- Upload any audio file (MP3, WAV, M4A, Opus, etc.)
- File is automatically converted to WAV format for processing
Speech-to-Text Transcription
- Uses OpenAI's Whisper (large-v2 model)
- Generates accurate word-level timestamps
- Supports English language
Speaker Diarization
- Uses Pyannote Audio 3.1 for speaker detection
- Automatically identifies up to 3 different speakers
- Labels speakers in order of first appearance
Text Cleaning & Formatting
- Removes filler words (um, uh, like, you know, etc.)
- Splits text into readable sentence blocks
- Adds speaker labels to each subtitle
- Generates standard SRT format

Features

Automatic speaker detection - No manual marking needed
Custom speaker names - Replace "Speaker 00" with real names
Clean text - Filler words automatically removed
Smart formatting - One speaker per subtitle, one sentence per block
Standard SRT format - Works with all video players and editors
GPU acceleration - Fast processing on T4 GPU

Tips for Best Results

Before Processing

Listen to the first minute of your audio to identify speakers
Note the order speakers appear (first voice = Voice 1, etc.)
Use clear names for easy identification in subtitles

Audio Quality

Better audio quality = more accurate transcription
Minimize background noise for best speaker detection
Clear speech separation helps diarization accuracy

Speaker Identification

You don't need to fill in all 3 voices if you have fewer speakers
If you skip speaker names, output will show "Speaker 00", "Speaker 01", etc.
Descriptions are just for your reference and don't affect the output

Output Format

Your SRT file will look like this:

1
00:00:01,234 --> 00:00:05,678
(Daniel) Welcome to the podcast.

2
00:00:06,123 --> 00:00:10,456
(Sarah) Thanks for having me.

3
00:00:11,789 --> 00:00:15,234
(Daniel) Let's dive into today's topic.

Each subtitle block includes:

Subtitle number
Start and end timestamps (HH:MM:SS,mmm format)
Speaker name in parentheses
Cleaned, formatted text

Troubleshooting

"Error: You need to accept the user agreement"

Visit pyannote/speaker-diarization-3.1
Click "Agree and access repository"
Try processing again

"Error: Invalid Hugging Face token"

Check your token at HF Settings
Make sure you copied the full token (starts with hf_)
Generate a new token if needed

Processing takes too long

Normal processing: 5-15 minutes for typical audio files
First run may download models (~1-2 GB)
Longer files (60+ minutes) may take 20-30 minutes

Wrong speaker labels

Speakers are detected in order of first appearance
Voice 1 = first person to speak, Voice 2 = second, etc.
Re-listen to your audio to identify the correct order

Privacy & Security

Your audio files are processed temporarily and not stored
Your HF token is only used for this session and never saved
All processing happens on Hugging Face's secure infrastructure
Generated SRT files are temporarily stored for download only

Technical Details

Models Used:

Whisper large-v2 (OpenAI) - Speech-to-text
Pyannote 3.1 - Speaker diarization

Hardware:

NVIDIA T4 GPU with CUDA support
16GB GPU memory
Automatic FP16 optimization

Supported Audio Formats: MP3, WAV, M4A, AAC, Opus, FLAC, OGG, WMA, and more

Support

If you encounter issues or have suggestions, please visit the Space's community tab or create an issue.