Audio to SRT Converter with Speaker Diarization
Convert audio files to formatted SRT subtitles with automatic speaker detection and identification.
Running on: CPU | Processing time: 5-15 minutes
Step 1: Authentication
Required: You need a Hugging Face token for speaker diarization.
- Create a free account at Hugging Face (if you don't have one)
- Get your token at Settings → Access Tokens
- Accept the user agreement at pyannote/speaker-diarization-3.1
- Paste your token below (starts with
hf_...)
Step 2: Upload Your Audio
Supports MP3, WAV, Opus, M4A, and most audio formats
Step 3: Identify Speakers (Optional)
The system automatically detects up to 3 speakers in order of appearance.
- Without names: Speakers appear as "Speaker 00", "Speaker 01", etc.
- With names: Your custom names appear instead (e.g., "Daniel", "Sarah")
- Descriptions: Optional notes to help you identify speakers (not shown in output)
Tip: Listen to the first 30 seconds of your audio to identify who speaks first!
Expected processing time:
• Transcription: 2-5 minutes
• Speaker detection: 3-10 minutes
• Formatting: ~30 seconds
Watch the progress bar for real-time updates!
• Transcription: 2-5 minutes
• Speaker detection: 3-10 minutes
• Formatting: ~30 seconds
Watch the progress bar for real-time updates!
Results
How This Tool Works
Process Overview
Audio Upload
- Upload any audio file (MP3, WAV, M4A, Opus, etc.)
- File is automatically converted to WAV format for processing
Speech-to-Text Transcription
- Uses OpenAI's Whisper (large-v2 model)
- Generates accurate word-level timestamps
- Supports English language
Speaker Diarization
- Uses Pyannote Audio 3.1 for speaker detection
- Automatically identifies up to 3 different speakers
- Labels speakers in order of first appearance
Text Cleaning & Formatting
- Removes filler words (um, uh, like, you know, etc.)
- Splits text into readable sentence blocks
- Adds speaker labels to each subtitle
- Generates standard SRT format
Features
- Automatic speaker detection - No manual marking needed
- Custom speaker names - Replace "Speaker 00" with real names
- Clean text - Filler words automatically removed
- Smart formatting - One speaker per subtitle, one sentence per block
- Standard SRT format - Works with all video players and editors
- GPU acceleration - Fast processing on T4 GPU
Tips for Best Results
Before Processing
- Listen to the first minute of your audio to identify speakers
- Note the order speakers appear (first voice = Voice 1, etc.)
- Use clear names for easy identification in subtitles
Audio Quality
- Better audio quality = more accurate transcription
- Minimize background noise for best speaker detection
- Clear speech separation helps diarization accuracy
Speaker Identification
- You don't need to fill in all 3 voices if you have fewer speakers
- If you skip speaker names, output will show "Speaker 00", "Speaker 01", etc.
- Descriptions are just for your reference and don't affect the output
Output Format
Your SRT file will look like this:
1
00:00:01,234 --> 00:00:05,678
(Daniel) Welcome to the podcast.
2
00:00:06,123 --> 00:00:10,456
(Sarah) Thanks for having me.
3
00:00:11,789 --> 00:00:15,234
(Daniel) Let's dive into today's topic.
Each subtitle block includes:
- Subtitle number
- Start and end timestamps (HH:MM:SS,mmm format)
- Speaker name in parentheses
- Cleaned, formatted text
Troubleshooting
"Error: You need to accept the user agreement"
- Visit pyannote/speaker-diarization-3.1
- Click "Agree and access repository"
- Try processing again
"Error: Invalid Hugging Face token"
- Check your token at HF Settings
- Make sure you copied the full token (starts with
hf_) - Generate a new token if needed
Processing takes too long
- Normal processing: 5-15 minutes for typical audio files
- First run may download models (~1-2 GB)
- Longer files (60+ minutes) may take 20-30 minutes
Wrong speaker labels
- Speakers are detected in order of first appearance
- Voice 1 = first person to speak, Voice 2 = second, etc.
- Re-listen to your audio to identify the correct order
Privacy & Security
- Your audio files are processed temporarily and not stored
- Your HF token is only used for this session and never saved
- All processing happens on Hugging Face's secure infrastructure
- Generated SRT files are temporarily stored for download only
Technical Details
Models Used:
- Whisper large-v2 (OpenAI) - Speech-to-text
- Pyannote 3.1 - Speaker diarization
Hardware:
- NVIDIA T4 GPU with CUDA support
- 16GB GPU memory
- Automatic FP16 optimization
Supported Audio Formats: MP3, WAV, M4A, AAC, Opus, FLAC, OGG, WMA, and more
Support
If you encounter issues or have suggestions, please visit the Space's community tab or create an issue.