Audio to SRT Converter with Speaker Diarization

Convert audio files to formatted SRT subtitles with automatic speaker detection and identification.

Running on: CPU | Processing time: 5-15 minutes

Step 1: Authentication

Required: You need a Hugging Face token for speaker diarization.
  1. Create a free account at Hugging Face (if you don't have one)
  2. Get your token at Settings → Access Tokens
  3. Accept the user agreement at pyannote/speaker-diarization-3.1
  4. Paste your token below (starts with hf_...)

Step 2: Upload Your Audio

Supports MP3, WAV, Opus, M4A, and most audio formats

Step 3: Identify Speakers (Optional)

The system automatically detects up to 3 speakers in order of appearance.

  • Without names: Speakers appear as "Speaker 00", "Speaker 01", etc.
  • With names: Your custom names appear instead (e.g., "Daniel", "Sarah")
  • Descriptions: Optional notes to help you identify speakers (not shown in output)

Tip: Listen to the first 30 seconds of your audio to identify who speaks first!


Expected processing time:
• Transcription: 2-5 minutes
• Speaker detection: 3-10 minutes
• Formatting: ~30 seconds

Watch the progress bar for real-time updates!

Results