Purpose
Transcribing spoken audio and identifying who said what is a critical but time-consuming task for journalists, researchers, investigators, legal professionals, and podcasters. Most built-in transcription tools offer limited accuracy, no speaker identification, and poor handling of long or noisy recordings.
This tool solves that by combining OpenAI’s Whisper (for highly accurate transcription) with pyannote-audio (for speaker diarization) in a streamlined, flexible script that runs on your local machine — no cloud services or APIs required once installed.
What This Tool Does
This script takes any audio file (e.g., .m4a
, .mp3
, .wav
) and produces:
- A full transcript of the spoken content using OpenAI’s Whisper model
- Automatically labeled speaker turns using pyannote’s diarization pipeline
- Output in:
- Human-readable
.txt
format with timestamps and speaker labels - Optional
.srt
subtitles for use in video editors or players
- Human-readable
Features
- Automatic conversion of any supported audio format to
.wav
using ffmpeg - Caching of transcripts so you can resume without reprocessing
- A
--force
option to reprocess transcription - A
--srt
option to generate subtitles
Use Cases
- Interview and focus group transcription with clear speaker labeling
- Audio logs, meeting notes, or legal depositions
- Podcast editing or subtitling
- Case studies and research involving qualitative audio
Platform Support
This tool works on:
- macOS
- Windows
- Linux
All you need is a working Python 3.9+ environment, ffmpeg, and access to a Hugging Face account for the diarization model.
Prerequisites
Python 3.9 or newer
Recommended installation: https://www.python.org/downloads/
Verify with:
python --version
ffmpeg
Required for converting audio files to .wav
format.
-
macOS:
brew install ffmpeg
-
Windows:
Download from https://ffmpeg.org/download.html and add it to your system PATH. -
Linux:
sudo apt install ffmpeg
Virtual Environment (Recommended)
To isolate dependencies and avoid version conflicts:
python -m venv diarize_env
source diarize_env/bin/activate # On Windows: diarize_env\Scripts\activate
Python Packages
These versions match the expected environment of the pretrained diarization models:
pip install torch==1.13.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install pyannote.audio==1.1.1
pip install git+https://github.com/openai/whisper.git
pip install ffmpeg-python tqdm
Note: Newer versions of
torch
orpyannote.audio
may introduce breaking changes. I recommend using the versions shown above for consistent results.
Hugging Face Account & Access Token
pyannote.audio
requires a token to access gated models.
-
Create an account: https://huggingface.co
-
Create a token: https://huggingface.co/settings/tokens
-
Accept model terms of use for the following repositories:
-
Add your token to the script under:
HUGGINGFACE_TOKEN = "your_token_here"
Running the Tool
Once prerequisites are met and the script is configured, run the tool from your terminal:
python transcribe_diarize.py <input_audio_file> [--force] [--srt]
Arguments
<input_audio_file>
: The path to the audio file you want to transcribe (supports.m4a
,.mp3
,.wav
, etc.)
Options
--force
: Forces re-running Whisper transcription, even if cached.--srt
: Generates a.srt
subtitle file in addition to the plain text output.
Output Files
For an input file named interview.m4a
, the script will produce:
File | Description |
---|---|
interview.wav |
Converted audio file used by Whisper and pyannote |
interview_whisper_transcript.json |
Cached transcription result (used for resuming) |
interview_diarized_transcript.txt |
Text transcript with speaker labels and timestamps |
interview_diarized_transcript.srt |
Subtitle file (if --srt is used) |
Example Usage
python transcribe_diarize.py interview.m4a --force --srt
This command will:
- Convert
interview.m4a
to.wav
- Re-run Whisper transcription
- Run speaker diarization
- Output both
.txt
and.srt
transcripts
Script
Place the following code in a file named transcribe_diarize.py
:
import os
import sys
import json
import time
import datetime
import hashlib
from pathlib import Path
import whisper
from tqdm import tqdm
from pyannote.audio import Pipeline
import ffmpeg
# --------------------------- Configuration --------------------------- #
HUGGINGFACE_TOKEN = "XXX" # Replace with your Hugging Face token
WHISPER_MODEL = "medium"
# --------------------------- Argument Parsing --------------------------- #
if len(sys.argv) < 2:
print("Usage: python transcribe_diarize.py <input_audio_file> [--force] [--srt]")
sys.exit(1)
input_file = Path(sys.argv[1])
force = "--force" in sys.argv
export_srt = "--srt" in sys.argv
if not input_file.exists():
print(f"Error: File not found – {input_file}")
sys.exit(1)
# --------------------------- Utility: File Hash --------------------------- #
def file_hash(path, block_size=65536):
hasher = hashlib.sha256()
with open(path, "rb") as f:
while chunk := f.read(block_size):
hasher.update(chunk)
return hasher.hexdigest()[:12] # Shorten to 12 characters for brevity
# --------------------------- File Conversion --------------------------- #
converted_wav = input_file.with_suffix(".wav")
if input_file.suffix.lower() != ".wav":
print(f"Converting {input_file.name} to WAV format...")
try:
ffmpeg.input(str(input_file)).output(str(converted_wav), ac=1, ar=16000).run(quiet=True, overwrite_output=True)
print(f"Converted to: {converted_wav}")
except Exception as e:
print(f"Error during ffmpeg conversion: {e}")
sys.exit(1)
else:
converted_wav = input_file
# --------------------------- Output File Naming --------------------------- #
hash_id = file_hash(input_file)
base_name = input_file.stem
prefix = f"{base_name}_{hash_id}"
transcript_json = input_file.parent / f"{prefix}_whisper_transcript.json"
final_transcript_txt = input_file.parent / f"{prefix}_diarized_transcript.txt"
srt_output = input_file.parent / f"{prefix}_diarized_transcript.srt"
# --------------------------- Whisper Transcription --------------------------- #
print("Loading Whisper model...")
start_time = time.time()
whisper_model = whisper.load_model(WHISPER_MODEL)
print(f"Whisper model loaded in {time.time() - start_time:.1f} seconds")
if transcript_json.exists() and not force:
print(f"Using cached transcription: {transcript_json}")
with open(transcript_json, "r", encoding="utf-8") as f:
result = json.load(f)
else:
print("Running Whisper transcription (this may take several minutes)...")
start_time = time.time()
result = whisper_model.transcribe(str(converted_wav), verbose=False) # Show progress bar
with open(transcript_json, "w", encoding="utf-8") as f:
json.dump(result, f)
print(f"Transcription completed and saved in {time.time() - start_time:.1f} seconds")
# --------------------------- Pyannote Diarization --------------------------- #
print("Loading speaker diarization pipeline...")
try:
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization",
use_auth_token=HUGGINGFACE_TOKEN
)
except Exception as e:
print("Failed to load diarization model. Make sure you've accepted model access:")
print(" - https://huggingface.co/pyannote/speaker-diarization")
print(" - https://huggingface.co/pyannote/segmentation")
print(" - https://huggingface.co/pyannote/embedding")
print(" - https://huggingface.co/pyannote/clustering")
print(f"Error: {e}")
sys.exit(1)
print("Running diarization (this may take a few minutes)...")
start_time = time.time()
diarization = pipeline(str(converted_wav))
print(f"Diarization completed in {time.time() - start_time:.1f} seconds")
# --------------------------- Output Alignment --------------------------- #
def format_time(seconds):
return str(datetime.timedelta(seconds=int(seconds)))[2:]
def format_srt_timestamp(seconds):
td = datetime.timedelta(seconds=seconds)
return str(td)[:-3].replace('.', ',').rjust(12, '0')
print("Aligning transcript with diarization...")
speaker_segments = []
for turn, _, speaker in tqdm(diarization.itertracks(yield_label=True), desc="Speaker Segments"):
speaker_segments.append({
"start": turn.start,
"end": turn.end,
"speaker": speaker
})
aligned_output = []
for segment in tqdm(result['segments'], desc="Merging Segments"):
start = segment['start']
text = segment['text'].strip()
for seg in speaker_segments:
if seg["start"] <= start < seg["end"]:
speaker = seg["speaker"]
timestamp = format_time(start)
aligned_output.append({
"speaker": speaker,
"timestamp": timestamp,
"start": start,
"end": segment['end'],
"text": text
})
break
# Save to plain text
with open(final_transcript_txt, "w", encoding="utf-8") as f:
for line in aligned_output:
f.write(f"[{line['speaker']}] {line['timestamp']} – {line['text']}\n")
print(f"Transcript saved to: {final_transcript_txt}")
# Save to SRT format
if export_srt:
print("Generating SRT file...")
with open(srt_output, "w", encoding="utf-8") as f:
for idx, entry in enumerate(aligned_output, 1):
f.write(f"{idx}\n")
f.write(f"{format_srt_timestamp(entry['start'])} --> {format_srt_timestamp(entry['end'])}\n")
f.write(f"{entry['speaker']}: {entry['text']}\n\n")
print(f"SRT file saved to: {srt_output}")