Purpose

Transcribing spoken audio and identifying who said what is a critical but time-consuming task for journalists, researchers, investigators, legal professionals, and podcasters. Most built-in transcription tools offer limited accuracy, no speaker identification, and poor handling of long or noisy recordings.

This tool solves that by combining OpenAI’s Whisper (for highly accurate transcription) with pyannote-audio (for speaker diarization) in a streamlined, flexible script that runs on your local machine — no cloud services or APIs required once installed.

What This Tool Does

This script takes any audio file (e.g., .m4a, .mp3, .wav) and produces:

A full transcript of the spoken content using OpenAI’s Whisper model
Automatically labeled speaker turns using pyannote’s diarization pipeline
Output in:
- Human-readable .txt format with timestamps and speaker labels
- Optional .srt subtitles for use in video editors or players

Features

Automatic conversion of any supported audio format to .wav using ffmpeg
Caching of transcripts so you can resume without reprocessing
A --force option to reprocess transcription
A --srt option to generate subtitles

Use Cases

Interview and focus group transcription with clear speaker labeling
Audio logs, meeting notes, or legal depositions
Podcast editing or subtitling
Case studies and research involving qualitative audio

Platform Support

This tool works on:

macOS
Windows
Linux

All you need is a working Python 3.9+ environment, ffmpeg, and access to a Hugging Face account for the diarization model.

Prerequisites

Python 3.9 or newer

Recommended installation: https://www.python.org/downloads/
Verify with:

python --version

ffmpeg

Required for converting audio files to .wav format.

macOS:
```
brew install ffmpeg
```
Windows:
Download from https://ffmpeg.org/download.html and add it to your system PATH.
Linux:
```
sudo apt install ffmpeg
```

Virtual Environment (Recommended)

To isolate dependencies and avoid version conflicts:

python -m venv diarize_env
source diarize_env/bin/activate   # On Windows: diarize_env\Scripts\activate

Python Packages

These versions match the expected environment of the pretrained diarization models:

pip install torch==1.13.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install pyannote.audio==1.1.1
pip install git+https://github.com/openai/whisper.git
pip install ffmpeg-python tqdm

Note: Newer versions of torch or pyannote.audio may introduce breaking changes. I recommend using the versions shown above for consistent results.

Hugging Face Account & Access Token

pyannote.audio requires a token to access gated models.

Create an account: https://huggingface.co
Create a token: https://huggingface.co/settings/tokens
Accept model terms of use for the following repositories:
Add your token to the script under:

HUGGINGFACE_TOKEN = "your_token_here"

Running the Tool

Once prerequisites are met and the script is configured, run the tool from your terminal:

python transcribe_diarize.py <input_audio_file> [--force] [--srt]

Arguments

<input_audio_file>: The path to the audio file you want to transcribe (supports .m4a, .mp3, .wav, etc.)

Options

--force: Forces re-running Whisper transcription, even if cached.
--srt: Generates a .srt subtitle file in addition to the plain text output.

Output Files

For an input file named interview.m4a, the script will produce:

File	Description
`interview.wav`	Converted audio file used by Whisper and pyannote
`interview_whisper_transcript.json`	Cached transcription result (used for resuming)
`interview_diarized_transcript.txt`	Text transcript with speaker labels and timestamps
`interview_diarized_transcript.srt`	Subtitle file (if `--srt` is used)

Example Usage

python transcribe_diarize.py interview.m4a --force --srt

This command will:

Convert interview.m4a to .wav
Re-run Whisper transcription
Run speaker diarization
Output both .txt and .srt transcripts

Script

Place the following code in a file named transcribe_diarize.py:

import os
import sys
import json
import time
import datetime
import hashlib
from pathlib import Path
import whisper
from tqdm import tqdm
from pyannote.audio import Pipeline
import ffmpeg

# --------------------------- Configuration --------------------------- #

HUGGINGFACE_TOKEN = "XXX"  # Replace with your Hugging Face token
WHISPER_MODEL = "medium"

# --------------------------- Argument Parsing --------------------------- #

if len(sys.argv) < 2:
    print("Usage: python transcribe_diarize.py <input_audio_file> [--force] [--srt]")
    sys.exit(1)

input_file = Path(sys.argv[1])
force = "--force" in sys.argv
export_srt = "--srt" in sys.argv

if not input_file.exists():
    print(f"Error: File not found – {input_file}")
    sys.exit(1)

# --------------------------- Utility: File Hash --------------------------- #

def file_hash(path, block_size=65536):
    hasher = hashlib.sha256()
    with open(path, "rb") as f:
        while chunk := f.read(block_size):
            hasher.update(chunk)
    return hasher.hexdigest()[:12]  # Shorten to 12 characters for brevity

# --------------------------- File Conversion --------------------------- #

converted_wav = input_file.with_suffix(".wav")
if input_file.suffix.lower() != ".wav":
    print(f"Converting {input_file.name} to WAV format...")
    try:
        ffmpeg.input(str(input_file)).output(str(converted_wav), ac=1, ar=16000).run(quiet=True, overwrite_output=True)
        print(f"Converted to: {converted_wav}")
    except Exception as e:
        print(f"Error during ffmpeg conversion: {e}")
        sys.exit(1)
else:
    converted_wav = input_file

# --------------------------- Output File Naming --------------------------- #

hash_id = file_hash(input_file)
base_name = input_file.stem
prefix = f"{base_name}_{hash_id}"

transcript_json = input_file.parent / f"{prefix}_whisper_transcript.json"
final_transcript_txt = input_file.parent / f"{prefix}_diarized_transcript.txt"
srt_output = input_file.parent / f"{prefix}_diarized_transcript.srt"

# --------------------------- Whisper Transcription --------------------------- #

print("Loading Whisper model...")
start_time = time.time()
whisper_model = whisper.load_model(WHISPER_MODEL)
print(f"Whisper model loaded in {time.time() - start_time:.1f} seconds")

if transcript_json.exists() and not force:
    print(f"Using cached transcription: {transcript_json}")
    with open(transcript_json, "r", encoding="utf-8") as f:
        result = json.load(f)
else:
    print("Running Whisper transcription (this may take several minutes)...")
    start_time = time.time()
    result = whisper_model.transcribe(str(converted_wav), verbose=False)  # Show progress bar
    with open(transcript_json, "w", encoding="utf-8") as f:
        json.dump(result, f)
    print(f"Transcription completed and saved in {time.time() - start_time:.1f} seconds")

# --------------------------- Pyannote Diarization --------------------------- #

print("Loading speaker diarization pipeline...")
try:
    pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization",
        use_auth_token=HUGGINGFACE_TOKEN
    )
except Exception as e:
    print("Failed to load diarization model. Make sure you've accepted model access:")
    print(" - https://huggingface.co/pyannote/speaker-diarization")
    print(" - https://huggingface.co/pyannote/segmentation")
    print(" - https://huggingface.co/pyannote/embedding")
    print(" - https://huggingface.co/pyannote/clustering")
    print(f"Error: {e}")
    sys.exit(1)

print("Running diarization (this may take a few minutes)...")
start_time = time.time()
diarization = pipeline(str(converted_wav))
print(f"Diarization completed in {time.time() - start_time:.1f} seconds")

# --------------------------- Output Alignment --------------------------- #

def format_time(seconds):
    return str(datetime.timedelta(seconds=int(seconds)))[2:]

def format_srt_timestamp(seconds):
    td = datetime.timedelta(seconds=seconds)
    return str(td)[:-3].replace('.', ',').rjust(12, '0')

print("Aligning transcript with diarization...")
speaker_segments = []
for turn, _, speaker in tqdm(diarization.itertracks(yield_label=True), desc="Speaker Segments"):
    speaker_segments.append({
        "start": turn.start,
        "end": turn.end,
        "speaker": speaker
    })

aligned_output = []
for segment in tqdm(result['segments'], desc="Merging Segments"):
    start = segment['start']
    text = segment['text'].strip()
    for seg in speaker_segments:
        if seg["start"] <= start < seg["end"]:
            speaker = seg["speaker"]
            timestamp = format_time(start)
            aligned_output.append({
                "speaker": speaker,
                "timestamp": timestamp,
                "start": start,
                "end": segment['end'],
                "text": text
            })
            break

# Save to plain text
with open(final_transcript_txt, "w", encoding="utf-8") as f:
    for line in aligned_output:
        f.write(f"[{line['speaker']}] {line['timestamp']} – {line['text']}\n")
print(f"Transcript saved to: {final_transcript_txt}")

# Save to SRT format
if export_srt:
    print("Generating SRT file...")
    with open(srt_output, "w", encoding="utf-8") as f:
        for idx, entry in enumerate(aligned_output, 1):
            f.write(f"{idx}\n")
            f.write(f"{format_srt_timestamp(entry['start'])} --> {format_srt_timestamp(entry['end'])}\n")
            f.write(f"{entry['speaker']}: {entry['text']}\n\n")
    print(f"SRT file saved to: {srt_output}")

Transcribe and Diarize Audio with Whisper and Pyannote