How to Transcribe Videos with Whisper on macOS: A Complete Guide

No need to pay any subscription fees to any of the online transcription service, use OpenAI's Whisper (https://openai.com/index/whisper/), and run it locally on your computer to generate subtitles for all kinds of videos. This guide walks you through installing Python, Whisper, FFmpeg, and yt-dlp on macOS to transcribe videos into subtitles using OpenAI's Whisper. Optimized for M1/M2/M3 and Intel Macs, it includes setting up a virtual environment, downloading YouTube videos, and optimizing performance.

Step 1: Install Python

Check if Python is Installed

Open Terminal:

Press Cmd + Space, type Terminal, and press Enter.

Type:

python3 --version

If you see a version like Python 3.10.6, Python is installed. Ensure it’s 3.8–3.11. If it’s older (e.g., 3.7) or missing, proceed to install.

Note: macOS includes a system Python (e.g., /usr/bin/python3). Don’t use it; we’ll install a separate version to avoid conflicts.

Install Python (if needed)

Download Python from python.org:

Visit python.org.
Choose Python 3.10 or 3.11 (e.g., 3.10.9 is stable as of 2025). Click the macOS installer link for your chip (universal, Intel, or Apple Silicon).

Run the installer:

Double-click the .pkg file (e.g., python-3.10.9-macos11.pkg).
Follow the prompts. It installs Python to /Applications/Python 3.10/ and adds python3 to your PATH.

Verify installation:

Open a new Terminal and type:

python3 --version

You should see the installed version (e.g., Python 3.10.9).

Check pip (Python’s package manager):

pip3 --version

It should show a version tied to Python 3.10 or 3.11 (e.g., pip 23.2 from ...).

Update PATH (if needed)

If python3 isn’t found, add Python to your PATH:

Open Terminal and edit your shell profile:

nano ~/.zshrc

(macOS uses Zsh by default since Catalina; if using Bash, edit ~/.bash_profile).

Add this line at the end:

export PATH="/Library/Frameworks/Python.framework/Versions/3.10/bin:$PATH"

Replace 3.10 with your version (e.g., 3.11).

Save: Press Ctrl + O, Enter, then Ctrl + X.

Apply:

source ~/.zshrc

Verify: python3 --version should now work.

Step 2: Set Up a Project Folder and Virtual Environment

A virtual environment isolates Whisper’s dependencies, keeping your system clean.

Create a Project Folder

In Terminal, make a folder for your project:

mkdir ~/whisper_project
cd ~/whisper_project

This creates whisper_project in your home directory (/Users/yourusername/).

Set Up a Virtual Environment

Create the environment:

python3 -m venv venv

Activate it:

source venv/bin/activate

Your prompt should change (e.g., (venv) yourusername@MacBook-Pro whisper_project %), showing the environment is active.

Note: Always activate the environment when working on this project. If you close Terminal, reopen it, navigate to ~/whisper_project, and run source venv/bin/activate again.

Deactivate (Later):

When done, type:

deactivate

to exit the virtual environment.

Step 3: Install Whisper

Install Whisper:

With the virtual environment activated, install Whisper:

pip3 install -U openai-whisper

This downloads Whisper and dependencies (e.g., PyTorch, ~1–2GB). It may take 2–5 minutes, depending on your internet speed.

On Apple Silicon (M1/M2/M3), PyTorch automatically uses Metal Performance Shaders (MPS) for acceleration if available.

Verify Installation:

Test Whisper:

whisper --help

You should see a help message listing command options (e.g., --model, --language). If you get “command not found,” ensure the virtual environment is active and reinstall Whisper.

Step 4: Install FFmpeg

Whisper needs FFmpeg to extract audio from video files (e.g., MP4 to WAV for transcription).

Install Homebrew (if not installed)

Homebrew is a package manager for macOS, making FFmpeg installation easy.

Check if Homebrew is installed:

brew --version

If it’s missing, install it:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Follow the prompts (may require your password). It installs to /usr/local/bin (Intel) or /opt/homebrew/bin (Apple Silicon).

Add Homebrew to PATH (if needed)

Add Homebrew to PATH if needed:

echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zshrc
source ~/.zshrc

Install FFmpeg

With Homebrew installed, run:

brew install ffmpeg

This downloads and installs FFmpeg (~100–200MB, 1–3 minutes).

Verify:

ffmpeg -version

You should see FFmpeg’s version (e.g., ffmpeg version 6.0). If not, ensure Homebrew’s PATH is set correctly.

Step 5 (Optional): Install macOS Certificates for Python

Why: Python 3.11 installed from python.org doesn’t automatically use macOS’s system certificate store (unlike the pre-installed /usr/bin/python3). This can cause SSL verification to fail if certificates aren’t manually set up.

How:

Install the certifi package, which provides a trusted certificate bundle:

source ~/whisper_project/venv/bin/activate
pip3 install certifi

Run the Install Certificates.command script that came with your Python 3.11 installation:

Open Finder, go to /Applications/Python 3.11/.
Double-click Install Certificates.command. It runs in Terminal and updates Python’s SSL certificates.

You’ll see output like:

-- pip install --upgrade certifi --
Successfully installed certifi-2023.x.x
-- removing any existing file or link --
-- creating symlink to certifi certificate bundle --
-- setting permissions --
-- update complete --

Step 6: Prepare Your Video File

You need a local video file for Whisper to transcribe. If your video is a YouTube URL, we’ll download it.

If You Have a Local Video

Copy your video (e.g., my_video.mp4) to the whisper_project folder:

cp /path/to/my_video.mp4 ~/whisper_project/

Example: If it’s in Downloads, use:

cp ~/Downloads/my_video.mp4 ~/whisper_project/

Supported formats: MP4, MOV, AVI, MKV, etc.

If Downloading from YouTube

Install yt-dlp:

pip3 install yt-dlp

Ensure the virtual environment is active.

Download the video:

yt-dlp -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best" "https://www.youtube.com/watch?v=VIDEO_ID" -o video.mp4

Replace VIDEO_ID with the video’s ID (e.g., K3CR6RiWS2U for the Nintendo Switch 2 video).

Example:

yt-dlp -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best" "https://www.youtube.com/watch?v=K3CR6RiWS2U" -o nintendo_switch_2.mp4

This saves nintendo_switch_2.mp4 in ~/whisper_project.

Note: Only download videos you have permission to use (e.g., for personal use or fair use under copyright law). YouTube’s terms prohibit unauthorized redistribution.

Step 7: Generate Subtitles with Whisper

Now, we’ll transcribe the video’s audio into subtitles.

Run Whisper:

Ensure you’re in the whisper_project folder with the virtual environment activated:

cd ~/whisper_project
source venv/bin/activate

Run Whisper on your video:

whisper video.mp4 --model medium --language en --output_format srt

Replace video.mp4 with your video’s filename (e.g., nintendo_switch_2.mp4).

--model medium: Uses the medium model for a good balance of accuracy and speed. Options:
- tiny: Fastest, least accurate.
- base: Fast, decent accuracy.
- small: Balanced.
- medium: Recommended for quality.
- large: Most accurate, slowest.
--language en: Assumes English audio. For other languages, use codes like es (Spanish), fr (French), zh (Chinese), or omit to auto-detect.
--output_format srt: Outputs subtitles in SRT format (timed text). Alternatives: txt (plain text), vtt (WebVTT).

Example for the Nintendo Switch 2 video:

whisper nintendo_switch_2.mp4 --model medium --language en --output_format srt

What Happens:

Whisper uses FFmpeg to extract audio from the video.
It transcribes the audio into text, segmenting it into subtitle entries with timestamps.
It saves video.srt (e.g., nintendo_switch_2.srt) in ~/whisper_project.

Processing time depends on your Mac:

M1/M2/M3 Mac: ~5–10 minutes for a 10-minute video (medium model) with MPS acceleration.
Intel Mac: ~20–30 minutes for a 10-minute video (CPU only).
Longer videos or larger models (e.g., large) take proportionally longer.

Example SRT Output:

Open video.srt with TextEdit:

open -a TextEdit video.srt

You’ll see something like:

1
00:00:00,000 --> 00:00:02,500
Yo, MKBHD here.

2
00:00:02,501 --> 00:00:05,000
Welcome to your first look and hands-on with the Nintendo Switch 2.

Each entry has a number, timestamp, and text, ideal for subtitles or article writing.

Step 8: Verify and Edit Subtitles

Check the SRT File:

Open video.srt in TextEdit or a code editor like Visual Studio Code (free, code.visualstudio.com).

Read through to ensure accuracy. Whisper is ~90–95% accurate for clear English audio but may misinterpret names (e.g., “MKBHD” as “MKV HD”) or technical terms (e.g., “Joy-Con”).

Edit if Needed:

Fix errors in TextEdit (e.g., change “Nintendo Switch too” to “Nintendo Switch 2”).
For easier editing, use Aegisub (a free subtitle editor):

Download from aegisub.org (~30MB).
Open Aegisub, load video.srt (File > Open Subtitles), and drag your video file into the video pane.
Play the video to check timing and edit text/timestamps as needed.
Save the corrected SRT (File > Save Subtitles).

Install Aegisub via Homebrew (optional):

brew install aegisub

Convert to TXT (Optional):

If you want plain text for an article, run Whisper with TXT output:

whisper video.mp4 --model medium --language en --output_format txt

This creates video.txt without timestamps (e.g., “Yo, MKBHD here. Welcome to your first look…”).

Alternatively, manually strip timestamps from video.srt:

Open in TextEdit, copy only the text lines (exclude numbers and timestamps), and paste into a new file (e.g., subtitles.txt).

Step 9: Optimizing Performance on macOS

Apple Silicon (M1/M2/M3)

Whisper automatically uses Metal Performance Shaders (MPS) on Apple Silicon for faster processing.

To confirm MPS is active:

python3 -c "import torch; print(torch.backends.mps.is_available())"

Should return True. If not, ensure PyTorch is updated:

pip3 install --upgrade torch torchvision torchaudio

MPS makes medium model ~2–3x faster than Intel CPUs (e.g., 5–10 minutes for a 10-minute video vs. 20–30).

Choose Model Size

Start with medium for quality and speed. Use large for complex audio (e.g., accents, background noise) but expect ~2x longer processing. Use small or base for quick tests or clear audio.

Manage Memory

Whisper’s medium model uses ~2–4GB RAM. If your Mac has low RAM (e.g., 8GB), close other apps or use small model.

For long videos, split them with FFmpeg:

ffmpeg -i video.mp4 -c copy -map 0 -segment_time 600 -f segment chunk_%03d.mp4

This creates 10-minute chunks (e.g., chunk_000.mp4). Transcribe each separately.

Batch Processing

Transcribe multiple videos:

for video in *.mp4; do whisper "$video" --model medium --language en --output_format srt; done

Save Disk Space

Delete the video file after transcription if not needed:

rm video.mp4

Step 10: Expected Performance on Mac

M1/M2/M3 Mac (e.g., MacBook Air M1, 8GB RAM):

~5–10 minutes for a 10-minute video (medium model).
~10–20 minutes for large model.

Intel Mac (e.g., 2019 MacBook Pro, 4-core i5):

~20–30 minutes for a 10-minute video (medium model).
~40–60 minutes for large model.

Accuracy: ~90–95% for clear English audio (e.g., YouTube tech reviews). Lower for noisy audio, accents, or overlapping speakers. Manual editing fixes most errors.

Output Size: SRT files are small (~10–100KB for a 10-minute video); TXT is similar.

HeIsRichard

Search This Blog