Audio & Video to Text — 100% Private, No Uploads

Transcribe meetings, lectures, podcasts, and Zoom recordings with OpenAI's Whisper AI — running entirely in your browser. Works for MP3, WAV, MP4, and video files up to 150 minutes. Your recordings never leave your device. Export as plain text, SRT subtitles, Markdown, or structured JSON.

100% Private

No Uploads

Runs in Browser

Works Offline (after first load)

Manage Projects Like a Pro in Excel 📊

Get our premium Excel Gantt Chart Template with automated dependencies.

Get 30% Off Now

100% private — your audio never leaves your device.

The Whisper AI model runs entirely in your browser. Zero server uploads.

AI Model

Language

All 99 Whisper-supported languages available — type to search by name or code (e.g. Thai, th). Auto-detect works well for most audio; set manually if detection is wrong.

Drag & drop audio or video, or click to browse

MP3 · WAV · M4A · MP4 · WebM · up to 150 min

Paragraph breaks are inserted at natural pauses (>1.5s) to aid readability — not true speaker diarization.

How it works

Drop your file

Drag any MP3, WAV, M4A, MP4, or WebM file (audio or video, up to 150 min). It stays in your browser — zero uploads.

Whisper AI transcribes locally

The model runs in a Web Worker via WebGPU or WASM. First run downloads the model (~75 MB for Base); later runs are instant.

Export to your format

Jump through the transcript with the audio player, then download TXT, SRT subtitles, Markdown for Notion, or JSON.

How the Audio Transcriber Works — Technical Details

SimpleTool uses OpenAI Whisper models (distributed as ONNX via the Hugging Face CDN) executed with Transformers.js v4 inside a browser Web Worker. The model runs via WebGPU on supported hardware with an automatic WebAssembly (WASM) fallback for CPU-only devices. No server connection is opened during transcription — only the one-time model download.

Audio files are decoded via the browser's native Web Audio API and resampled to 16 kHz mono Float32 using OfflineAudioContext— the exact format Whisper was trained on. Long recordings are processed in 30-second chunks with 5-second overlap to preserve context across boundaries. Audio extracted from video containers (MP4, WebM, MOV) is handled by the same pipeline.

Paragraph breaks are inferred from natural pauses (silence gaps >1.5 seconds) to aid readability. This approximates speaker turns for meeting audio but is not true speaker diarization. Word-level timestamps are preserved in the JSON export for downstream tooling.

More Productivity Tools

Explore our other privacy-focused tools designed to boost your productivity

Audio & Video to Text

Transcribe meetings, lectures, and Zoom recordings with Whisper AI — MP3, WAV, MP4, video files, all in your browser

Try this tool

AI Background Remover

Remove image backgrounds with AI — runs entirely in your browser, no uploads, free forever

Try this tool

AI Image Upscaler

Upscale photos to 4× resolution with Real-ESRGAN AI — 100% private, no uploads, free

Try this tool

View all tools