Audio & Video to Text โ€” 100% Private, No Uploads

Transcribe meetings, lectures, podcasts, and Zoom recordings with OpenAI's Whisper AI โ€” running entirely in your browser. Works for MP3, WAV, MP4, and video files up to 150 minutes. Your recordings never leave your device. Export as plain text, SRT subtitles, Markdown, or structured JSON.

100% Private
No Uploads
Runs in Browser
Works Offline (after first load)

Manage Projects Like a Pro in Excel ๐Ÿ“Š

Get our premium Excel Gantt Chart Template with automated dependencies.

Get 30% Off Now

100% private โ€” your audio never leaves your device.

The Whisper AI model runs entirely in your browser. Zero server uploads.

All 99 Whisper-supported languages available โ€” type to search by name or code (e.g. Thai, th). Auto-detect works well for most audio; set manually if detection is wrong.

Paragraph breaks are inserted at natural pauses (>1.5s) to aid readability โ€” not true speaker diarization.

How it works

1

Drop your file

Drag any MP3, WAV, M4A, MP4, or WebM file (audio or video, up to 150 min). It stays in your browser โ€” zero uploads.

2

Whisper AI transcribes locally

The model runs in a Web Worker via WebGPU or WASM. First run downloads the model (~75 MB for Base); later runs are instant.

3

Export to your format

Jump through the transcript with the audio player, then download TXT, SRT subtitles, Markdown for Notion, or JSON.

How the Audio Transcriber Works โ€” Technical Details

SimpleTool uses OpenAI Whisper models (distributed as ONNX via the Hugging Face CDN) executed with Transformers.js v4 inside a browser Web Worker. The model runs via WebGPU on supported hardware with an automatic WebAssembly (WASM) fallback for CPU-only devices. No server connection is opened during transcription โ€” only the one-time model download.

Audio files are decoded via the browser's native Web Audio API and resampled to 16 kHz mono Float32 using OfflineAudioContextโ€” the exact format Whisper was trained on. Long recordings are processed in 30-second chunks with 5-second overlap to preserve context across boundaries. Audio extracted from video containers (MP4, WebM, MOV) is handled by the same pipeline.

Paragraph breaks are inferred from natural pauses (silence gaps >1.5 seconds) to aid readability. This approximates speaker turns for meeting audio but is not true speaker diarization. Word-level timestamps are preserved in the JSON export for downstream tooling.