Audio & Video to Text โ 100% Private, No Uploads
Transcribe meetings, lectures, podcasts, and Zoom recordings with OpenAI's Whisper AI โ running entirely in your browser. Works for MP3, WAV, MP4, and video files up to 150 minutes. Your recordings never leave your device. Export as plain text, SRT subtitles, Markdown, or structured JSON.
Manage Projects Like a Pro in Excel ๐
Get our premium Excel Gantt Chart Template with automated dependencies.
100% private โ your audio never leaves your device.
The Whisper AI model runs entirely in your browser. Zero server uploads.
All 99 Whisper-supported languages available โ type to search by name or code (e.g. Thai, th). Auto-detect works well for most audio; set manually if detection is wrong.
Drag & drop audio or video, or click to browse
MP3 ยท WAV ยท M4A ยท MP4 ยท WebM ยท up to 150 min
Paragraph breaks are inserted at natural pauses (>1.5s) to aid readability โ not true speaker diarization.
How it works
Drop your file
Drag any MP3, WAV, M4A, MP4, or WebM file (audio or video, up to 150 min). It stays in your browser โ zero uploads.
Whisper AI transcribes locally
The model runs in a Web Worker via WebGPU or WASM. First run downloads the model (~75 MB for Base); later runs are instant.
Export to your format
Jump through the transcript with the audio player, then download TXT, SRT subtitles, Markdown for Notion, or JSON.
How the Audio Transcriber Works โ Technical Details
SimpleTool uses OpenAI Whisper models (distributed as ONNX via the Hugging Face CDN) executed with Transformers.js v4 inside a browser Web Worker. The model runs via WebGPU on supported hardware with an automatic WebAssembly (WASM) fallback for CPU-only devices. No server connection is opened during transcription โ only the one-time model download.
Audio files are decoded via the browser's native Web Audio API and resampled to 16 kHz mono Float32 using OfflineAudioContextโ the exact format Whisper was trained on. Long recordings are processed in 30-second chunks with 5-second overlap to preserve context across boundaries. Audio extracted from video containers (MP4, WebM, MOV) is handled by the same pipeline.
Paragraph breaks are inferred from natural pauses (silence gaps >1.5 seconds) to aid readability. This approximates speaker turns for meeting audio but is not true speaker diarization. Word-level timestamps are preserved in the JSON export for downstream tooling.
More Productivity Tools
Explore our other privacy-focused tools designed to boost your productivity
Audio & Video to Text
Transcribe meetings, lectures, and Zoom recordings with Whisper AI โ MP3, WAV, MP4, video files, all in your browser
AI Background Remover
Remove image backgrounds with AI โ runs entirely in your browser, no uploads, free forever
AI Image Upscaler
Upscale photos to 4ร resolution with Real-ESRGAN AI โ 100% private, no uploads, free