DevLog 20250706: Speech to Text Transcription using OpenAI Whisper

DevLog 20250706: Speech to Text Transcription using OpenAI Whisper

BackerLeader posted Originally published at dev.to 1 min read

Overview

Mobile phones have had audio input for a long time, but none of the default options are particularly satisfactory. And despite the rise of capable online AI-based transcription services, for very simple scenarios like "turn this recording into some text," there's still no easy tool.

OpenAI released Whisper in 2022, a powerful model capable of transcribing many languages - but even now, there's no straightforward way to use it without invoking the API directly.

The API

Under the hood, Whisper is a deep neural network trained end-to-end to map raw audio to text. Conceptually, you:

  1. Provide an audio input - the model analyzes the waveform to extract linguistic and acoustic features.
  2. Leverage learned representations - its multi-layer architecture handles background noise, varied accents, and low-quality recordings.
  3. Produce a transcription - Whisper outputs a sequence of text that you can display, store, or post-process.

This high-level interaction keeps things simple: feed in speech, get back text - no need to manage model internals or low-level signal processing.

The Utility

Today I'm sharing our free Transcriber tool, which I've been using for almost half a year. It does a solid job at what it's meant to do:
https://methodox.itch.io/transcriber

We likely won't have time to develop it further, but sharing it online makes it more accessible for others looking for a similar solution.

Screenshot

Challenges

Currently, there's a limit on audio length due to OpenAI API restrictions. It would also be ideal to add real-time transcription - something like Google Voice IME.

References

If you read this far, tweet to the author to show them you care. Tweet a Thanks

Thanks for the clear overview and sharing the Transcriber tool! A bit more depth on how you handle API limits or plans for real-time transcription would be really helpful—any thoughts on that?

Hi Ben, thanks for the question! The Transcriber tool doesn't work around API limits natively - it's merely a shell around OpenAI's Whisper API.

When it comes to the plans, those are some ideas:

  1. Whisper limits file size to 25Mb, and it accepts mp3, mp4, mpweg, mpga, m4a, wav, and webm - instead of sending wav, one can try sending as mp3 which has good compression. For human voices the loss in auido quality won't matter.
  2. A naive way of implementing real-time transcription would be to periodically request transcription of partial recordings up to that point to OpenAI, this should result in good context-awareness but is less efficient on API use - on the other hand, Whisper does offer a streaming API that should definitely be looked into: https://platform.openai.com/docs/guides/speech-to-text#streaming-the-transcription-of-an-ongoing-audio-recording Streaming requires the use of web sockets though.

It turns out OpenAI has since updated their doc and have quite extensive guidance on handling longer inputs:

Please let me know if you find any technique particularly helpful!

More Posts

Whisper Transcriber - v0.5 Usage Note - Run Whisper Models Locally | Desktop AI

Methodox - Aug 24

DevLog 20250710: ComfyUI API

Methodox - Jul 10

DevLog 20250613: Ol'ista Web Framework

Methodox - Jun 13

DevLog 20250706: Analyzing (C#) Project Dependencies

Methodox - Jul 6

DevLog 20250610: Plotting in Divooka

Methodox - Jun 10
chevron_left