Free Transcription with Whisper from OpenAI

It's a great tool to have in the toolbox.

Free Transcription with Whisper from OpenAI

Whisper is a free command line tool from OpenAI for transcribing audio files. I’ve been using it for about a year, and I’ve been happy with the results. This is how OpenAI describes the tool:

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language.

Some of the benefits include:

  • Local data processing. I appreciate not having to upload files to the cloud for processing. Whisper is a package that you can install on your computer, and it runs on your hardware. This gives me peace of mind.
  • Great out-of-the-box accuracy. At my guesstimation, the transcripts I’ve generated from Whisper have been at least 80% accurate, depending on audio quality and the enunciation of the speaker. That's more than sufficient for my purposes.
  • Five different model sizes. Each model has a different parameter size. The larger models run more slowly than the smaller ones, but they provide better accuracy. You can pick the size of the model that suits your needs.
  • Free to run. I’ve used Rev in the past for automated transcriptions, which can get pretty spendy. Whisper is just as good and it’s free.
  • Cross-language transcription and translation. Whisper can transcribe speech from a variety of languages and translate the transcription into a different language. For example, it can transcribe audio in French and translate it into English.

Setting up Whisper

Whisper does require configuration on the front end. If you’re comfortable doing basic command line work, set up will be a breeze. Follow the directions on the Whisper repo on GitHub.

If you’re new to the command line, you can check out how I got started as well as this tutorial from freecodeCamp, and this guide from Tania Rasica.

Running Whisper

Once you have whisper set up, all you need to do is save your audio file on your machine, fire up the command line and run the whisper command to transcribe the audio. I like to save the audio in it's own folder, navigate to the folder in the terminal, and then run the command from there. This ensures that the transcript is automatically saved in the same folder as the audio file.

Here’s what a typical command looks like:

whisper AUDIO-FILE-NAME.mp3 --model medium --output_format txt

Here's how it works:

  • whisper triggers the application to run
  • AUDIO-FILE-NAME.mp3 is the audio file to be transcribed.
  • --model medium tells Whisper which model to use.
  • --output_format txt tells Whisper to output a single text file. If an output format is not specified, Whisper will generate .txt, .vtt, .srt, .tsv, and .json versions of the transcript.

For most things, I find the medium model to be a good starting point. It seems offer a good balance and speed and accuracy. If you have a longer file to transcribe (30 minutes or longer), I recommend processing with the tiny model first, spot-checking the result, and then running it through a larger model if you need more accuracy.

⏲️
As a test, I tested transcription of a 49 minute podcast. After running the medium model of Whisper for 45 minutes on a M1 MacBook Pro, it had transcribed just 15 minutes of the podcast. By contrast, when I processed the same audio using the tiny model, it was done in under 7 minutes.

Whisper has a bunch of additional settings that you can tweak by adding arguments to your command. To see a list of all available settings, run whisper --help.

Limitations of Whisper

  • Whisper cannot identify different speakers. If you’re trying to transcribe a podcast, recording multiple people, this may not be the tool for you.
  • Whisper may not pick up jargon and every technical term. Even though Whisper was trained on a sizable data set, if your content includes a lot of specialized language, you may have to go on a find-and-replace safari and/or manually comb through the content.
  • Whisper can't transcribe poor-quality recordings. In my experience, it can handle background noise and reverb pretty well, but the quality of the audio directly impacts the quality of the transcription. Don't expect magical results without decent input. It's a good idea to clean up your audio with Audition, Audacity, or even GarageBand. A little compression and noise reduction can work wonders.

Final Thoughts

Whisper has become my go-to tool for quick transcriptions of audio files. Cloud-based transcription services have their place, and I'll continue to use them if they're the best option, but it's extremely helpful to have a local, free, and remarkably accurate tool accessible with a few keystrokes.

For more info on Whisper, check out Introducing Whisper from Open AI.