OpenAI Whisper is an automatic speech recognition (ASR) system and transcription model. It was trained for over 680,000 hours in multilingual data collected from the web and can transcribe 97 different languages. Whisper can be used as a standalone binary or can be incorporated with an application as a library. It is open-source and free to use.
In this tutorial, we will learn
- Installation and Running on Windows, Mac, and Ubuntu
- OpenAI Whisper parameters
- Whisper performance
- Setting up with Python
- Setting with Node.js
Are you looking to Transcribe Youtube videos?
Visit our free online tool coswafe.com to transcribe your YouTube videos. Covert your audio to SRT, JSON, and Text files easily
1. Installation and Running on Windows, Mac, and Ubuntu
In this section, we will learn how to set up dependencies for OpenAI Whisper and use it as a standalone application. Also, we will learn to extract text from audio clips.
Whisper requires Python 3.8+, Pip, and any latest version of PyTorch to run. Also, it needs ffmpeg command-line tool to run. There are a few other dependent Python packages, most notably HuggingFace Transformers for their fast tokenizer implementation and ffmpeg-python for reading audio files. Rust might be needed if tokenizers do not provide a pre-built wheel on your platform. This tutorial won’t go into details about setting up Rust and building tokenizers.
1.a. Setting up dependencies
- Windows
On Windows, you need to have Chocolatey or Scoop package manager installed to set up the ffmpeg command-line tool.
# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
Developer mode should be enabled, if not already enabled, to install applications from any source. You can enable Developer mode from Settings > Privacy & security > For developers > Developer mode
.
- Mac
You will need the Homebrew package manager installed and then run the following command in the terminal
# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
On prompted for Installing applications from unknown sources, you will need to go to System Preferences > Security & Privacy > General
and select Anywhere
in Allow apps downloaded from
section.
- Ubuntu
Installing on Ubuntu is pretty straightforward with the following command
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
1.b. Installing Whisper
The following command will pull and install the latest commit of Whisper from its Github, along with its Python dependencies:
pip install git+https://github.com/openai/whisper.git
Your installation may complain about setting up environment variables for the new paths. Set the paths on your environment variables accordingly.
1.c. Running the Whisper command line
Once the above step runs successfully, you are ready to extract the text from an audio file by running the following command
whisper sample.mp3
It takes some time to process the audio and generates five different formats, namely .json
, .srt
, .tsv
, .txt
, and .vtt
.
# Run the command
chilarai@chilarai:~$ whisper sample.mp3
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:04.000] My thought, I have nobody by a beauty and will as you've poured.
[00:04.000 --> 00:09.800] Mr. Rochester is sir, and that so don't find simpus, and devoted abode, to hath might in
[00:09.800 --> 00:34.620] her name.
# Output files
chilarai@chilarai:$ ls
sample.mp3 sample.mp3.json sample.mp3.srt sample.mp3.tsv sample.mp3.txt sample.mp3.vtt
1.d. Customizing Whisper
If you write the following command on your terminal, you will see a lot of customization options for the input and output arguments. I will explain a few of the common arguments here so that it is easy for us to understand when we want to extract better results from Whisper.
whisper --help
- —model: This is how you select the model for your processing. Each of these models has been trained with a different number of parameters. In the next section, the parameters have been described in detail.
- –output_format: Select output format from the options:
txt, vtt, srt, tsv, json, all.
Default value isall
- –verbose: whether to print out the progress and debug messages (default: True)
- –task: Whether you want to
transcribe
ortranslate
(default: Transcribe). Transcribe does voice recognition whereas Translate changes from one language to another - –language: Specify the language in the audio. If
None
provided, language detection will be done. - –temperature: Higher temperature values indicate how greedy the system has to become. By default, it is 0 and if the transcription is not good, we can gradually increase its value up to 1.
- –beam_size: number of beams in beam search, a heuristic search algorithm that explores a graph
- –best_of: number of candidates when sampling with non-zero temperature.
2. OpenAI Whisper parameters
The performance of a model is determined by a variety of parameters. A model is regarded as good if it achieves high accuracy in production or test data and can generalize effectively to unknown data. The machine learning model parameters determine how the input data is transformed into the desired output. Whisper has 5 major models, each of which has been trained with a different number of parameters. Here is the list
Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
---|---|---|---|---|---|
tiny | 39 M | tiny.en | tiny | ~1 GB | ~32x |
base | 74 M | base.en | base | ~1 GB | ~16x |
small | 244 M | small.en | small | ~2 GB | ~6x |
medium | 769 M | medium.en | medium | ~5 GB | ~2x |
large | 1550 M | N/A | large | ~10 GB | 1x |
For Engish-only models, one can use models with .en extension which performs better. The difference in transcribing performance becomes less significant for the small.en
and medium.en
models. Here, the higher the number of parameters, the better the model’s performance. However, they consume more hardware resources, are slower, and incur more costs.
3. OpenAI Whisper performance
The transcribing performance of Whisper is different for different languages. Below is the officially depicted performance graph on Zero-shot Word-Error-Rate (WER) published by OpenAI. Word Error Rate (WER) is a common metric for measuring the speech-to-text accuracy of automatic speech recognition (ASR) systems. Microsoft claims to have a word error rate of 5.1%. Google boasts a WER of 4.9%. For comparison, human transcriptionists average a word error rate of 4%. A 4% WER means, 96% of the transcript is correct.
As the metrics indicate above, WER in a lot of languages is very high, meaning they have a lot of transcription issues when used with Whisper. However, this can be improved by fine-tuning a model, which will be covered in another tutorial.
Whisper can be run both on CPU and GPU. However, with larger models, it is advisable to use GPU to boost your performance speed as it involves a lot of processing.
4. Setting up Whisper with Python
If you want to learn how to set up Whisper with Node.js, you may skip this section and read on.
Integration of Whisper with Python is pretty straightforward. To transcribe a simple English speech into text, use the following code and save it as transcribe.py
import whisper
model = whisper.load_model("base")
result = model.transcribe("sample.mp3")
print(result["text"])
In the program above, we load the base model for the sample.mp3 file we want to transcribe. Since there is no language option, it will automatically decode the language and print the text to the console. Now run the program as
# Run the command
chilarai@chilarai:~$ python3 transcribe.py
# Command Output
My thought I have nobody by a beauty and will as you poured. Mr. Rochester is sub in that so-don't find simplest, and devoted about, to what might in a-
That’s how you integrate the Whisper program with Python.
Update: Official Python Whisper model
An official paid model, named “whisper-1” has been released by OpenAI which is faster and optimized for speed. This is invoked by installing openai and depends on the OpenAI key which can be obtained here
Install openai using the following
pip install openai
To transcribe an audio file, use the following code
import os
import openai
openai.api_key = "OPENAI_API_KEY"
audio_file = open("audio.mp3", "rb") # audio file to be transcribed
transcript = openai.Audio.transcribe("whisper-1", audio_file) # here all the parameters described above can be passed
This code will generate transcripts much faster.
5. Setting up Whisper with Node.js
Integrating Node.js with Whisper is not a straightforward job, as there is no official SDK or API available. However, interacting with Whisper is not very difficult. The main idea behind the implementation lies in the fact that Node.js can interact with the Whisper command-line application.
The program depends on child_process which can be installed as
npm install child_process
Let’s create a program that interacts with the Whisper command line. Save the file as nodewhisper.js
const { spawn } = require("child_process");
const outdata = spawn("whisper", ["sample.mp3"]);
outdata .stdout.on("data", data => {
console.log(`stdout: ${data}`);
});
outdata .stderr.on("data", data => {
console.log(`stderr: ${data}`);
});
outdata .on('error', (error) => {
console.log(`error: ${error.message}`);
});
outdata .on("close", code => {
console.log(`child process exited with code ${code}`);
});
In line 1 of the code above, the child_process
module creates new child processes of our main Node.js process. We can execute shell commands with these child processes.
In Line 2, we are calling the Whisper shell command using spawn() which streams all the shell output.
Run the command using
# Run the command
chilarai@chilarai:~$ node nodewhisper.js
# Command Output
My thought I have nobody by a beauty and will as you poured. Mr. Rochester is sub in that so-don't find simplest, and devoted about, to what might in a-
You see, it’s not that hard. Was it?
Update: Official OpenAI Whisper Node.js library!
The tutorial above uses a free Whisper model to transcribe your audio. However, very recently a paid model has been unveiled by OpenAI called “whisper-1” which is optimized for processing speed.
You will need an OpenAI Key which can be obtained here. After that, install the openai library using the following
npm install openai
Once the library is installed, use the code below to interact with the OpenAI library
const { Configuration, OpenAIApi } = require("openai");
const configuration = new Configuration({
apiKey: "<KEY OBTAINED ABOVE>",
});
async function transcribe() {
const openai = new OpenAIApi(configuration);
const resp = await openai.createTranscription(
fs.createReadStream("audio.mp3"), // audio input file
"whisper-1", // Whisper model name.
undefined, // Prompt
format, // Output format. Options are: json, text, srt, verbose_json, or vtt.
1, // Temperature.
language // ISO language code. Eg, for english `en`
);
console.log(resp.data);
}
The parameters inside the openai.createTranscription() are already explained in detail above. This will give you the output of your audio
Conclusion
OpenAI Whisper is a versatile and powerful tool for transcriptions and translations. In this tutorial, we have learned how to install Whisper on Windows, Mac, and Ubuntu. Also, we have analyzed the performances of the available models. Finally, we saw how to integrate Whisper with Python and Node.js. In a later tutorial, we’ll learn how to use Whisper’s more advanced features and how to add them to our programs.
If you want to read How to train OpenAI GPT 3 model, visit https://techpro.ninja/how-to-train-openai-gpt-3/