Voice control on PC and RaspberryPi with Whisper

voice control on pc and raspberry pi

The idea of ​​the project is give voice instructions to interact through our PC or our Raspberry Pi using the Voice-to-text Whisper model.

We will give an order that will be transcribed, converted to text, with Whisper and then analyzed to execute the appropriate order, which can be from executing a program to giving voltage to the RaspberryPi pins.

I am going to use an old Raspberry Pi 2, a micro USB and I will use the Voice-to-text model recently released by OpenAI, Whisper. At the end of the article you can see a little more whisper.

all programmed in Python.

I leave you a demonstration of how it works in this video, controlling the PC by voice.

Mounting

To use it with the PC, we will only need a microphone.

If you are going to mount it on the RaspberryPi, you will need a USB microphone, because the jack it has is only for output.

We need:

As the general purpose of the tool is voice identification. I find it very useful to integrate it into the operation of other devices.

  • micro USB
  • Raspberry PI with operating system (Raspbian pro example)
  • Electronics (LED, wires, 480 ohm resistor and breadboard)

We connect the LED to pin 17, which is the one that we will activate and deactivate for this experience.

code development

It is divided into three parts, the first, the audio recording for which I have taken a code from geeksforgeeks, because I don't know those bookstores. The second, the conversion of audio to text with Whisper and the third, the treatment of that text and response in the RaspberryPi

In the test example I am only going to interact with a Led, making it light up or blink, but we could develop the script to adjust it to our needs.

I'm aware that this is a Raspberry Pi 2 and it's going to be much slower than a Raspberry Pi 4, but for testing it's fine.

Before you can get it working, you will need to install the following

#Instalar whisper
pip install git+https://github.com/openai/whisper.git
sudo apt update && sudo apt install ffmpeg

#para que funcione la grabación de audio
python3 -m pip install sounddevice --user
pip install git+https://github.com/WarrenWeckesser/wavio.git

#si vas a instalarlo en la raspberry
#dar permisos para usar la GPIO
sudo apt install python3-gpiozero
sudo usermode -aG gpio <username>

all the code

#!/usr/bin/env python3
import whisper
import time
from gpiozero import LED
import sounddevice as sd
from scipy.io.wavfile import write
import wavio as wv

        
def main ():
    inicio = time.time()
    record_audio ()

    model = whisper.load_model("tiny")
    result = model.transcribe("audio1.wav")
    words = result["text"].split()

    for word in words:
        word = word.replace(',', '').replace('.', '').lower()
        if word == 'enciende' or 'encender':
            encender()
            break
        if word == 'parpadea' or 'parpadear':
            parpadear()
            break      
    fin = time.time()
    print(fin-inicio)

def encender ():
    LED(17).on()

def parpadear ():
    light = LED(17)
    while True:
        light.on()
        sleep(1)
        light.off()
        sleep(1)

def record_audio ():
    # Sampling frequency
    freq = 44100
    # Recording duration
    duration = 5
    # Start recorder with the given values
    # of duration and sample frequency
    recording = sd.rec(int(duration * freq),
                    samplerate=freq, channels=2)
    # Record audio for the given number of seconds
    sd.wait()
    # This will convert the NumPy array to an audio
    # file with the given sampling frequency
    write("audio0.wav", freq, recording)
    # Convert the NumPy array to audio file
    wv.write("audio1.wav", recording, freq, sampwidth=2)
        
main ()


#dar permisos para usar la GPIO
#sudo apt install python3-gpiozero
#sudo usermode -aG gpio <username>

#Instalar whisper
#pip install git+https://github.com/openai/whisper.git
#sudo apt update &amp;&amp; sudo apt install ffmpeg

I haven't been able to test it because I don't have a microSD for the RaspberryPi, or a USB speaker to connect, but as soon as I try it I correct some error that it's easy to slip in.

Step by step explanation of the code

#!/usr/bin/env python3

The Shebang to tell the device what language we have programmed in and what interpreter to use. Although it seems trivial, not putting it causes errors on many occasions.

imported libraries

import whisper
import time
from gpiozero import LED
import sounddevice as sd
from scipy.io.wavfile import write
import wavio as wv

Whisper to work with the model

time, because I use it to control the time it takes to execute the script, gpiozero to work with the GPIO pins of the Raspberry and sounddevice, scipy and wavio to record the audio

The functions

I have created 4 functions:

  • main()
  • light ()
  • to blink ()
  • record_audio()

turn on () simply gives voltage to pin 17 of the raspberry where we have connected in this case the LED to test

def encender ():
    LED(17).on()

blink() is like on() but it makes the led blink by turning it on and off within a loop.

def parpadear ():
    light = LED(17)
    while True:
        light.on()
        sleep(1)
        light.off()
        sleep(1)

With record_audio() we record the audio file

def record_audio ():
    # Sampling frequency
    freq = 44100
    # Recording duration
    duration = 5
    # Start recorder with the given values
    # of duration and sample frequency
    recording = sd.rec(int(duration * freq),
                    samplerate=freq, channels=2)
    # Record audio for the given number of seconds
    sd.wait()
    # This will convert the NumPy array to an audio
    # file with the given sampling frequency
    write("audio0.wav", freq, recording)
    # Convert the NumPy array to audio file
    wv.write("audio1.wav", recording, freq, sampwidth=2)

Main is the main function, notice that the only thing we have outside of functions is the call to main() at the end of the script. This way on startup, it will import the libraries and then make the function call.

def main ():
    inicio = time.time()
    record_audio ()

    model = whisper.load_model("tiny")
    result = model.transcribe("audio1.wav")
    words = result["text"].split()

    for word in words:
        word = word.replace(',', '').replace('.', '').lower()
        if word == 'enciende' or 'encender':
            encender()
            break
        if word == 'parpadea' or 'parpadear':
            parpadear()
            break      
    fin = time.time()
    print(fin-inicio)

We save the time at which we start executing the function and then we call the record audio function that will record our instruction in a .wav, .mp3, etc. file that we will later convert to text

    inicio = time.time()
    record_audio ()

  

Once we have the audio, whisper will be called and we tell it the model we want to use, there are 5 available, and we will use tiny, although it is the most imprecise because it is the fastest and the audio will be simple, only 3 or 4 words .

     model = whisper.load_model("tiny")
    result = model.transcribe("audio1.wav")

  

With this we have the audio converted to text and saved in a variable. Let's modify it a bit.

We convert result into a list with each of the words of the audio

     words = result["text"].split()

  

And everything ready to interact with our device. Now we just have to create the conditions we want.

If the audio has the word X, do Y. As we have the words in a list, it is very easy to add conditions

         for word in words:
        word = word.replace(',', '').replace('.', '').lower()
        if word == 'enciende' or 'encender':
            encender()
            break
        if word == 'parpadea' or 'parpadear':
            parpadear()
            break   

  

The line

         
        word = word.replace(',', '').replace('.', '').lower()


  

I use it to convert the words in the audio to lowercase and remove the commas and periods. And in this way avoid errors in the comparisons

In each if if the condition of having any of the words we have chosen is met, it calls a function that will do what we want,

This is where we tell it to activate a PIN that will light an LED or make it blink. Either run some code, or shut down the computer.

All this is a basic idea. From here you can develop the project and improve it as you want. Each person can find a different use for it.

Things we can do with this montage

These are ideas that come to me to take advantage of this montage. Once the skeleton is armed, we can use it to activate everything that comes to mind by voice, we can activate a relay that starts a motor or we can launch a script that executes a script, an email or whatever.

What is whisper

Whisper is a vol recognition model, works in multilanguage with a large number of languages ​​and allows translation into English. It is what we know as a text-to-voice tool, but this is Open Source, released by the OpenAI team, the creators of Stable Diffusion.

Leave a comment