Automated AI Voices

Overview

Intro

Retrieval-based Voice Conversion (RVC) is a type of AI model that can change one voice into another or help voice text to speech. This powerful technology is open source. This means that anyone can create their own unique voices on their own computer. The main concerns are that of ease of use, practically, and model quality. Can we find a way to make training automated and produce high quality results with no programming? With new open source tools, yes.

RVC training

Tools Used

( All open source )

Tool	Description
NeMo	Diarization (Determining speakers in an audio file).
WhisperX	Transcription and VAD Generation. (VAD is used to determine cut off points for audio segments).
FFmpeg	Manipulating audio files.
Spacy	Comparing typed lines vs transcribed lines.
Ultimate Vocal Remover	Removes background music from audio files.
RVC Training and Inference	A forked repo (original here) modified by Jarod Mica that makes RVC training and inference simple.

Automating the Data Prep

Running Automated Speaker Extraction

What’s the best way to automate data needed for training RVC? With AI of course! This github repository built myself will utilize diarization. After installing a few prerequisites programs, all that is needed on our end is a file containing audio and to specify lines spoken by our desired speaker in the Excel file in the github repository.

Prerequisite programs:

Install Cuda for Windows
Install WSL for Windows
Install Cuda for WSL
Install Miniconda (other Python managers work aswell)

Let’s set up the repository. Abbreviated usage instructions can be found on the github repository’s readme

Open up a command line terminal. A shortcut to do this on Windows is to navigate to the desired folder and type cmd in the explorer window, as show below.

Command shortcut

Paste the command line terminal.

git clone https://github.com/ProtoPompAI/Automated-RVC-Data-Preprocessing.git
conda create -n Automated-RVC-Data python=3.10 -y && conda activate Automated-RVC-Data
pip install -r requirements.txt

Whenever you want to use the program, be sure and type the command conda activate Automated-RVC-Data first. This will activate the Python environment that contains all of the packages needed to run the program.

Create a directory with files with some sort of audio. One way to get some example audio is to use yt-dlp on a long video Youtube videos such as public domain TV Shows. One such example is the first season of the The Beverly Hillbillies.

The next step is fill out the Excel file included in the cloned repository as shown in the below image. More information on the columns can be found in the Excel comments indicated by the top right red triangle.

Specification File

After the Excel file is completed, you want to run the command line script as shown below.

python preprocess_data.py INPUT_AUDIO_DIRECTORY OUTPUT_LOCATION --speaker_label SPEAKER_LABEL --specification_file SPEC_FILE -k

After this script is complete, there will be a folder with the extracted audio for the given speaker label in the Excel file. This audio is ready to be put through Ultimate Vocal Removal 5.

Removing Background Music

The next step is to download and install Ultimate Vocal Remover GUI from github

Ultimate Vocal Remover GUI

Defaults can be kept the same, apart from selecting the checkbox Vocals Only for convenience. All that needs to be done is to select input (the automated speaker extraction), select an output folder, and press enter. After a few minutes, the processed results will appear in the final folder.

RVC Training

After removing the background music, the next step is to download RVC WebUI. After this program is downloaded, unzip the archive and run go-web.bat. A web browser should pop up. Navigate over to the training tab.

RVC WebUI

For an initial run, standard the defaults can be kept. I do recommend turning up save frequency to 25 and epochs trained to 200 to ensure that the model has enough time to train well.

RVC WebUI Default Changes

Inference

After training, inference can be done on the first tab of the WebUI. There are also parameters the can be adjusted. Try tuning these to get the best output. And just like that, the AI voice model is complete.