Introduction
AI voice-cloning has taken social media by storm. It has opened a world of inventive potentialities. You will need to have seen memes or AI voice-overs of well-known personalities on social media. Have you ever questioned how it’s performed? Positive, many platforms present APIs like Eleven Labs, however can we do it at no cost, utilizing open-source software program? The brief reply is YES. The open-source has TTS fashions and lip-syncing instruments to realize voice synthesis. So, on this article, we’ll discover open-source instruments and fashions for voice-cloning and lip-syncing.
Studying Targets
- Discover open-source instruments for AI voice-cloning and lip-syncing.
- Use FFmpeg and Whisper to transcribe movies.
- Use the Coqui-AI’s xTTS mannequin to clone voice.
- Use the Wav2Lip for lip-syncing movies.
- Discover real-world use circumstances of this know-how.
This text was revealed as part of the Knowledge Science Blogathon.
Open-Supply Stack
As you already know, we’ll use OpenAI’s Whisper, FFmpeg, Coqui-ai’s xTTS mannequin, and Wav2lip as our tech stack. However earlier than delving into the codes, let’s briefly talk about these instruments. And in addition because of the authors of those tasks.
Whisper: Whisper is OpenAI’s ASR (Automated Speech Recognition) mannequin. It’s an encoder-decoder transformer mannequin skilled with over 650k hours of numerous audio knowledge and corresponding transcripts. Thus making it very potent at a multi-lingual transcription from audio.
The encoders obtain the log-mel spectrogram of 30-second chunks of audio. Every encoder block makes use of self-attention to grasp totally different components of audio indicators. The decoder then receives hidden state data from encoders and realized positional encodings. The decoder makes use of self-attention and cross-attention to foretell the subsequent token. On the finish of the method, it outputs a sequence of tokens representing the acknowledged textual content. For extra on Whisper, seek advice from the official repository.
Coqui TTS: TTS is an open-source library from Coqui-ai. It hosts a number of text-to-speech fashions. It has end-to-end fashions like Bark, Tortoise, and xTTS, spectrogram fashions like Glow-TTS, FastSpeech, and so on, and Vocoders like Hifi-GAN, MelGAN, and so on. Furthermore, it offers a unified API for inferencing, fine-tuning, and coaching text-to-speech fashions. On this challenge, we’ll use xTTS, an end-to-end multi-lingual voice-cloning mannequin. It helps 16 languages, together with English, Japanese, Hindi, Mandarin, and so on. For extra details about the TTS, seek advice from the official TTS repository.
Wav2Lip: Wav2lip is a Python repository for the paper “A Lip Sync Professional Is All You Want for Speech to Lip Era Within the Wild.” It makes use of a lip-sync discriminator to acknowledge face and lip actions. This works out nice for dubbing voices. For extra data, seek advice from the official repository. We are going to use this forked repository of Wav2lip.
Workflow
Now that we’re accustomed to the instruments and fashions we’ll use, let’s perceive the workflow. This can be a easy workflow. So, here’s what we’ll do.
- Add a video to the Colab runtime and resize it to 720p format for higher lip-syncing.
- Use FFmpeg to extract 24-bit audio from the video and use Whisper to transcribe the audio file.
- Use Google Translate or an LLM to translate the transcribed script to a different language.
- Load the Multi-lingual xTTS mannequin with the TTS library and move the script and reference audio mannequin for voice synthesis.
- Clone the Wav2lip repository and obtain mannequin checkpoints. Run the inference.py file to sync the unique video with synthesized audio.
Now, let’s delve into the codes.
Step 1: Set up Dependencies
This challenge would require important RAM and GPU consumption, so it’s prudent to make use of a Colab runtime. The free tier Colab offers 12GB of CPU and 15GB of T4 GPU. This must be sufficient for this challenge. So, head over to your Colab and hook up with a GPU runtime.
Now, set up the TTS and Whisper.
!pip set up TTS
!pip set up git+https://github.com/openai/whisper.git
Step 2: Add Movies to Colab
Now, we’ll add a video and resize it to 720p format. The Wav2lip tends to carry out higher when the movies are in 720p format. This may be performed utilizing FFmpeg.
#@title Add Video
from google.colab import recordsdata
import os
import subprocess
uploaded = None
resize_to_720p = False
def upload_video():
international uploaded
international video_path # Declare video_path as international to switch it
uploaded = recordsdata.add()
for filename in uploaded.keys():
print(f'Uploaded {filename}')
if resize_to_720p:
filename = resize_video(filename) # Get the title of the resized video
video_path = filename # Replace video_path with both authentic or resized filename
return filename
def resize_video(filename):
output_filename = f"resized_{filename}"
cmd = f"ffmpeg -i {filename} -vf 'scale=-1:720' {output_filename}"
subprocess.run(cmd, shell=True)
print(f'Resized video saved as {output_filename}')
return output_filename
# Create a kind button that calls upload_video when clicked and a checkbox for resizing
import ipywidgets as widgets
from IPython.show import show
button = widgets.Button(description="Add Video")
checkbox = widgets.Checkbox(worth=False, description='Resize to 720p (higher outcomes)')
output = widgets.Output()
def on_button_clicked(b):
with output:
international video_path
international resize_to_720p
resize_to_720p = checkbox.worth
video_path = upload_video()
button.on_click(on_button_clicked)
show(checkbox, button, output)
It will output a kind button for importing movies from a neighborhood gadget and a checkbox for enabling 720p resizing. You can too add a video manually to the present collab session and resize it utilizing a subprocess.
Step 3: Audio Extraction and Whisper Transcription
Now that we now have our video, the subsequent factor we’ll do is extract audio utilizing FFmpeg and use Whisper to transcribe.
# @title Audio extraction (24 bit) and whisper conversion
import subprocess
# Guarantee video_path variable exists and isn't None
if 'video_path' in globals() and video_path will not be None:
ffmpeg_command = f"ffmpeg -i '{video_path}' -acodec pcm_s24le -ar 48000 -q:a 0 -map a
-y 'output_audio.wav'"
subprocess.run(ffmpeg_command, shell=True)
else:
print("No video uploaded. Please add a video first.")
import whisper
mannequin = whisper.load_model("base")
consequence = mannequin.transcribe("output_audio.wav")
whisper_text = consequence["text"]
whisper_language = consequence['language']
print("Whisper textual content:", whisper_text)
It will extract audio from the video in 24-bit format and can use the Whisper Base to transcribe it. For higher transcription, use Whisper small or medium fashions.
Step 4: Voice Synthesis
Now, to the voice cloning half. As I’ve talked about earlier than, we’ll use Coqui-ai’s xTTS mannequin. This is likely one of the greatest open-source fashions on the market for voice synthesis. Coqui-ai additionally offers many TTS fashions for various functions; do examine them. For our use case, which is voice-cloning, we’ll use the xTTS v2 mannequin.
Load the xTTS mannequin. This can be a massive mannequin with a measurement of 1.87 GB. So, this may take some time.
# @title Voice synthesis
from TTS.api import TTS
import torch
from IPython.show import Audio, show # Import the Audio and show modules
gadget = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(gadget)
XTTS presently helps 16 languages. Listed below are the ISO codes of languages the xTTS mannequin helps.
print(tts.languages)
['en','es','fr','de','it','pt','pl','tr','ru','nl','cs','ar','zh-cn','hu','ko','ja','hi']
Notice: Languages like English and French wouldn’t have a personality restrict, whereas Hindi has a personality restrict of 250. Few different languages may need the restrict as properly.
For this challenge, we’ll use the Hindi language, you’ll be able to experiment with others as properly.
So, the very first thing we want now’s to translate the transcribed textual content into Hindi. This will both be performed by Google Translate package deal or utilizing an LLM. As per my observations, GPT-3.5-Turbo performs a lot better than Google Translate. We are able to use OpenAI API to get our translation.
import openai
shopper = openai.OpenAI(api_key = "api_key")
completion = shopper.chat.completions.create(
mannequin="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"translate the texts to Hindi {whisper_text}"}
]
)
translated_text = completion.decisions[0].message
print(translated_text)
As we all know, Hindi has a personality restrict, so we have to do textual content pre-processing earlier than passing it to the TTS mannequin. We have to break up the textual content into chunks of lower than 250 characters.
text_chunks = translated_text.break up(sep = "।")
final_chunks = [""]
for chunk in text_chunks:
if not final_chunks[-1] or len(final_chunks[-1])+len(chunk)<250:
chunk += "।"
final_chunks[-1]+=chunk.strip()
else:
final_chunks.append(chunk+"।".strip())
final_chunks
This can be a quite simple splitter. You may create a distinct one or use Langchain’s recursive text-splitter. Now, we’ll move every chunk to the TTS mannequin. The ensuing audio recordsdata will probably be merged utilizing FFmpeg.
def audio_synthesis(textual content, file_name):
tts.tts_to_file(
textual content,
speaker_wav='output_audio.wav',
file_path=file_name,
language="hello"
)
return file_name
file_names = []
for i in vary(len(final_chunks)):
file_name = audio_synthesis(final_chunks[i], f"output_synth_audio_{i}.wav")
file_names.append(file_name)
As all of the recordsdata have the identical codec, we will simply merge them with FFmpeg. To do that, create a Txt file and add the file paths.
# it is a remark
file 'output_synth_audio_0.wav'
file 'output_synth_audio_1.wav'
file 'output_synth_audio_2.wav'
Now, run the code under to merge recordsdata.
import subprocess
cmd = "ffmpeg -f concat -safe 0 -i my_files.txt -c copy final_output_synth_audio_hi.wav"
subprocess.run(cmd, shell=True)
It will output the ultimate concatenated audio file. You can too play the audio in Colab.
from IPython.show import Audio, show
show(Audio(filename="final_output_synth_audio_hi.wav", autoplay=False))
Step 5: Lip-Syncing
Now, to the lip-syncing half. To lip-sync our artificial audio with the unique video, we’ll use the Wav2lip repository. To make use of Wav2lip to sync audio, we have to set up the mannequin checkpoints. However earlier than that, if you’re on T4 GPU runtime, delete the xTTS and Whisper fashions within the present Colab session or restart the session.
import torch
attempt:
del tts
besides NameError:
print("Voice mannequin already deleted")
attempt:
del mannequin
besides NameError:
print("Whisper mannequin deleted")
torch.cuda.empty_cache()
Now, clone the Wav2lip repository and set up the checkpoints.
# @title Dependencies
%cd /content material/
!git clone https://github.com/justinjohn0306/Wav2Lip
!cd Wav2Lip && pip set up -r requirements_colab.txt
%cd /content material/Wav2Lip
!wget 'https://github.com/justinjohn0306/Wav2Lip/releases
/obtain/fashions/wav2lip.pth' -O 'checkpoints/wav2lip.pth'
!wget 'https://github.com/justinjohn0306/Wav2Lip/releases
/obtain/fashions/wav2lip_gan.pth' -O 'checkpoints/wav2lip_gan.pth'
!wget 'https://github.com/justinjohn0306/Wav2Lip/releases
/obtain/fashions/mobilenet.pth' -O 'checkpoints/mobilenet.pth'
!pip set up batch-face
The Wav2lip has two fashions for lip-syncing. wav2lip and wav2lip_gan. Based on the authors of the fashions, the GAN mannequin requires much less effort in face detection however produces barely inferior outcomes. In distinction, the non-GAN mannequin can produce higher outcomes with extra handbook padding and rescaling of the detection field. You may check out each and see which one is doing higher.
Run the inference with the mannequin checkpoint path, video, and audio recordsdata.
%cd /content material/Wav2Lip
#That is the detection field padding, modify incase of poor outcomes.
#Normally, the underside one is the largest challenge
pad_top = 0
pad_bottom = 15
pad_left = 0
pad_right = 0
rescaleFactor = 1
video_path_fix = f"'../{video_path}'"
!python inference.py --checkpoint_path 'checkpoints/wav2lip_gan.pth'
--face $video_path_fix --audio "/content material/final_output_synth_audio_hi.wav"
--pads $pad_top $pad_bottom $pad_left $pad_right --resize_factor $rescaleFactor --nosmooth
--outfile '/content material/output_video.mp4'
It will output a lip-synced video. If the video doesn’t look good, modify the parameters and retry.
So, right here is the repository for the pocket book and some samples.
GitHub Repository: sunilkumardash9/voice-clone-and-lip-sync
Actual-world Use Circumstances
Video voice-cloning and lip-syncing know-how have loads of use circumstances throughout industries. Listed below are just a few circumstances the place this may be helpful.
Leisure: The leisure trade would be the most affected trade of all. We’re already witnessing the change. Voices of celebrities of present and bygone eras may be synthesized and re-used. This additionally poses moral challenges. The usage of synthesized voices must be performed responsively and inside the perimeter of legal guidelines.
Advertising: Personalised advert campaigns with acquainted and relatable voices can tremendously improve model enchantment.
Communication: Language has all the time been a barrier to all kinds of actions. Cross-language communication continues to be a problem. Realtime end-to-end translation whereas conserving one’s accent and voice will revolutionize the way in which we talk. This may turn out to be a actuality in just a few years.
Content material Creation: Content material creators will not depend upon translators to achieve a much bigger viewers. With environment friendly voice cloning and lip-syncing, cross-language content material creation will probably be simpler. Podcasts and audiobook narration expertise may be enhanced with voice synthesis.
Conclusion
Voice synthesis is likely one of the most sought-after use circumstances of generative AI. It has the potential to revolutionize the way in which we talk. Ever because the creation of civilizations, the language barrier between communities has been a hurdle for forging deeper relationships, culturally and commercially. With AI voice synthesis, this hole may be crammed. So, on this article, we explored the open-source approach of voice-cloning and lip-syncing.
Key Takeaways
- TTS, a Python library by Coqui-ai, serves and maintains fashionable text-to-speech fashions.
- The xTTS is a multi-lingual voice cloning mannequin able to cloning voice to 16 totally different languages.
- Whisper is an ASR mannequin from OpenAI for environment friendly transcription and English translation.
- Wav2lip is an open-source software for lip-syncing movies.
- Voice cloning is likely one of the most occurring frontiers of generative AI, with a major potential influence on industries from leisure to advertising.
Often Requested Questions
A. Cloning voice may be unlawful because it infringes on copyright. Nonetheless, getting permission from the individual earlier than cloning is the fitting method to go about it.
A. Most AI voice cloning API providers require charges. Nonetheless, some open-source fashions may give pretty first rate voice synthesis functionality.
A. This is dependent upon explicit use circumstances. The xTTS mannequin is an efficient alternative for multi-lingual voice synthesis. However for extra languages, Meta’s Fairseq fashions may be preferable.
A. Sure, it’s doable to clone the voice of a star. Nonetheless, be conscious that any potential misuse can land you in authorized hassle.
A. Voice cloning may be helpful for a variety of use circumstances, corresponding to content material creation, narration in video games and flicks, Advert campaigns, and so on.
The media proven on this article will not be owned by Analytics Vidhya and is used on the Writer’s discretion.