Find out how to Lip-Sync a Video Utilizing Opensource Instruments

December 17, 2023

1

Introduction

AI voice-cloning has taken social media by storm. It has opened a world of inventive potentialities. You will need to have seen memes or AI voice-overs of well-known personalities on social media. Have you ever questioned how it’s performed? Positive, many platforms present APIs like Eleven Labs, however can we do it at no cost, utilizing open-source software program? The brief reply is YES. The open-source has TTS fashions and lip-syncing instruments to realize voice synthesis. So, on this article, we’ll discover open-source instruments and fashions for voice-cloning and lip-syncing.

AI voice cloning and lip syncing using open-source tools

Studying Targets

Discover open-source instruments for AI voice-cloning and lip-syncing.
Use FFmpeg and Whisper to transcribe movies.
Use the Coqui-AI’s xTTS mannequin to clone voice.
Use the Wav2Lip for lip-syncing movies.
Discover real-world use circumstances of this know-how.

This text was revealed as part of the Knowledge Science Blogathon.

Open-Supply Stack

As you already know, we’ll use OpenAI’s Whisper, FFmpeg, Coqui-ai’s xTTS mannequin, and Wav2lip as our tech stack. However earlier than delving into the codes, let’s briefly talk about these instruments. And in addition because of the authors of those tasks.

Whisper: Whisper is OpenAI’s ASR (Automated Speech Recognition) mannequin. It’s an encoder-decoder transformer mannequin skilled with over 650k hours of numerous audio knowledge and corresponding transcripts. Thus making it very potent at a multi-lingual transcription from audio.

The encoders obtain the log-mel spectrogram of 30-second chunks of audio. Every encoder block makes use of self-attention to grasp totally different components of audio indicators. The decoder then receives hidden state data from encoders and realized positional encodings. The decoder makes use of self-attention and cross-attention to foretell the subsequent token. On the finish of the method, it outputs a sequence of tokens representing the acknowledged textual content. For extra on Whisper, seek advice from the official repository.

Coqui TTS: TTS is an open-source library from Coqui-ai. It hosts a number of text-to-speech fashions. It has end-to-end fashions like Bark, Tortoise, and xTTS, spectrogram fashions like Glow-TTS, FastSpeech, and so on, and Vocoders like Hifi-GAN, MelGAN, and so on. Furthermore, it offers a unified API for inferencing, fine-tuning, and coaching text-to-speech fashions. On this challenge, we’ll use xTTS, an end-to-end multi-lingual voice-cloning mannequin. It helps 16 languages, together with English, Japanese, Hindi, Mandarin, and so on. For extra details about the TTS, seek advice from the official TTS repository.

Wav2Lip: Wav2lip is a Python repository for the paper “A Lip Sync Professional Is All You Want for Speech to Lip Era Within the Wild.” It makes use of a lip-sync discriminator to acknowledge face and lip actions. This works out nice for dubbing voices. For extra data, seek advice from the official repository. We are going to use this forked repository of Wav2lip.

Workflow

Now that we’re accustomed to the instruments and fashions we’ll use, let’s perceive the workflow. This can be a easy workflow. So, here’s what we’ll do.

Add a video to the Colab runtime and resize it to 720p format for higher lip-syncing.
Use FFmpeg to extract 24-bit audio from the video and use Whisper to transcribe the audio file.
Use Google Translate or an LLM to translate the transcribed script to a different language.
Load the Multi-lingual xTTS mannequin with the TTS library and move the script and reference audio mannequin for voice synthesis.
Clone the Wav2lip repository and obtain mannequin checkpoints. Run the inference.py file to sync the unique video with synthesized audio.

Now, let’s delve into the codes.

Step 1: Set up Dependencies

This challenge would require important RAM and GPU consumption, so it’s prudent to make use of a Colab runtime. The free tier Colab offers 12GB of CPU and 15GB of T4 GPU. This must be sufficient for this challenge. So, head over to your Colab and hook up with a GPU runtime.

Now, set up the TTS and Whisper.

!pip set up TTS
!pip set up git+https://github.com/openai/whisper.git

Step 2: Add Movies to Colab

Now, we’ll add a video and resize it to 720p format. The Wav2lip tends to carry out higher when the movies are in 720p format. This may be performed utilizing FFmpeg.

#@title Add Video

from google.colab import recordsdata
import os
import subprocess

uploaded = None
resize_to_720p = False

def upload_video():
  international uploaded
  international video_path  # Declare video_path as international to switch it
  uploaded = recordsdata.add()
  for filename in uploaded.keys():
    print(f'Uploaded {filename}')
    if resize_to_720p:
        filename = resize_video(filename)  # Get the title of the resized video
    video_path = filename  # Replace video_path with both authentic or resized filename
    return filename


def resize_video(filename):
    output_filename = f"resized_{filename}"
    cmd = f"ffmpeg -i {filename} -vf 'scale=-1:720' {output_filename}"
    subprocess.run(cmd, shell=True)
    print(f'Resized video saved as {output_filename}')
    return output_filename

# Create a kind button that calls upload_video when clicked and a checkbox for resizing
import ipywidgets as widgets
from IPython.show import show

button = widgets.Button(description="Add Video")
checkbox = widgets.Checkbox(worth=False, description='Resize to 720p (higher outcomes)')
output = widgets.Output()

def on_button_clicked(b):
  with output:
    international video_path
    international resize_to_720p
    resize_to_720p = checkbox.worth
    video_path = upload_video()

button.on_click(on_button_clicked)
show(checkbox, button, output)

It will output a kind button for importing movies from a neighborhood gadget and a checkbox for enabling 720p resizing. You can too add a video manually to the present collab session and resize it utilizing a subprocess.

Step 3: Audio Extraction and Whisper Transcription

Now that we now have our video, the subsequent factor we’ll do is extract audio utilizing FFmpeg and use Whisper to transcribe.

# @title Audio extraction (24 bit) and whisper conversion
import subprocess

# Guarantee video_path variable exists and isn't None
if 'video_path' in globals() and video_path will not be None:
    ffmpeg_command = f"ffmpeg -i '{video_path}' -acodec pcm_s24le -ar 48000 -q:a 0 -map a
                       -y 'output_audio.wav'"
    subprocess.run(ffmpeg_command, shell=True)
else:
    print("No video uploaded. Please add a video first.")

import whisper

mannequin = whisper.load_model("base")
consequence = mannequin.transcribe("output_audio.wav")

whisper_text = consequence["text"]
whisper_language = consequence['language']

print("Whisper textual content:", whisper_text)

It will extract audio from the video in 24-bit format and can use the Whisper Base to transcribe it. For higher transcription, use Whisper small or medium fashions.

Step 4: Voice Synthesis

Now, to the voice cloning half. As I’ve talked about earlier than, we’ll use Coqui-ai’s xTTS mannequin. This is likely one of the greatest open-source fashions on the market for voice synthesis. Coqui-ai additionally offers many TTS fashions for various functions; do examine them. For our use case, which is voice-cloning, we’ll use the xTTS v2 mannequin.

Load the xTTS mannequin. This can be a massive mannequin with a measurement of 1.87 GB. So, this may take some time.

# @title Voice synthesis
from TTS.api import TTS
import torch
from IPython.show import Audio, show  # Import the Audio and show modules

gadget = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(gadget)

XTTS presently helps 16 languages. Listed below are the ISO codes of languages the xTTS mannequin helps.

print(tts.languages)


['en','es','fr','de','it','pt','pl','tr','ru','nl','cs','ar','zh-cn','hu','ko','ja','hi']

Notice: Languages like English and French wouldn’t have a personality restrict, whereas Hindi has a personality restrict of 250. Few different languages may need the restrict as properly.

For this challenge, we’ll use the Hindi language, you’ll be able to experiment with others as properly.

So, the very first thing we want now’s to translate the transcribed textual content into Hindi. This will both be performed by Google Translate package deal or utilizing an LLM. As per my observations, GPT-3.5-Turbo performs a lot better than Google Translate. We are able to use OpenAI API to get our translation.

import openai

shopper = openai.OpenAI(api_key = "api_key")
completion = shopper.chat.completions.create(
  mannequin="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"translate the texts to Hindi {whisper_text}"}
  ]
)
translated_text = completion.decisions[0].message
print(translated_text)

As we all know, Hindi has a personality restrict, so we have to do textual content pre-processing earlier than passing it to the TTS mannequin. We have to break up the textual content into chunks of lower than 250 characters.

text_chunks = translated_text.break up(sep = "।")
final_chunks = [""]
for chunk in text_chunks:
  if not final_chunks[-1] or len(final_chunks[-1])+len(chunk)<250:
    chunk += "।"
    final_chunks[-1]+=chunk.strip()
  else:
    final_chunks.append(chunk+"।".strip())
final_chunks

This can be a quite simple splitter. You may create a distinct one or use Langchain’s recursive text-splitter. Now, we’ll move every chunk to the TTS mannequin. The ensuing audio recordsdata will probably be merged utilizing FFmpeg.

def audio_synthesis(textual content, file_name):
  tts.tts_to_file(
      textual content,
      speaker_wav='output_audio.wav',
      file_path=file_name,
      language="hello"
  )
  return file_name
file_names = []
for i in vary(len(final_chunks)):
    file_name = audio_synthesis(final_chunks[i], f"output_synth_audio_{i}.wav")
    file_names.append(file_name)

As all of the recordsdata have the identical codec, we will simply merge them with FFmpeg. To do that, create a Txt file and add the file paths.

# it is a remark
file 'output_synth_audio_0.wav'
file 'output_synth_audio_1.wav'
file 'output_synth_audio_2.wav'

Now, run the code under to merge recordsdata.

import subprocess

cmd = "ffmpeg -f concat -safe 0 -i my_files.txt -c copy final_output_synth_audio_hi.wav"
subprocess.run(cmd, shell=True)

It will output the ultimate concatenated audio file. You can too play the audio in Colab.

from IPython.show import Audio, show
show(Audio(filename="final_output_synth_audio_hi.wav", autoplay=False))

Step 5: Lip-Syncing

Now, to the lip-syncing half. To lip-sync our artificial audio with the unique video, we’ll use the Wav2lip repository. To make use of Wav2lip to sync audio, we have to set up the mannequin checkpoints. However earlier than that, if you’re on T4 GPU runtime, delete the xTTS and Whisper fashions within the present Colab session or restart the session.

import torch

attempt:
    del tts
besides NameError:
    print("Voice mannequin already deleted")

attempt:
    del mannequin
besides NameError:
    print("Whisper mannequin  deleted")

torch.cuda.empty_cache()

Now, clone the Wav2lip repository and set up the checkpoints.

# @title Dependencies
%cd /content material/

!git clone https://github.com/justinjohn0306/Wav2Lip
!cd Wav2Lip && pip set up -r requirements_colab.txt

%cd /content material/Wav2Lip

!wget 'https://github.com/justinjohn0306/Wav2Lip/releases 
/obtain/fashions/wav2lip.pth' -O 'checkpoints/wav2lip.pth'

!wget 'https://github.com/justinjohn0306/Wav2Lip/releases 
/obtain/fashions/wav2lip_gan.pth' -O 'checkpoints/wav2lip_gan.pth'

!wget 'https://github.com/justinjohn0306/Wav2Lip/releases 
/obtain/fashions/mobilenet.pth' -O 'checkpoints/mobilenet.pth'

!pip set up batch-face

The Wav2lip has two fashions for lip-syncing. wav2lip and wav2lip_gan. Based on the authors of the fashions, the GAN mannequin requires much less effort in face detection however produces barely inferior outcomes. In distinction, the non-GAN mannequin can produce higher outcomes with extra handbook padding and rescaling of the detection field. You may check out each and see which one is doing higher.

Run the inference with the mannequin checkpoint path, video, and audio recordsdata.

%cd /content material/Wav2Lip

#That is the detection field padding, modify incase of poor outcomes. 
#Normally, the underside one is the largest challenge
pad_top =  0
pad_bottom =  15
pad_left =  0
pad_right =  0
rescaleFactor =  1

video_path_fix = f"'../{video_path}'"

!python inference.py --checkpoint_path 'checkpoints/wav2lip_gan.pth' 
--face $video_path_fix --audio "/content material/final_output_synth_audio_hi.wav" 
--pads $pad_top $pad_bottom $pad_left $pad_right --resize_factor $rescaleFactor --nosmooth  
--outfile '/content material/output_video.mp4'

It will output a lip-synced video. If the video doesn’t look good, modify the parameters and retry.

So, right here is the repository for the pocket book and some samples.

GitHub Repository: sunilkumardash9/voice-clone-and-lip-sync

Actual-world Use Circumstances

Video voice-cloning and lip-syncing know-how have loads of use circumstances throughout industries. Listed below are just a few circumstances the place this may be helpful.

Leisure: The leisure trade would be the most affected trade of all. We’re already witnessing the change. Voices of celebrities of present and bygone eras may be synthesized and re-used. This additionally poses moral challenges. The usage of synthesized voices must be performed responsively and inside the perimeter of legal guidelines.

Advertising: Personalised advert campaigns with acquainted and relatable voices can tremendously improve model enchantment.

Communication: Language has all the time been a barrier to all kinds of actions. Cross-language communication continues to be a problem. Realtime end-to-end translation whereas conserving one’s accent and voice will revolutionize the way in which we talk. This may turn out to be a actuality in just a few years.

Content material Creation: Content material creators will not depend upon translators to achieve a much bigger viewers. With environment friendly voice cloning and lip-syncing, cross-language content material creation will probably be simpler. Podcasts and audiobook narration expertise may be enhanced with voice synthesis.

Conclusion

Voice synthesis is likely one of the most sought-after use circumstances of generative AI. It has the potential to revolutionize the way in which we talk. Ever because the creation of civilizations, the language barrier between communities has been a hurdle for forging deeper relationships, culturally and commercially. With AI voice synthesis, this hole may be crammed. So, on this article, we explored the open-source approach of voice-cloning and lip-syncing.

Key Takeaways

TTS, a Python library by Coqui-ai, serves and maintains fashionable text-to-speech fashions.
The xTTS is a multi-lingual voice cloning mannequin able to cloning voice to 16 totally different languages.
Whisper is an ASR mannequin from OpenAI for environment friendly transcription and English translation.
Wav2lip is an open-source software for lip-syncing movies.
Voice cloning is likely one of the most occurring frontiers of generative AI, with a major potential influence on industries from leisure to advertising.

Often Requested Questions

Q1. Is AI voice cloning authorized?

A. Cloning voice may be unlawful because it infringes on copyright. Nonetheless, getting permission from the individual earlier than cloning is the fitting method to go about it.

Q2. Is AI voice cloning free?

A. Most AI voice cloning API providers require charges. Nonetheless, some open-source fashions may give pretty first rate voice synthesis functionality.

Q3. What’s the greatest voice cloning mannequin?

A. This is dependent upon explicit use circumstances. The xTTS mannequin is an efficient alternative for multi-lingual voice synthesis. However for extra languages, Meta’s Fairseq fashions may be preferable.

This fall. Can AI clone movie star voices?

A. Sure, it’s doable to clone the voice of a star. Nonetheless, be conscious that any potential misuse can land you in authorized hassle.

Q5. What’s the usage of voice cloning?

A. Voice cloning may be helpful for a variety of use circumstances, corresponding to content material creation, narration in video games and flicks, Advert campaigns, and so on.

The media proven on this article will not be owned by Analytics Vidhya and is used on the Writer’s discretion.

Associated

Supply hyperlink

Previous articleWhat Are Tailgating Assaults and Defend Your self From Them

Next articleApple proclaims new ‘contingent pricing’ characteristic for App Retailer subscriptions

Find out how to Lip-Sync a Video Utilizing Opensource Instruments

Introduction

Studying Targets

Open-Supply Stack

Workflow

Step 1: Set up Dependencies

Step 2: Add Movies to Colab

Step 3: Audio Extraction and Whisper Transcription

Step 4: Voice Synthesis

Step 5: Lip-Syncing

Actual-world Use Circumstances

Conclusion

Key Takeaways

Often Requested Questions

Associated

Analytics in Machining Methods in Making Superior Exoskeleton Robots

Indexing MongoDB Change Streams: Elasticsearch versus Rockset

free sport builders from working in content material factories | Owen Mahoney

LEAVE A REPLY Cancel reply

Most Popular

Boss Gifted Her Employees Lottery Tickets—And They Received $50,000

Nothing Cellphone 2a key specs, images, and doable launch date revealed!

Greatest USB-C Hub 2023 – CNET

How you can watch Blue Origin’s December launch livestream

Recent Comments

ABOUT US

POPULAR POSTS

Boss Gifted Her Employees Lottery Tickets—And They Received $50,000

Nothing Cellphone 2a key specs, images, and doable launch date revealed!

Greatest USB-C Hub 2023 – CNET

POPULAR CATEGORY