In recent years, speech synthesis has undergone a profound transformation thanks to the emergence of large-scale generative models. This evolution has led to significant strides in zero-shot speech synthesis systems, including text-to-speech (TTS), voice conversion (VC), and editing. These systems aim to generate speech by incorporating unseen speaker characteristics from a reference audio segment during… →
The interdisciplinary domain of vision-language representation seeks innovative methods to develop systems to understand the nuanced interactions between text and images. This area is pivotal as it enables machines to process and interpret the vast amount of digitally available visual and textual content. Despite significant advances, the challenge persists primarily due to the noisy data… →