I am Yusheng Dai, and I recently joined Monash University in Australia to work with Prof. Jianfei Cai. Before that, I completed my Master’s program at University of Science and Technology of China (USTC) working with Prof. Jun Du and Prof. Chin-hui Lee. I obtained my Bachelor’s Degree in Cyber Engineering from Sichuan University in June 2022.
My research focuses on audio-visual modality generation and understanding. Specifically, recent work includes multi-conditional-based universal audio generation, Visual-Text to Speech Audio and Music (VT2SAM), which considers both semantic and temporal alignment. I am also interested in extending standard diffusion-based mel-spectrum generation to better approximate the complete real world, such as in long latent spaces (e.g., infinite-duration audio or panorama) or higher resolutions (up to 44.1kHz audio). Earlier, I focused on audio-visual speech recognition using talking-face videos in noisy, multi-speaker scenarios.