Yusheng Dai's Homepage

Hello 👋!

I am Yusheng Dai, a PhD candidate at Monash University in Australia, working with Prof. Jianfei Cai (IEEE Fellow) and Prof. Qiuhong Ke. Before that, I completed my Master’s program at University of Science and Technology of China (USTC), working with Prof. Jun Du and Prof. Chin-hui Lee (IEEE Fellow).

My research focuses on Audio-Visual Foundation World Models, working toward an interactive metaverse driven by real-time video and sound generation. Representative works include:

Omni2Sound: A unified video-text-to-audio foundation model that achieves state-of-the-art on V2A, T2A, and VT2A with a simple DiT architecture.
HolisticAudio: The first scalable end-to-end cinematic dubbing model that operates on holistic video without preprocessing pipelines, jointly modeling speech, sound effects, and music across multi-speaker, off-screen, and combined generation scenarios.
ControlAudio: A controllable multi-event audio foundation model that produces millisecond-level temporally aligned audio from natural language descriptions.
SaFa: Seamless long-form audio and panorama generation via latent swap joint diffusion, up to 20x faster than training-based methods.

Note: I am looking for collaborators to do great work in audio and speech — generation or understanding, frontend or backend. I bring strong research insights and sharp storytelling. If your work is potential enough to achieve high impact, email me. I am fast enough: I have gone to help my collorators from zero context to a finished paper in one week, multiple times.

News

[Apr. 2026]: Our work on unified video-text-to-audio generation Omni2Sound has been accepted by CVPR 2026 as a Highlight (top 15%).
[Apr. 2026]: Our work on controllable audio generation ControlAudio has been accepted by ACL 2026 as an Oral presentation (top 15%).
[July 2025]: Our work on Timing Audio Generation benchmark —— AudioAtlas has been accepted by ACM MM 2025.
[June 2025]: Our work on Seamless long-form audio and panorama generation SaFa** has been accepted by ICCV 2025.
[Nov. 2024]: Obtained National Scholarship at University of Science and Technology of China.
[Oct. 2024]: Obtained Monash International Tuition Scholarship (MITS) and Monash Graduate Scholarship (MGS).

Selected Publications [Google Scholar]

CVPR

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

Yusheng Dai, Zehua Chen, Yuxuan Jiang, Baolong Gao, Qiuhong Ke, Jianfei Cai, Jun Zhu

IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2026.

Project Page arXiv Highlight (top 15%)

Under Review

HolisticAudio: Scaling End-to-End Video Dubbing to Multi-Speaker Dialogues with Coherent Sound Effects

Yusheng Dai, et.al.

Under Review, 2026.

Project Page arXiv

ACL

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

Yuxuan Jiang, Zehua Chen, Zeqian Ju, Yusheng Dai, Weibei Dou, Jun Zhu

Annual Meeting of the Association for Computational Linguistics (ACL), 2026.

Project Page arXiv Oral (top 15%)

ICCV

Latent Swap Joint Diffusion for 2D Long-Form Latent Generation

Yusheng Dai, Chenxi Wang, Chang Li, Chen Wang, et.al.

International Conference on Computer Vision (ICCV), 2025.

arXiv

CVPR

A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition

Yusheng Dai, Hang Chen, Jun Du, Chin-hui Lee, et.al.

IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024.

arXiv