Open Music AGI · Multimodal Foundation Models · Creative AI
Ruibin Yuan
I am a PhD student in Artificial Intelligence at HKUST, an AI researcher, developer, and musician working on open music AGI. My research focuses on foundation models for music generation and understanding, audio and multimodal foundation models, and the data and evaluation needed to make creative AI genuinely useful.
I co-founded the Multimodal Art Projection Research Community, lead MAP’s multimodal and AI music direction, and have led or contributed to YuE, MERT, MARBLE, ChatMusician, MMMU/CMMMU, COIG, and other open foundation-model releases, datasets, and benchmarks.
6111 Google Scholar citations
29 Google Scholar h-index
41 Google Scholar i10-index
6.2k+ GitHub stars on YuE
Citation metrics are from Google Scholar, captured on May 16, 2026. The page refreshes these numbers automatically.
Research Areas
Music Generation
Full-song generation, symbolic music LLMs, text and melody control, and open alternatives for high-fidelity creative music systems.
YuEChatMusicianMuPTAudioX
Music Understanding
Self-supervised music audio representation, multilingual MIR, cross-modal retrieval, and practical evaluation for music intelligence.
MERTMARBLECLaMP 2/3SongFormer
Multimodal LLMs and Benchmarks
Omni-modal foundation models, discrete multimodal modeling, expert-level reasoning benchmarks, Chinese multimodal evaluation, and generalist instruction data.
Qwen-OmniMMMUCMMMUAnyGPTOmniBench
Open Research Infrastructure
Open datasets, reproducible training pipelines, benchmark design, community releases, and tooling for researchers and builders.
MAPCOIGRQ-RAGOpen data
Selected Work
YuE / OpenSuno
Open full-song music generation foundation model, designed as an open alternative in the direction of systems such as Suno and Udio.
Qwen3-Omni / Qwen3.5-Omni
Omni-modal foundation model work across text, image, audio, video, and real-time speech interaction, focused on the Qwen3-Omni and Qwen3.5-Omni line.
Kimi-Audio / Spark-TTS
High-impact audio and speech foundation-model reports spanning general audio understanding and efficient LLM-based text-to-speech.
ChatMusician / ComposerX
Symbolic music LLM research, multi-agent composition, large-scale music-language data, and advanced music understanding evaluation.
MERT and MARBLE
Self-supervised music audio representation learning and unified benchmark design for music understanding.
MMMU / SuperGPQA / CMMMU
Expert-level multimodal and graduate-discipline benchmarks for testing reasoning across university-level subjects, with CMMMU tracked as the Chinese CoRR benchmark.
MMAR / OmniBench
Benchmarks for audio-language reasoning and omni-language evaluation across speech, audio, music, vision, and text.
Multimodal Art Projection
Open research community for multimodal art, music intelligence, datasets, checkpoints, and reproducible releases.
Selected Publications and Manuscripts
Flagship Music and Audio Generation
YuE: Scaling Open Foundation Models for Long-Form Music Generation
Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, et al.
Open foundation model family for long-form lyrics-to-song generation; ICLR 2026 poster.
AudioX: A Unified Framework for Anything-to-Audio Generation
Zeyue Tian, Zhaoyang Liu, Yizhu Jin, Ruibin Yuan, Liumeng Xue, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo.
Unified diffusion framework for multimodal-conditioned audio and music generation; ICLR 2026 poster.
Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv, Wei Xue, Yike Guo.
Unified framework spanning general sound, music, speech understanding, generation, and editing; SIGGRAPH 2026.
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo.
Video-conditioned music generation with long-short-term visual modeling; CVPR 2025.
Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer
Siyuan Hou, Shansong Liu, Ruibin Yuan, Wei Xue, Ying Shan, Mangsuo Zhao, Chao Zhang.
Melody- and text-conditioned music editing with ControlNet-style conditioning for diffusion transformers.
MuPT: A Generative Symbolic Music Pretrained Transformer
Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu, Ruibin Yuan, et al.
Generative pretraining for symbolic music modeling and controllable composition.
ChatMusician: Understanding and Generating Music Intrinsically with LLM
Ruibin Yuan, Hanfeng Lin, Yi Wang, Zeyue Tian, et al.
Music-language modeling that treats symbolic music as a native language for understanding and generation.
High-Impact Multimodal and LLM Work
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, et al., Ruibin Yuan, et al.
Large-scale multimodal benchmark across college-level disciplines and expert reasoning tasks.
Qwen3-Omni Technical Report
Qwen Team, Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, et al., Ruibin Yuan, et al.
Natively end-to-end omni-modal foundation model for text, image, audio, video, and real-time speech interaction.
Qwen3.5-Omni Technical Report
Qwen Team, Ruibin Yuan, et al.
Scaled omni-modal model family with long-context audio-visual understanding, speech interaction, and audio-visual grounding.
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, et al., Ruibin Yuan, et al.
Unified discrete sequence modeling for language, image, audio, and speech modalities.
OmniBench: Towards The Future of Universal Omni-Language Models
Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, et al.
Tri-modal benchmark for integrated visual, acoustic, and textual reasoning in omni-language models.
RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation
Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, Jie Fu.
Query refinement for stronger retrieval-augmented generation.
Kimi-Audio Technical Report
Kimi Team, et al., Ruibin Yuan, et al.
General audio foundation model report covering speech, sound, music, and audio interaction.
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, et al., Ruibin Yuan, et al.
Large-scale graduate-discipline benchmark for encyclopedic LLM evaluation; NeurIPS 2025 Datasets and Benchmarks.
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, et al., Ruibin Yuan, et al.
Efficient LLM-based TTS with single-stream decoupled speech tokens and controllable voice synthesis.
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, et al., Ruibin Yuan, et al.
Audio-language reasoning benchmark spanning speech, audio, music, and mixed-modality questions; NeurIPS 2025.
UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice
Sitong Cheng, Weizhen Bian, Xinsheng Wang, Ruibin Yuan, Jianyi Chen, Shunshun Yin, Yike Guo, Wei Xue.
Single-stage expressive speech-to-speech translation preserving content, speaker identity, emotion, and duration; ICLR 2026 poster.
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
Ge Zhang, Xinrun Du, Bei Chen, Yiming Liang, Tongxu Luo, et al., Ruibin Yuan, et al.
Chinese multimodal benchmark extending MMMU-style expert reasoning across disciplines; checked as CoRR 2024 rather than CVPR.
COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning
Yuelin Bai, Xeron Du, Yiming Liang, Leo Jin, Junting Zhou, et al., Ruibin Yuan, et al.
High-quality Chinese instruction data construction and fine-tuning.
Music Understanding and MIR
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, et al.
Large-scale self-supervised acoustic music model for transferable music understanding.
MARBLE: Music Audio Representation Benchmark for Universal Evaluation
Ruibin Yuan, Yinghao Ma, Yizhi Li, Ge Zhang, et al.
Unified benchmark for evaluating music audio representations across diverse MIR tasks.
Foundation Models for Music: A Survey
Yinghao Ma, Anders Oland, Anton Ragni, Chris Donahue, Chenghua Lin, et al., Ruibin Yuan, et al.
Comprehensive survey of representation, generation, multimodal learning, agents, datasets, and evaluation for music foundation models.
CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages
Shangda Wu, Zhancheng Guo, Ruibin Yuan, Junyan Jiang, Seungheon Doh, Gus Xia, Juhan Nam, et al.
Universal music information retrieval across unaligned modalities and unseen languages.
CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
Shangda Wu, Yashan Wang, Ruibin Yuan, Zhancheng Guo, Xu Tan, Ge Zhang, et al.
Multilingual music information retrieval across text, audio, and symbolic music interfaces.
LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT
Le Zhuo, Ruibin Yuan, Jiahao Pan, Yinghao Ma, et al.
Multilingual zero-shot lyrics transcription using speech recognition and LLM post-processing.
SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision
Chunbo Hao, Ruibin Yuan, Jixun Yao, Qixin Deng, Xinyi Bai, Wei Xue, Lei Xie.
Large-scale heterogeneous supervision for music structure analysis, with SongFormDB and SongFormBench.
See Google Scholar for the complete publication list.
Let’s Build Together
I am always open to collaboration with people who care about open music AI, foundation models, evaluation, and creative tools. The shortest informal version of what I am exploring lives in my GitHub profile README, and the more academic version lives here.
Current ideas I am especially excited about:
- A public Lyrics2Song dataset.
- Better diffusion upsampling for YuE, especially fidelity and resolution.
- Better evaluation metrics for musicality.
- More controllability for full-song generation.
- A music arena for popular AI music systems.
- NMLB (No Music Left Behind): collecting all human music, not only western music, and building open music understanding and generation models on top of it.
Experience
Qwen
Research Intern, Qwen-Omni Series.
2025.04 - present · Remote
Moonshot.ai
Research Intern.
2024.09 - 2025.04 · Remote
Stardust.ai
Part-time ML Consultant, Ex-engineer.
2020.05 - 2021 · Beijing
NetEase, Inc.
Intern Machine Learning Engineer.
2019.12 - 2020.04 · Shanghai
Education
Hong Kong University of Science and Technology
PhD in Artificial Intelligence.
2023.09 - present · Hong Kong
Carnegie Mellon University
M.S. in Music and Technology, Computer Science Emphasis.
2021.09 - 2023.05 · Pittsburgh
Music
Lead guitarist, band cofounder, guitar club organizer, and longtime believer that intelligence without music is missing something essential.