Open Music AGI · Multimodal Foundation Models · Creative AI

Ruibin Yuan

I am a PhD student in Artificial Intelligence at HKUST, an AI researcher, developer, and musician working on open music AGI. My research focuses on foundation models for music generation and understanding, audio and multimodal foundation models, and the data and evaluation needed to make creative AI genuinely useful.

I co-founded the Multimodal Art Projection Research Community, lead MAP’s multimodal and AI music direction, and have led or contributed to YuE, MERT, MARBLE, ChatMusician, MMMU/CMMMU, COIG, and other open foundation-model releases, datasets, and benchmarks.

6111 Google Scholar citations

29 Google Scholar h-index

41 Google Scholar i10-index

6.2k+ GitHub stars on YuE

Citation metrics are from Google Scholar, captured on May 16, 2026. The page refreshes these numbers automatically.

Research Areas

Music Generation

Full-song generation, symbolic music LLMs, text and melody control, and open alternatives for high-fidelity creative music systems.

YuEChatMusicianMuPTAudioX

Music Understanding

Self-supervised music audio representation, multilingual MIR, cross-modal retrieval, and practical evaluation for music intelligence.

MERTMARBLECLaMP 2/3SongFormer

Multimodal LLMs and Benchmarks

Omni-modal foundation models, discrete multimodal modeling, expert-level reasoning benchmarks, Chinese multimodal evaluation, and generalist instruction data.

Qwen-OmniMMMUCMMMUAnyGPTOmniBench

Open Research Infrastructure

Open datasets, reproducible training pipelines, benchmark design, community releases, and tooling for researchers and builders.

MAPCOIGRQ-RAGOpen data

Selected Work

Open music model6.2k+ starsGS cites 65

YuE / OpenSuno

Open full-song music generation foundation model, designed as an open alternative in the direction of systems such as Suno and Udio.

Omni-modal foundation modelsQwen-OmniGS cites 854

Qwen3-Omni / Qwen3.5-Omni

Omni-modal foundation model work across text, image, audio, video, and real-time speech interaction, focused on the Qwen3-Omni and Qwen3.5-Omni line.

Audio foundation modelsTechnical ReportsKimi GS cites 195

Kimi-Audio / Spark-TTS

High-impact audio and speech foundation-model reports spanning general audio understanding and efficient LLM-based text-to-speech.

Music LLMACL / ISMIRGS cites 125

ChatMusician / ComposerX

Symbolic music LLM research, multi-agent composition, large-scale music-language data, and advanced music understanding evaluation.

Music understandingICLR / NeurIPSMERT GS cites 329

MERT and MARBLE

Self-supervised music audio representation learning and unified benchmark design for music understanding.

Expert reasoning benchmarksCVPR / NeurIPS / CoRRMMMU GS cites 2303

MMMU / SuperGPQA / CMMMU

Expert-level multimodal and graduate-discipline benchmarks for testing reasoning across university-level subjects, with CMMMU tracked as the Chinese CoRR benchmark.

Omni/audio reasoning benchmarksNeurIPS 2025MMAR GS cites 83

MMAR / OmniBench

Benchmarks for audio-language reasoning and omni-language evaluation across speech, audio, music, vision, and text.

Open communityMAP

Multimodal Art Projection

Open research community for multimodal art, music intelligence, datasets, checkpoints, and reproducible releases.

Selected Publications and Manuscripts

Flagship Music and Audio Generation

ICLR 2026Open full-song generation6.2k+ starsGS cites 65

YuE: Scaling Open Foundation Models for Long-Form Music Generation

Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, et al.

Open foundation model family for long-form lyrics-to-song generation; ICLR 2026 poster.

ICLR 2026Anything-to-audioGS cites 55

AudioX: A Unified Framework for Anything-to-Audio Generation

Zeyue Tian, Zhaoyang Liu, Yizhu Jin, Ruibin Yuan, Liumeng Xue, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo.

Unified diffusion framework for multimodal-conditioned audio and music generation; ICLR 2026 poster.

SIGGRAPH 2026Audio understanding/generation/editing

Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv, Wei Xue, Yike Guo.

Unified framework spanning general sound, music, speech understanding, generation, and editing; SIGGRAPH 2026.

CVPR 2025Video-to-music generationGS cites 59

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo.

Video-conditioned music generation with long-short-term visual modeling; CVPR 2025.

ICASSP 2025Music editingGS cites 21

Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer

Siyuan Hou, Shansong Liu, Ruibin Yuan, Wei Xue, Ying Shan, Mangsuo Zhao, Chao Zhang.

Melody- and text-conditioned music editing with ControlNet-style conditioning for diffusion transformers.

ICLR 2025Symbolic pretrainingGS cites 35

MuPT: A Generative Symbolic Music Pretrained Transformer

Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu, Ruibin Yuan, et al.

Generative pretraining for symbolic music modeling and controllable composition.

ACL 2024Symbolic music LLMGS cites 125

ChatMusician: Understanding and Generating Music Intrinsically with LLM

Ruibin Yuan, Hanfeng Lin, Yi Wang, Zeyue Tian, et al.

Music-language modeling that treats symbolic music as a native language for understanding and generation.

High-Impact Multimodal and LLM Work

CVPR 2024Expert AGI benchmarkGS cites 2303

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, et al., Ruibin Yuan, et al.

Large-scale multimodal benchmark across college-level disciplines and expert reasoning tasks.

Technical Report 2025Omni-modal LLMGS cites 854

Qwen3-Omni Technical Report

Qwen Team, Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, et al., Ruibin Yuan, et al.

Natively end-to-end omni-modal foundation model for text, image, audio, video, and real-time speech interaction.

Technical Report 2026Scaled omni-modal LLMGS cites 27

Qwen3.5-Omni Technical Report

Qwen Team, Ruibin Yuan, et al.

Scaled omni-modal model family with long-context audio-visual understanding, speech interaction, and audio-visual grounding.

ACL 2024 MainUnified multimodal LLMGS cites 298

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, et al., Ruibin Yuan, et al.

Unified discrete sequence modeling for language, image, audio, and speech modalities.

NeurIPS 2025Omni-language evaluationGS cites 67

OmniBench: Towards The Future of Universal Omni-Language Models

Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, et al.

Tri-modal benchmark for integrated visual, acoustic, and textual reasoning in omni-language models.

COLM 2024Retrieval-augmented generationGS cites 277

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, Jie Fu.

Query refinement for stronger retrieval-augmented generation.

Technical Report 2025General audio foundation modelGS cites 195

Kimi-Audio Technical Report

Kimi Team, et al., Ruibin Yuan, et al.

General audio foundation model report covering speech, sound, music, and audio interaction.

NeurIPS 2025Graduate-level LLM evaluationGS cites 169

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, et al., Ruibin Yuan, et al.

Large-scale graduate-discipline benchmark for encyclopedic LLM evaluation; NeurIPS 2025 Datasets and Benchmarks.

Technical Report 2025Text-to-speech foundation modelGS cites 161

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, et al., Ruibin Yuan, et al.

Efficient LLM-based TTS with single-stream decoupled speech tokens and controllable voice synthesis.

NeurIPS 2025Audio reasoning benchmarkGS cites 83

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, et al., Ruibin Yuan, et al.

Audio-language reasoning benchmark spanning speech, audio, music, and mixed-modality questions; NeurIPS 2025.

ICLR 2026Expressive S2ST

UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

Sitong Cheng, Weizhen Bian, Xinsheng Wang, Ruibin Yuan, Jianyi Chen, Shunshun Yin, Yike Guo, Wei Xue.

Single-stage expressive speech-to-speech translation preserving content, speaker identity, emotion, and duration; ICLR 2026 poster.

CoRR 2024Chinese multimodal benchmarkGS cites 74

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

Ge Zhang, Xinrun Du, Bei Chen, Yiming Liang, Tongxu Luo, et al., Ruibin Yuan, et al.

Chinese multimodal benchmark extending MMMU-style expert reasoning across disciplines; checked as CoRR 2024 rather than CVPR.

NAACL 2025Instruction data qualityGS cites 57

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Yuelin Bai, Xeron Du, Yiming Liang, Leo Jin, Junting Zhou, et al., Ruibin Yuan, et al.

High-quality Chinese instruction data construction and fine-tuning.

Music Understanding and MIR

ICLR 2024Music representationGS cites 329

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, et al.

Large-scale self-supervised acoustic music model for transferable music understanding.

NeurIPS 2023Music benchmarkGS cites 59

MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Ruibin Yuan, Yinghao Ma, Yizhi Li, Ge Zhang, et al.

Unified benchmark for evaluating music audio representations across diverse MIR tasks.

Survey 2024Music foundation modelsGS cites 79

Foundation Models for Music: A Survey

Yinghao Ma, Anders Oland, Anton Ragni, Chris Donahue, Chenghua Lin, et al., Ruibin Yuan, et al.

Comprehensive survey of representation, generation, multimodal learning, agents, datasets, and evaluation for music foundation models.

ACL 2025Universal music retrievalGS cites 40

CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages

Shangda Wu, Zhancheng Guo, Ruibin Yuan, Junyan Jiang, Seungheon Doh, Gus Xia, Juhan Nam, et al.

Universal music information retrieval across unaligned modalities and unseen languages.

NAACL 2025Multilingual music retrievalGS cites 19

CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

Shangda Wu, Yashan Wang, Ruibin Yuan, Zhancheng Guo, Xu Tan, Ge Zhang, et al.

Multilingual music information retrieval across text, audio, and symbolic music interfaces.

ISMIR 2023Lyrics transcriptionGS cites 43

LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT

Le Zhuo, Ruibin Yuan, Jiahao Pan, Yinghao Ma, et al.

Multilingual zero-shot lyrics transcription using speech recognition and LLM post-processing.

Preprint 2025Music structure analysis

SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Chunbo Hao, Ruibin Yuan, Jixun Yao, Qixin Deng, Xinyi Bai, Wei Xue, Lei Xie.

Large-scale heterogeneous supervision for music structure analysis, with SongFormDB and SongFormBench.

See Google Scholar for the complete publication list.

Let’s Build Together

I am always open to collaboration with people who care about open music AI, foundation models, evaluation, and creative tools. The shortest informal version of what I am exploring lives in my GitHub profile README, and the more academic version lives here.

Current ideas I am especially excited about:

  • A public Lyrics2Song dataset.
  • Better diffusion upsampling for YuE, especially fidelity and resolution.
  • Better evaluation metrics for musicality.
  • More controllability for full-song generation.
  • A music arena for popular AI music systems.
  • NMLB (No Music Left Behind): collecting all human music, not only western music, and building open music understanding and generation models on top of it.

GitHub Profile README MAP Discord Buy Me a Coffee

Experience

Qwen
Research Intern, Qwen-Omni Series.
2025.04 - present · Remote

Moonshot.ai
Research Intern.
2024.09 - 2025.04 · Remote

Stardust.ai
Part-time ML Consultant, Ex-engineer.
2020.05 - 2021 · Beijing

NetEase, Inc.
Intern Machine Learning Engineer.
2019.12 - 2020.04 · Shanghai

Education

Hong Kong University of Science and Technology
PhD in Artificial Intelligence.
2023.09 - present · Hong Kong

Carnegie Mellon University
M.S. in Music and Technology, Computer Science Emphasis.
2021.09 - 2023.05 · Pittsburgh

Music

Lead guitarist, band cofounder, guitar club organizer, and longtime believer that intelligence without music is missing something essential.