SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Welcome to the documentation for SRPO, a novel framework designed to enhance the reasoning capabilities of multimodal large language models (MLLMs) through reflection-aware reinforcement learning.

🚀 Project Overview

SRPO introduces a reflection-aware RL pipeline that enables MLLMs to self-reflect, critique, and iteratively improve their reasoning on complex multimodal tasks.

✨ Key Features

📦 Repository Structure

🏁 Quick Start

See the Quick Start guide for installation, data preparation, and running evaluation/reflection.

📄 Citation

If you use SRPO or this codebase, please cite our paper:

bibtex
@misc{wan2025srpoenhancingmultimodalllm,
      title={SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning}, 
      author={Zhongwei Wan and Zhihao Dou and Che Liu and Yu Zhang and Dongfei Cui and Qinjian Zhao and Hui Shen and Jing Xiong and Yi Xin and Yifan Jiang and Yangfan He and Mi Zhang and Shen Yan},
      year={2025},
      eprint={2506.01713},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.01713}, 
}

For more details, explore the linked documentation pages or the codebase.