SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Welcome to the documentation for SRPO, a novel framework designed to enhance the reasoning capabilities of multimodal large language models (MLLMs) through reflection-aware reinforcement learning.

🚀 Project Overview

SRPO introduces a reflection-aware RL pipeline that enables MLLMs to self-reflect, critique, and iteratively improve their reasoning on complex multimodal tasks.

Paper: Arxiv (coming soon)
Model: Hugging Face Models
Dataset: Hugging Face Dataset
Project Page: srpo.pages.dev

✨ Key Features

Reflection-Driven Training: SRPO systematically generates high-quality, reflection-focused training data and employs a novel reward mechanism that explicitly incentivizes concise and effective self-reflection. This approach directly addresses the limitations of previous methods, such as insufficient data quality and a lack of self-reflective behavior for refining responses.
Enhanced Multimodal Reasoning: By encouraging models to self-reflect and critique their own outputs, SRPO significantly improves the reasoning capabilities of multimodal large language models.
State-of-the-Art Performance: Comprehensive experiments across multiple multimodal reasoning benchmarks demonstrate that SRPO surpasses existing state-of-the-art models in both reasoning accuracy and reflection quality.
Robustness Through Reflection: Our results highlight the critical role of reflection-driven training strategies in achieving robust and reliable multimodal reasoning.

📦 Repository Structure

See the llm_engine for unified LLM SDK (OpenAI, Azure, VLLM, etc.)
See the llm_sft for scripts for answer evaluation, reflection evaluation, and image description extraction

🏁 Quick Start

See the Quick Start guide for installation, data preparation, and running evaluation/reflection.

📄 Citation

If you use SRPO or this codebase, please cite our paper:

bibtex
@misc{wan2025srpoenhancingmultimodalllm,
      title={SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning}, 
      author={Zhongwei Wan and Zhihao Dou and Che Liu and Yu Zhang and Dongfei Cui and Qinjian Zhao and Hui Shen and Jing Xiong and Yi Xin and Yifan Jiang and Yangfan He and Mi Zhang and Shen Yan},
      year={2025},
      eprint={2506.01713},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.01713}, 
}

For more details, explore the linked documentation pages or the codebase.