🔥 Quick Start

Self-Reflection SFT Data Curation

bash
# Clone the repository
git clone https://github.com/SUSTechBruce/SRPO_MLLMs
cd SRPO_MLLMs

# Install dependencies
pip install -r requirements.txt

1. Data Preparation

Example (LLaVA-CoT-100k format):

json
{
  "query": "How many Mexican municipal leaders were killed in the previous year? Answer the question using a single word or phrase.",
  "image": "chartqa/train/png/two_col_100466.png",
  "answer": "21",
  "content": "<SUMMARY> I will examine the image to determine the number of Mexican municipal leaders killed in the previous year by analyzing the data presented in the bar chart. </SUMMARY>\n\n<CAPTION> The image displays a bar chart illustrating the number of Mexican municipal leaders killed each year from 2005 to 2018. Each bar represents the total number of victims for a specific year. </CAPTION>\n\n<REASONING> I will look at the bar corresponding to the year 2017 to find the number of Mexican municipal leaders killed in the previous year. The chart indicates that in 2017, there were 21 victims, as shown by the height of the bar labeled for that year. </REASONING>\n\n<CONCLUSION> 21 </CONCLUSION>"
}

2. Data Construction

Answer Evaluation

bash
python -m llm_sft.answer_eval \
    --model Qwen/Qwen2.5-VL-7B-Instruct \
    --model_type remote \
    --platform VLLM \
    --input_path /path/to/your/data.jsonl \
    --image_dir /path/to/your/images

Note: This command runs the LLM to answer the queries in your prepared data.

Reflection Evaluation

bash
python -m llm_sft.reflection_eval \
    --model Qwen/Qwen2.5-VL-7B-Instruct \
    --model_type remote \
    --platform VLLM \
    --input_path /path/to/your/data.jsonl \
    --image_dir /path/to/your/images \
    --output_path /path/to/save/reflections.jsonl

Note:

Image Description Extraction

bash
python -m llm_sft.image_description \
    --input_path /path/to/your/data.jsonl \
    --source cot100k \
    --output_path /path/to/save/image_descriptions.jsonl

Note:

3. Output

4. Workflow

You can also run the shell scripts provided in the /scripts directory (such as eval_answer.sh, eval_reflection.sh, eval_extract_description.sh) for one-click batch evaluation and image description extraction.


5. Reproducibility

You can use the SFT data we provide in our Hugging Face dataset, or prepare your own using the methods described above.

TODO: Self-Reflection Cold Start

TODO: Self-Reflection RL Training

📄 Citation

If you use SRPO or this codebase, please cite our paper:

bibtex
placeholder

← Back to Docs