bash# Clone the repository
git clone https://github.com/SUSTechBruce/SRPO_MLLMs
cd SRPO_MLLMs
# Install dependencies
pip install -r requirements.txt
input.jsonl
) in a designated data directory (such as data/
).Example (LLaVA-CoT-100k format):
json{
"query": "How many Mexican municipal leaders were killed in the previous year? Answer the question using a single word or phrase.",
"image": "chartqa/train/png/two_col_100466.png",
"answer": "21",
"content": "<SUMMARY> I will examine the image to determine the number of Mexican municipal leaders killed in the previous year by analyzing the data presented in the bar chart. </SUMMARY>\n\n<CAPTION> The image displays a bar chart illustrating the number of Mexican municipal leaders killed each year from 2005 to 2018. Each bar represents the total number of victims for a specific year. </CAPTION>\n\n<REASONING> I will look at the bar corresponding to the year 2017 to find the number of Mexican municipal leaders killed in the previous year. The chart indicates that in 2017, there were 21 victims, as shown by the height of the bar labeled for that year. </REASONING>\n\n<CONCLUSION> 21 </CONCLUSION>"
}
query
, answer
, and image
. The content
field (as in Mulberry-SFT and LLaVA-CoT-100k) is used for image description extraction (optional).images/
).image
field in your input file contains the correct relative path or URL to the image.bashpython -m llm_sft.answer_eval \ --model Qwen/Qwen2.5-VL-7B-Instruct \ --model_type remote \ --platform VLLM \ --input_path /path/to/your/data.jsonl \ --image_dir /path/to/your/images
Note: This command runs the LLM to answer the queries in your prepared data.
bashpython -m llm_sft.reflection_eval \ --model Qwen/Qwen2.5-VL-7B-Instruct \ --model_type remote \ --platform VLLM \ --input_path /path/to/your/data.jsonl \ --image_dir /path/to/your/images \ --output_path /path/to/save/reflections.jsonl
Note:
- This command lets the advanced MLLM generate reflections for each sample.
- If you use
openai
orazure
as the platform, images will be automatically encoded as base64 and sent to the API by default.- For large images or to avoid base64 encoding, you can upload your images to a public server or image hosting service, then set the
--image_url
argument to the accessible URL prefix.- Alternatively, you can implement your own upload logic in
utils/upload_utils.py
and use the--upload_image
flag to enable custom image uploading.
bashpython -m llm_sft.image_description \
--input_path /path/to/your/data.jsonl \
--source cot100k \
--output_path /path/to/save/image_descriptions.jsonl
Note:
- Run this only if you want to use unimodal models (e.g., o3-mini) for reflection, or need to extract image descriptions for other purposes.
- You can extract image descriptions from Mulberry-SFT and LLaVA-CoT-100k using our predefined patterns, or from your own dataset with a custom pattern.
You can also run the shell scripts provided in the /scripts
directory (such as eval_answer.sh
, eval_reflection.sh
, eval_extract_description.sh
) for one-click batch evaluation and image description extraction.
You can use the SFT data we provide in our Hugging Face dataset, or prepare your own using the methods described above.
If you use SRPO or this codebase, please cite our paper:
bibtexplaceholder