The llm_sft/
directory contains scripts for supervised fine-tuning (SFT) data curation
All scripts support a rich set of command-line arguments. Here are the most important ones:
Arguments:
--model
(str
, default: DEPLOYMENT
): LLM model name or deployment name.--model_type
(str
, default: "remote"
): LLM backend type, either remote
or local
.--platform
(str
, default: "VLLM"
): LLM platform, e.g., VLLM
, OpenAI
, Azure
, etc.--input_path
(str
, required): Path to the input dataset (JSONL format), e.g., cot100k
.--image_dir
(str
, optional): Directory containing images for multimodal input.--image_url
(str
, optional, default: None
): Accessible URL prefix for images (if needed by backend).--batch_size
(int
, default: 4
): Batch size for concurrent LLM calls.--concurrent_tasks
(int
, default: 4
): Number of concurrent batches to process.--checkpoint_interval
(int
, default: 5
): Number of new results before saving a checkpoint.--incorrect_prefix
(str
, default: data/checkpoint/incorrect_
): Prefix for checkpoint files of incorrect results.--correct_prefix
(str
, default: data/checkpoint/correct_
): Prefix for checkpoint files of correct results.--incorrect_final
(str
, default: data/incorrect_final.jsonl
): Final output file for all incorrect results.--correct_final
(str
, default: data/correct_final.jsonl
): Final output file for all correct results.Arguments:
--model
(str
, default: DEPLOYMENT
): LLM model name or deployment name.--model_type
(str
, default: "remote"
): LLM backend type, either remote
or local
.--upload_image
(bool
, optional): Set to True
to upload images to a public image host or storage (e.g., SM.MS, Imgur, S3, R2, etc.). Requires implementation of upload_image_and_get_url
in utils/upload_utils.py
.--image_description_path
(str
, optional): Path to the image description file, needed if you do not want to use the multimodal input.--platform
(str
, default: "VLLM"
): LLM platform, e.g., VLLM
, OpenAI
, Azure
, etc.--input_path
(str
, required): Path to the initial response.--image_dir
(str
, optional): Directory containing images for multimodal input.--image_url
(str
, optional, default: None
): Accessible URL prefix for images if you have already uploaded them to a server.--batch_size
(int
, default: 4
): Batch size for concurrent LLM calls.--concurrent_tasks
(int
, default: 4
): Number of concurrent batches to process.--checkpoint_interval
(int
, default: 10
): Number of new results before saving a checkpoint.--reflection_prefix
(str
, default: data/checkpoint/reflection_
): Prefix for checkpoint files of reflection results.--reflection_final
(str
, default: data/reflection_final.jsonl
): Final output file for all reflection results.Arguments:
--source
(str
, choices: mulberry
, cot100k
): Data source name (uses default pattern).--input_path
(str
, optional): Path to your prepared data (overrides source default).--pattern
(str
, optional): Custom regex pattern to extract descriptions.--output_path
(str
, required): Output path for image_description.jsonl
.--content_key
(str
, default: content
): Key name in JSON object where text is stored.Answer Evaluation:
bashpython -m llm_sft.answer_eval \ --model Qwen/Qwen2.5-VL-7B-Instruct \ --model_type remote \ --platform VLLM \ --input_path /path/to/your/data.jsonl \ --image_dir /path/to/your/images
Reflection Evaluation:
bashpython -m llm_sft.reflection_eval \ --model Qwen/Qwen2.5-VL-7B-Instruct \ --model_type remote \ --platform VLLM \ --input_path /path/to/your/data.jsonl \ --image_dir /path/to/your/images \ --output_path /path/to/save/reflections.jsonl
Image Description Extraction:
bashpython -m llm_sft.image_description \
--input_path /path/to/your/data.jsonl \
--source cot100k \
--output_path /path/to/save/image_descriptions.jsonl
--help
.