The llm_sft/ directory contains scripts for supervised fine-tuning (SFT) data curation
All scripts support a rich set of command-line arguments. Here are the most important ones:
Arguments:
--model (str, default: DEPLOYMENT): LLM model name or deployment name.--model_type (str, default: "remote"): LLM backend type, either remote or local.--platform (str, default: "VLLM"): LLM platform, e.g., VLLM, OpenAI, Azure, etc.--input_path (str, required): Path to the input dataset (JSONL format), e.g., cot100k.--image_dir (str, optional): Directory containing images for multimodal input.--image_url (str, optional, default: None): Accessible URL prefix for images (if needed by backend).--batch_size (int, default: 4): Batch size for concurrent LLM calls.--concurrent_tasks (int, default: 4): Number of concurrent batches to process.--checkpoint_interval (int, default: 5): Number of new results before saving a checkpoint.--incorrect_prefix (str, default: data/checkpoint/incorrect_): Prefix for checkpoint files of incorrect results.--correct_prefix (str, default: data/checkpoint/correct_): Prefix for checkpoint files of correct results.--incorrect_final (str, default: data/incorrect_final.jsonl): Final output file for all incorrect results.--correct_final (str, default: data/correct_final.jsonl): Final output file for all correct results.Arguments:
--model (str, default: DEPLOYMENT): LLM model name or deployment name.--model_type (str, default: "remote"): LLM backend type, either remote or local.--upload_image (bool, optional): Set to True to upload images to a public image host or storage (e.g., SM.MS, Imgur, S3, R2, etc.). Requires implementation of upload_image_and_get_url in utils/upload_utils.py.--image_description_path (str, optional): Path to the image description file, needed if you do not want to use the multimodal input.--platform (str, default: "VLLM"): LLM platform, e.g., VLLM, OpenAI, Azure, etc.--input_path (str, required): Path to the initial response.--image_dir (str, optional): Directory containing images for multimodal input.--image_url (str, optional, default: None): Accessible URL prefix for images if you have already uploaded them to a server.--batch_size (int, default: 4): Batch size for concurrent LLM calls.--concurrent_tasks (int, default: 4): Number of concurrent batches to process.--checkpoint_interval (int, default: 10): Number of new results before saving a checkpoint.--reflection_prefix (str, default: data/checkpoint/reflection_): Prefix for checkpoint files of reflection results.--reflection_final (str, default: data/reflection_final.jsonl): Final output file for all reflection results.Arguments:
--source (str, choices: mulberry, cot100k): Data source name (uses default pattern).--input_path (str, optional): Path to your prepared data (overrides source default).--pattern (str, optional): Custom regex pattern to extract descriptions.--output_path (str, required): Output path for image_description.jsonl.--content_key (str, default: content): Key name in JSON object where text is stored.Answer Evaluation:
bashpython -m llm_sft.answer_eval \ --model Qwen/Qwen2.5-VL-7B-Instruct \ --model_type remote \ --platform VLLM \ --input_path /path/to/your/data.jsonl \ --image_dir /path/to/your/images
Reflection Evaluation:
bashpython -m llm_sft.reflection_eval \ --model Qwen/Qwen2.5-VL-7B-Instruct \ --model_type remote \ --platform VLLM \ --input_path /path/to/your/data.jsonl \ --image_dir /path/to/your/images \ --output_path /path/to/save/reflections.jsonl
Image Description Extraction:
bashpython -m llm_sft.image_description \
--input_path /path/to/your/data.jsonl \
--source cot100k \
--output_path /path/to/save/image_descriptions.jsonl
--help.