Model Comparison Table

--- license: apache-2.0 --- ## Introduction **InfiMed-SFT-3B** is a versatile, medical-focused Multimodal Large Language Model (MLLM) developed by the InfiXAI team, leveraging the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework. **InfiMed-RL-3B**, built upon InfiMed-SFT-3B, is further refined using [EasyR1](https://github.com/hiyouga/EasyR1). These models outperform larger-scale general-purpose models like Qwen2.5-VL-7B and InternVL2.5-8B, as well as specialized medical open-source models such as MedGemma-4B-IT and HuatuoGPT-V-7B. Both InfiMed-SFT-3B and InfiMed-RL-3B deliver high performance as a resource-efficient MLLM, ensuring accessibility and affordability for a broad audience. We invite you to explore its capabilities and welcome inquiries or collaboration opportunities. ## Evaluation Results We evaluated our model on [MedEvalKit](https://github.com/alibaba-damo-academy/MedEvalKit), using Qwen2.5-72B as the judge model. The results are as follows. Model Comparison Table

Model	Size	MMMU-H&M	VQA-RAD	SLAKE	PathVQA	PMC-VQA	OmniMedVQA	MedXpertQA	Avg.
Proprietary Models
GPT-5	-	83.60	67.80	78.10	52.80	60.00	76.40	71.00	70.00
GPT-5-mini	-	80.50	66.30	76.10	52.40	57.60	70.90	60.10	66.30
GPT-5-nano	-	74.10	55.40	69.30	45.40	51.30	66.50	45.10	58.20
GPT-4.1	-	75.20	65.00	72.20	55.50	55.20	75.50	45.20	63.40
Claude Sonnet 4	-	74.60	67.60	70.60	54.20	54.40	65.50	43.30	61.50
Gemini-2.5-Flash	-	76.90	68.50	75.80	55.40	55.40	71.00	52.80	65.10
General Open-source Models
Qwen2.5VL-3B	3B	51.30	56.80	63.20	37.10	50.60	64.50	20.70	49.20
Qwen2.5VL-7B	7B	54.00	64.96	67.62	44.60	51.25	63.47	21.70	52.51
InternVL2.5-8B	8B	53.50	59.40	69.00	42.10	51.30	81.30	21.70	54.00
InternVL3-8B	8B	59.20	65.40	72.80	48.60	53.80	79.10	22.40	57.30
Medical Open-source Models
MedGemma-4B-IT	4B	43.70	72.50	76.40	48.80	49.90	69.80	22.30	54.80
LLaVA-Med-7B	7B	29.30	53.70	48.00	38.80	30.50	44.30	20.30	37.80
HuatuoGPT-V-7B	7B	47.30	67.00	67.80	48.00	53.30	74.20	21.60	54.20
Lingshu-7B	7B	54.00	67.90	83.10	61.90	56.30	82.90	26.70	61.80
BioMediX2-8B	8B	39.80	49.20	57.70	37.00	43.50	63.30	21.80	44.60
InfiMed-Series Model
InfiMed-SFT-3B	3B	54.67	58.09	82.00	60.59	53.22	67.01	23.55	57.02
InfiMed-RL-3B	3B	55.33	60.53	82.38	61.97	58.74	71.71	23.60	59.18

## Model Download Download the InfiMed models from the Hugging Face Hub into the `./models` directory. ```bash # Create a directory for models mkdir -p ./models # Download InfiMed-SFT-3B huggingface-cli download --resume-download InfiX-ai/InfiMed-SFT-3B --local-dir ./models/InfiMed-SFT-3B # Download InfiMed-RL-3B huggingface-cli download --resume-download InfiX-ai/InfiMed-RL-3B --local-dir ./models/InfiMed-RL-3B ``` ## Inference Our models are established on top of the Qwen2.5-VL family. So we include a simple use case here, and refer the readers to [the standard inference procedure of Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL). ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info # default: Load the model on the available device(s) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "InfiX-ai/InfiMed-SFT-3B", torch_dtype="auto", device_map="auto" ) min_pixels = 256*28*28 max_pixels = 1280*28*28 processor = AutoProcessor.from_pretrained("InfiX-ai/InfiMed-SFT-3B", min_pixels=min_pixels, max_pixels=max_pixels) messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, {"type": "text", "text": "Describe this image."}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to(model.device) # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=4096) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ## Acknowledge Our model is built upon numerous outstanding open-source projects, such as [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), [EasyR1](https://github.com/hiyouga/EasyR1), and [MedEvalKit](https://github.com/alibaba-damo-academy/MedEvalKit). We are grateful for their contributions. We extend special thanks to the [Qwen](https://github.com/QwenLM/Qwen2.5-VL) team for their great base models. ## Citation Information If you find this work useful, we would be grateful if you consider citing the following papers: ```bibtex @article{liu2025infimedlowresourcemedicalmllms, title = {InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning}, author = {Liu, Zeyu and Hou, Zhitian and Zhu, Guanghao and Sang, Zhijie and Xie, Congkai and Yang, Hongxia}, journal = {arXiv preprint arXiv:2505.23867}, year = {2025}, url = {https://arxiv.org/abs/2505.23867} } ```