--- license: apache-2.0 --- ## Introduction **InfiMed-SFT-3B** is a versatile, medical-focused Multimodal Large Language Model (MLLM) developed by the InfiXAI team, leveraging the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework. **InfiMed-RL-3B**, built upon InfiMed-SFT-3B, is further refined using [EasyR1](https://github.com/hiyouga/EasyR1). These models outperform larger-scale general-purpose models like Qwen2.5-VL-7B and InternVL2.5-8B, as well as specialized medical open-source models such as MedGemma-4B-IT and HuatuoGPT-V-7B. Both InfiMed-SFT-3B and InfiMed-RL-3B deliver high performance as a resource-efficient MLLM, ensuring accessibility and affordability for a broad audience. We invite you to explore its capabilities and welcome inquiries or collaboration opportunities. ## Evaluation Results We evaluated our model on [MedEvalKit](https://github.com/alibaba-damo-academy/MedEvalKit), using Qwen2.5-72B as the judge model. The results are as follows. Model Comparison Table
Model Size MMMU-H&M VQA-RAD SLAKE PathVQA PMC-VQA OmniMedVQA MedXpertQA Avg.
Proprietary Models
GPT-5-83.6067.8078.1052.8060.0076.4071.0070.00
GPT-5-mini-80.5066.3076.1052.4057.6070.9060.1066.30
GPT-5-nano-74.1055.4069.3045.4051.3066.5045.1058.20
GPT-4.1-75.2065.0072.2055.5055.2075.5045.2063.40
Claude Sonnet 4-74.6067.6070.6054.2054.4065.5043.3061.50
Gemini-2.5-Flash-76.9068.5075.8055.4055.4071.0052.8065.10
General Open-source Models
Qwen2.5VL-3B3B51.3056.8063.2037.1050.6064.5020.7049.20
Qwen2.5VL-7B7B54.0064.9667.6244.6051.2563.4721.7052.51
InternVL2.5-8B8B53.5059.4069.0042.1051.3081.3021.7054.00
InternVL3-8B8B59.2065.4072.8048.6053.8079.1022.4057.30
Medical Open-source Models
MedGemma-4B-IT4B43.7072.5076.4048.8049.9069.8022.3054.80
LLaVA-Med-7B7B29.3053.7048.0038.8030.5044.3020.3037.80
HuatuoGPT-V-7B7B47.3067.0067.8048.0053.3074.2021.6054.20
Lingshu-7B7B54.0067.9083.1061.9056.3082.9026.7061.80
BioMediX2-8B8B39.8049.2057.7037.0043.5063.3021.8044.60
InfiMed-Series Model
InfiMed-SFT-3B3B54.6758.0982.0060.5953.2267.0123.5557.02
InfiMed-RL-3B3B55.3360.5382.3861.9758.7471.7123.6059.18
## Model Download Download the InfiMed models from the Hugging Face Hub into the `./models` directory. ```bash # Create a directory for models mkdir -p ./models # Download InfiMed-SFT-3B huggingface-cli download --resume-download InfiX-ai/InfiMed-SFT-3B --local-dir ./models/InfiMed-SFT-3B # Download InfiMed-RL-3B huggingface-cli download --resume-download InfiX-ai/InfiMed-RL-3B --local-dir ./models/InfiMed-RL-3B ``` ## Inference Our models are established on top of the Qwen2.5-VL family. So we include a simple use case here, and refer the readers to [the standard inference procedure of Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL). ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info # default: Load the model on the available device(s) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "InfiX-ai/InfiMed-SFT-3B", torch_dtype="auto", device_map="auto" ) min_pixels = 256*28*28 max_pixels = 1280*28*28 processor = AutoProcessor.from_pretrained("InfiX-ai/InfiMed-SFT-3B", min_pixels=min_pixels, max_pixels=max_pixels) messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, {"type": "text", "text": "Describe this image."}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to(model.device) # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=4096) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ## Acknowledge Our model is built upon numerous outstanding open-source projects, such as [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), [EasyR1](https://github.com/hiyouga/EasyR1), and [MedEvalKit](https://github.com/alibaba-damo-academy/MedEvalKit). We are grateful for their contributions. We extend special thanks to the [Qwen](https://github.com/QwenLM/Qwen2.5-VL) team for their great base models. ## Citation Information If you find this work useful, we would be grateful if you consider citing the following papers: ```bibtex @article{liu2025infimedlowresourcemedicalmllms, title = {InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning}, author = {Liu, Zeyu and Hou, Zhitian and Zhu, Guanghao and Sang, Zhijie and Xie, Congkai and Yang, Hongxia}, journal = {arXiv preprint arXiv:2505.23867}, year = {2025}, url = {https://arxiv.org/abs/2505.23867} } ```