---
license: apache-2.0
---
## Introduction
**InfiMed-SFT-3B** is a versatile, medical-focused Multimodal Large Language Model (MLLM) developed by the InfiXAI team, leveraging the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework.
**InfiMed-RL-3B**, built upon InfiMed-SFT-3B, is further refined using [EasyR1](https://github.com/hiyouga/EasyR1).
These models outperform larger-scale general-purpose models like Qwen2.5-VL-7B and InternVL2.5-8B, as well as specialized medical open-source models such as MedGemma-4B-IT and HuatuoGPT-V-7B.
Both InfiMed-SFT-3B and InfiMed-RL-3B deliver high performance as a resource-efficient MLLM, ensuring accessibility and affordability for a broad audience.
We invite you to explore its capabilities and welcome inquiries or collaboration opportunities.
## Evaluation Results
We evaluated our model on [MedEvalKit](https://github.com/alibaba-damo-academy/MedEvalKit), using Qwen2.5-72B as the judge model.
The results are as follows.
Model Comparison Table
| Model |
Size |
MMMU-H&M |
VQA-RAD |
SLAKE |
PathVQA |
PMC-VQA |
OmniMedVQA |
MedXpertQA |
Avg. |
| Proprietary Models |
| GPT-5 | - | 83.60 | 67.80 | 78.10 | 52.80 | 60.00 | 76.40 | 71.00 | 70.00 |
| GPT-5-mini | - | 80.50 | 66.30 | 76.10 | 52.40 | 57.60 | 70.90 | 60.10 | 66.30 |
| GPT-5-nano | - | 74.10 | 55.40 | 69.30 | 45.40 | 51.30 | 66.50 | 45.10 | 58.20 |
| GPT-4.1 | - | 75.20 | 65.00 | 72.20 | 55.50 | 55.20 | 75.50 | 45.20 | 63.40 |
| Claude Sonnet 4 | - | 74.60 | 67.60 | 70.60 | 54.20 | 54.40 | 65.50 | 43.30 | 61.50 |
| Gemini-2.5-Flash | - | 76.90 | 68.50 | 75.80 | 55.40 | 55.40 | 71.00 | 52.80 | 65.10 |
| General Open-source Models |
| Qwen2.5VL-3B | 3B | 51.30 | 56.80 | 63.20 | 37.10 | 50.60 | 64.50 | 20.70 | 49.20 |
| Qwen2.5VL-7B | 7B | 54.00 | 64.96 | 67.62 | 44.60 | 51.25 | 63.47 | 21.70 | 52.51 |
| InternVL2.5-8B | 8B | 53.50 | 59.40 | 69.00 | 42.10 | 51.30 | 81.30 | 21.70 | 54.00 |
| InternVL3-8B | 8B | 59.20 | 65.40 | 72.80 | 48.60 | 53.80 | 79.10 | 22.40 | 57.30 |
| Medical Open-source Models |
| MedGemma-4B-IT | 4B | 43.70 | 72.50 | 76.40 | 48.80 | 49.90 | 69.80 | 22.30 | 54.80 |
| LLaVA-Med-7B | 7B | 29.30 | 53.70 | 48.00 | 38.80 | 30.50 | 44.30 | 20.30 | 37.80 |
| HuatuoGPT-V-7B | 7B | 47.30 | 67.00 | 67.80 | 48.00 | 53.30 | 74.20 | 21.60 | 54.20 |
| Lingshu-7B | 7B | 54.00 | 67.90 | 83.10 | 61.90 | 56.30 | 82.90 | 26.70 | 61.80 |
| BioMediX2-8B | 8B | 39.80 | 49.20 | 57.70 | 37.00 | 43.50 | 63.30 | 21.80 | 44.60 |
| InfiMed-Series Model |
| InfiMed-SFT-3B | 3B | 54.67 | 58.09 | 82.00 | 60.59 | 53.22 | 67.01 | 23.55 | 57.02 |
| InfiMed-RL-3B | 3B | 55.33 | 60.53 | 82.38 | 61.97 | 58.74 | 71.71 | 23.60 | 59.18 |
## Model Download
Download the InfiMed models from the Hugging Face Hub into the `./models` directory.
```bash
# Create a directory for models
mkdir -p ./models
# Download InfiMed-SFT-3B
huggingface-cli download --resume-download InfiX-ai/InfiMed-SFT-3B --local-dir ./models/InfiMed-SFT-3B
# Download InfiMed-RL-3B
huggingface-cli download --resume-download InfiX-ai/InfiMed-RL-3B --local-dir ./models/InfiMed-RL-3B
```
## Inference
Our models are established on top of the Qwen2.5-VL family. So we include a simple use case here, and refer the readers to [the standard inference procedure of Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL).
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"InfiX-ai/InfiMed-SFT-3B", torch_dtype="auto", device_map="auto"
)
min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained("InfiX-ai/InfiMed-SFT-3B", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
## Acknowledge
Our model is built upon numerous outstanding open-source projects, such as [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), [EasyR1](https://github.com/hiyouga/EasyR1), and [MedEvalKit](https://github.com/alibaba-damo-academy/MedEvalKit).
We are grateful for their contributions. We extend special thanks to the [Qwen](https://github.com/QwenLM/Qwen2.5-VL) team for their great base models.
## Citation Information
If you find this work useful, we would be grateful if you consider citing the following papers:
```bibtex
@article{liu2025infimedlowresourcemedicalmllms,
title = {InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning},
author = {Liu, Zeyu and Hou, Zhitian and Zhu, Guanghao and Sang, Zhijie and Xie, Congkai and Yang, Hongxia},
journal = {arXiv preprint arXiv:2505.23867},
year = {2025},
url = {https://arxiv.org/abs/2505.23867}
}
```