Qwen3-VL-Embedding-2B finetuned on Arabic-culture visual document retrieval

This is a sentence-transformers model trained on the pearl-vdr-ar-train-preprocessed dataset. It maps sentences & paragraphs to a 2048-dimensional dense vector space and can be used for retrieval.

Model Details

Model Description

Model Type: Sentence Transformer
Maximum Sequence Length: 262144 tokens
Output Dimensionality: 2048 dimensions
Similarity Function: Cosine Similarity
Supported Modalities: Text, Image, Video, Message
Training Dataset:
- pearl-vdr-ar-train-preprocessed
Language: ar
License: apache-2.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'image': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'video': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'message': {'method': 'forward', 'method_output_name': 'last_hidden_state', 'format': 'structured'}}, 'module_output_name': 'token_embeddings', 'processing_kwargs': {'chat_template': {'add_generation_prompt': True}}, 'unpad_inputs': False, 'architecture': 'Qwen3VLModel'})
  (1): Pooling({'embedding_dimension': 2048, 'pooling_mode': 'lasttoken', 'include_prompt': True})
  (2): Normalize({})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Omartificial-Intelligence-Space/Qwen3-VL-Embedding-2B-Arabic-VDR")
# Run inference
queries = [
    'ما اسم هذه الزهور البيضاء الصغيرة التي تنمو بين الصخور؟',
]
documents = [
    'https://i.ibb.co/svZf6D92/image1.jpg',
    'https://i.ibb.co/spFmq82S/image2.jpg',
    'https://i.ibb.co/mF5BDDsB/image3.jpg'
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 2048] [3, 2048]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[ 0.5869, -0.1090,  0.1076]])

Training Details

Training Dataset

pearl-vdr-ar-train-preprocessed

Dataset: pearl-vdr-ar-train-preprocessed at 494822e
Size: 48,002 training samples
Columns: query, image, and negative_0

Approximate statistics based on the first 1000 samples:

	query	image	negative_0
type	string	image	image
details	min: 31 tokens mean: 51.45 tokens max: 90 tokens	min: 53x96 px mean: 639x540 px max: 800x798 px	min: 101x100 px mean: 630x545 px max: 800x787 px

Samples:

query	image	negative_0
`ما هي التحديات التي تواجه الحرف التقليدية كما يظهر في الصورة، وما هي الحلول الممكنة لمواجهة هذه التحديات؟`
`إذا شاركت في ورشة عمل لتعلم كيفية صنع الآلة التي يظهر في الصورة، ما هي الخطوات التي ستحتاج إلى اتباعها لصنعها بشكل صحيح؟`
`كيف يختلف العزف على الآلة التي يظهر في الصورة عن العزف على الآلات الوترية الأخرى في المنطقة، وما هي الخصائص الفريدة لهذه الآلة؟`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "CachedMultipleNegativesRankingLoss",
    "matryoshka_dims": [
        2048,
        1536,
        1024,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 64
num_train_epochs: 2
learning_rate: 1e-05
warmup_steps: 0.03
bf16: True
per_device_eval_batch_size: 64
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

- `per_device_train_batch_size`: 64 - `num_train_epochs`: 2 - `max_steps`: -1 - `learning_rate`: 1e-05 - `lr_scheduler_type`: linear - `lr_scheduler_kwargs`: None - `warmup_steps`: 0.03 - `optim`: adamw_torch_fused - `optim_args`: None - `weight_decay`: 0.0 - `adam_beta1`: 0.9 - `adam_beta2`: 0.999 - `adam_epsilon`: 1e-08 - `optim_target_modules`: None - `gradient_accumulation_steps`: 1 - `average_tokens_across_devices`: True - `max_grad_norm`: 1.0 - `label_smoothing_factor`: 0.0 - `bf16`: True - `fp16`: False - `bf16_full_eval`: False - `fp16_full_eval`: False - `tf32`: None - `gradient_checkpointing`: False - `gradient_checkpointing_kwargs`: None - `torch_compile`: False - `torch_compile_backend`: None - `torch_compile_mode`: None - `use_liger_kernel`: False - `liger_kernel_config`: None - `use_cache`: False - `neftune_noise_alpha`: None - `torch_empty_cache_steps`: None - `auto_find_batch_size`: False - `log_on_each_node`: True - `logging_nan_inf_filter`: True - `include_num_input_tokens_seen`: no - `log_level`: passive - `log_level_replica`: warning - `disable_tqdm`: False - `project`: huggingface - `trackio_space_id`: trackio - `per_device_eval_batch_size`: 64 - `prediction_loss_only`: True - `eval_on_start`: False - `eval_do_concat_batches`: True - `eval_use_gather_object`: False - `eval_accumulation_steps`: None - `include_for_metrics`: [] - `batch_eval_metrics`: False - `save_only_model`: False - `save_on_each_node`: False - `enable_jit_checkpoint`: False - `push_to_hub`: False - `hub_private_repo`: None - `hub_model_id`: None - `hub_strategy`: every_save - `hub_always_push`: False - `hub_revision`: None - `load_best_model_at_end`: False - `ignore_data_skip`: False - `restore_callback_states_from_checkpoint`: False - `full_determinism`: False - `seed`: 42 - `data_seed`: None - `use_cpu`: False - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} - `parallelism_config`: None - `dataloader_drop_last`: False - `dataloader_num_workers`: 0 - `dataloader_pin_memory`: True - `dataloader_persistent_workers`: False - `dataloader_prefetch_factor`: None - `remove_unused_columns`: True - `label_names`: None - `train_sampling_strategy`: random - `length_column_name`: length - `ddp_find_unused_parameters`: None - `ddp_bucket_cap_mb`: None - `ddp_broadcast_buffers`: False - `ddp_backend`: None - `ddp_timeout`: 1800 - `fsdp`: [] - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} - `deepspeed`: None - `debug`: [] - `skip_memory_metrics`: True - `do_predict`: False - `resume_from_checkpoint`: None - `warmup_ratio`: None - `local_rank`: -1 - `prompts`: None - `batch_sampler`: no_duplicates - `multi_dataset_batch_sampler`: proportional - `router_mapping`: {} - `learning_rate_mapping`: {}

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Pearl Dataset

If you use this dataset or the accompanying benchmarks, please cite our paper:

@inproceedings{alwajih-etal-2025-pearl,
    title = "Pearl: A Multimodal Culturally-Aware {A}rabic Instruction Dataset",
    author = "Alwajih, Fakhraddin  and
      Magdy, Samar M.  and
      El Mekki, Abdellah  and
      Nacar, Omer  and
      Nafea, Youssef  and
      Abdelfadil, Safaa Taher  and
      Yahya, Abdulfattah Mohammed  and
      Luqman, Hamzah  and
      Almarwani, Nada  and
      Aloufi, Samah  and
      Qawasmeh, Baraah  and
      Atou, Houdaifa  and
      Sibaee, Serry  and
      Alsayadi, Hamzah A.  and
      Al-Dhabyani, Walid  and
      Al-shaibani, Maged S.  and
      El aatar, Aya  and
      Qandos, Nour  and
      Alhamouri, Rahaf  and
      Ahmad, Samar  and
      AL-Ghrawi, Mohammed Anwar  and
      Yacoub, Aminetou  and
      AbuHweidi, Ruwa  and
      Lemin, Vatimetou Mohamed  and
      Abdel-Salam, Reem  and
      Bashiti, Ahlam  and
      Ammar, Adel  and
      Alansari, Aisha  and
      Ashraf, Ahmed  and
      Alturayeif, Nora  and
      Alcoba Inciarte, Alcides  and
      Elmadany, AbdelRahim A.  and
      Tourad, Mohamedou Cheikh  and
      Berrada, Ismail  and
      Jarrar, Mustafa  and
      Shehata, Shady  and
      Abdul-Mageed, Muhammad",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "[https://aclanthology.org/2025.findings-emnlp.1254/](https://aclanthology.org/2025.findings-emnlp.1254/)",
    pages = "23048--23079",
    ISBN = "979-8-89176-335-7"
}