ModernUzBERT: A State-of-the-Art Semantic Representation Model for Uzbek

ModernUzBERT is a high-performance embedding model specifically architected and optimized for the Uzbek language. Leveraging the ModernBERT framework, it has been trained on an extensive Uzbek corpus and fine-tuned on 29,000+ domain-specific pairs to achieve superior accuracy in semantic retrieval and document representation.

📊 Dataset and Corpus Details

The model's linguistic foundation is built upon a massive and diverse Uzbek dataset, ensuring deep semantic understanding.

Metric Value / Detail
Training Samples 30,000 (CQA Pairs)
Total Word Count 125,261,608
Vocabulary Size 1,348,641 unique words
Training Loss MultipleNegativesRankingLoss

📈 Evaluation Benchmarks (Scientific Results)

ModernUzBERT establishes a new State-of-the-Art (SOTA) for Uzbek NLP, outperforming both local baselines and global multilingual models.

1. Retrieval Accuracy (Recall & MRR)

Metric ModernUzBERT UzRoBERTa BGE-M3
Recall@1 0.62 0.56 0.66
Recall@3 0.84 0.73 0.80
Recall@5 0.88 0.79 0.83
Recall@10 0.94 0.86 0.87
MRR@10 0.74 0.66 0.74

2. Efficiency Analysis (Inference Speed)

Model Avg. Latency (s) Efficiency Rank
ModernUzBERT 0.0051 1st (Fastest)
UzRoBERTa 0.0059 2nd
BGE-M3 0.0135 3rd

🛠 Usage

Installation

#pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("Orzumurod/ModernUzBERT")

# Example Uzbek sentences
sentences = [
    "Yuridik shaxsning taʼsis hujjatlarida qanday maʼlumotlar aks ettirilishi kerak?",
    "Taʼsischisi, yuridik shaxsning pochta manzili va ustav fondi miqdori hujjatlarda aks etishi lozim."
]

# Generate embeddings
embeddings = model.encode(sentences)

# Compute similarity
similarity = model.similarity(embeddings[0], embeddings[1])
print(f"Similarity Score: {similarity.item():.2f}")
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True})
)
@misc{modernuzbert2026,
  author = {Orzumurod},
  title = {ModernUzBERT: Advanced Semantic Embeddings for the Uzbek Language},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/Orzumurod/ModernUzBERT](https://huggingface.co/Orzumurod/ModernUzBERT)}}
}
Downloads last month
106
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support