ModernUzBERT: A State-of-the-Art Semantic Representation Model for Uzbek
ModernUzBERT is a high-performance embedding model specifically architected and optimized for the Uzbek language. Leveraging the ModernBERT framework, it has been trained on an extensive Uzbek corpus and fine-tuned on 29,000+ domain-specific pairs to achieve superior accuracy in semantic retrieval and document representation.
📊 Dataset and Corpus Details
The model's linguistic foundation is built upon a massive and diverse Uzbek dataset, ensuring deep semantic understanding.
| Metric | Value / Detail |
|---|---|
| Training Samples | 30,000 (CQA Pairs) |
| Total Word Count | 125,261,608 |
| Vocabulary Size | 1,348,641 unique words |
| Training Loss | MultipleNegativesRankingLoss |
📈 Evaluation Benchmarks (Scientific Results)
ModernUzBERT establishes a new State-of-the-Art (SOTA) for Uzbek NLP, outperforming both local baselines and global multilingual models.
1. Retrieval Accuracy (Recall & MRR)
| Metric | ModernUzBERT | UzRoBERTa | BGE-M3 |
|---|---|---|---|
| Recall@1 | 0.62 | 0.56 | 0.66 |
| Recall@3 | 0.84 | 0.73 | 0.80 |
| Recall@5 | 0.88 | 0.79 | 0.83 |
| Recall@10 | 0.94 | 0.86 | 0.87 |
| MRR@10 | 0.74 | 0.66 | 0.74 |
2. Efficiency Analysis (Inference Speed)
| Model | Avg. Latency (s) | Efficiency Rank |
|---|---|---|
| ModernUzBERT | 0.0051 | 1st (Fastest) |
| UzRoBERTa | 0.0059 | 2nd |
| BGE-M3 | 0.0135 | 3rd |
🛠 Usage
Installation
#pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("Orzumurod/ModernUzBERT")
# Example Uzbek sentences
sentences = [
"Yuridik shaxsning taʼsis hujjatlarida qanday maʼlumotlar aks ettirilishi kerak?",
"Taʼsischisi, yuridik shaxsning pochta manzili va ustav fondi miqdori hujjatlarda aks etishi lozim."
]
# Generate embeddings
embeddings = model.encode(sentences)
# Compute similarity
similarity = model.similarity(embeddings[0], embeddings[1])
print(f"Similarity Score: {similarity.item():.2f}")
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True})
)
@misc{modernuzbert2026,
author = {Orzumurod},
title = {ModernUzBERT: Advanced Semantic Embeddings for the Uzbek Language},
publisher = {Hugging Face},
howpublished = {\url{[https://huggingface.co/Orzumurod/ModernUzBERT](https://huggingface.co/Orzumurod/ModernUzBERT)}}
}
- Downloads last month
- 106