Instructions to use Alverciito/beast_cased with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Alverciito/beast_cased with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Alverciito/beast_cased", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Alverciito/beast_cased", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
BEAST: Byte-wise Embedding Architecture for Semantic Transfer
BEAST (Byte-wise Embedding Architecture for Semantic Transfer) is a lightweight, tokenizer-free static embedding model that operates directly on byte-level input.
It produces 256-dimensional word embeddings distilled from GPT-3 semantic representations, without relying on tokenization, subword vocabularies, or linguistic preprocessing.
The model is designed as a semantic encoder, not a language model, and is optimized for word-level similarity, clustering, and low-latency embedding extraction, particularly in Spanish.
Model description
- Input: raw text encoded as byte sequences (values 0–255)
- Output: single 256-dimensional L2-normalized embedding per word
- Architecture: Transformer encoder over byte embeddings
- Tokenizer: none (byte-level encoding)
- Context: static (context-independent embeddings)
BEAST is neither a generative model nor a contextual encoder. Each input word deterministically maps to a single vector representation.
Architecture
BEAST operates on fixed-length byte sequences (typically 24 bytes per word):
- Byte-level embedding layer (256 possible symbols)
- Sinusoidal positional encoding
- Stack of Transformer encoder layers (multi-head self-attention)
- Flattening + projection
- Tanh activation
- L2 normalization onto the unit hypersphere
Typical configuration:
- Embedding dimension: 256
- Encoder layers: 16
- Attention heads: 16
- Parameters: ~14.3M (BEAST-Base)
- Model size: ~51 MB
An alternative lightweight variant (BEAST-Mini) with ~7.9M parameters is also explored in the paper.
Training
BEAST is trained using knowledge distillation from GPT-3 embeddings.
Training data:
1,250,794 isolated Spanish words (lowercased)Teacher model:
GPT-3 embedding representations (word-level, no context)Objective:
Minimize cosine distance between BEAST embeddings and teacher embeddingsTraining setup:
- Optimizer: AdamW
- Initial learning rate: 2e-5
- Weight decay: 1e-6
- Scheduler: Cosine Annealing
- Epochs: 500
- Batch size: 64
- Hardware: single NVIDIA TITAN V GPU
- Training time: ~4 days
BEAST cannot be meaningfully trained without distillation, as it is architecturally designed as a student model.
Evaluation
Semantic similarity
Evaluated on a Spanish translation of SimLex-999, using Pearson and Spearman correlation between cosine similarity and human judgments.
| Model | Pearson | Spearman | Size |
|---|---|---|---|
| BEAST-Base | 0.383 | 0.394 | 51 MB |
| fastText | 0.347 | 0.337 | 3.3 GB |
| Word2Vec | 0.320 | 0.318 | 2.8 GB |
| GloVe | 0.288 | 0.284 | 2.4 GB |
| BERT | 0.297 | 0.301 | 439 MB |
| XLM-RoBERTa | 0.285 | 0.273 | 1.1 GB |
BEAST outperforms all classic static embeddings and several multilingual contextual models on this benchmark.
Clustering benchmarks
Unsupervised clustering on:
- Non-polysemous dataset (200 words, 8 categories)
- Polysemous dataset (50 words, 5 categories)
Metrics:
- Adjusted Rand Index (ARI)
- Normalized Mutual Information (NMI)
- V-Measure
BEAST shows competitive clustering performance while maintaining significantly lower inference time and memory footprint compared to large embedding tables.
Intended use
Recommended uses:
- Word-level semantic similarity
- Lexical clustering
- Keyword matching
- Embedding-based retrieval
- Low-latency or low-resource NLP pipelines
- As a lexical embedding layer inside larger models
Not recommended:
- Sentence-level semantics
- Context-sensitive tasks
- Text generation
- Classification without additional modeling
Limitations
- Static embeddings (no context awareness)
- Trained exclusively on Spanish
- Distillation inherits biases and inconsistencies from GPT-3
- Limited handling of polysemy
- Not suitable as a standalone classifier
Ethical considerations
BEAST inherits semantic properties from GPT-3 embeddings and may reflect biases present in the teacher model. No explicit bias mitigation or filtering was applied during training.
Repository structure
- Model files are located in the root of this repository
- Research code, experiments, and datasets are stored in
research_files/ - Only root-level files are required for inference via
from_pretrained
Usage
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("Alverciito/beast_cased")
tokenizer = AutoTokenizer.from_pretrained("Alverciito/beast_cased")
- Downloads last month
- 6
