BEAST: Byte-wise Embedding Architecture for Semantic Transfer

BEAST (Byte-wise Embedding Architecture for Semantic Transfer) is a lightweight, tokenizer-free static embedding model that operates directly on byte-level input.
It produces 256-dimensional word embeddings distilled from GPT-3 semantic representations, without relying on tokenization, subword vocabularies, or linguistic preprocessing.

The model is designed as a semantic encoder, not a language model, and is optimized for word-level similarity, clustering, and low-latency embedding extraction, particularly in Spanish.

Model description

Input: raw text encoded as byte sequences (values 0–255)
Output: single 256-dimensional L2-normalized embedding per word
Architecture: Transformer encoder over byte embeddings
Tokenizer: none (byte-level encoding)
Context: static (context-independent embeddings)

BEAST is neither a generative model nor a contextual encoder. Each input word deterministically maps to a single vector representation.

Architecture

BEAST operates on fixed-length byte sequences (typically 24 bytes per word):

Byte-level embedding layer (256 possible symbols)
Sinusoidal positional encoding
Stack of Transformer encoder layers (multi-head self-attention)
Flattening + projection
Tanh activation
L2 normalization onto the unit hypersphere

Typical configuration:

Embedding dimension: 256
Encoder layers: 16
Attention heads: 16
Parameters: ~14.3M (BEAST-Base)
Model size: ~51 MB

An alternative lightweight variant (BEAST-Mini) with ~7.9M parameters is also explored in the paper.

Training

BEAST is trained using knowledge distillation from GPT-3 embeddings.

Training data:
1,250,794 isolated Spanish words (lowercased)
Teacher model:
GPT-3 embedding representations (word-level, no context)
Objective:
Minimize cosine distance between BEAST embeddings and teacher embeddings
Training setup:
- Optimizer: AdamW
- Initial learning rate: 2e-5
- Weight decay: 1e-6
- Scheduler: Cosine Annealing
- Epochs: 500
- Batch size: 64
- Hardware: single NVIDIA TITAN V GPU
- Training time: ~4 days

BEAST cannot be meaningfully trained without distillation, as it is architecturally designed as a student model.

Evaluation

Semantic similarity

Evaluated on a Spanish translation of SimLex-999, using Pearson and Spearman correlation between cosine similarity and human judgments.

Model	Pearson	Spearman	Size
BEAST-Base	0.383	0.394	51 MB
fastText	0.347	0.337	3.3 GB
Word2Vec	0.320	0.318	2.8 GB
GloVe	0.288	0.284	2.4 GB
BERT	0.297	0.301	439 MB
XLM-RoBERTa	0.285	0.273	1.1 GB

BEAST outperforms all classic static embeddings and several multilingual contextual models on this benchmark.

Clustering benchmarks

Unsupervised clustering on:

Non-polysemous dataset (200 words, 8 categories)
Polysemous dataset (50 words, 5 categories)

Metrics:

Adjusted Rand Index (ARI)
Normalized Mutual Information (NMI)
V-Measure

BEAST shows competitive clustering performance while maintaining significantly lower inference time and memory footprint compared to large embedding tables.

Intended use

Recommended uses:

Word-level semantic similarity
Lexical clustering
Keyword matching
Embedding-based retrieval
Low-latency or low-resource NLP pipelines
As a lexical embedding layer inside larger models

Not recommended:

Sentence-level semantics
Context-sensitive tasks
Text generation
Classification without additional modeling

Limitations

Static embeddings (no context awareness)
Trained exclusively on Spanish
Distillation inherits biases and inconsistencies from GPT-3
Limited handling of polysemy
Not suitable as a standalone classifier

Ethical considerations

BEAST inherits semantic properties from GPT-3 embeddings and may reflect biases present in the teacher model. No explicit bias mitigation or filtering was applied during training.

Repository structure

Model files are located in the root of this repository
Research code, experiments, and datasets are stored in research_files/
Only root-level files are required for inference via from_pretrained

Usage

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("Alverciito/beast_cased")
tokenizer = AutoTokenizer.from_pretrained("Alverciito/beast_cased")

Downloads last month: 6

Safetensors

Model size

6.36M params

Tensor type

F32

Collection including Alverciito/beast_cased

language-modeling

Collection

3 items • Updated Mar 2 • 1