BEAST: Byte-wise Embedding Architecture for Semantic Transfer

BEAST

BEAST (Byte-wise Embedding Architecture for Semantic Transfer) is a lightweight, tokenizer-free static embedding model that operates directly on byte-level input.
It produces 256-dimensional word embeddings distilled from GPT-3 semantic representations, without relying on tokenization, subword vocabularies, or linguistic preprocessing.

The model is designed as a semantic encoder, not a language model, and is optimized for word-level similarity, clustering, and low-latency embedding extraction, particularly in Spanish.


Model description

  • Input: raw text encoded as byte sequences (values 0–255)
  • Output: single 256-dimensional L2-normalized embedding per word
  • Architecture: Transformer encoder over byte embeddings
  • Tokenizer: none (byte-level encoding)
  • Context: static (context-independent embeddings)

BEAST is neither a generative model nor a contextual encoder. Each input word deterministically maps to a single vector representation.


Architecture

BEAST operates on fixed-length byte sequences (typically 24 bytes per word):

  1. Byte-level embedding layer (256 possible symbols)
  2. Sinusoidal positional encoding
  3. Stack of Transformer encoder layers (multi-head self-attention)
  4. Flattening + projection
  5. Tanh activation
  6. L2 normalization onto the unit hypersphere

Typical configuration:

  • Embedding dimension: 256
  • Encoder layers: 16
  • Attention heads: 16
  • Parameters: ~14.3M (BEAST-Base)
  • Model size: ~51 MB

An alternative lightweight variant (BEAST-Mini) with ~7.9M parameters is also explored in the paper.


Training

BEAST is trained using knowledge distillation from GPT-3 embeddings.

  • Training data:
    1,250,794 isolated Spanish words (lowercased)

  • Teacher model:
    GPT-3 embedding representations (word-level, no context)

  • Objective:
    Minimize cosine distance between BEAST embeddings and teacher embeddings

  • Training setup:

    • Optimizer: AdamW
    • Initial learning rate: 2e-5
    • Weight decay: 1e-6
    • Scheduler: Cosine Annealing
    • Epochs: 500
    • Batch size: 64
    • Hardware: single NVIDIA TITAN V GPU
    • Training time: ~4 days

BEAST cannot be meaningfully trained without distillation, as it is architecturally designed as a student model.


Evaluation

Semantic similarity

Evaluated on a Spanish translation of SimLex-999, using Pearson and Spearman correlation between cosine similarity and human judgments.

Model Pearson Spearman Size
BEAST-Base 0.383 0.394 51 MB
fastText 0.347 0.337 3.3 GB
Word2Vec 0.320 0.318 2.8 GB
GloVe 0.288 0.284 2.4 GB
BERT 0.297 0.301 439 MB
XLM-RoBERTa 0.285 0.273 1.1 GB

BEAST outperforms all classic static embeddings and several multilingual contextual models on this benchmark.


Clustering benchmarks

Unsupervised clustering on:

  • Non-polysemous dataset (200 words, 8 categories)
  • Polysemous dataset (50 words, 5 categories)

Metrics:

  • Adjusted Rand Index (ARI)
  • Normalized Mutual Information (NMI)
  • V-Measure

BEAST shows competitive clustering performance while maintaining significantly lower inference time and memory footprint compared to large embedding tables.


Intended use

Recommended uses:

  • Word-level semantic similarity
  • Lexical clustering
  • Keyword matching
  • Embedding-based retrieval
  • Low-latency or low-resource NLP pipelines
  • As a lexical embedding layer inside larger models

Not recommended:

  • Sentence-level semantics
  • Context-sensitive tasks
  • Text generation
  • Classification without additional modeling

Limitations

  • Static embeddings (no context awareness)
  • Trained exclusively on Spanish
  • Distillation inherits biases and inconsistencies from GPT-3
  • Limited handling of polysemy
  • Not suitable as a standalone classifier

Ethical considerations

BEAST inherits semantic properties from GPT-3 embeddings and may reflect biases present in the teacher model. No explicit bias mitigation or filtering was applied during training.


Repository structure

  • Model files are located in the root of this repository
  • Research code, experiments, and datasets are stored in research_files/
  • Only root-level files are required for inference via from_pretrained

Usage

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("Alverciito/beast_cased")
tokenizer = AutoTokenizer.from_pretrained("Alverciito/beast_cased")
Downloads last month
6
Safetensors
Model size
6.36M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Alverciito/beast_cased