ProtGPT2-Distilled-Tiny

A compact protein language model distilled from ProtGPT2 using complementary-regularizer distillation---a method that combines uncertainty-aware position weighting with calibration-aware label smoothing to achieve 87% better perplexity than standard knowledge distillation at 20x compression.

Preprint: Distilling Protein Language Models with Complementary Regularizers (Wijaya, 2026) — bioRxiv Code: github.com/ewijaya/protein-lm-distill

Model Summary

Property Value
Parameters ~37M
Architecture GPT-2 (4 layers, 4 heads, 512 embedding dim)
Compression 20x (vs. 738M teacher)
Perplexity ratio 5.06 (87% better than baseline KD)
Expected calibration error 0.183 (47% better than baseline)
Inference speedup 5.3x over ProtGPT2
GPU memory 170 MB (19x reduction from teacher)
Throughput ~111 sequences/min on NVIDIA L40S

Quick Start

from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline

model = GPT2LMHeadModel.from_pretrained("littleworth/protgpt2-distilled-tiny")
tokenizer = GPT2Tokenizer.from_pretrained("littleworth/protgpt2-distilled-tiny")

generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)

sequences = generator(
    "<|endoftext|>",
    max_length=256,
    do_sample=True,
    top_k=950,
    repetition_penalty=1.2,
    num_return_sequences=5,
    eos_token_id=0,
    pad_token_id=0,
    truncation=True,
)

for i, seq in enumerate(sequences):
    protein = seq["generated_text"].replace("<|endoftext|>", "").replace("\n", "")
    protein = "".join(c for c in protein if c.isalpha())
    print(f">Generated_{i}\n{protein}")

How It Works

This model was trained using complementary-regularizer distillation, which augments standard temperature-scaled knowledge distillation (Hinton et al., 2015) with two protein-specific enhancements:

  1. Uncertainty-aware position weighting --- Uses teacher entropy to emphasize biologically variable regions (loops, surface residues) during distillation, directing learning capacity toward positions where the teacher's distributional knowledge is richest.

  2. Calibration-aware label smoothing --- Applies confidence-dependent smoothing to teacher distributions, acting as a noise filter that removes miscalibration artifacts while preserving genuine amino acid substitution preferences.

The key finding: Each enhancement individually degrades distillation quality (+95% and +109% perplexity increase, respectively), yet their combination yields a 53% perplexity improvement over baseline---a phenomenon we call complementary regularizers. Smoothing removes the noise that weighting would amplify, while weighting compensates for the signal attenuation that smoothing introduces.

Performance

Compared to Baseline Knowledge Distillation

Method PPL Ratio ECE KL Divergence
Baseline KD 39.91 0.345 3.16
This model (complementary regularizers) 5.06 0.183 1.34
Improvement 87% 47% 58%

Model Family Comparison

Model Params Compression PPL Ratio Speedup GPU Memory
ProtGPT2 (teacher) 738M 1x 1.00 1.0x 3,211 MB
Tiny (this model) 37M 20x 5.06 5.3x 170 MB
Small 78M 9.4x 7.05 4.1x 343 MB
Medium 194M 3.8x 2.58 2.4x 836 MB

Biological Validity

Generated sequences produce amino acid distributions closely matching natural proteins (KL divergence from UniProt < 0.015), confirming that compressed models preserve biologically realistic sequence statistics.

When to Use This Model

  • High-throughput screening: 111 seq/min enables scoring ~10^6 candidates in ~6 GPU-hours on consumer hardware
  • Resource-constrained deployment: 170 MB GPU memory fits on shared lab workstations
  • On-premise inference: Run locally without sending proprietary sequences to cloud APIs
  • Antibody/enzyme engineering: Fast iteration in ML-guided design-build-test cycles
  • Rapid domain adaptation: Fine-tunes in 25 seconds vs 66 minutes for the teacher (162x faster), ideal for rapid prototyping of family-specific models

For applications where perplexity matters more than speed, consider the Medium variant (2.58 PPL ratio, 2.4x speedup).

Fine-Tuning on Custom Protein Families

This model serves as a superior starting point for domain adaptation compared to the full-size teacher. When fine-tuned on protein families, it achieves lower perplexity than the 738M teacher on conotoxin (PPL 40 vs 54 at N=1,000) and higher HMMER hit rate on lysozyme (84% vs 69%). Fine-tuning completes in 25 seconds versus 66 minutes for the teacher (162x faster).

This advantage stems from the synergy distillation method itself, not just model compression---a standard-distilled model with the same 37M architecture performs at teacher level, while this synergy-distilled model far exceeds both (15 out of 15 perplexity wins across three protein families).

from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset

model_name = "littleworth/protgpt2-distilled-tiny"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Prepare your protein sequences as a list of strings
sequences = ["MKTLLILAVL...", "MKFLILLFNL..."]  # your family sequences

dataset = Dataset.from_dict({"text": sequences})
dataset = dataset.map(
    lambda x: tokenizer(x["text"], truncation=True, max_length=512),
    batched=True, remove_columns=["text"],
)

trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./finetuned-model",
        num_train_epochs=20,
        per_device_train_batch_size=8,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_steps=100,
        fp16=True,
        eval_strategy="epoch",
    ),
    train_dataset=dataset,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()
trainer.save_model("./finetuned-model")

Recommended fine-tuning hyperparameters for this model:

Parameter Value
Learning rate 2e-4
Batch size 8
Scheduler Cosine with 100 warmup steps
Early stopping Patience 3 on eval loss
Precision FP16
Gradient checkpointing Not needed

Training Details

Parameter Value
Teacher model nferruz/ProtGPT2 (738M)
Training data 10% UniProt subset (Parquet)
Temperature (T) 2.0
Alpha 0.5
Learning rate 5e-4 (with 500-step linear warmup)
Epochs 3
Batch size 32 (effective)
Optimizer AdamW
Precision FP16
Uncertainty weighting Enabled
Calibration smoothing Enabled (lambda=0.1)

Citation

@article {Wijaya2026.02.17.706304,
    author = {Wijaya, Edward},
    title = {Distilling Protein Language Models with Complementary Regularizers},
    elocation-id = {2026.02.17.706304},
    year = {2026},
    doi = {10.64898/2026.02.17.706304},
    publisher = {Cold Spring Harbor Laboratory},
    abstract = {Large autoregressive protein language models generate novel sequences de novo, but their size limits throughput and precludes rapid domain adaptation on scarce proprietary data. We distill a 738M-parameter protein language model into compact students using two protein-specific enhancements, uncertainty-aware position weighting and calibration-aware label smoothing, that individually degrade quality yet combine for substantial improvement. We trace this complementary-regularizer effect to information theory: smoothing denoises teacher distributions while weighting amplifies the cleaned signal at biologically variable positions. Students achieve up to 5x inference speedup, preserve natural amino acid distributions, and require as little as 170 MB of GPU memory, enabling deployment on consumer-grade hardware. When fine-tuned on protein families with as few as 50 sequences, students generate more family-matching sequences than the teacher, achieving higher sample efficiency and Pfam hit rates despite their smaller capacity. These results establish distilled protein language models as superior starting points for domain adaptation on scarce data.Competing Interest StatementThe authors have declared no competing interest.},
    URL = {https://www.biorxiv.org/content/early/2026/02/25/2026.02.17.706304},
    eprint = {https://www.biorxiv.org/content/early/2026/02/25/2026.02.17.706304.full.pdf},
    journal = {bioRxiv}
}

Related Models

License

Apache 2.0

Downloads last month
90
Safetensors
Model size
38.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for littleworth/protgpt2-distilled-tiny

Base model

nferruz/ProtGPT2
Finetuned
(22)
this model

Dataset used to train littleworth/protgpt2-distilled-tiny

Paper for littleworth/protgpt2-distilled-tiny