Constitutional Safety Classifier

This model is a LoRA fine-tune of Qwen/Qwen3-1.7B trained with TRL SFT as a next-token safety classifier. Given a constitution and content to classify, it predicts one of two labels:

safe
unsafe

The model is intended for research and evaluation of constitutional safety classification, not as a complete production guardrail by itself.

Paper-aligned evaluation

I evaluated this model against the protocol style of Anthropic's Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming (arXiv:2501.18837).

The exact Anthropic CBRN jailbreak dataset and Claude.ai production traffic are not public, so the evaluation uses public proxies matching the paper's key axes:

held-out classifier accuracy,
harmful recall / missed-unsafe rate as an ASR proxy,
over-refusal / false-positive rate on benign but safety-adjacent prompts.

Full results are in constitutional_eval_results.json. The reproducible script is evaluate_constitutional_classifier.py.

Results at threshold 0.5

Dataset / proxy	N	Unsafe recall / TPR	Missed unsafe / ASR proxy	Over-refusal / FPR	Unsafe F1	AUROC
Held-out classifier data	500	0.936	0.064	0.332	0.825	0.930
ToxicChat toxicity	500	0.960	0.040	0.160	0.906	0.973
ToxicChat jailbreak subset	66	0.985	0.015	0.000	0.992	n/a
Aegis 2.0 prompt-only	500	0.964	0.036	0.460	0.795	0.923
Aegis 2.0 prompt+response	500	0.980	0.020	0.424	0.815	0.937
BeaverTails prompt+response	500	0.952	0.048	0.324	0.837	0.934
OR-Bench toxic	500	0.996	0.004	0.000	0.998	n/a
OR-Bench hard benign	500	n/a	n/a	0.950	n/a	n/a
MMLU chemistry benign	303	n/a	n/a	0.0033	n/a	n/a

Interpretation

The model has strong harmful-content recall across public proxy datasets: most unsafe recall values are around 95-99.6%. This suggests the fine-tuning successfully taught the model to recognize many unsafe and jailbreak-like prompts.

The main weakness is over-refusal. At threshold 0.5, the model flags many benign but safety-adjacent prompts as unsafe, especially on OR-Bench hard benign prompts, where FPR is 95%. This is much higher than the paper-style target of roughly ≤5% FPR / increased FPR on over-refusal datasets.

The held-out score distribution is still separable: AUROC is 0.930 on the held-out classifier set and 0.973 on ToxicChat. However, deployment would require threshold calibration and likely more benign hard-negative training data.

Held-out threshold sweep:

Constraint	Threshold	TPR	FPR
FPR ≤ 0.5%	0.997	0.220	0.000
FPR ≤ 1%	0.997	0.220	0.000
FPR ≤ 5%	0.981	0.728	0.032

Reproduce evaluation

pip install transformers peft accelerate datasets scikit-learn huggingface_hub sentencepiece

python evaluate_constitutional_classifier.py \
  --max-per-dataset 500 \
  --batch-size 8 \
  --max-length 2048 \
  --threshold 0.5 \
  --output constitutional_eval_results.json

The evaluator loads the base model, applies this LoRA adapter, formats prompts with constitution.json, and scores the next-token probability mass assigned to safe/unsafe label tokens.

Usage

This repository contains a PEFT LoRA adapter. For direct scoring, use the evaluation script above. Minimal generation-style use:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = "Qwen/Qwen3-1.7B"
adapter = "imadreamerboy/constitutional-safety-classifier"

tok = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(base_model, dtype="auto", device_map="auto", trust_remote_code=True)
model = PeftModel.from_pretrained(model, adapter)
model.eval()

For robust classification, prefer next-token scoring of safe vs unsafe as implemented in evaluate_constitutional_classifier.py, rather than free-form generation parsing.

Training procedure

This model was trained with SFT.

Framework versions

TRL: 1.2.0
Transformers: 5.5.4
PyTorch: 2.11.0
Datasets: 4.8.4
Tokenizers: 0.22.2

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for imadreamerboy/constitutional-safety-classifier

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Adapter

(481)

this model

Paper for imadreamerboy/constitutional-safety-classifier

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Paper • 2501.18837 • Published Jan 31, 2025 • 10