Constitutional Safety Classifier
This model is a LoRA fine-tune of Qwen/Qwen3-1.7B trained with TRL SFT as a next-token safety classifier. Given a constitution and content to classify, it predicts one of two labels:
safeunsafe
The model is intended for research and evaluation of constitutional safety classification, not as a complete production guardrail by itself.
Paper-aligned evaluation
I evaluated this model against the protocol style of Anthropic's Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming (arXiv:2501.18837).
The exact Anthropic CBRN jailbreak dataset and Claude.ai production traffic are not public, so the evaluation uses public proxies matching the paper's key axes:
- held-out classifier accuracy,
- harmful recall / missed-unsafe rate as an ASR proxy,
- over-refusal / false-positive rate on benign but safety-adjacent prompts.
Full results are in constitutional_eval_results.json. The reproducible script is evaluate_constitutional_classifier.py.
Results at threshold 0.5
| Dataset / proxy | N | Unsafe recall / TPR | Missed unsafe / ASR proxy | Over-refusal / FPR | Unsafe F1 | AUROC |
|---|---|---|---|---|---|---|
| Held-out classifier data | 500 | 0.936 | 0.064 | 0.332 | 0.825 | 0.930 |
| ToxicChat toxicity | 500 | 0.960 | 0.040 | 0.160 | 0.906 | 0.973 |
| ToxicChat jailbreak subset | 66 | 0.985 | 0.015 | 0.000 | 0.992 | n/a |
| Aegis 2.0 prompt-only | 500 | 0.964 | 0.036 | 0.460 | 0.795 | 0.923 |
| Aegis 2.0 prompt+response | 500 | 0.980 | 0.020 | 0.424 | 0.815 | 0.937 |
| BeaverTails prompt+response | 500 | 0.952 | 0.048 | 0.324 | 0.837 | 0.934 |
| OR-Bench toxic | 500 | 0.996 | 0.004 | 0.000 | 0.998 | n/a |
| OR-Bench hard benign | 500 | n/a | n/a | 0.950 | n/a | n/a |
| MMLU chemistry benign | 303 | n/a | n/a | 0.0033 | n/a | n/a |
Interpretation
The model has strong harmful-content recall across public proxy datasets: most unsafe recall values are around 95-99.6%. This suggests the fine-tuning successfully taught the model to recognize many unsafe and jailbreak-like prompts.
The main weakness is over-refusal. At threshold 0.5, the model flags many benign but safety-adjacent prompts as unsafe, especially on OR-Bench hard benign prompts, where FPR is 95%. This is much higher than the paper-style target of roughly ≤5% FPR / increased FPR on over-refusal datasets.
The held-out score distribution is still separable: AUROC is 0.930 on the held-out classifier set and 0.973 on ToxicChat. However, deployment would require threshold calibration and likely more benign hard-negative training data.
Held-out threshold sweep:
| Constraint | Threshold | TPR | FPR |
|---|---|---|---|
| FPR ≤ 0.5% | 0.997 | 0.220 | 0.000 |
| FPR ≤ 1% | 0.997 | 0.220 | 0.000 |
| FPR ≤ 5% | 0.981 | 0.728 | 0.032 |
Reproduce evaluation
pip install transformers peft accelerate datasets scikit-learn huggingface_hub sentencepiece
python evaluate_constitutional_classifier.py \
--max-per-dataset 500 \
--batch-size 8 \
--max-length 2048 \
--threshold 0.5 \
--output constitutional_eval_results.json
The evaluator loads the base model, applies this LoRA adapter, formats prompts with constitution.json, and scores the next-token probability mass assigned to safe/unsafe label tokens.
Usage
This repository contains a PEFT LoRA adapter. For direct scoring, use the evaluation script above. Minimal generation-style use:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = "Qwen/Qwen3-1.7B"
adapter = "imadreamerboy/constitutional-safety-classifier"
tok = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(base_model, dtype="auto", device_map="auto", trust_remote_code=True)
model = PeftModel.from_pretrained(model, adapter)
model.eval()
For robust classification, prefer next-token scoring of safe vs unsafe as implemented in evaluate_constitutional_classifier.py, rather than free-form generation parsing.
Training procedure
This model was trained with SFT.
Framework versions
- TRL: 1.2.0
- Transformers: 5.5.4
- PyTorch: 2.11.0
- Datasets: 4.8.4
- Tokenizers: 0.22.2