Constitutional Safety Classifier

This model is a LoRA fine-tune of Qwen/Qwen3-1.7B trained with TRL SFT as a next-token safety classifier. Given a constitution and content to classify, it predicts one of two labels:

  • safe
  • unsafe

The model is intended for research and evaluation of constitutional safety classification, not as a complete production guardrail by itself.

Paper-aligned evaluation

I evaluated this model against the protocol style of Anthropic's Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming (arXiv:2501.18837).

The exact Anthropic CBRN jailbreak dataset and Claude.ai production traffic are not public, so the evaluation uses public proxies matching the paper's key axes:

  1. held-out classifier accuracy,
  2. harmful recall / missed-unsafe rate as an ASR proxy,
  3. over-refusal / false-positive rate on benign but safety-adjacent prompts.

Full results are in constitutional_eval_results.json. The reproducible script is evaluate_constitutional_classifier.py.

Results at threshold 0.5

Dataset / proxy N Unsafe recall / TPR Missed unsafe / ASR proxy Over-refusal / FPR Unsafe F1 AUROC
Held-out classifier data 500 0.936 0.064 0.332 0.825 0.930
ToxicChat toxicity 500 0.960 0.040 0.160 0.906 0.973
ToxicChat jailbreak subset 66 0.985 0.015 0.000 0.992 n/a
Aegis 2.0 prompt-only 500 0.964 0.036 0.460 0.795 0.923
Aegis 2.0 prompt+response 500 0.980 0.020 0.424 0.815 0.937
BeaverTails prompt+response 500 0.952 0.048 0.324 0.837 0.934
OR-Bench toxic 500 0.996 0.004 0.000 0.998 n/a
OR-Bench hard benign 500 n/a n/a 0.950 n/a n/a
MMLU chemistry benign 303 n/a n/a 0.0033 n/a n/a

Interpretation

The model has strong harmful-content recall across public proxy datasets: most unsafe recall values are around 95-99.6%. This suggests the fine-tuning successfully taught the model to recognize many unsafe and jailbreak-like prompts.

The main weakness is over-refusal. At threshold 0.5, the model flags many benign but safety-adjacent prompts as unsafe, especially on OR-Bench hard benign prompts, where FPR is 95%. This is much higher than the paper-style target of roughly ≤5% FPR / increased FPR on over-refusal datasets.

The held-out score distribution is still separable: AUROC is 0.930 on the held-out classifier set and 0.973 on ToxicChat. However, deployment would require threshold calibration and likely more benign hard-negative training data.

Held-out threshold sweep:

Constraint Threshold TPR FPR
FPR ≤ 0.5% 0.997 0.220 0.000
FPR ≤ 1% 0.997 0.220 0.000
FPR ≤ 5% 0.981 0.728 0.032

Reproduce evaluation

pip install transformers peft accelerate datasets scikit-learn huggingface_hub sentencepiece

python evaluate_constitutional_classifier.py \
  --max-per-dataset 500 \
  --batch-size 8 \
  --max-length 2048 \
  --threshold 0.5 \
  --output constitutional_eval_results.json

The evaluator loads the base model, applies this LoRA adapter, formats prompts with constitution.json, and scores the next-token probability mass assigned to safe/unsafe label tokens.

Usage

This repository contains a PEFT LoRA adapter. For direct scoring, use the evaluation script above. Minimal generation-style use:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = "Qwen/Qwen3-1.7B"
adapter = "imadreamerboy/constitutional-safety-classifier"

tok = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(base_model, dtype="auto", device_map="auto", trust_remote_code=True)
model = PeftModel.from_pretrained(model, adapter)
model.eval()

For robust classification, prefer next-token scoring of safe vs unsafe as implemented in evaluate_constitutional_classifier.py, rather than free-form generation parsing.

Training procedure

This model was trained with SFT.

Framework versions

  • TRL: 1.2.0
  • Transformers: 5.5.4
  • PyTorch: 2.11.0
  • Datasets: 4.8.4
  • Tokenizers: 0.22.2
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for imadreamerboy/constitutional-safety-classifier

Finetuned
Qwen/Qwen3-1.7B
Adapter
(481)
this model

Paper for imadreamerboy/constitutional-safety-classifier