arxiv:2602.13576

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Published on Feb 14

· Submitted by

Ruomeng Ding on Feb 23

· ZD Lab @ UNC-Chapel Hill

Upvote

Authors:

Abstract

LLM-based judges using natural-language rubrics for evaluation can exhibit systematic preference drift from minor rubric modifications, which can be exploited to manipulate alignment pipelines and degrade model performance.

AI-generated summary

Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system-level alignment risk that extends beyond evaluator reliability alone. The code is available at: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.

View arXiv page View PDF GitHub 3 Add to collection

Community

Czardas

Paper submitter 1 day ago

🔥 Evaluation rubrics can silently alter alignment.
We identify a new failure mode in LLM-as-a-judge pipelines:
Rubric-Induced Preference Drift (RIPD)

Under RIPD:
👉 Benchmark scores remain stable
👉 Judge preferences shift on unseen domains
👉 Misalignment propagates downstream
No model updates.
No data manipulation.
Only rubric refinement.

Empirically, benchmark-compliant rubric edits reduce target accuracy by up to
📉 9.5% (helpfulness)
📉 27.9% (harmlessness)
while benchmark validation remains intact.

We further show that when such judges generate preference labels for downstream post-training (e.g., DPO / RLAIF), the induced bias is internalized in trained policies, leading to persistent behavior drift.
These results expose a structural gap between benchmark validation and cross-domain preference stability: rubric design is not merely a specification artifact, but a control variable in alignment dynamics.

📄 arXiv: https://arxiv.org/pdf/2602.13576
💻 Code: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface