Papers
arxiv:2602.13576

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Published on Feb 14
Β· Submitted by
Ruomeng Ding
on Feb 23
Authors:
,
,
,
,
,

Abstract

LLM-based judges using natural-language rubrics for evaluation can exhibit systematic preference drift from minor rubric modifications, which can be exploited to manipulate alignment pipelines and degrade model performance.

AI-generated summary

Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system-level alignment risk that extends beyond evaluator reliability alone. The code is available at: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.

Community

Paper submitter

πŸ”₯ Evaluation rubrics can silently alter alignment.
We identify a new failure mode in LLM-as-a-judge pipelines:
Rubric-Induced Preference Drift (RIPD)

Under RIPD:
πŸ‘‰ Benchmark scores remain stable
πŸ‘‰ Judge preferences shift on unseen domains
πŸ‘‰ Misalignment propagates downstream
No model updates.
No data manipulation.
Only rubric refinement.

Empirically, benchmark-compliant rubric edits reduce target accuracy by up to
πŸ“‰ 9.5% (helpfulness)
πŸ“‰ 27.9% (harmlessness)
while benchmark validation remains intact.

We further show that when such judges generate preference labels for downstream post-training (e.g., DPO / RLAIF), the induced bias is internalized in trained policies, leading to persistent behavior drift.
These results expose a structural gap between benchmark validation and cross-domain preference stability: rubric design is not merely a specification artifact, but a control variable in alignment dynamics.

πŸ“„ arXiv: https://arxiv.org/pdf/2602.13576
πŸ’» Code: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 8

Browse 8 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.13576 in a Space README.md to link it from this page.

Collections including this paper 1