TWGuard — LLM guardrail for the Taiwanese linguistic context
(English below)
TWGuard 是一個推論安全研究計畫,旨在提出適用於臺灣語境的大語言模型安全護欄(LLM guardrail),以滿足 AI 技術落地安全合規需求。
藉由高品質真實語料及人工標注,我們提出一個能反映臺灣語境的安全護欄模型,能攔截偵測各種帶有有害意圖的 LLM 提示詞(Prompt),例如犯罪、違法專業建議、色情、兒童性剝削等。
在評測中我們的模型顯著優於 Nvidia、Google、Meta、OpenAI、Alibaba 等 AI lab 的最佳模型(SOTA)。
我們的研究著重於反映臺灣語境情境,以期成果能適用於真實案場:
- 基於語境設計:多數研究著重於中文支援,然而實際語境存在 code-mixing,例如臺灣常見注音文、網路用語、本土語言並用的情況。測試顯示 TWGuard 可以更少成本資料達成比支援中文模型更佳效果。
- 反映真實語境:TWGuard 在 ToxicChat 真實語料庫的評測結果良好,相較之下,一般使用模擬合成語料訓練的模型常有效能衰退的狀況。
- 低誤報率(FPR):TWGuard 在 FPR 方面有良好的控制,由於誤報在許多情境中的處理成本遠高於漏報,此指標對於實際案場技術選型至關重要。
此 repo以符合學術慣例、提升論文可重現性、促進後續研究等目的建立,詳細內容、技術細節及侷限請參考下文及論文本文。
- Paper: TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts
- Developers: 中華電信研究院 / Chunghwa Telecom Laboratories。若有任何問題請以論文作者電子信箱聯絡。
Usage
This repository provides a Parameter-Efficient Fine-Tuned (PEFT/LoRA) adapter designed for Taiwanese linguistic context harmful intention detection.
Warning: may contain explicit and harmful examples.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
model_id = "meta-llama/Llama-Guard-3-8B"
adapter_id = "chunghwa-telecom/twguard"
device = "cuda"
dtype = torch.float16
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=dtype, device_map=device)
model = PeftModel.from_pretrained(model, adapter_id)
def moderate(chat):
inputs = tokenizer.apply_chat_template(chat, return_tensors="pt", return_dict=True).to(device)
input_ids = inputs["input_ids"]
output = model.generate(**inputs, max_new_tokens=100, pad_token_id=0)
prompt_len = input_ids.shape[-1]
return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
# Context: Taiwan became the first place in Asia to legalize same-sex marriage in 2019.)
result = moderate([
{"role": "user", "content": "可以結婚了 還要幾年同性戀才會滅絕?"},
])
print(f"Result: {result}") # unsafe
Evaluations
Taiwan linguistic context
We evaluate guardrail performance in the Taiwan linguistic context using a held-out evaluation split of 665 samples drawn from outside the training split of the data pool. The positive sample ratio is set to 20%, All models are evaluated in a zero-shot setting, meaning each model operates without any in-context examples.
| Model | Precision | Recall | F1 | FPR |
|---|---|---|---|---|
| ShieldGemma | 1.0000 | 0.7519 | 0.8584 | 0.0000 |
| LlamaGuard 3 | 0.8427 | 0.5639 | 0.6757 | 0.0263 |
| GPT OSS Safeguard | 0.7357 | 0.7744 | 0.7546 | 0.0695 |
| NemoGuard | 0.9625 | 0.5789 | 0.7230 | 0.0056 |
| Qwen3 Guard (loose) | 0.9397 | 0.8195 | 0.8755 | 0.0132 |
| Qwen3 Guard (strict) | 0.8627 | 0.9925 | 0.9231 | 0.0395 |
| TWGuard (ours) | 0.9921 | 0.9398 | 0.9653 | 0.0019 |
FPR = False Positive Rate (lower is better). All other metrics: higher is better.
Qwen3 Guard results are reported under two settings for itscontroversialflag, which controls whether borderline content is flagged as unsafe.
ToxicChat
To assess whether Taiwan-specific fine-tuning degrades the base model's general moderation capability, we evaluate TWGuard on ToxicChat, an in-the-wild safety benchmark collected from real user interactions with ChatGPT and Vicuna.
Our fine-tuned model outperforms the foundation model on this out-of-distribution benchmark, suggesting that domain-specific fine-tuning on Taiwanese online discourse does not harm — and may slightly improve — general content moderation performance.
We hypothesize that both our Taiwan-specific corpus and ToxicChat share characteristics common to real-world unsafe content, such as informal wording, ambiguous intent, and jailbreak-like expressions. This structural similarity across languages may explain the retained — and slightly improved — performance.
| Model | F1 Score |
|---|---|
| Llama Gaurd 3 8B (foundation) | 0.538 |
| TWGuard (ours) | 0.645 |
Baseline score for the foundation model is sourced from the Qwen3 Guard technical report. Higher is better.
Notice
License: Users should check the LICENSE file in the repo and must be used in accordance with the Meta Llama 3 Community License Agreement (Built on Meta Llama 3). This repository is intended for research reproducibility purposes only and is provided without warranty of any kind.
Contact: For inquiries, please reach out to the authors via the email addresses listed in the accompanying paper.
Citation
If you use TWGuard in your research, please cite:
@misc{chu2026twguardcasestudyllm,
title={TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts},
author={Hua-Rong Chu and Kuan-Chun Wang and Yao-Te Huang},
year={2026},
eprint={2604.16542},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2604.16542},
}
- Downloads last month
- 257