TWGuard — LLM guardrail for the Taiwanese linguistic context

(English below)

TWGuard 是一個推論安全研究計畫,旨在提出適用於臺灣語境的大語言模型安全護欄(LLM guardrail),以滿足 AI 技術落地安全合規需求。

藉由高品質真實語料及人工標注,我們提出一個能反映臺灣語境的安全護欄模型,能攔截偵測各種帶有有害意圖的 LLM 提示詞(Prompt),例如犯罪、違法專業建議、色情、兒童性剝削等。

在評測中我們的模型顯著優於 Nvidia、Google、Meta、OpenAI、Alibaba 等 AI lab 的最佳模型(SOTA)。

我們的研究著重於反映臺灣語境情境,以期成果能適用於真實案場:

  1. 基於語境設計:多數研究著重於中文支援,然而實際語境存在 code-mixing,例如臺灣常見注音文、網路用語、本土語言並用的情況。測試顯示 TWGuard 可以更少成本資料達成比支援中文模型更佳效果。
  2. 反映真實語境:TWGuard 在 ToxicChat 真實語料庫的評測結果良好,相較之下,一般使用模擬合成語料訓練的模型常有效能衰退的狀況。
  3. 低誤報率(FPR):TWGuard 在 FPR 方面有良好的控制,由於誤報在許多情境中的處理成本遠高於漏報,此指標對於實際案場技術選型至關重要。

此 repo以符合學術慣例、提升論文可重現性、促進後續研究等目的建立,詳細內容、技術細節及侷限請參考下文及論文本文。

Usage

This repository provides a Parameter-Efficient Fine-Tuned (PEFT/LoRA) adapter designed for Taiwanese linguistic context harmful intention detection.

Warning: may contain explicit and harmful examples.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

model_id = "meta-llama/Llama-Guard-3-8B"
adapter_id = "chunghwa-telecom/twguard"
device = "cuda"
dtype = torch.float16

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=dtype, device_map=device)
model = PeftModel.from_pretrained(model, adapter_id)

def moderate(chat):
    inputs = tokenizer.apply_chat_template(chat, return_tensors="pt", return_dict=True).to(device)
    input_ids = inputs["input_ids"]
    output = model.generate(**inputs, max_new_tokens=100, pad_token_id=0)
    prompt_len = input_ids.shape[-1]
    return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)

# Context: Taiwan became the first place in Asia to legalize same-sex marriage in 2019.)
result = moderate([
    {"role": "user", "content": "可以結婚了 還要幾年同性戀才會滅絕?"}, 
])

print(f"Result: {result}") # unsafe

Evaluations

Taiwan linguistic context

We evaluate guardrail performance in the Taiwan linguistic context using a held-out evaluation split of 665 samples drawn from outside the training split of the data pool. The positive sample ratio is set to 20%, All models are evaluated in a zero-shot setting, meaning each model operates without any in-context examples.

Model Precision Recall F1 FPR
ShieldGemma 1.0000 0.7519 0.8584 0.0000
LlamaGuard 3 0.8427 0.5639 0.6757 0.0263
GPT OSS Safeguard 0.7357 0.7744 0.7546 0.0695
NemoGuard 0.9625 0.5789 0.7230 0.0056
Qwen3 Guard (loose) 0.9397 0.8195 0.8755 0.0132
Qwen3 Guard (strict) 0.8627 0.9925 0.9231 0.0395
TWGuard (ours) 0.9921 0.9398 0.9653 0.0019

FPR = False Positive Rate (lower is better). All other metrics: higher is better.
Qwen3 Guard results are reported under two settings for its controversial flag, which controls whether borderline content is flagged as unsafe.

ToxicChat

To assess whether Taiwan-specific fine-tuning degrades the base model's general moderation capability, we evaluate TWGuard on ToxicChat, an in-the-wild safety benchmark collected from real user interactions with ChatGPT and Vicuna.

Our fine-tuned model outperforms the foundation model on this out-of-distribution benchmark, suggesting that domain-specific fine-tuning on Taiwanese online discourse does not harm — and may slightly improve — general content moderation performance.

We hypothesize that both our Taiwan-specific corpus and ToxicChat share characteristics common to real-world unsafe content, such as informal wording, ambiguous intent, and jailbreak-like expressions. This structural similarity across languages may explain the retained — and slightly improved — performance.

Model F1 Score
Llama Gaurd 3 8B (foundation) 0.538
TWGuard (ours) 0.645

Baseline score for the foundation model is sourced from the Qwen3 Guard technical report. Higher is better.

Notice

License: Users should check the LICENSE file in the repo and must be used in accordance with the Meta Llama 3 Community License Agreement (Built on Meta Llama 3). This repository is intended for research reproducibility purposes only and is provided without warranty of any kind.

Contact: For inquiries, please reach out to the authors via the email addresses listed in the accompanying paper.

Citation

If you use TWGuard in your research, please cite:

@misc{chu2026twguardcasestudyllm,
      title={TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts}, 
      author={Hua-Rong Chu and Kuan-Chun Wang and Yao-Te Huang},
      year={2026},
      eprint={2604.16542},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2604.16542}, 
}
Downloads last month
257
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for chunghwa-telecom/twguard