# Training Details

This document provides detailed information about the training methodology used to develop the CAT-Translate models.
The details will be available in a technical report.

## Table of Contents

- [Training Data](#training-data)
- [Supervised Fine-Tuning](#supervised-fine-tuning)
- [Reinforcement Learning](#reinforcement-learning)
- [LoRA Configuration](#lora-configuration)

## Training Data

We synthesized parallel corpora from monolingual data using large language models. For generating translations, we used:

- **DeepSeek-V3**: Used for initial prototyping only, not used for the rest of the development.
- **gpt-oss-20b**: Generated most of the data, providing sufficiently high quality for many instances.
- **gpt-oss-120b**: Used for domains where gpt-oss-20b was not satisfactory (e.g., scientific abstracts).

### Data Filtering

The synthesized data were filtered according to several criteria:

- Texts written mostly in languages other than Japanese and English
- The ratio of Japanese and English text lengths being too large or too small
- Duplicated content using MinHash ([Duplodocus](https://github.com/allenai/duplodocus))
- Low quality according to BLEU score and/or COMET score ([comet-qe](https://huggingface.co/Unbabel/wmt22-comet-da))
- Manually identified low quality texts
- Rule-based algorithms manually coded to filter low quality texts identified by hand

## Supervised Fine-Tuning

We applied a two-stage fine-tuning approach.

### First Stage: Focus on Diversity

The first stage focused on diversity of prompts. The dataset consisted of:

- Mostly web-crawled data with relatively low quality translation
- Some portion from targeted domains including:
  - Scientific abstracts (arXiv and PubMed)
  - Patents (USPTO)
- Most instances were sentence-long, with some paragraph-long instances

**Key Finding**: We found that the performance of the models mostly saturated with this corpus at around 100k instances. This led us to prepare a more challenging and higher quality dataset for the second stage.

### Second Stage: Focus on Quality

The second stage focused on the quality of generated translations. Key characteristics:

- Large portion of data instances generated by **gpt-oss-120b**
- Focus areas:
  - Scientific abstracts (arXiv and PubMed)
  - Patents (USPTO)
  - Underspecified/misspecified text (e.g., input with typo)
- Most instances were paragraph-long to multiple-paragraph-long
- Kept some data from the first stage corpus to maintain diversity

## Reinforcement Learning

We used the same corpus as the second stage of SFT. The model was trained with **Multi-Objective GRPO** (Ichihara et al. 2025).

### Primary Reward Model: MetricX-24

We chose [MetricX-24](https://huggingface.co/google/metricx-24-hybrid-xl-v2p6-bfloat16) as our primary reward model for the following reasons:

- Open source
- Faster than LLM-as-a-Judge models
- High agreement with human judgments

We also considered using gpt-oss-120b as a judge, which has very high accuracy. However, it requires significantly more computational resources that were not available under our constraints.

### MetricX Limitations

Like all reward models, MetricX has several misspecifications that generation models may exploit:

1. **Language-agnostic**: Being multilingual, it gives scores regardless of output language, even when the task requires generating Japanese text.
2. **Format-agnostic**: Syntactic characters such as newlines (`\n`) and markdown syntax (e.g., `*`, `#`) are ignored.
3. **Allows hallucination**: MetricX is relatively tolerant to hallucination as long as the output text contains the information in the input text. This is not ideal for training language model-based machine translation systems.

### Auxiliary Reward Functions

To remedy these problems, we implemented auxiliary reward functions:

#### 1. BLEU Score (Weight: 0.1)

Used to compute lexical overlap with reference text. Expected to be effective for:
- Avoiding overoptimization to the other reward model
- Giving reward to accurate translation of technical terms

#### 2. Format Consistency

Texts too different in format are penalized. This addresses the issue where models often generate markdown-formatted text even when the input is plain text.

#### 3. Length Penalty

Texts that are too long or too short are penalized. This suppressed many hallucinations generated by the models.

### Reward Normalization Strategy

- **MetricX and BLEU** (weights 1.0 and 0.1): Applied with normalization to compute advantages
  - **Rationale**: Translation quality is difficult to learn, and training pedagogically with relative advantage makes sense
  
- **Format consistency and length penalty**: Applied absolutely without normalization (as in Dr. GRPO)
  - **Rationale**: These are easy to learn by themselves and exist to keep the model from getting out of control. The penalties should be large enough to prevent violations regardless of translation quality improvements, and the model should be able to learn them. Thus, we penalize with a large absolute value rather than a relative value.

## LoRA Configuration

We used LoRA (Low-Rank Adaptation) to reduce computational resource requirements.

| Model Size | LoRA Usage |
|------------|-----------|
| 0.5B | Not used |
| 1B | For GRPO |
| 3B | For the second stage of SFT and GRPO |
| 7B | For all processes |

---

For the main project overview, see [README.md](README.md).