legal-reference-extraction-base-de
A fine-tuned EuroBERT-210m
encoder for German legal citation extraction: detecting references
to laws (e.g. § 823 BGB, Art. 14 GG) and court decisions (e.g.
1 BvR 123/89) in German legal text via BIO token classification.
This model is the default transformer backbone for the
refex
library.
Task
Sequence token classification over 5 BIO labels:
| id | label | meaning |
|---|---|---|
| 0 | O |
Outside any citation |
| 1 | B-LAW_REF |
Beginning of a law-citation span |
| 2 | I-LAW_REF |
Inside a law-citation span |
| 3 | B-CASE_REF |
Beginning of a case-citation span |
| 4 | I-CASE_REF |
Inside a case-citation span |
Output spans can be consumed directly or routed through refex's
TransformerExtractor, which assembles them into typed
LawCitation / CaseCitation objects with span, book,
number, court, file_number, date fields.
Evaluation
Evaluated on an unreleased benchmark of 1,009 German court decisions (test split, locked-in once). Numbers reported using the benchmark's span-level F1 metric; exact requires character-perfect (start, end) agreement with the gold annotation, overlap requires any character-level intersection.
| Engine | span F1 (exact) | span F1 (overlap) | Law F1 (overlap) | Case F1 (overlap) | Throughput (docs/s) | Median ms/doc |
|---|---|---|---|---|---|---|
| regex baseline (CPU) | 0.737 | 0.860 | 0.872 | 0.828 | 455.9 | 1.1 |
| regex + CRF (CPU) | 0.740 | 0.878 | 0.891 | 0.846 | 106.4 | 6.4 |
| this model (MPS) | 0.533 | 0.909 | 0.932 | 0.855 | 1.5 | 467.4 |
| regex + this model (MPS) | 0.743 | 0.889 | 0.905 | 0.849 | 1.5 | 467.3 |
How to read the two span-F1 columns
- Span exact is character-perfect. A pure transformer works at whitespace-word granularity, so its span boundaries rarely match an annotator's character-level trimming (trailing punctuation, enclosing parens, etc.), which is why exact F1 is lower than overlap.
- Span overlap is the right metric for "did we locate a citation in the right place", which is what matters when a downstream step re-parses the span into structured fields.
Headline: on overlap F1 this model beats the regex baseline by +4.9 pp and the regex + CRF ensemble by +3.1 pp, driven primarily by recall on law citations (Law overlap F1 +6.0 pp over regex). Ensembling regex + this model gives the best exact F1 (0.743) — the right choice when you need precise character boundaries as well as recall.
Training-time validation metrics
During training, on a held-out validation split, the seqeval
span-level F1 reached 0.8743 (precision 0.839, recall 0.913)
after three epochs. Losses converged smoothly (eval_loss
0.0344 → 0.0232). This seqeval number is measured on the
transformer's own tokenisation; the unreleased benchmark numbers
above are the character-span-level re-alignment used for engine
comparison.
Inference speed
All numbers above are Apple Silicon MPS, single document, batch
size 1. On CUDA with batching, expect 100–500× higher
throughput (base-sized EuroBERT comfortably hits 1 000+ docs/s
on a single modern GPU). For CPU-only deployments where sub-10 ms
latency per document matters, stick with refex's regex or
regex + CRF engines — this model is recall-first, not
latency-first.
Usage
Via the refex library (recommended)
pip install legal-reference-extraction[transformers]
from refex.engines.transformer import TransformerExtractor
extractor = TransformerExtractor(
model="openlegaldata/legal-reference-extraction-base-de",
device="mps", # or "cuda" / "cpu"
)
citations, relations = extractor.extract(
"Gemäß § 823 Abs. 1 BGB haftet der Schädiger. "
"Vgl. auch BVerfG, Urt. v. 12.03.1990 – 1 BvR 123/89."
)
for c in citations:
print(c.type, c.span.text, "->", getattr(c, "book", None) or getattr(c, "court", None))
Via transformers directly
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
tokenizer = AutoTokenizer.from_pretrained(
"openlegaldata/legal-reference-extraction-base-de",
trust_remote_code=True,
)
model = AutoModelForTokenClassification.from_pretrained(
"openlegaldata/legal-reference-extraction-base-de",
trust_remote_code=True,
)
ner = pipeline(
"token-classification",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple",
)
text = "Nach § 242 BGB sowie Art. 2 GG ist dies zu beachten."
for span in ner(text):
print(span)
trust_remote_code=True is required because EuroBERT ships custom
modeling code (modeling_eurobert.py, configuration_eurobert.py).
Intended use
- German legal document processing: extracting law (
§ / Art.) and case-number references as structured spans for downstream linking, indexing, or redaction. - Drop-in backbone for the
refexlibrary'sTransformerExtractor. - Research into German legal NLP.
Out of scope
- Languages other than German. The base model is multilingual but fine-tuning targets German legal prose specifically; performance on other languages is not evaluated.
- Non-legal domains.
- Production where low-latency CPU inference matters (see the speed table above) — use the regex engine.
- Any commercial use (see license).
Limitations
- Span boundaries are emitted at the transformer's token granularity, not the character level. Exact-match F1 is correspondingly lower than overlap F1; if you need character-perfect boundaries, post-process with the regex engine (see the ensemble row above).
- Confidence is a single label-argmax per token; no calibration has been performed.
- Model inherits any biases from the EuroBERT-210m pre-training corpus.
- Rare law codes, obscure court formats, and historically unusual citation formats will be under-represented relative to the high-frequency patterns in modern German court decisions.
License
CC BY-NC 4.0 — Creative Commons Attribution-NonCommercial 4.0 International.
You may redistribute and adapt this model for non-commercial use provided you give appropriate attribution. For commercial licensing inquiries, contact Open Legal Data.
The underlying EuroBERT-210m base model is distributed under Apache-2.0; this fine-tuned checkpoint is a derivative work and the CC BY-NC 4.0 terms here apply to the fine-tuned weights and model card.
Citation
@inproceedings{10.1145/3383583.3398616,
author = {Ostendorff, Malte and Blume, Till and Ostendorff, Saskia},
title = {Towards an Open Platform for Legal Information},
year = {2020},
isbn = {9781450375856},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3383583.3398616},
doi = {10.1145/3383583.3398616},
booktitle = {Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020},
pages = {385–388},
numpages = {4},
keywords = {open data, open source, legal information system, legal data},
location = {Virtual Event, China},
series = {JCDL '20}
}
Contact
Issues and feedback: https://github.com/openlegaldata/legal-reference-extraction/issues
- Downloads last month
- -
Model tree for openlegaldata/legal-reference-extraction-base-de
Base model
EuroBERT/EuroBERT-210m