tanaos-text-anonymizer-v1: A small but performant Text Anonymization model

This model was created by Tanaos with the Artifex Python library.

This is a Named Entity Recognition model based on tanaos/tanaos-NER-v1 and fine-tuned on a synthetic dataset to recognize Personal Identifiable Information (PII) entities in text. Once identified, the entities can be redacted to ensure privacy and confidentiality, before sharing or processing text data.

While the base NER model was trained to recognize 14 named entity categories, this Text Anonymization was fine-tuned specifically to focus on the following 5 key PII entity categories that are commonly found in text data and are critical for anonymization:

Entity	Description
`PERSON`	Individual people, fictional characters
`LOCATION`	Geographical areas
`DATE`	Absolute or relative dates, including years, months and/or days
`ADDRESS`	Full addresses
`PHONE_NUMBER`	Telephone numbers

How to Use

This model can be used in one of two ways:

Via the Artifex library (`pip install artifex`)

Using this model through our Artifex Python library, Personal Identifiable Information (PII) aren't just detected, but automatically redacted from the text, replacing them with a placeholder.

from artifex import Artifex

ta = Artifex().text_anonymization

print(ta("John Doe lives at 123 Main St, New York. His phone number is (555) 123-4567."))
# >>> ["[MASKED] lives at [MASKED]. His phone number is [MASKED]."]

Via the Transformers library

Using this model through the transformers library, Personal Identifiable Information (PII) are only identified, but not automatically redacted; you will have to implement your own redaction logic.

from transformers import pipeline

ta = pipeline(
    task="token-classification",
    model="tanaos/tanaos-text-anonymizer-v1",
    aggregation_strategy="first"
)

print(ta("John Doe lives at 123 Main St, New York. His phone number is (555) 123-4567."))
# >>> [{'entity_group': 'PERSON', 'score': 0.90219176, 'word': 'John Doe', 'start': 0, 'end': 8}, {'entity_group': 'ADDRESS', 'score': 0.9522348, 'word': ' 123 Main St,', 'start': 18, 'end': 30}, {'entity_group': 'LOCATION', 'score': 0.97109795, 'word': ' New York.', 'start': 31, 'end': 40}, {'entity_group': 'PHONE_NUMBER', 'score': 0.9054972, 'word': ' (555) 123-4567.', 'start': 61, 'end': 76}]

Fine-tune without training data on different languages or domains

Do you want to tailor the model to a specific language or domain? Use the Artifex library to fine-tune the model by generating synthetic training data on-the-fly.

from artifex import Artifex

ta = Artifex().text_anonymization

model_output_path = "./output_model/"

ta.train(
    domain="documentos medicos en Español",
    output_path=model_output_path
)

ta.load(model_output_path)
print(ta("El paciente John Doe visitó Nueva York el 12 de marzo de 2023 a las 10:30 a. m."))

# >>> ["El paciente [MASKED] visitó [MASKED] el [MASKED] a las [MASKED]."]

Model Description

Base model: FacebookAI/roberta-base
Task: Token classification (Named Entity Recognition for Text Anonymization)
Languages: English (see the fine-tuning section to adapt to other languages)
Fine-tuning data: A synthetic, custom dataset of around 10,000 passages, each containing multiple named entities across 5 Personal Identifiable Information categories.

Training Details

This model was trained using the Artifex Python library

pip install artifex

by providing the following instructions and generating 10,000 synthetic training samples:

from artifex import Artifex

ta = Artifex().text_anonymization

ta.train(
    domain="general",
    num_samples=10000
)

Intended Uses

This model is intended to:

Anonymize text data by redacting personal identifiable information (PII) such as names, addresses, phone numbers, dates, and locations.
Ensure privacy and confidentiality in text data for compliance with data protection regulations.
Be used before sharing or processing text data to protect sensitive information.
Be GDPR compliant when handling personal data.

Not intended for:

Scenarios involving highly specialized or domain-specific text without further fine-tuning.

Downloads last month: 257

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for tanaos/tanaos-text-anonymizer-v1

Base model

FacebookAI/roberta-base

Finetuned

tanaos/tanaos-NER-v1

Finetuned

(1)

this model

Quantizations

1 model

tanaos
/

tanaos-text-anonymizer-v1

tanaos-text-anonymizer-v1: A small but performant Text Anonymization model

How to Use

Via the Artifex library (`pip install artifex`)

Via the Transformers library

Fine-tune without training data on different languages or domains

Model Description

Training Details

Intended Uses

Model tree for tanaos/tanaos-text-anonymizer-v1

Dataset used to train tanaos/tanaos-text-anonymizer-v1

tanaos-text-anonymizer-v1: A small but performant Text Anonymization model

How to Use

Via the Artifex library (pip install artifex)

Via the Transformers library

Fine-tune without training data on different languages or domains

Model Description

Training Details

Intended Uses

Model tree for tanaos/tanaos-text-anonymizer-v1

Dataset used to train tanaos/tanaos-text-anonymizer-v1

Via the Artifex library (`pip install artifex`)