Tokenizers documentation
Normalizers
Normalizers
ByteLevel
Bytelevel Normalizer
Converts all bytes in the input to their Unicode representation using the GPT-2 byte-to-unicode mapping. Every byte value (0–255) is mapped to a unique visible character so that any arbitrary binary input can be tokenized without needing a special unknown token.
This normalizer is used together with the ByteLevel pre-tokenizer and ByteLevel decoder.
Lowercase
Lowercase Normalizer
Converts all text to lowercase using Unicode-aware lowercasing. This is equivalent
to calling str.lower on the input.
NFC
NFC Unicode Normalizer
Applies Unicode NFC (Canonical Decomposition, followed by Canonical Composition) normalization. First decomposes characters, then recomposes them using canonical composition rules. This produces the canonical composed form.
NFD
NFD Unicode Normalizer
Applies Unicode NFD (Canonical Decomposition) normalization. Decomposes characters into
their canonical components. For example, accented characters like é (U+00E9) are
decomposed into e (U+0065) + combining accent (U+0301).
This is often used as a first step before stripping accents with StripAccents.
NFKC
NFKC Unicode Normalizer
Applies Unicode NFKC (Compatibility Decomposition, followed by Canonical Composition)
normalization. Like NFC but also maps compatibility characters to their canonical
equivalents. This is the normalization used by Python’s str.casefold and
by many NLP pipelines.
NFKD
NFKD Unicode Normalizer
Applies Unicode NFKD (Compatibility Decomposition) normalization. Like NFD but also
decomposes compatibility characters. For example, the ligature fi (U+FB01) is
decomposed into f + i.
Nmt
Nmt normalizer
Normalizer used in the Google NMT pipeline. It handles various text cleaning tasks including removing control characters, normalizing whitespace, and replacing certain Unicode characters. This is equivalent to the normalization done in the original SentencePiece NMT preprocessing.
Normalizer
Base class for all normalizers
This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.
normalize
( normalized )
Parameters
- normalized (
NormalizedString) — The normalized string on which to apply this Normalizer
Normalize a NormalizedString in-place
This method allows to modify a NormalizedString to
keep track of the alignment information. If you just want to see the result
of the normalization on a raw string, you can use
normalize_str()
Normalize the given string
This method provides a way to visualize the effect of a
Normalizer but it does not keep track of the alignment
information. If you need to get/convert offsets, you can use
normalize()
Precompiled
Precompiled normalizer
A normalizer that uses a precompiled character map built from a SentencePiece model.
This normalizer is automatically extracted from SentencePiece .model files and
should not be constructed manually — it is used internally for full compatibility
with SentencePiece-based tokenizers.
Replace
Replace normalizer
Replaces occurrences of a pattern in the input string with the given content.
The pattern can be either a plain string or a regular expression wrapped in
Regex.
Sequence
Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order
Strip
Strip normalizer
Removes leading and/or trailing whitespace from the input string.
StripAccents
StripAccents normalizer
Strips all accent marks (combining diacritical characters) from the input. This normalizer should typically be used after applying NFD or NFKD decomposition, which separates base characters from their combining accents.
BertNormalizer
class tokenizers.normalizers.BertNormalizer
( clean_text = True handle_chinese_chars = True strip_accents = None lowercase = True )
Parameters
- clean_text (
bool, optional, defaults toTrue) — Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one. - handle_chinese_chars (
bool, optional, defaults toTrue) — Whether to handle chinese chars by putting spaces around them. - strip_accents (
bool, optional) — Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert). - lowercase (
bool, optional, defaults toTrue) — Whether to lowercase.
BertNormalizer
Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing