padding_side discrepancy

#84

by Muennighoff - opened Aug 15, 2022

BigScience Workshop org Aug 15, 2022

PreTrainedTokenizerFast(name_or_path='bigscience/tokenizer', vocab_size=250680, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})

PreTrainedTokenizerFast(name_or_path='bigscience/bloom', vocab_size=250680, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})

I think the two should be the same no? cc @ybelkada

ybelkada

BigScience Workshop org Aug 15, 2022

Yeah you are right, the padding_side of bigscience/tokenizer should be set to left, opened a PR at: https://huggingface.co/bigscience/tokenizer/discussions/3
Feel free to merge it ;)

TimeRobber changed discussion status to closed Jan 27, 2023

PaulLerner

Jan 18, 2024

padding_side='left'?? if it's a left-to-right LM the padding should be on the right. Here's the current behavior:

In [72]: tokenizer(["foo","foo bar baz"],return_tensors="pt", padding="longest")['input_ids']
Out[72]: 
tensor([[    3,     3, 27988],
        [27988,  2879, 20300]])

PaulLerner

Jan 23, 2024

If it is correct it is also needed in https://huggingface.co/bigscience/bloomz-7b1-mt/blob/main/tokenizer_config.json

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment