Bloom Tokenization

#269

by Niazi - opened Nov 3, 2023

Nov 3, 2023

I am working with one of the low-resource languages which has a small portion of data in the roots corpus on which Bloom is trained.

when I checked the token of the language, it is failed, I checked the all vocab of the language but the bloom tokenizer was not able to tokeneize that,

is it possible to inject the vocabulary by using the sentence piece tokenization method into Bloom's vocabulary and then tune it via a prompt?

any suggestions?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment