You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The tokenizer for a given model is deterministic (it only depends on the vocab file + whether it's cased). Producing the tokenizer takes 100x as long as loading a pre-processed tokenizer (about 4 s vs 40 ms for bert_base_uncased).
Save the tokenizer as part of the download process. If a model has a vocab but not a tokenizer, save a tokenizer once and then use it going forward (for backward compatibility with things that are already downloaded).
The text was updated successfully, but these errors were encountered:
Note: I tested preprocessing the config json vs saving it as-is, preprocessing saves microseconds, so it probably isn't worth messing with. It wouldn't HURT, though, so I may do the same fix for that one when I do the tokenizer.
The tokenizer for a given model is deterministic (it only depends on the vocab file + whether it's cased). Producing the tokenizer takes 100x as long as loading a pre-processed tokenizer (about 4 s vs 40 ms for bert_base_uncased).
Save the tokenizer as part of the download process. If a model has a vocab but not a tokenizer, save a tokenizer once and then use it going forward (for backward compatibility with things that are already downloaded).
The text was updated successfully, but these errors were encountered: