Tokenization is a fundamental step in the preprocessing stage of NLP projects, and its purpose is to convert unstructured symbolic texts into numeric matrices, which are suitable for machine learning systems.
In the tokenization process, a tokenizer is used to split natural language text
into a sequence of semantic units called tokens, which are then converted into
ids by looking up the tokens in a vocabulary file. An example of tokenizing
input text Jack is walking a dog.
is shown below:
It is noticeable that different tokenizers can have different ways to split text, and have different vocabulary files.
from flagai.data.tokenizer import Tokenizer
model_name = "GLM-large-en"
tokenizer = Tokenizer.from_pretrained(model_name) # Load tokenizer
At this step, the vocab files from Modelhub will be automatically downloaded to the path specified in cache_dir
parameter. It is set to ./checkpoints/{model_name}
directory in default.
The tokenizer can be used to encode text to a list of token IDs, as well as decoding the token IDs to the original text.
text = "Jack is walking a dog." # Input text
encoded_ids = tokenizer.EncodeAsIds(text) # Convert text string to a list of token ids
# Now encoded_ids = [2990, 2003, 3788, 1037, 3899, 1012]
recoverd_text = tokenizer.DecodeIds(encoded_ids) # Recover text string
# recovered_text should be the same as text
Different tokenizers has different vocabulary and different ways to split text. To suit your project, sometimes it is significant to create a new tokenizer, and how to implement that is given below:
let's take T5 tokenizer as an example
from transformers import T5Tokenizer
from ..tokenizer import Tokenizer
class T5BPETokenizer(Tokenizer):
def __init__(self, model_type_or_path="t5-base", cache_dir=None):
self.text_tokenizer = T5Tokenizer.from_pretrained(model_type_or_path,
cache_dir=cache_dir)
self.text_tokenizer.max_len = int(1e12)