Skip to content

Commit

Permalink
Fix the broken links (huggingface#628)
Browse files Browse the repository at this point in the history
  • Loading branch information
complexbrains authored Nov 10, 2022
1 parent 6e4314b commit 0734a34
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions how-to-train.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ t a

What is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto – `ĉ`, `ĝ`, `ĥ`, `ĵ`, `ŝ`, and `ŭ` – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.

Here’s how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`.
Here’s how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`.

```python
from tokenizers.implementations import ByteLevelBPETokenizer
Expand Down Expand Up @@ -144,7 +144,7 @@ print(

**Update:** The associated Colab notebook uses our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly, instead of through a script. Feel free to pick the approach you like best.

We will now train our language model using the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script from `transformers` (newly renamed from `run_lm_finetuning.py` as it now supports training from scratch more seamlessly). Just remember to leave `--model_name_or_path` to `None` to train from scratch vs. from an existing model or checkpoint.
We will now train our language model using the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/main/examples/legacy/run_language_modeling.py) script from `transformers` (newly renamed from `run_lm_finetuning.py` as it now supports training from scratch more seamlessly). Just remember to leave `--model_name_or_path` to `None` to train from scratch vs. from an existing model or checkpoint.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).
Expand Down Expand Up @@ -292,7 +292,7 @@ With more complex prompts, you can probe whether your language model captured mo

We now can fine-tune our new Esperanto language model on a downstream task of **Part-of-speech tagging.**

As mentioned before, Esperanto is a highly regular language where word endings typically condition the grammatical part of speech. Using a dataset of annotated Esperanto POS tags formatted in the CoNLL-2003 format (see example below), we can use the [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py) script from `transformers`.
As mentioned before, Esperanto is a highly regular language where word endings typically condition the grammatical part of speech. Using a dataset of annotated Esperanto POS tags formatted in the CoNLL-2003 format (see example below), we can use the [`run_ner.py`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/token-classification/run_ner.py) script from `transformers`.

> POS tagging is a token classification task just as NER so we can just use the exact same script.
Expand Down

0 comments on commit 0734a34

Please sign in to comment.