Skip to content

Latest commit

 

History

History
83 lines (65 loc) · 3.63 KB

PRETRAINING.md

File metadata and controls

83 lines (65 loc) · 3.63 KB

Pretraining PIXEL

Here we provide instructions for pretraining PIXEL. You can use these to reproduce our results or train your own PIXEL model on a different dataset.

Our rendered pretraining datasets are available on HuggingFace:

The datasets were created, i.e. prerendered from the respective original text datasets, as described below.

Prerendering the data

Show Instructions  

It's not necessary to prerender the data but it does make things a little faster. We simultaneously rendered and uploaded the data to the huggingface hub. This also works if you have no disk space locally :).

We provide two prerendering scripts, one for the bookcorpus, which we streamed directly from the huggingface hub, and one for Wikipedia, which was available locally as a txt file in which one line corresponds to one paragraph and articles are separated by triple newlines. The scripts can also be modified to work with the Wikipedia dataset on the huggingface hub, although the 2018 dump we used will not be accessible that way. Note: we used the PyGameTextRenderer for prerendering and the scripts are currently not compatible with the PangoCairoTextRenderer, although this would only require a small change in the code.

Executing these scripts requires installing the modified datasets library from our git submodule, in which we added support to push the data to the huggingface hub in chunks.

Rendering Wikipedia:

export DATASET_FILE="en.20180201.txt"
export RENDERER_PATH="configs/renderers/noto_renderer"

python scripts/data/prerendering/prerender_wikipedia.py \
  --renderer_name_or_path=${RENDERER_PATH} \
  --data_path=${DATASET_FILE} \
  --chunk_size=100000 \
  --repo_id="<your_target_huggingface_hub_repo_id>" \
  --split="train" \
  --auth_token="<your_auth_token_with_write_access"

Rendering Bookcorpus:

export RENDERER_PATH="configs/renderers/noto_renderer"

python scripts/data/prerendering/prerender_bookcorpus.py \
  --renderer_name_or_path=${RENDERER_PATH} \
  --chunk_size=100000 \
  --repo_id="<your_target_huggingface_hub_repo_id>" \
  --split="train" \
  --auth_token="<your_auth_token_with_write_access"

Training

We provide a wrapper for pretraining that uses the submitit to flexibly switch between single and multi-node training setups. You can execute it as follows to use the config json files in the configs folder to train PIXEL from scratch with the exact same configuration as described in our paper.

# Training on 2 nodes with 4x 40GB A100 GPUs each took around 8 days

python submitit_pretrain.py \
  --job_dir=../experiments \
  --prototype_config_name=scratch_noto_span0.25-dropout \
  --training_config_name=fp16_apex_bs32 \
  --nodes 2 \
  --ngpus 4 \
  --partition gpu

You can also directly pass the training arguments to the submitit script:

# Here, we pass the model arguments directly
python submitit_pretrain.py \
  --job_dir=../experiments \
  --text_renderer_name_or_path=configs/renderers/noto_renderer \
  --mask_ratio=0.25 \
  --span_mask \
  --masking_max_span_length=6 \
  --masking_cumulative_span_weights=0.2,0.4,0.6,0.8,0.9,1.0 \
  --dropout_prob=0.1 \
  --training_config_name=fp16_apex_bs32 \
  --nodes 2 \
  --ngpus 4 \
  --partition gpu