TensorFlow 2.1 implementation of pretraining and finetuning scripts for BERT.
The original paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Pretraining consists of two phases. We use mixed-batch training.
- Phase1: We pretrain 14064 steps with a total batch size of 32K for maximum sequence length of 128 across 8 p3dn.24xlarge nodes.
- Phase2: We pretrain 6248 steps with a total batch size of 8K for maximum sequence length of 512 across 8 p3dn.24xlarge nodes.
- Lastly, we finetune SQuAD v1.1 3649 steps with a total batch size of 48 on a single p3dn.24xlarge node.
SQuAD F1 score combines both precision and recall of each word in the predicted answer ranging between 0-100. For fewer nodes, we apply gradient accumulation to reach the same global batch size per step.
Model | p3dn Nodes | Phase1 | Phase2 | Finetuning | Total Training Time | SQuAD v1.1 F1 | SQuAD v2.0 F1 |
---|---|---|---|---|---|---|---|
BERT-base | 1 | 32 hours | 15 hours | 15 mins | 47 hours | same | same |
BERT-base | 2 | 17 hours | 8 hours | 15 mins | 25 hours | same | same |
BERT-base | 4 | 10 hours | 4 hours | 15 mins | 14 hours | same | same |
BERT-base | 8 | 5 hrs 33 mins | 2 hrs 53 mins | 15 mins | 8 hrs 41 mins | 87.68 | 76.14 |
All commands should be run from the models/nlp
directory.
-
Create an FSx volume.
-
Download the datasets onto FSx. The simplest way to start is with English Wikipedia.
-
Create an Amazon Elastic Container Registry (ECR) repository. Then build a Docker image from
models/nlp/Dockerfile
and push it to ECR.
export ACCOUNT_ID=
export REPO=
export IMAGE=${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/${REPO}:py37_tf211
docker build -t ${IMAGE} .
$(aws ecr get-login --no-include-email)
docker push ${IMAGE}
- Define environment variables to point to the FSx volume. For a list, use a comma-separated string.
export SAGEMAKER_ROLE=arn:aws:iam::${ACCOUNT_ID}:role/service-role/AmazonSageMaker-ExecutionRole-20200101T123
export SAGEMAKER_IMAGE_NAME=${IMAGE}
export SAGEMAKER_FSX_ID=fs-123
export SAGEMAKER_SUBNET_IDS=subnet-123
export SAGEMAKER_SECURITY_GROUP_IDS=sg-123,sg-456
- Define BERT-specific run names.
export PHASE1_RUN_NAME=bertphase1
export PHASE2_RUN_NAME=bertphase2
export PHASE1_STEPS=14064
export PHASE2_STEPS=6248
# The data should be in TFRecords inside /fsx/${TRAIN_DIR}
export TRAIN_DIR=bert_data/train
export VAL_DIR=bert_data/val
export LOG_DIR=logs/bert
export CHECKPOINT_DIR=checkpoints/bert
- Launch the SageMaker Phase1 training.
python -m albert.launch_sagemaker \
--source_dir=. \
--entry_point=albert/run_pretraining.py \
--sm_job_name=bert-pretrain-phase1 \
--instance_type=ml.p3dn.24xlarge \
--instance_count=8 \
--train_dir=${TRAIN_DIR} \
--val_dir=${VAL_DIR} \
--log_dir=${LOG_DIR} \
--checkpoint_dir=${CHECKPOINT_DIR} \
--load_from=scratch \
--model_type=bert \
--model_size=base \
--per_gpu_batch_size=128 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--optimizer=lamb \
--learning_rate=0.005 \
--end_learning_rate=0.0003 \
--hidden_dropout_prob=0.1 \
--attention_probs_dropout_prob=0.1 \
--gradient_accumulation_steps=4 \
--learning_rate_decay_power=0.5 \
--warmup_steps=2812 \
--total_steps=${PHASE1_STEPS} \
--log_frequency=100 \
--squad_frequency=0 \
--run_name=${PHASE1_RUN_NAME} \
--name=mybertphase1
- Launch the SageMaker Phase2 training.
python -m albert.launch_sagemaker \
--source_dir=. \
--entry_point=albert/run_pretraining.py \
--sm_job_name=bert-pretrain-phase2 \
--instance_type=ml.p3dn.24xlarge \
--instance_count=8 \
--train_dir=${TRAIN_DIR} \
--val_dir=${VAL_DIR} \
--log_dir=${LOG_DIR} \
--checkpoint_dir=${CHECKPOINT_DIR} \
--load_from=checkpoint \
--load_optimizer_state=false \
--model_type=bert \
--model_size=base \
--per_gpu_batch_size=32 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--optimizer=lamb \
--learning_rate=0.004 \
--end_learning_rate=0.0003 \
--hidden_dropout_prob=0.1 \
--attention_probs_dropout_prob=0.1 \
--gradient_accumulation_steps=4 \
--learning_rate_decay_power=0.5 \
--warmup_steps=625 \
--total_steps=${PHASE2_STEPS} \
--log_frequency=100 \
--squad_frequency=0 \
--run_name=${PHASE2_RUN_NAME} \
--checkpoint_path=checkpoints/albert/${PHASE1_RUN_NAME}-step${PHASE1_STEPS} \
--name=mybertphase2
- Launch a SageMaker finetuning job.
For SQuAD v1.1
python -m albert.launch_sagemaker \
--source_dir=. \
--entry_point=albert/run_squad.py \
--sm_job_name=bert-squadv1 \
--instance_type=ml.p3dn.24xlarge \
--instance_count=1 \
--load_from=checkpoint \
--checkpoint_path=checkpoints/albert/${PHASE2_RUN_NAME}-step${PHASE2_STEPS} \
--model_type=bert \
--per_gpu_batch_size=6 \
--model_size=base \
--squad_version=squadv1 \
--learning_rate=5e-5 \
--warmup_steps=365 \
--total_steps=3649 \
--validation_frequency=10000 \
--evaluate_frequency=10000 \
--skip_xla=true \
SQuAD v2.0
python -m albert.launch_sagemaker \
--source_dir=. \
--entry_point=albert/run_squad.py \
--sm_job_name=bert-squadv2 \
--instance_type=ml.p3dn.24xlarge \
--instance_count=1 \
--load_from=checkpoint \
--checkpoint_path=checkpoints/albert/${PHASE2_RUN_NAME}-step${PHASE2_STEPS} \
--model_type=bert \
--per_gpu_batch_size=6 \
--model_size=base \
--squad_version=squadv2 \
--learning_rate=10.0e-5 \
--warmup_steps=814 \
--total_steps=8144 \
--validation_frequency=10000 \
--evaluate_frequency=100000 \
--skip_xla=true \
- Enter the Docker container to debug and edit code.
docker run -it --privileged -v=/fsx:/fsx --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm ${IMAGE} /bin/bash