-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training setup #13
Comments
Hello, we trained our model using enwik8 and OpenWebText. For other datasets, we use a version pre-trained on OpenWebText. Could you specify which dataset you're interested in replicating, or do you want to replicate all of them? |
Hi, sorry for the lack of clarity. More specifically, I am looking to take the OpenWebText pre-trained model (primarily focusing on the 216M parameter version) on the both the WikiText-2 and WikiText-103 datasets as well as on enwik8 as defined in the train,py script. I've assumed that the provided code contains the setup used for enwik8, but I was curious to replicate the WikiText-2 and WikiText-103 fine tuning runs as well. I have access to 4xV100 GPUs to be able to replicate your runs as accurately as possible. Additionally I have a functioning environment that can run train.py and would like to maximize performance so I have hoped to avoid using the Docker image if possible. |
Hi, thanks for your clarify! I've updated the README.md with more pre-training details, you can refer to the updated README! |
Thanks for the details on pre-training on a large corpus, that'll be useful as I go along as well. For what I'm working on now, I was actually looking for the hyperparameters you used to finetune the 216M model on Wikitext-2 and Wikitext-103, specifically the batch size, the learning rates and the number of epochs assuming this is being trained on 4xV100 GPUs as stated in the paper. |
Hi, sorry for the misunderstanding. I've also uploaded the pre-tokenized wikitext103 dataset and the README for the detailed information to fine-tune this model. |
Thanks for updating this! I've attempted fine-tuning, but I'm noticing a very slow runtime. I am trying to run fine-tuning on the wikitext-2 dataset. My config is as follows:
If you have any suggestions as to why this would lead to such a slow runtime (6.5hrs per epoch) those would be most welcome! For further investigation, I've tried running training with your default setup (as specified in train.py in this repo) on the enwik8 dataset.The paper reports running training in 48hrs for the 216M model. Although I am not sure how many epochs training was run for, the train.py file seems to indicate this is run for 1000 'mini-epochs'. The paper quotes runtimes in the range of 48hrs, yet currently it is taking me 9hrs per 'mini-epoch' when splitting training over 4x V100 GPUs. This is confusing as this would indicate that training would take 9000hrs rather than 48hrs. The setup we ran could not fit a batchsize of 12. The highest batchsize that has managed to fit is 3. The working setup is running on 4x V100 GPUs using the hugging face accelerate to parallelize the work across the 4 GPUs. Do you have any advice as to what this discrepancy between the setup run and the one listed in the paper? |
Hi, It seems the key issue impacting your runtime is the number of mini-epochs used in training. The mini-epoch count should be calculated based on the total training tokens, which is a product of the mini-epoch numbers, iteration numbers, and context length. This count directly influences the training duration. For the Wikitext-2 dataset, the total token count is considerably smaller compared to the defaults set in the training configuration. Hence, if you're using a higher mini-epoch count (like the default setup) and higher iteration numbers, it will significantly prolong the training time. I recommend recalibrating the mini-epoch count to align with the actual size of your dataset. This adjustment should bring your training duration closer to expected timelines. Hope this helps in optimizing your training process! |
Hi, Thanks for the response. What are values you used for n_epochs, ctx_length, lr_init, lr_final, epoch_length_fixed and batch_size to be able to replicate the results in your paper for Wikitext-2 and Wikitext-103 while running 4x V100 GPUs? |
Hi, I suggest enabling DeepSpeed for your training process. It significantly boosts performance, especially on multi-GPU setups. In my experience with a V100 GPU, DeepSpeed offered a 3x-4x acceleration. It also allows for larger batch sizes, which is beneficial. After activating DeepSpeed, do monitor your VRAM usage. Here's an adjusted configuration to consider:
|
Hi, I'm attempting to replicate the training runs with all the different datasets. Could you provide some insight into the configuration that you used to train all three of the datasets you mentioned in the paper?
Thanks in advance!
The text was updated successfully, but these errors were encountered: