Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training setup #13

Open
diederik-vink opened this issue Dec 30, 2023 · 9 comments
Open

Training setup #13

diederik-vink opened this issue Dec 30, 2023 · 9 comments

Comments

@diederik-vink
Copy link

Hi, I'm attempting to replicate the training runs with all the different datasets. Could you provide some insight into the configuration that you used to train all three of the datasets you mentioned in the paper?

Thanks in advance!

@ridgerchu
Copy link
Owner

ridgerchu commented Dec 31, 2023

Hello, we trained our model using enwik8 and OpenWebText. For other datasets, we use a version pre-trained on OpenWebText. Could you specify which dataset you're interested in replicating, or do you want to replicate all of them?

@diederik-vink
Copy link
Author

Hi, sorry for the lack of clarity. More specifically, I am looking to take the OpenWebText pre-trained model (primarily focusing on the 216M parameter version) on the both the WikiText-2 and WikiText-103 datasets as well as on enwik8 as defined in the train,py script. I've assumed that the provided code contains the setup used for enwik8, but I was curious to replicate the WikiText-2 and WikiText-103 fine tuning runs as well.

I have access to 4xV100 GPUs to be able to replicate your runs as accurately as possible. Additionally I have a functioning environment that can run train.py and would like to maximize performance so I have hoped to avoid using the Docker image if possible.

@ridgerchu
Copy link
Owner

Hi, thanks for your clarify! I've updated the README.md with more pre-training details, you can refer to the updated README!

@DiederikVink
Copy link

Thanks for the details on pre-training on a large corpus, that'll be useful as I go along as well. For what I'm working on now, I was actually looking for the hyperparameters you used to finetune the 216M model on Wikitext-2 and Wikitext-103, specifically the batch size, the learning rates and the number of epochs assuming this is being trained on 4xV100 GPUs as stated in the paper.

@ridgerchu
Copy link
Owner

Hi, sorry for the misunderstanding. I've also uploaded the pre-tokenized wikitext103 dataset and the README for the detailed information to fine-tune this model.

@diederik-vink
Copy link
Author

diederik-vink commented Jan 9, 2024

Thanks for updating this! I've attempted fine-tuning, but I'm noticing a very slow runtime. I am trying to run fine-tuning on the wikitext-2 dataset. My config is as follows:

ctx_len = 1024        # ===> increase T_MAX in model.py if your ctx_len > 1024  
n_layer = 18  
n_embd = 768   
model_type = 'RWKV'  
batch_size = 3  
lr_init = 3e-6  
lr_final = 3e-6  
n_epoch = 10  
epoch_length_fixed = 10000  

If you have any suggestions as to why this would lead to such a slow runtime (6.5hrs per epoch) those would be most welcome!

For further investigation, I've tried running training with your default setup (as specified in train.py in this repo) on the enwik8 dataset.The paper reports running training in 48hrs for the 216M model. Although I am not sure how many epochs training was run for, the train.py file seems to indicate this is run for 1000 'mini-epochs'. The paper quotes runtimes in the range of 48hrs, yet currently it is taking me 9hrs per 'mini-epoch' when splitting training over 4x V100 GPUs. This is confusing as this would indicate that training would take 9000hrs rather than 48hrs. The setup we ran could not fit a batchsize of 12. The highest batchsize that has managed to fit is 3. The working setup is running on 4x V100 GPUs using the hugging face accelerate to parallelize the work across the 4 GPUs.

Do you have any advice as to what this discrepancy between the setup run and the one listed in the paper?

@ridgerchu
Copy link
Owner

Hi,

It seems the key issue impacting your runtime is the number of mini-epochs used in training. The mini-epoch count should be calculated based on the total training tokens, which is a product of the mini-epoch numbers, iteration numbers, and context length. This count directly influences the training duration.

For the Wikitext-2 dataset, the total token count is considerably smaller compared to the defaults set in the training configuration. Hence, if you're using a higher mini-epoch count (like the default setup) and higher iteration numbers, it will significantly prolong the training time. I recommend recalibrating the mini-epoch count to align with the actual size of your dataset. This adjustment should bring your training duration closer to expected timelines.

Hope this helps in optimizing your training process!

@diederik-vink
Copy link
Author

Hi,

Thanks for the response. What are values you used for n_epochs, ctx_length, lr_init, lr_final, epoch_length_fixed and batch_size to be able to replicate the results in your paper for Wikitext-2 and Wikitext-103 while running 4x V100 GPUs?

@ridgerchu
Copy link
Owner

Hi,

I suggest enabling DeepSpeed for your training process. It significantly boosts performance, especially on multi-GPU setups. In my experience with a V100 GPU, DeepSpeed offered a 3x-4x acceleration. It also allows for larger batch sizes, which is beneficial. After activating DeepSpeed, do monitor your VRAM usage. Here's an adjusted configuration to consider:

ctx_len = 1024        # Increase T_MAX in model.py if your ctx_len exceeds 1024  
n_layer = 18  
n_embd = 768  
model_type = 'RWKV'  
batch_size = 3        # Adjust based on VRAM capacity 
lr_init = 3e-6  
lr_final = 3e-6  
n_epoch = 1  
epoch_length_fixed = 10000 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants