The best place to learn about examples in TRL is our docs page!
pip install trl
#optional: wandb
pip install wandb
Note: if you don't want to log with wandb
remove log_with="wandb"
in the scripts/notebooks.
You can also replace it with your favourite experiment tracker that's supported by accelerate
.
For all the examples, you'll need to generate an Accelerate
config with:
accelerate config # will prompt you to define the training configuration
Then, it is encouraged to launch jobs with accelerate launch
!
The examples are currently split over the following categories:
1: Sentiment: Fine-tune a model with a sentiment classification model. 2: StackOverflow: Perform the full RLHF process (fine-tuning, reward model training, and RLHF) on StackOverflow data. 3: summarization: Recreate OpenAI's Learning to Summarize paper. 4: toxicity: Fine-tune a model to reduce the toxicity of its generations. write about best-of-n as an alternative rlhf 5: best-of-n sampling: Comparative demonstration of best-of-n sampling as a simpler (but relatively expensive) alternative to RLHF