Skip to content

Latest commit

 

History

History
 
 

llm_finetuning

☮️ Fine-tuning open source LLMs using MLOps pipelines with PEFT

Welcome to your newly generated "ZenML LLM PEFT Finetuning project" project! This is a great way to get hands-on with ZenML using production-like template. The project contains a collection of ZenML steps, pipelines and other artifacts and useful resources that can serve as a solid starting point for finetuning open-source LLMs using ZenML.

Using these pipelines, we can run the data-preparation and model finetuning with a single command while using YAML files for configuration and letting ZenML take care of tracking our metadata and containerizing our pipelines.

🌎 Inspiration and Credit

This project heavily relies on the PEFT project by the amazing people at Hugging Face and the microsoft/phi-2 model from the amazing people at microsoft.

🏃 How to run

In this project, we provide a predefined configuration file to finetune models on the gem/viggo dataset. Before we're able to run any pipeline, we need to set up our environment as follows:

# Set up a Python virtual environment, if you haven't already
python3 -m venv .venv
source .venv/bin/activate

# Install requirements
pip install -r requirements.txt

👷 Combined feature engineering and finetuning pipeline

Warning

All steps of this pipeline have a clean_gpu_memory(force=True) at the beginning. This is used to ensure that the memory is properly cleared after previous steps.

This functionality might affect other GPU processes running on the same environment, so if you don't want to clean the GPU memory between the steps, you can delete those utility calls from all steps.

The easiest way to get started with just a single command is to run the finetuning pipeline with the orchestrator_finetune.yaml configuration file, which will do data preparation, model finetuning, evaluation with Rouge and promotion:

python run.py --config orchestrator_finetune.yaml

When running the pipeline like this, the trained model will be stored in the ZenML artifact store.

⚡ Accelerate your finetuning

Do you want to benefit from multi-GPU-training with Distributed Data Parallelism (DDP)? Then you can use other configuration files prepared for this purpose. For example, orchestrator_finetune.yaml can run a finetuning of the microsoft/phi-2 powered by Hugging Face Accelerate on all GPUs available in the environment. To do so, just call:

python run.py --config orchestrator_finetune.yaml --accelerate

Under the hood, the finetuning step will spin up the accelerated job using the step code, which will run on all available GPUs.

☁️ Running with a step operator in the stack

To finetune an LLM on remote infrastructure, you can either use a remote orchestrator or a remote step operator. Follow these steps to set up a complete remote stack:

  • Register the orchestrator (or step operator) and make sure to configure it in a way so that the finetuning step has access to a GPU with at least 24GB of VRAM. Check out our docs for more details.
    • To access GPUs with this amount of VRAM, you might need to increase your GPU quota (AWS, GCP, Azure).
    • The GPU instance that your finetuning will be running on will have CUDA drivers of a specific version installed. If that CUDA version is not compatible with the one provided by the default Docker image of the finetuning pipeline, you will need to modify it in the configuration file. See here for a list of available PyTorch images.
  • Register a remote artifact store and container registry.
  • Register a stack with all these components
    zenml stack register llm-finetuning-stack -o <ORCHESTRATOR_NAME> \
        -a <ARTIFACT_STORE_NAME> \
        -c <CONTAINER_REGISTRY_NAME> \
        [-s <STEP_OPERATOR_NAME>]

🗂️ Bring Your Own Data

To fine-tune an LLM using your own datasets, consider adjusting the prepare_data step to match your needs:

  • This step loads, tokenizes, and stores the dataset from an external source to the artifact store defined in the ZenML Stack.
  • The dataset can be loaded from Hugging Face by adjusting the dataset_name parameter in the configuration file. By default, the step code expects the dataset to have at least three splits: train, validation, and test. If your dataset uses different split naming, you'll need to make the necessary adjustments.
  • If you want to retrieve the dataset from other sources, you'll need to create the relevant code and prepare the splits in a Hugging Face dataset format for further processing.
  • Tokenization occurs in the utility function generate_and_tokenize_prompt. It has a default way of formatting the inputs before passing them into the model. If this default logic doesn't fit your use case, you'll also need to adjust this function.
  • The return value is the path to the stored datasets (by default, train, val, and test_raw splits). Note: The test set is not tokenized here and will be tokenized later during evaluation.

📜 Project Structure

The project loosely follows the recommended ZenML project structure:

.
├── configs                                       # pipeline configuration files
│   ├── orchestrator_finetune.yaml                # default local or remote orchestrator configuration
│   └── remote_finetune.yaml                      # default step operator configuration
├── materializers
│   └── directory_materializer.py                 # custom materializer to push whole directories to the artifact store and back
├── pipelines                                     # `zenml.pipeline` implementations
│   └── train.py                                  # Finetuning and evaluation pipeline
├── steps                                         # logically grouped `zenml.steps` implementations
│   ├── evaluate_model.py                         # evaluate base and finetuned models using Rouge metrics
│   ├── finetune.py                               # finetune the base model
│   ├── log_metadata.py                           # helper step to ensure that model metadata is always logged
│   ├── prepare_datasets.py                       # load and tokenize dataset
│   └── promote.py                                # promote good models to target environment
├── utils                                         # utility functions
│   ├── callbacks.py                              # custom callbacks
│   ├── loaders.py                                # loaders for models and data
│   ├── logging.py                                # logging helpers
│   └── tokenizer.py                              # load and tokenize
├── .dockerignore
├── README.md                                     # this file
├── requirements.txt                              # extra Python dependencies 
└── run.py                                        # CLI tool to run pipelines on ZenML Stack