Skip to content

Commit

Permalink
Codebase chores: typos, trailing spaces, markdown linting, Ruff and B…
Browse files Browse the repository at this point in the history
…lack. (Lightning-AI#777)
  • Loading branch information
Andrei-Aksionov authored Nov 24, 2023
1 parent aa6ee9e commit 2647dbd
Show file tree
Hide file tree
Showing 12 changed files with 27 additions and 44 deletions.
2 changes: 1 addition & 1 deletion pretrain/openwebtext_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import sys
import time
from pathlib import Path
from typing import Any, Optional, Dict
from typing import Any, Dict, Optional

import lightning as L
import numpy as np
Expand Down
5 changes: 2 additions & 3 deletions pretrain/tinyllama.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
wd = Path(__file__).parent.parent.resolve()
sys.path.append(str(wd))

from lit_gpt.model import GPT, Block, Config, LLaMAMLP, CausalSelfAttention
from lit_gpt.model import GPT, Block, CausalSelfAttention, Config, LLaMAMLP
from lit_gpt.packed_dataset import CombinedDataset
from lit_gpt.utils import chunked_cross_entropy, num_parameters

Expand Down Expand Up @@ -154,7 +154,6 @@ def train(fabric, state, train_dataloader, val_dataloader, resume):
curr_iter = 0

for train_data in train_dataloader:

if state["iter_num"] >= max_iters:
break

Expand Down Expand Up @@ -336,7 +335,7 @@ def choose_logger(logger_name: str, name: str, resume: Union[bool, Path], *args,
return TensorBoardLogger(root_dir="logs", name=name, *args, **kwargs)
if logger_name == "wandb":
return WandbLogger(project="tinyllama", name=name, resume=(resume is not False), *args, **kwargs)
raise ValueError(f"`logger={logger} is not a valid option.")
raise ValueError(f"`logger={logger_name}` is not a valid option.")


if __name__ == "__main__":
Expand Down
1 change: 0 additions & 1 deletion tests/test_lm_eval_harness.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@

import datasets
import pytest
from conftest import RunIf
from lightning import Fabric


Expand Down
2 changes: 1 addition & 1 deletion tests/test_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def test_tokenizer_against_hf(config):
cache_dir = Path("/tmp/tokenizer_test_cache")

# create a checkpoint directory that points to the HF files
checkpoint_dir = cache_dir / "ligpt" / config.hf_config["org"] / config.hf_config["name"]
checkpoint_dir = cache_dir / "litgpt" / config.hf_config["org"] / config.hf_config["name"]
if not checkpoint_dir.exists():
file_to_cache = {}
for file in ("tokenizer.json", "generation_config.json", "tokenizer.model", "tokenizer_config.json"):
Expand Down
3 changes: 1 addition & 2 deletions tutorials/convert_lit_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@ python scripts/convert_lit_checkpoint.py \

These paths are just placeholders, you will need to customize them based on which finetuning or pretraining script you ran and it's configuration.


Please note that if you want to convert a model that has been fine-tuned using an adapter like LoRA, these weights should be [merged](../scripts/merge_lora.py) to the checkpoint prior to converting.
Please note that if you want to convert a model that has been fine-tuned using an adapter like LoRA, these weights should be [merged](../scripts/merge_lora.py) to the checkpoint prior to converting.

```sh
python scripts/merge_lora.py \
Expand Down
2 changes: 0 additions & 2 deletions tutorials/download_phi15.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,8 @@ The model was trained the same data sources (7B tokens) as its [phi-1](https://a

In addition, to create phi-1.5, the authors included additional textbook-quality synthetic text (roughly 20B tokens) in natural language, which was created using the [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) approach.


The model weights are released under a [*Microsoft Research license*](https://huggingface.co/microsoft/phi-1_5/blob/main/README.md#license).


In order to use the phi-1.5 model checkpoint, which requires about 3 Gb of disk space, download the weights and convert the checkpoint to the lit-gpt format:

```bash
Expand Down
9 changes: 1 addition & 8 deletions tutorials/neurips_challenge_quickstart.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,13 @@
# NeurIPS 2023 LLM Efficiency Challenge Quickstart Guide



The [NeurIPS 2023 Efficiency Challenge](https://llm-efficiency-challenge.github.io/) is a competition focused on training **1 LLM for 24 hours on 1 GPU** – the team with the best LLM gets to present their results at NeurIPS 2023.

This quick start guide is a short starter guide illustrating the main steps to get started with Lit-GPT, which was selected as the competition's official starter kit.



 

## Competition Facts


 

**Permitted GPUs:**
Expand Down Expand Up @@ -43,7 +38,7 @@ These don't include models that have been finetuned or otherwise aligned, as per

**Permitted datasets**

Any open-source dataset is allowed. Originally, [per competition rules](https://llm-efficiency-challenge.github.io/challenge), datasets that utilize "generated content" from other LLMs were not permitted. However, the rules were recently softened to also allow LLM-generated datasets if those datasets are made available and if it is not against the usage restrictions and guidelines of the LLM. If you plan to use a specific dataset that is not explicitely listed on the [challenge website](https://llm-efficiency-challenge.github.io/challenge) or want to use LLM-generated data, it is recommended to reach out to the organizers and confirm that this is in line with the competition rules.
Any open-source dataset is allowed. Originally, [per competition rules](https://llm-efficiency-challenge.github.io/challenge), datasets that utilize "generated content" from other LLMs were not permitted. However, the rules were recently softened to also allow LLM-generated datasets if those datasets are made available and if it is not against the usage restrictions and guidelines of the LLM. If you plan to use a specific dataset that is not explicitly listed on the [challenge website](https://llm-efficiency-challenge.github.io/challenge) or want to use LLM-generated data, it is recommended to reach out to the organizers and confirm that this is in line with the competition rules.

Examples of permitted datasets are the following:

Expand Down Expand Up @@ -171,7 +166,6 @@ python eval/lm_eval_harness.py \

To evaluate a LoRA-finetuned model, you need to first merge the LoRA weights with the base model to create a new checkpoint file:


```bash
python scripts/merge_lora.py \
--checkpoint_dir "checkpoints/stabilityai/stablelm-base-alpha-3b/" \
Expand Down Expand Up @@ -205,7 +199,6 @@ python eval/lm_eval_harness.py \

You will be required to submit a Docker image for the submission itself. Fortunately, the organizers have a GitHub repository with the exact steps [here](https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge) and a toy-submission setup guide to test your model locally before submission.


 

## Additional Information & Resources
Expand Down
8 changes: 5 additions & 3 deletions tutorials/oom.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,15 @@ Experiment with different micro batch sizes to find a balance between memory con
### Reduce the model's context length

The context length (`block_size` in the code) plays a significant role in running models with attention.
* The pretraining scripts are configured to use the full context length of the model to train.

* The pretraining scripts are configured to use the full context length of the model to train.
* The finetuning scripts are configured to use the longest sample length of the training data to avoid allocating unnecessary memory (`max_seq_length` in the code).
If that's longer than the model's context length, an error is raised. If you try to run a batch that is longer than this, an error is raised.
If that's longer than the model's context length, an error is raised. If you try to run a batch that is longer than this, an error is raised.

However, your hardware may not support such large context lengths. Here's what you can do:

* For the pretraining scripts, you can simply reduce the `Config(block_size=...)` value.
* For the finetuning scripts, you can trim the length of the samples in your dataset.
* For the finetuning scripts, you can trim the length of the samples in your dataset.
Most of the `scripts/prepare_*.py` scripts expose a `--max_seq_length=...` argument. This might also be useful in cases where
sample lengths are highly unbalanced, as the presence of a single very long sample would incur a larger memory usage for all other
shorter samples. For example, the median length of the samples in Alpaca is 110 tokens. Truncating the Alpaca dataset to 256 max tokens reduces the memory requirements of a Falcon 7B model from 23.52 GB to 15.73 GB. For more information about the dataset truncation, please see the *Truncating datasets* section in the the [prepare_datasets.md](prepare_datasets.md) tutorial.
Expand Down
19 changes: 7 additions & 12 deletions tutorials/prepare_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,9 @@

Below is a table of all datasets that are currently supported in Lit-GPT:


| Name | Task | Size | Reference Repo | Paper / Blog | Data License |
|--------------|-------------|---------------------|-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Alpaca | Finetuning | 51,759 samples | [URL](https://github.com/tatsu-lab/stanford_alpaca) | [URL](https://crfm.stanford.edu/2023/03/13/alpaca.html) | Attribution-NonCommercial 4.0 International, [ URL](https://crfm.stanford.edu/2023/03/13/alpaca.html) |
| Alpaca | Finetuning | 51,759 samples | [URL](https://github.com/tatsu-lab/stanford_alpaca) | [URL](https://crfm.stanford.edu/2023/03/13/alpaca.html) | Attribution-NonCommercial 4.0 International, [URL](https://crfm.stanford.edu/2023/03/13/alpaca.html) |
| Alpaca Libre | Finetuning | 55,370 samples | [URL](https://github.com/mobarski/alpaca-libre) | - | CC0/MIT, [URL](https://github.com/mobarski/alpaca-libre) |
| Dolly | Finetuning | 15,011 samples | [URL](https://github.com/databrickslabs/dolly/tree/master/data) | [URL](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm) | CC-BY-SA, [URL](https://github.com/databrickslabs/dolly#model-overview) |
| LongForm | Finetuning | 23,652 samples | [URL](https://github.com/akoksal/LongForm) | [URL](https://arxiv.org/abs/2304.08460) | No information provided and subset-dependent, [URL](https://github.com/akoksal/LongForm) |
Expand All @@ -32,7 +31,6 @@ The steps here only need to be done once before preparing the finetuning dataset

 


The Alpaca dataset consists of 52,000 instructions and demonstrations produced by OpenAI's text-davinci-003 engine. This data is used in instruction-tuning, helping improve the performance of language models to follow instructions.

In its development, the creators leveraged the data generation methodology from the [Self-Instruct framework](https://github.com/yizhongw/self-instruct).
Expand Down Expand Up @@ -91,8 +89,6 @@ python scripts/prepare_alpaca.py \
--max_seq_length 256
```



 

### Dolly
Expand Down Expand Up @@ -188,7 +184,6 @@ python scripts/prepare_dolly.py \
--max_seq_length 512
```


 

### Finetuning After Data Preparation
Expand Down Expand Up @@ -217,7 +212,7 @@ Please read the [tutorials/finetune_*.md](../tutorials) documents for more infor

The models in Lit-GPT expect datasets for instruction finetuning in the following format:

```
```text
[
{
"instruction": "Write a limerick about a
Expand All @@ -237,6 +232,7 @@ The models in Lit-GPT expect datasets for instruction finetuning in the followin
},
]
```

(Note that depending on the task, the `"input"` text can be an empty string, as shown above.)

Custom datasets can be prepared by either creating a new `scripts/prepare_dataset.py` script or reading the dataset
Expand All @@ -259,6 +255,7 @@ Before you finetune, prepare the dataset using the `prepare_csv.py` script:
```bash
python scripts/prepare_csv.py --csv_path path/to/the/file.csv
```

You can also customize the dataset generation by using these additional parameters

- `destination_path`: The folder where the binary data will be saved. By default, it is saved inside `data/csv`
Expand All @@ -284,11 +281,12 @@ python scripts/prepare_csv.py --csv_path test_data.csv \
--mask_inputs false \
--ignore_index -1
```

Replace `test_data.csv` with your CSV path and the other additional parameters accordingly. Executing the command above will save `train.pt` and `test.pt` on your disk at the `destination_path`. Now you can use the prepared data to [finetune your model](https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/finetune_lora.md#running-the-finetuning).

 

### Preparing Custom Datasets Using a Dataset Prepration Script
### Preparing Custom Datasets Using a Dataset Preparation Script

If you don't have a CSV file following the format described in the previous section, the easiest way to prepare a new dataset is to copy and modify one of the existing dataset preparation scripts:

Expand All @@ -299,16 +297,13 @@ These scripts may look intimidating at first glance since they include code for

In [`scripts/prepare_lima.py`](https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/prepare_lima.py), the [line 26](https://github.com/Lightning-AI/lit-gpt/blob/98fad263a62e5e57821de817bdd5e316abfb34d4/scripts/prepare_lima.py#L26) references the HF repo ID, and the lines [50-53](https://github.com/Lightning-AI/lit-gpt/blob/98fad263a62e5e57821de817bdd5e316abfb34d4/scripts/prepare_lima.py#L50-L53) save the dataset as `train_data`. Here, `train_data` is a list that contains the instruction examples in the format mentioned above.


In [`scripts/prepare_alpaca.py`](https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/prepare_alpaca.py), you only need to modify [lines 24-25](https://github.com/Lightning-AI/lit-gpt/blob/98fad263a62e5e57821de817bdd5e316abfb34d4/scripts/prepare_alpaca.py#L24-L25) for the file name and URL, assuming the JSON file you are working with has the same format as the [Alpaca JSON file](https://raw.githubusercontent.com/tloen/alpaca-lora/main/alpaca_data_cleaned_archive.json).



 

## Preparing Pretraining Datasets

In addition to the finetuning dataset described above, Lit-GPT also supports several datasets for pretraining. The pretraining datasets are described in more detail in the following separate tutorial documents:

- [Pretrain Llama 2 on OpenWebText](./pretrain_openwebtext.md)
- [Pretrain Llama 2 on RedPajama](./pretrain_redpajama.md)
- [Pretrain Llama 2 on RedPajama](./pretrain_redpajama.md)
8 changes: 2 additions & 6 deletions tutorials/pretrain_openwebtext.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,8 @@ This tutorial will walk you through setting up the OpenWebText dataset and launc

[OpenWebText](https://github.com/jcpeterson/openwebtext) is an open-source reproduction of OpenAI's unreleased WebText training dataset, which was originally used to train GPT-2. The version that is used here consists of 8M documents and is loaded via the `load_dataset("openwebtext", ...)` function from the [datasets](https://github.com/huggingface/datasets) Python package. [Please refer to the website hosting the dataset](https://huggingface.co/datasets/Skylion007/openwebtext) for the licensing information.


## Prepare OpenWebText for training


In order to start pretraining lit-gpt on it, you need to read, tokenize, and write the data in binary format.

To prepare the dataset with the Llama 2 tokenizer, run
Expand All @@ -24,7 +22,6 @@ python scripts/prepare_openwebtext.py \

The script will take about 15 min to run.


## Pretraining

Running the pretraining script with its default settings requires at least 4 GPUs with 40GB+ each. (However, alternatively, you can train a smaller Pythia-70m on 1 GPU, more information about that further below).
Expand All @@ -47,8 +44,8 @@ model_name = "Llama-2-7b-hf"

at the top of this script.

The currently supported model names are contained in the [config.py](https://github.com/Lightning-AI/lit-gpt/lit_gpt/config.py) file.
You can
The currently supported model names are contained in the [config.py](https://github.com/Lightning-AI/lit-gpt/lit_gpt/config.py) file.
You can

1) either search this file for lines containing "name =",
2) or run `python scripts/download.py` without additional command line arguments,
Expand Down Expand Up @@ -77,7 +74,6 @@ call a logging client library like `wandb` directly.

To train a smaller Pythia 70M model on a single GPU, you can modify the `pretrain/openwebtext.py` file to use the following settings:


```python
model_name = "pythia-70m"
```
Expand Down
8 changes: 5 additions & 3 deletions tutorials/pretrain_tinyllama.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ Here is a quick fact sheet:

(this table was sourced from the author's [README](https://github.com/jzhang38/TinyLlama/))


## Download datasets

You can download the data using git lfs:
Expand All @@ -36,11 +35,12 @@ git lfs install
git clone https://huggingface.co/datasets/cerebras/slimpajama-627b data/slimpajama-raw
git clone https://huggingface.co/datasets/bigcode/starcoderdata data/starcoderdata-raw
```

Around 1.2 TB of disk space is required to store both datasets.

## Prepare the datasets for training

In order to start pretraining lit-gpt on it, you need to read, tokenize, and write the data in binary chunks. This will leverage our `lightning.data` optimization pipeline and streaming dataset that comes with Lightning.
In order to start pretraining lit-gpt on it, you need to read, tokenize, and write the data in binary chunks. This will leverage our `lightning.data` optimization pipeline and streaming dataset that comes with Lightning.

First, install additional dependencies for preprocessing:

Expand All @@ -61,6 +61,7 @@ Then, run the preprocessing script for each dataset and split.
You will require **1.1 TB** of disk space for Starcoder and **2.5** TB of space for the SlimPajama dataset.

**Starcoder:**

```bash
python scripts/prepare_starcoder.py \
--input_dir data/starcoderdata-raw \
Expand All @@ -69,6 +70,7 @@ python scripts/prepare_starcoder.py \
```

**SlimPajama:**

```bash
python scripts/prepare_slimpajama.py \
--input_dir data/slimpajama-raw/validation \
Expand All @@ -89,10 +91,10 @@ python scripts/prepare_slimpajama.py \
If you want to run on a small slice of the datasets first, pass the flag `--fast_dev_run=true` to the commands above.
In the above we are assuming that you will be using the same tokenizer as used in LlaMA/TinyLlama, but any trained [SentencePiece](https://github.com/google/sentencepiece) tokenizer with a 32000 vocabulary size will do here.


## Pretraining

Currently, the pretraining with `torch.compile` requires PyTorch 2.2 "nightly". We recommend CUDA 12.1:

```bash
pip install -U --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
```
Expand Down
4 changes: 2 additions & 2 deletions tutorials/resource-tables.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Resource Tables

- Last updated: 10/20/2023
- Last updated: 10/20/2023
- Lit-GPT version: commit 8641822
- Hardware: NVIDIA A100-SXM4-40GB
- OS: Ubuntu 22.04.3 LTS (x86_64)
- Nvidia driver version: 525.125.06
- Relevant libraries
- CMake 3.26.4
- Libc glibc-2.35
- Libc glibc-2.35
- PyTorch 2.1.0+cu121
- Lightning 2.1.0.rc0
- Bitsandbytes 0.41.1
Expand Down

0 comments on commit 2647dbd

Please sign in to comment.