Codebase chores: typos, trailing spaces, markdown linting, Ruff and B…

…lack. (Lightning-AI#777)
shivashankarv · Nov 24, 2023 · 2647dbd · 2647dbd
1 parent aa6ee9e
commit 2647dbd
Show file tree

Hide file tree

Showing 12 changed files with 27 additions and 44 deletions.
diff --git a/pretrain/openwebtext_trainer.py b/pretrain/openwebtext_trainer.py
@@ -2,7 +2,7 @@
 import sys
 import time
 from pathlib import Path
-from typing import Any, Optional, Dict
+from typing import Any, Dict, Optional
 
 import lightning as L
 import numpy as np

diff --git a/pretrain/tinyllama.py b/pretrain/tinyllama.py
@@ -22,7 +22,7 @@
 wd = Path(__file__).parent.parent.resolve()
 sys.path.append(str(wd))
 
-from lit_gpt.model import GPT, Block, Config, LLaMAMLP, CausalSelfAttention
+from lit_gpt.model import GPT, Block, CausalSelfAttention, Config, LLaMAMLP
 from lit_gpt.packed_dataset import CombinedDataset
 from lit_gpt.utils import chunked_cross_entropy, num_parameters
 
@@ -154,7 +154,6 @@ def train(fabric, state, train_dataloader, val_dataloader, resume):
     curr_iter = 0
 
     for train_data in train_dataloader:
-
         if state["iter_num"] >= max_iters:
             break
 
@@ -336,7 +335,7 @@ def choose_logger(logger_name: str, name: str, resume: Union[bool, Path], *args,
         return TensorBoardLogger(root_dir="logs", name=name, *args, **kwargs)
     if logger_name == "wandb":
         return WandbLogger(project="tinyllama", name=name, resume=(resume is not False), *args, **kwargs)
-    raise ValueError(f"`logger={logger} is not a valid option.")
+    raise ValueError(f"`logger={logger_name}` is not a valid option.")
 
 
 if __name__ == "__main__":

diff --git a/tests/test_lm_eval_harness.py b/tests/test_lm_eval_harness.py
@@ -6,7 +6,6 @@
 
 import datasets
 import pytest
-from conftest import RunIf
 from lightning import Fabric
 
 

diff --git a/tests/test_tokenizer.py b/tests/test_tokenizer.py
@@ -25,7 +25,7 @@ def test_tokenizer_against_hf(config):
     cache_dir = Path("/tmp/tokenizer_test_cache")
 
     # create a checkpoint directory that points to the HF files
-    checkpoint_dir = cache_dir / "ligpt" / config.hf_config["org"] / config.hf_config["name"]
+    checkpoint_dir = cache_dir / "litgpt" / config.hf_config["org"] / config.hf_config["name"]
     if not checkpoint_dir.exists():
         file_to_cache = {}
         for file in ("tokenizer.json", "generation_config.json", "tokenizer.model", "tokenizer_config.json"):

diff --git a/tutorials/convert_lit_models.md b/tutorials/convert_lit_models.md
@@ -13,8 +13,7 @@ python scripts/convert_lit_checkpoint.py \
 
 These paths are just placeholders, you will need to customize them based on which finetuning or pretraining script you ran and it's configuration.
 
-
-Please note that if you want to convert a model that has been fine-tuned using an adapter like LoRA, these weights should be [merged](../scripts/merge_lora.py) to the checkpoint prior to converting. 
+Please note that if you want to convert a model that has been fine-tuned using an adapter like LoRA, these weights should be [merged](../scripts/merge_lora.py) to the checkpoint prior to converting.
 
 ```sh
 python scripts/merge_lora.py \

diff --git a/tutorials/download_phi15.md b/tutorials/download_phi15.md
@@ -11,10 +11,8 @@ The model was trained the same data sources (7B tokens) as its [phi-1](https://a
 
 In addition, to create phi-1.5, the authors included additional textbook-quality synthetic text (roughly 20B tokens) in natural language, which was created using the [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) approach.
 
-
 The model weights are released under a [*Microsoft Research license*](https://huggingface.co/microsoft/phi-1_5/blob/main/README.md#license).
 
-
 In order to use the phi-1.5 model checkpoint, which requires about 3 Gb of disk space, download the weights and convert the checkpoint to the lit-gpt format:
 
 ```bash

diff --git a/tutorials/neurips_challenge_quickstart.md b/tutorials/neurips_challenge_quickstart.md
@@ -1,18 +1,13 @@
 # NeurIPS 2023 LLM Efficiency Challenge Quickstart Guide
 
-
-
 The [NeurIPS 2023 Efficiency Challenge](https://llm-efficiency-challenge.github.io/) is a competition focused on training **1 LLM for 24 hours on 1 GPU** – the team with the best LLM gets to present their results at NeurIPS 2023.
 
 This quick start guide is a short starter guide illustrating the main steps to get started with Lit-GPT, which was selected as the competition's official starter kit.
 
-
-
 &nbsp;
 
 ## Competition Facts
 
-
 &nbsp;
 
 **Permitted GPUs:**
@@ -43,7 +38,7 @@ These don't include models that have been finetuned or otherwise aligned, as per
 
 **Permitted datasets**
 
-Any open-source dataset is allowed. Originally, [per competition rules](https://llm-efficiency-challenge.github.io/challenge), datasets that utilize "generated content" from other LLMs were not permitted. However, the rules were recently softened to also allow LLM-generated datasets if those datasets are made available and if it is not against the usage restrictions and guidelines of the LLM. If you plan to use a specific dataset that is not explicitely listed on the [challenge website](https://llm-efficiency-challenge.github.io/challenge) or want to use LLM-generated data, it is recommended to reach out to the organizers and confirm that this is in line with the competition rules.
+Any open-source dataset is allowed. Originally, [per competition rules](https://llm-efficiency-challenge.github.io/challenge), datasets that utilize "generated content" from other LLMs were not permitted. However, the rules were recently softened to also allow LLM-generated datasets if those datasets are made available and if it is not against the usage restrictions and guidelines of the LLM. If you plan to use a specific dataset that is not explicitly listed on the [challenge website](https://llm-efficiency-challenge.github.io/challenge) or want to use LLM-generated data, it is recommended to reach out to the organizers and confirm that this is in line with the competition rules.
 
 Examples of permitted datasets are the following:
 
@@ -171,7 +166,6 @@ python eval/lm_eval_harness.py \
 
 To evaluate a LoRA-finetuned model, you need to first merge the LoRA weights with the base model to create a new checkpoint file:
 
-
 ```bash
 python scripts/merge_lora.py \
   --checkpoint_dir "checkpoints/stabilityai/stablelm-base-alpha-3b/" \
@@ -205,7 +199,6 @@ python eval/lm_eval_harness.py \
 
 You will be required to submit a Docker image for the submission itself. Fortunately, the organizers have a GitHub repository with the exact steps [here](https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge) and a toy-submission setup guide to test your model locally before submission.
 
-
 &nbsp;
 
 ## Additional Information & Resources

diff --git a/tutorials/oom.md b/tutorials/oom.md
@@ -23,13 +23,15 @@ Experiment with different micro batch sizes to find a balance between memory con
 ### Reduce the model's context length
 
 The context length (`block_size` in the code) plays a significant role in running models with attention.
-* The pretraining scripts are configured to use the full context length of the model to train. 
+
+* The pretraining scripts are configured to use the full context length of the model to train.
 * The finetuning scripts are configured to use the longest sample length of the training data to avoid allocating unnecessary memory (`max_seq_length` in the code).
-  If that's longer than the model's context length, an error is raised. If you try to run a batch that is longer than this, an error is raised. 
+  If that's longer than the model's context length, an error is raised. If you try to run a batch that is longer than this, an error is raised.
 
 However, your hardware may not support such large context lengths. Here's what you can do:
+
 * For the pretraining scripts, you can simply reduce the `Config(block_size=...)` value.
-* For the finetuning scripts, you can trim the length of the samples in your dataset. 
+* For the finetuning scripts, you can trim the length of the samples in your dataset.
   Most of the `scripts/prepare_*.py` scripts expose a `--max_seq_length=...` argument. This might also be useful in cases where
   sample lengths are highly unbalanced, as the presence of a single very long sample would incur a larger memory usage for all other
   shorter samples. For example, the median length of the samples in Alpaca is 110 tokens. Truncating the Alpaca dataset to 256 max tokens reduces the memory requirements of a Falcon 7B model from 23.52 GB to 15.73 GB. For more information about the dataset truncation, please see the *Truncating datasets* section in the the [prepare_datasets.md](prepare_datasets.md) tutorial.

diff --git a/tutorials/prepare_dataset.md b/tutorials/prepare_dataset.md
@@ -2,10 +2,9 @@
 
 Below is a table of all datasets that are currently supported in Lit-GPT:
 
-
 | Name         | Task        | Size                | Reference Repo                                                  | Paper / Blog                                                                                                              | Data License                                                                                                                                                                                                     |
 |--------------|-------------|---------------------|-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Alpaca       | Finetuning  | 51,759 samples      | [URL](https://github.com/tatsu-lab/stanford_alpaca)             | [URL](https://crfm.stanford.edu/2023/03/13/alpaca.html)                                                                   | Attribution-NonCommercial 4.0 International, [ URL](https://crfm.stanford.edu/2023/03/13/alpaca.html)                                                                                                            |
+| Alpaca       | Finetuning  | 51,759 samples      | [URL](https://github.com/tatsu-lab/stanford_alpaca)             | [URL](https://crfm.stanford.edu/2023/03/13/alpaca.html)                                                                   | Attribution-NonCommercial 4.0 International, [URL](https://crfm.stanford.edu/2023/03/13/alpaca.html)                                                                                                            |
 | Alpaca Libre | Finetuning  | 55,370 samples      | [URL](https://github.com/mobarski/alpaca-libre)                 | -                                                                                                                         | CC0/MIT,  [URL](https://github.com/mobarski/alpaca-libre)                                                                                                                                                        |
 | Dolly        | Finetuning  | 15,011 samples      | [URL](https://github.com/databrickslabs/dolly/tree/master/data) | [URL](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm)              | CC-BY-SA, [URL](https://github.com/databrickslabs/dolly#model-overview)                                                                                                                                          |
 | LongForm     | Finetuning  | 23,652 samples      | [URL](https://github.com/akoksal/LongForm)                      | [URL](https://arxiv.org/abs/2304.08460)                                                                                   | No information provided and subset-dependent, [URL](https://github.com/akoksal/LongForm)                                                                                                                         |
@@ -32,7 +31,6 @@ The steps here only need to be done once before preparing the finetuning dataset
 
 &nbsp;
 
-
 The Alpaca dataset consists of 52,000 instructions and demonstrations produced by OpenAI's text-davinci-003 engine. This data is used in instruction-tuning, helping improve the performance of language models to follow instructions.
 
 In its development, the creators leveraged the data generation methodology from the [Self-Instruct framework](https://github.com/yizhongw/self-instruct).
@@ -91,8 +89,6 @@ python scripts/prepare_alpaca.py \
  --max_seq_length 256
 ```
 
-
-
 &nbsp;
 
 ### Dolly
@@ -188,7 +184,6 @@ python scripts/prepare_dolly.py \
  --max_seq_length 512
 ```
 
-
 &nbsp;
 
 ### Finetuning After Data Preparation
@@ -217,7 +212,7 @@ Please read the [tutorials/finetune_*.md](../tutorials) documents for more infor
 
 The models in Lit-GPT expect datasets for instruction finetuning in the following format:
 
-```
+```text
 [
     {
         "instruction": "Write a limerick about a
@@ -237,6 +232,7 @@ The models in Lit-GPT expect datasets for instruction finetuning in the followin
     },
 ]
 ```
+
 (Note that depending on the task, the `"input"` text can be an empty string, as shown above.)
 
 Custom datasets can be prepared by either creating a new `scripts/prepare_dataset.py` script or reading the dataset
@@ -259,6 +255,7 @@ Before you finetune, prepare the dataset using the `prepare_csv.py` script:
 ```bash
 python scripts/prepare_csv.py --csv_path path/to/the/file.csv
 ```
+
 You can also customize the dataset generation by using these additional parameters
 
 - `destination_path`: The folder where the binary data will be saved. By default, it is saved inside `data/csv`
@@ -284,11 +281,12 @@ python scripts/prepare_csv.py --csv_path test_data.csv \
 --mask_inputs false \
 --ignore_index -1
 ```
+
 Replace `test_data.csv` with your CSV path and the other additional parameters accordingly. Executing the command above will save `train.pt` and `test.pt` on your disk at the `destination_path`. Now you can use the prepared data to [finetune your model](https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/finetune_lora.md#running-the-finetuning).
 
 &nbsp;
 
-### Preparing Custom Datasets Using a Dataset Prepration Script
+### Preparing Custom Datasets Using a Dataset Preparation Script
 
 If you don't have a CSV file following the format described in the previous section, the easiest way to prepare a new dataset is to copy and modify one of the existing dataset preparation scripts:
 
@@ -299,16 +297,13 @@ These scripts may look intimidating at first glance since they include code for
 
 In [`scripts/prepare_lima.py`](https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/prepare_lima.py), the [line 26](https://github.com/Lightning-AI/lit-gpt/blob/98fad263a62e5e57821de817bdd5e316abfb34d4/scripts/prepare_lima.py#L26) references the HF repo ID, and the lines [50-53](https://github.com/Lightning-AI/lit-gpt/blob/98fad263a62e5e57821de817bdd5e316abfb34d4/scripts/prepare_lima.py#L50-L53) save the dataset as `train_data`. Here, `train_data` is a list that contains the instruction examples in the format mentioned above.
 
-
 In [`scripts/prepare_alpaca.py`](https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/prepare_alpaca.py), you only need to modify [lines 24-25](https://github.com/Lightning-AI/lit-gpt/blob/98fad263a62e5e57821de817bdd5e316abfb34d4/scripts/prepare_alpaca.py#L24-L25) for the file name and URL, assuming the JSON file you are working with has the same format as the [Alpaca JSON file](https://raw.githubusercontent.com/tloen/alpaca-lora/main/alpaca_data_cleaned_archive.json).
 
-
-
 &nbsp;
 
 ## Preparing Pretraining Datasets
 
 In addition to the finetuning dataset described above, Lit-GPT also supports several datasets for pretraining. The pretraining datasets are described in more detail in the following separate tutorial documents:
 
 - [Pretrain Llama 2 on OpenWebText](./pretrain_openwebtext.md)
-- [Pretrain Llama 2 on RedPajama](./pretrain_redpajama.md)
+- [Pretrain Llama 2 on RedPajama](./pretrain_redpajama.md)
diff --git a/tutorials/pretrain_openwebtext.md b/tutorials/pretrain_openwebtext.md
@@ -6,10 +6,8 @@ This tutorial will walk you through setting up the OpenWebText dataset and launc
 
 [OpenWebText](https://github.com/jcpeterson/openwebtext) is an open-source reproduction of OpenAI's unreleased WebText training dataset, which was originally used to train GPT-2. The version that is used here consists of 8M documents and is loaded via the `load_dataset("openwebtext", ...)` function from the [datasets](https://github.com/huggingface/datasets) Python package. [Please refer to the website hosting the dataset](https://huggingface.co/datasets/Skylion007/openwebtext) for the licensing information.
 
-
 ## Prepare OpenWebText for training
 
-
 In order to start pretraining lit-gpt on it, you need to read, tokenize, and write the data in binary format.
 
 To prepare the dataset with the Llama 2 tokenizer, run
@@ -24,7 +22,6 @@ python scripts/prepare_openwebtext.py \
 
 The script will take about 15 min to run.
 
-
 ## Pretraining
 
 Running the pretraining script with its default settings requires at least 4 GPUs with 40GB+ each. (However, alternatively, you can train a smaller Pythia-70m on 1 GPU, more information about that further below).
@@ -47,8 +44,8 @@ model_name = "Llama-2-7b-hf"
 
 at the top of this script.
 
-The currently supported model names are contained in the [config.py](https://github.com/Lightning-AI/lit-gpt/lit_gpt/config.py) file. 
-You can 
+The currently supported model names are contained in the [config.py](https://github.com/Lightning-AI/lit-gpt/lit_gpt/config.py) file.
+You can
 
 1) either search this file for lines containing "name =",
 2) or run `python scripts/download.py` without additional command line arguments,
@@ -77,7 +74,6 @@ call a logging client library like `wandb` directly.
 
 To train a smaller Pythia 70M model on a single GPU, you can modify the `pretrain/openwebtext.py` file to use the following settings:
 
-
 ```python
 model_name = "pythia-70m"
 ```

diff --git a/tutorials/pretrain_tinyllama.md b/tutorials/pretrain_tinyllama.md
@@ -22,7 +22,6 @@ Here is a quick fact sheet:
 
 (this table was sourced from the author's [README](https://github.com/jzhang38/TinyLlama/))
 
-
 ## Download datasets
 
 You can download the data using git lfs:
@@ -36,11 +35,12 @@ git lfs install
 git clone https://huggingface.co/datasets/cerebras/slimpajama-627b data/slimpajama-raw
 git clone https://huggingface.co/datasets/bigcode/starcoderdata data/starcoderdata-raw
 ```
+
 Around 1.2 TB of disk space is required to store both datasets.
 
 ## Prepare the datasets for training
 
-In order to start pretraining lit-gpt on it, you need to read, tokenize, and write the data in binary chunks. This will leverage our `lightning.data` optimization pipeline and streaming dataset that comes with Lightning. 
+In order to start pretraining lit-gpt on it, you need to read, tokenize, and write the data in binary chunks. This will leverage our `lightning.data` optimization pipeline and streaming dataset that comes with Lightning.
 
 First, install additional dependencies for preprocessing:
 
@@ -61,6 +61,7 @@ Then, run the preprocessing script for each dataset and split.
 You will require **1.1 TB** of disk space for Starcoder and **2.5** TB of space for the SlimPajama dataset.
 
 **Starcoder:**
+
 ```bash
 python scripts/prepare_starcoder.py \
   --input_dir data/starcoderdata-raw \
@@ -69,6 +70,7 @@ python scripts/prepare_starcoder.py \
 ```
 
 **SlimPajama:**
+
 ```bash
 python scripts/prepare_slimpajama.py \
   --input_dir data/slimpajama-raw/validation \
@@ -89,10 +91,10 @@ python scripts/prepare_slimpajama.py \
 If you want to run on a small slice of the datasets first, pass the flag `--fast_dev_run=true` to the commands above.
 In the above we are assuming that you will be using the same tokenizer as used in LlaMA/TinyLlama, but any trained [SentencePiece](https://github.com/google/sentencepiece) tokenizer with a 32000 vocabulary size will do here.
 
-
 ## Pretraining
 
 Currently, the pretraining with `torch.compile` requires PyTorch 2.2 "nightly". We recommend CUDA 12.1:
+
 ```bash
 pip install -U --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
 ```

diff --git a/tutorials/resource-tables.md b/tutorials/resource-tables.md
@@ -1,13 +1,13 @@
 # Resource Tables
 
-- Last updated: 10/20/2023 
+- Last updated: 10/20/2023
 - Lit-GPT version: commit 8641822
 - Hardware: NVIDIA A100-SXM4-40GB
 - OS: Ubuntu 22.04.3 LTS (x86_64)
 - Nvidia driver version: 525.125.06
 - Relevant libraries
   - CMake 3.26.4
-  - Libc glibc-2.35 
+  - Libc glibc-2.35
   - PyTorch 2.1.0+cu121
   - Lightning 2.1.0.rc0
   - Bitsandbytes 0.41.1
-Original file line number
+Diff line change
@@ Expand Up @@
     In addition, to create phi-1.5, the authors included additional textbook-quality synthetic text (roughly 20B tokens) in natural language, which was created using the [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) approach.
     The model weights are released under a [*Microsoft Research license*](https://huggingface.co/microsoft/phi-1_5/blob/main/README.md#license).
     In order to use the phi-1.5 model checkpoint, which requires about 3 Gb of disk space, download the weights and convert the checkpoint to the lit-gpt format:
     ```bash
@@ Expand Down @@