This benchmark is part of the JUPITER Benchmark Suite. See the repository of the suite for some general remarks.
This repository contains the Megatron-LM NLP/LLM benchmark. DESCRIPTION.md
contains details for compilation, execution, and evaluation.
The required source code (Megatron-LM, Apex) is included in the ./src/
subdirectory as submodules from the upstream repositories; github.com/NVIDIA/Megatron-LM for Megatron-LM and github.com/NVIDIA/apex for Apex. Sample data files are also included.
- benchmark
- aux
- tokenizers
- script used for getting data and tokenizers;
get_shrink_data_and_tokenizers.sh
- script used for preprocessing data;
job_preprocess_data.sbatch
- sample 10MB oscar dataset got using
get_shrink_data_and_tokenizers.sh
- env
- script for activating the python virtual env;
activate.bash
- script to set up python virtual env;
setup_venv.sh
- script for activating the python virtual env;
- slurm
- sbatch scripts for 13B and 175B model to be used when running without JUBE
- jube
- contains accompanying files for JUBE run and the JUBE yaml file
- aux
- src
- data : contains the preprocessed data (
*idx
and*.bin
files) compile_build.sh
: script to build the software dependenciesvariables.bash
: file that sets important pathsprebuild_kernels.py
: script to prebuild fused kernels
- data : contains the preprocessed data (
The following workflow can be done if data and tokenizers are not already present with this repository:
- Step 1: Set
NLP_BENCH_ROOT
variable asexport NLP_BENCH_ROOT=<rootdir path of this benchmark>
in your bash shell - Step 2:
cd benchmark/aux/
- Step 3:
bash get_shrink_data_and_tokenizers.sh
to get tokenizers and compress the raw dataoscar-1GB.jsonl.xz
tooscar-10MB.jsonl.xz
If your src/data
folder does not contain preprocessed data (*.idx
and *.bin
files), then execute
sbatch job_preprocess_data.sbatch
after Step 5 in "Workflow With Preprocessed Data And Tokenizers Available" from benchmark/aux
directory.
The job_preprocess_data.sbatch
script in benchmark/aux/
is used to preprocess the oscar-10MB.jsonl.xz
and put it in src/data/
. The file can be modified to preprocess any data of choice.
- Step 1:
cd
into it the folder of this benchmark - Step 2: Set
NLP_BENCH_ROOT
variable asexport NLP_BENCH_ROOT=<rootdir path of this benchmark>
in your bash shell - Step 3: Set
TORCH_CUDA_ARCH_LIST
according to GPU's compute capability inbenchmark/env/activate.bash
- Step 4: Run
bash benchmark/env/setup_venv.sh
- Step 5: Run
bash src/compile_build.sh
- Step 6: Run
sbatch benchmark/slurm/jobscript_13B.sbatch
orsbatch benchmark/slurm/jobscript_175B.sbatch
The output file *.out
file would have result logs of the following form that are important :
[default3]: iteration 10/ 292968 | consumed samples: 10240 | elapsed time per iteration (s): 35.8651 | learning rate: 4.734E-06 | global batch size: 1024 | lm loss: 1.332803E+01 | loss scale: 4096.0 | grad norm: 42.627 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 28.551 | TFLOPs: 199.03 |
[default3]: iteration 20/ 292968 | consumed samples: 20480 | elapsed time per iteration (s): 34.9991 | learning rate: 9.467E-06 | global batch size: 1024 | lm loss: 1.010884E+01 | loss scale: 4096.0 | grad norm: 13.038 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 29.258 | TFLOPs: 203.96 |
[default3]: iteration 30/ 292968 | consumed samples: 30720 | elapsed time per iteration (s): 34.8709 | learning rate: 1.420E-05 | global batch size: 1024 | lm loss: 9.072961E+00 | loss scale: 4096.0 | grad norm: 26.640 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 29.365 | TFLOPs: 204.71 |
[default3]: iteration 40/ 292968 | consumed samples: 40960 | elapsed time per iteration (s): 35.3346 | learning rate: 1.893E-05 | global batch size: 1024 | lm loss: 8.486469E+00 | loss scale: 4096.0 | grad norm: 3.441 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 28.980 | TFLOPs: 202.02 |
[default3]: iteration 50/ 292968 | consumed samples: 51200 | elapsed time per iteration (s): 35.3357 | learning rate: 2.367E-05 | global batch size: 1024 | lm loss: 8.
The metric tokens_per_sec
should be calculated as (1.0/$elapsed_time_per_iteration)*$global_batch_size*$sequence_length
obtained from the *.out
file.
For submission the throughput tokens_per_sec is converted into time, a hypothetical training would require. This conversion is done by assuming a training with 20 Million tokens, using the formula
[ time_to_report_in_seconds ] = [tokens] / [tokens/second]
Example: Using the 13B model result below (Tokens/sec: 59463.14), we obtain a duration of 20,000,000 / 59463.14 = 336.34 seconds.
Hint: sequence_length can be found in the jobscript.
- Step 1:
cd
into it the folder of this benchmark - Step 2: Set
TORCH_CUDA_ARCH_LIST
according to GPU's compute capability inbenchmark/env/activate.bash
- Step 3: Execute either
jube run benchmark/jube/nlp_benchmark.yaml --tag 175
for 175B model orjube run benchmark/jube/nlp_benchmark.yaml --tag 13
for 13B model - Step 4: Wait for the benchmark to run and then do
jube continue nlp_benchmark_run -i last
until no Steps with the "wait" state remain - Step 5: After the benchmark finishes, run
jube result -a nlp_benchmark_run -i last
to print the benchmark results
Example result from JUBE:
| system | version | queue | JobID | Job_Time | Model_Size (Billion Param) | Nodes | Batch_Size | Pipeline_Parallel | Tensor_Parallel | Iterations | Avg_TFLOPs/GPU | Tokens/sec | time_to_report_in_seconds |
|---------------|---------|---------|----------|------------|----------------------------|-------|------------|-------------------|-----------------|------------|----------------|------------|---------------------------|
| juwelsbooster | 2024.01 | booster | 10011638 | "00:30:00" | 13 | 8 | 1024 | 4 | 2 | 20 | 206.885 | 60777.68 | 329.07 |