Fix missing docs and out of sync EngineArgs (#4219)

Co-authored-by: Harry Mellor <[email protected]>
vllm-project · Apr 20, 2024 · 682789d · 682789d
1 parent 138485a
commit 682789d
Show file tree

Hide file tree

Showing 3 changed files with 98 additions and 197 deletions.
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -11,12 +11,14 @@
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 
 import logging
+import os
 import sys
 from typing import List
 
 from sphinx.ext import autodoc
 
 logger = logging.getLogger(__name__)
+sys.path.append(os.path.abspath("../.."))
 
 # -- Project information -----------------------------------------------------
 

diff --git a/docs/source/models/engine_args.rst b/docs/source/models/engine_args.rst
@@ -5,133 +5,17 @@ Engine Arguments
 
 Below, you can find an explanation of every engine argument for vLLM:
 
-.. option:: --model <model_name_or_path>
-
-    Name or path of the huggingface model to use.
-
-.. option:: --tokenizer <tokenizer_name_or_path>
-
-    Name or path of the huggingface tokenizer to use.
-
-.. option:: --revision <revision>
-
-    The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.
-
-.. option:: --tokenizer-revision <revision>
-
-    The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.
-
-.. option:: --tokenizer-mode {auto,slow}
-
-    The tokenizer mode.
-
-    * "auto" will use the fast tokenizer if available.
-    * "slow" will always use the slow tokenizer.
-
-.. option:: --trust-remote-code
-
-    Trust remote code from huggingface.
-
-.. option:: --download-dir <directory>
-
-    Directory to download and load the weights, default to the default cache dir of huggingface.
-
-.. option:: --load-format {auto,pt,safetensors,npcache,dummy,tensorizer}
-
-    The format of the model weights to load.
-
-    * "auto" will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available.
-    * "pt" will load the weights in the pytorch bin format.
-    * "safetensors" will load the weights in the safetensors format.
-    * "npcache" will load the weights in pytorch format and store a numpy cache to speed up the loading.
-    * "dummy" will initialize the weights with random values, mainly for profiling.
-    * "tensorizer" will load serialized weights using `CoreWeave's Tensorizer model deserializer. <https://github.com/coreweave/tensorizer>`_ See `examples/tensorize_vllm_model.py <https://github.com/vllm-project/vllm/blob/main/examples/tensorize_vllm_model.py>`_ to serialize a vLLM model, and for more information.
-
-.. option:: --dtype {auto,half,float16,bfloat16,float,float32}
-
-    Data type for model weights and activations.
-
-    * "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
-    * "half" for FP16. Recommended for AWQ quantization.
-    * "float16" is the same as "half".
-    * "bfloat16" for a balance between precision and range.
-    * "float" is shorthand for FP32 precision.
-    * "float32" for FP32 precision.
-
-.. option:: --max-model-len <length>
-
-    Model context length. If unspecified, will be automatically derived from the model config.
-
-.. option:: --worker-use-ray
-
-    Use Ray for distributed serving, will be automatically set when using more than 1 GPU.
-
-.. option:: --pipeline-parallel-size (-pp) <size>
-
-    Number of pipeline stages.
-
-.. option:: --tensor-parallel-size (-tp) <size>
-
-    Number of tensor parallel replicas.
-
-.. option:: --max-parallel-loading-workers <workers>
-
-    Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.
-
-.. option:: --block-size {8,16,32}
-
-    Token block size for contiguous chunks of tokens.
-
-.. option:: --enable-prefix-caching
-
-    Enables automatic prefix caching
-
-.. option:: --seed <seed>
-
-    Random seed for operations.
-
-.. option:: --swap-space <size>
-
-    CPU swap space size (GiB) per GPU.
-
-.. option:: --gpu-memory-utilization <fraction>
-
-    The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. 
-    For example, a value of 0.5 would imply 50% GPU memory utilization.
-    If unspecified, will use the default value of 0.9.
-
-.. option:: --max-num-batched-tokens <tokens>
-
-    Maximum number of batched tokens per iteration.
-
-.. option:: --max-num-seqs <sequences>
-
-    Maximum number of sequences per iteration.
-
-.. option:: --max-paddings <paddings>
-
-    Maximum number of paddings in a batch.
-
-.. option:: --disable-log-stats
-
-    Disable logging statistics.
-
-.. option:: --quantization (-q) {awq,squeezellm,None}
-
-    Method used to quantize the weights.
+.. argparse::
+    :module: vllm.engine.arg_utils
+    :func: _engine_args_parser
+    :prog: -m vllm.entrypoints.openai.api_server
 
 Async Engine Arguments
 ----------------------
-Below are the additional arguments related to the asynchronous engine:
-
-.. option:: --engine-use-ray
 
-    Use Ray to start the LLM engine in a separate process as the server process.
-
-.. option:: --disable-log-requests
-
-    Disable logging requests.
-
-.. option:: --max-log-len
+Below are the additional arguments related to the asynchronous engine:
 
-    Max number of prompt characters or prompt ID numbers being printed in log. Defaults to unlimited.
+.. argparse::
+    :module: vllm.engine.arg_utils
+    :func: _async_engine_args_parser
+    :prog: -m vllm.entrypoints.openai.api_server