Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc update sprint 4 #993

Merged
merged 10 commits into from
Feb 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/_build_doc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,9 @@ jobs:
run:
shell: bash
steps:
- name: Install Git LFS
- name: Install Package
run: |
yum install -y git-lfs
yum install -y git-lfs epel-release pandoc
- name: Check-out the repository
uses: actions/checkout@v3
with:
Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/ci_build_doc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@
name: CI

on:
# pull_request:
# paths-ignore:
# - '**.md'
# - 'ci/**'
pull_request:
paths-ignore:
- '**.md'
- 'ci/**'

jobs:
build_doc:
Expand Down
15 changes: 7 additions & 8 deletions .github/workflows/release.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -69,17 +69,16 @@ jobs:
release_type: ${{ inputs.release_type }}
version_override: ${{ needs.process_version.outputs.version_override }}

# build_doc:
# name: Build documentation
# needs: [process_version]
# uses: ./.github/workflows/_build_doc.yaml
# with:
# version_override: ${{ needs.process_version.outputs.version_override }}
build_doc:
name: Build documentation
needs: [process_version]
uses: ./.github/workflows/_build_doc.yaml
with:
version_override: ${{ needs.process_version.outputs.version_override }}

publish:
name: Publish
needs: [build_wheels]
# needs: [build_wheels, build_doc]
needs: [build_wheels, build_doc]
uses: ./.github/workflows/_publish.yaml
with:
release_type: ${{ inputs.release_type }}
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<p align="center">
<img src="doc/source/_static/img/logo.png" width="150"><br />
<img src="doc/source/_static/img/logo.svg" width="150"><br />
</p>

# fairseq2: FAIR Sequence Modeling Toolkit 2
Expand Down
4 changes: 3 additions & 1 deletion doc/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,6 @@ sphinx-favicon~=1.0.1
sphinx-design~=0.5.0
myst-parser~=4.0.0
sphinxcontrib-mermaid~=1.0.0
furo==2024.8.6
furo==2024.8.6
nbsphinx~=0.9.6
ipython~=8.31.0
3 changes: 0 additions & 3 deletions doc/source/_static/img/logo.png

This file was deleted.

459 changes: 459 additions & 0 deletions doc/source/_static/img/logo.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
45 changes: 22 additions & 23 deletions doc/source/basics/assets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ How to Customize Your Assets

---

name: gsm8k_sft@awscluster
name: gsm8k_sft@user
data: "/data/gsm8k_data/sft"


Expand All @@ -58,49 +58,48 @@ How to Customize Your Assets

.. code-block:: yaml

name: llama3_2_1b@awscluster
name: llama3_2_1b@user
checkpoint: "/models/Llama-3.2-1B/original/consolidated.00.pth"


Advanced Topics
---------------

Model Store
Asset Store
~~~~~~~~~~~

A store is a place where all the model cards are stored. In fairseq2, a store is accessed via
:py:class:`fairseq2.assets.AssetStore`. Multiple stores are allowed. By default, fairseq2 will look up the following stores:
:py:class:`fairseq2.assets.AssetStore`. By default, fairseq2 will look up the following paths to
find asset cards:

* System asset store: Cards that are shared by all users. By default, the system store is `/etc/fairseq2/assets`,
but this can be changed via the environment variable `FAIRSEQ2_ASSET_DIR`
* System: Cards that are shared by all users. By default, the system store is `/etc/fairseq2/assets`,
but this can be changed via the environment variable `FAIRSEQ2_ASSET_DIR`.

* User asset store: Cards that are only available to the user. By default, the user store is
`~/.config/fairseq2/assets`, but this can be changed via the environment variable `FAIRSEQ2_USER_ASSET_DIR`
* User: Cards can be created with name with the suffix ``@user`` (`e.g.` ``llama3_2_1b@user``) that are only available to the user.
By default, the user store is ``~/.config/fairseq2/assets``, but this can be changed via the environment variable `FAIRSEQ2_USER_ASSET_DIR`.

To register a new store, implement a :py:class:`fairseq2.assets.AssetMetadataProvider` and add them to
:py:class:`fairseq2.assets.asset_store`. Here is an example to register a new directory as a model store:
Here is an example on how to register a new directory to the a asset store:

.. code-block:: python

from pathlib import Path
from fairseq2.assets import FileAssetMetadataProvider, asset_store
from fairseq2.assets import FileAssetMetadataLoader, StandardAssetStore

my_dir = Path("/path/to/model_store")
asset_store.metadata_providers.append(FileAssetMetadataProvider(my_dir))
def register_my_models(asset_store: StandardAssetStore) -> None:
my_dir = Path("/path/to/model_store")
loader = FileAssetMetadataLoader(my_dir)
asset_provider = loader.load()
asset_store.metadata_providers.append(asset_provider)


Model Card
Asset Card
~~~~~~~~~~

A model card is a .YAML file that contains information about a model and instructs a
:py:class:`fairseq2.models.utils.generic_loaders.ModelLoader` on how to load the model into the memory. Each model card
must have 2 mandatory attributes: `name` and `checkpoint`. `name` will be used to identify the model card, and it must
be unique `across` all
fairseq2 provides example cards for different LLMs in
:py:mod:`fairseq2.assets.cards`.

In fairseq2, a model card is accessed via :py:class:`fairseq2.assets.AssetCard`. Alternatively, one can call
`fairseq2.assets.AssetMetadataProvider.get_metadata(name: str)` to get the meta data of a given model card name.
A model card is a .YAML file that contains information about an asset such as
a model, dataset, or tokenizer. Each asset card must have a mandatory attribute
`name`. `name` will be used to identify the relevant asset, and it must be
unique across all fairseq2 provides example cards for different assets in
:py:mod:`fairseq2.assets.cards`.

See Also
--------
Expand Down
62 changes: 40 additions & 22 deletions doc/source/basics/ckpt.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ It provides a robust way to:
- Save model checkpoints during training
- Load checkpoints to resume training
- Manage multiple checkpoints with policies like keeping N-best or last N checkpoints
- Handle distributed training scenarios including FSDP (Fully Sharded Data Parallel)
- Handle distributed training scenarios including FSDP (Fully Sharded Data Parallel) and TP (Tensor Parallel)

Architecture Overview
---------------------
Expand Down Expand Up @@ -37,22 +37,29 @@ The :class:`fairseq2.checkpoint.manager.CheckpointManager` provides a transactio
# Initialize checkpoint manager
ckpt_manager = FileCheckpointManager(
checkpoint_dir=Path("checkpoints"),
gang=root_gang # For distributed training coordination
gangs=root_gang, # For distributed training coordination
file_system=file_system, # File system abstraction
tensor_loader=tensor_loader, # For loading tensors
tensor_dumper=tensor_dumper, # For saving tensors
)

# Begin checkpoint operation
ckpt_manager.begin_checkpoint(step_nr=1000)

# Save model and optimizer state
ckpt_manager.save_state({
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"step_nr": 1000,
"epoch": 5
})
ckpt_manager.save_state(
{
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"step_nr": 1000,
"epoch": 5
},
model_key="model", # Key for model state in the state dict
replicated_keys={"epoch"} # Keys that are same across all processes
)

# Save validation score if needed
ckpt_manager.save_score(valid_score)
ckpt_manager.save_score(valid_score, lower_better=True) # Optional, lower is better

# Commit the checkpoint
ckpt_manager.commit_checkpoint()
Expand Down Expand Up @@ -87,18 +94,17 @@ Keep Last N Checkpoints
.. code-block:: python

# Keep only the last 5 checkpoints
ckpt_manager.keep_last_n_checkpoints(n=5)
ckpt_manager.keep_last_n_checkpoints(n=5, preserve_model=False)

Keep Best N Checkpoints
^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

# Keep the 3 checkpoints with best validation scores
ckpt_manager.keep_best_n_checkpoints(
n=3,
lower_better=True # True if lower scores are better
)
ckpt_manager.keep_best_n_checkpoints(n=3, preserve_model=False)

The `preserve_model` parameter allows keeping model weights while deleting other checkpoint data.

Distributed Training Support
----------------------------
Expand All @@ -125,19 +131,25 @@ A checkpoint directory contains:

checkpoint_dir/
├── model.yaml # Model metadata
├── cc/ # Carbon copy directory for files to copy to each checkpoint
└── step_1000/ # Checkpoint at step 1000
└── model.pt # Model training state
├── model.pt # Model training state
├── rank_0.pt # Process-specific state for rank 0
├── rank_1.pt # Process-specific state for rank 1
└── score.txt # Optional validation score

For sharded checkpoints (FSDP), each rank has its own files:
For tensor parallel training, model files are suffixed with the TP rank:

.. code-block:: text

checkpoint_dir/
├── model.yaml # Model metadata
├── model.yaml
└── step_1000/
├── model.pt # Consolidated model
├── rank_0.pt # Model rank 0 state
└── rank_1.pt # Model rank 1 state
├── model.0.pt # Model shard for TP rank 0
├── model.1.pt # Model shard for TP rank 1
├── replicated.0.pt # Replicated state for TP rank 0
├── replicated.1.pt # Replicated state for TP rank 1
└── score.txt

Error Handling
--------------
Expand All @@ -146,7 +158,9 @@ The checkpoint system provides specific exceptions for error cases:

- ``CheckpointError``: Base class for checkpoint-related errors
- ``CheckpointNotFoundError``: Raised when attempting to load non-existent checkpoint
- ``InvalidOperationError``: Raised for invalid checkpoint operations
- ``CheckpointSaveError``: Raised when saving a checkpoint fails
- ``CheckpointLoadError``: Raised when loading a checkpoint fails
- ``CheckpointDeleteError``: Raised when deleting a checkpoint fails

Example error handling:

Expand All @@ -156,7 +170,7 @@ Example error handling:
ckpt_manager.load_checkpoint(step_nr=1000)
except CheckpointNotFoundError:
print("Checkpoint not found")
except CheckpointError as e:
except CheckpointLoadError as e:
print(f"Error loading checkpoint: {e}")

Best Practices
Expand All @@ -171,3 +185,7 @@ Best Practices
4. Handle checkpoint errors gracefully in production code

5. For distributed training, ensure proper gang coordination

6. Use the carbon copy directory (cc/) for files that should be present in every checkpoint

7. Consider using ``preserve_model=True`` when cleaning up checkpoints to keep model weights while reducing storage
52 changes: 44 additions & 8 deletions doc/source/basics/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Here are some basic examples of using the CLI:
# Get help about a specific command (e.g. recipe lm::instruction_finetune)
fairseq2 lm instruction_finetune -h

# List available presets for a recipe (e.g. recipe lm::instruction_finetune)
# List available configuration presets for a recipe (e.g. recipe lm::instruction_finetune)
fairseq2 lm instruction_finetune --list-presets

# Dump the default configuration for a recipe (e.g. recipe lm::instruction_finetune)
Expand Down Expand Up @@ -60,17 +60,27 @@ Use ``--config`` to override specific values:
.. code-block:: bash

# Override single value
fairseq2 lm instruction_finetune <OUTPUT_DIR> --config max_num_tokens=512
fairseq2 lm instruction_finetune <OUTPUT_DIR> --config dataset.max_num_tokens=512

# Override nested values
fairseq2 lm instruction_finetune <OUTPUT_DIR> --config optimizer_config.lr=4e-5
fairseq2 lm instruction_finetune <OUTPUT_DIR> --config optimizer.config.lr=4e-5

# Override multiple values
fairseq2 lm instruction_finetune <OUTPUT_DIR> --config max_num_tokens=512 max_seq_len=512
fairseq2 lm instruction_finetune <OUTPUT_DIR> --config dataset.max_num_tokens=512 dataset.max_seq_len=512

# Override a tuple
fairseq2 lm instruction_finetune <OUTPUT_DIR> --config profile="[500,10]"

or add, delete values:

.. code-block:: bash

# Delete a configuration key
fairseq2 lm instruction_finetune <OUTPUT_DIR> --config del:common.metric_recorders.tensorboard

# Add a configuration key
fairseq2 lm instruction_finetune <OUTPUT_DIR> --config add:common.metric_recorders.tensorboard="{enabled: true}"

.. note::

Unlike ``--config-file``, only one ``--config`` argument can be used.
Expand Down Expand Up @@ -112,17 +122,43 @@ fairseq2 provides commands to manage and inspect assets:
# List all available assets
fairseq2 assets list

# Show details of a specific asset
fairseq2 assets show llama3_1_8b_instruct

# List assets filtered by type
fairseq2 assets list --type model
fairseq2 assets list --type dataset
fairseq2 assets list --type tokenizer

# Show details of a specific asset
fairseq2 assets show llama3_1_8b_instruct

LLaMA Utilities
---------------

fairseq2 provides utilities for working with LLaMA models:

.. code-block:: bash

# Convert fairseq2 LLaMA checkpoints to reference format
fairseq2 llama convert_checkpoint <MODEL_NAME> <INPUT_DIR> <OUTPUT_DIR>

# Write LLaMA configurations in Hugging Face format
fairseq2 llama write_hf_config <MODEL_NAME> <OUTPUT_DIR>

Available Recipe Groups
-----------------------

fairseq2 includes several recipe groups for different tasks:

- ``asr``: ASR (Automatic Speech Recognition) recipes
- ``lm``: Language model recipes (instruction fine-tuning, preference optimization, etc.)
- ``mt``: Machine translation recipes
- ``wav2vec2``: wav2vec 2.0 pretraining recipes
- ``wav2vec2_asr``: wav2vec 2.0 ASR recipes

For more details about the recipe configurations, please refer to :ref:`basics-recipe`.

See More
--------

For more technical details about implementing custom CLIs and extensions, see:

- :doc:`/reference/api/fairseq2.recipes/cli`
- :doc:`/reference/api/fairseq2.cli/index`
Loading
Loading