facebookresearch · cbalioglu · Feb 7, 2025 · Jan 26, 2025 · Jan 26, 2025 · Jan 31, 2025
@@ -28,9 +28,9 @@ jobs:
       run:
         shell: bash
     steps:
-      - name: Install Git LFS
+      - name: Install Package
         run: |
-          yum install -y git-lfs
+          yum install -y git-lfs epel-release pandoc
       - name: Check-out the repository
         uses: actions/checkout@v3
         with:

@@ -7,10 +7,10 @@
 name: CI
 
 on:
-#  pull_request:
-#    paths-ignore:
-#      - '**.md'
-#      - 'ci/**'
+  pull_request:
+    paths-ignore:
+      - '**.md'
+      - 'ci/**'
 
 jobs:
   build_doc:

@@ -69,17 +69,16 @@ jobs:
       release_type: ${{ inputs.release_type }}
       version_override: ${{ needs.process_version.outputs.version_override }}
 
-#  build_doc:
-#    name: Build documentation
-#    needs: [process_version]
-#    uses: ./.github/workflows/_build_doc.yaml
-#    with:
-#      version_override: ${{ needs.process_version.outputs.version_override }}
+  build_doc:
+    name: Build documentation
+    needs: [process_version]
+    uses: ./.github/workflows/_build_doc.yaml
+    with:
+      version_override: ${{ needs.process_version.outputs.version_override }}
 
   publish:
     name: Publish
-    needs: [build_wheels]
-#    needs: [build_wheels, build_doc]
+    needs: [build_wheels, build_doc]
     uses: ./.github/workflows/_publish.yaml
     with:
       release_type: ${{ inputs.release_type }}
@@ -1,5 +1,5 @@
 <p align="center">
-  <img src="doc/source/_static/img/logo.png" width="150"><br />
+  <img src="doc/source/_static/img/logo.svg" width="150"><br />
 </p>
 
 # fairseq2: FAIR Sequence Modeling Toolkit 2

@@ -4,4 +4,6 @@ sphinx-favicon~=1.0.1
 sphinx-design~=0.5.0
 myst-parser~=4.0.0
 sphinxcontrib-mermaid~=1.0.0
-furo==2024.8.6
+furo==2024.8.6
+nbsphinx~=0.9.6
+ipython~=8.31.0
@@ -46,7 +46,7 @@ How to Customize Your Assets
 
     ---
 
-    name: gsm8k_sft@awscluster
+    name: gsm8k_sft@user
     data: "/data/gsm8k_data/sft"
 
 
@@ -58,49 +58,48 @@ How to Customize Your Assets
 
 .. code-block:: yaml
 
-    name: llama3_2_1b@awscluster
+    name: llama3_2_1b@user
     checkpoint: "/models/Llama-3.2-1B/original/consolidated.00.pth"
 
 
 Advanced Topics
 ---------------
 
-Model Store
+Asset Store
 ~~~~~~~~~~~
 
 A store is a place where all the model cards are stored. In fairseq2, a store is accessed via 
-:py:class:`fairseq2.assets.AssetStore`. Multiple stores are allowed. By default, fairseq2 will look up the following stores:
+:py:class:`fairseq2.assets.AssetStore`. By default, fairseq2 will look up the following paths to
+find asset cards:
 
-* System asset store: Cards that are shared by all users. By default, the system store is `/etc/fairseq2/assets`,
-    but this can be changed via the environment variable `FAIRSEQ2_ASSET_DIR`
+* System: Cards that are shared by all users. By default, the system store is `/etc/fairseq2/assets`,
+    but this can be changed via the environment variable `FAIRSEQ2_ASSET_DIR`.
 
-* User asset store: Cards that are only available to the user. By default, the user store is 
-    `~/.config/fairseq2/assets`, but this can be changed via the environment variable `FAIRSEQ2_USER_ASSET_DIR`
+* User: Cards can be created with name with the suffix ``@user`` (`e.g.` ``llama3_2_1b@user``) that are only available to the user.
+    By default, the user store is ``~/.config/fairseq2/assets``, but this can be changed via the environment variable `FAIRSEQ2_USER_ASSET_DIR`.
 
-To register a new store, implement a :py:class:`fairseq2.assets.AssetMetadataProvider` and add them to 
-:py:class:`fairseq2.assets.asset_store`. Here is an example to register a new directory as a model store:
+Here is an example on how to register a new directory to the a asset store:
 
 .. code-block:: python
 
     from pathlib import Path
-    from fairseq2.assets import FileAssetMetadataProvider, asset_store
+    from fairseq2.assets import FileAssetMetadataLoader, StandardAssetStore
 
-    my_dir = Path("/path/to/model_store")
-    asset_store.metadata_providers.append(FileAssetMetadataProvider(my_dir))
+    def register_my_models(asset_store: StandardAssetStore) -> None:
+        my_dir = Path("/path/to/model_store")
+        loader = FileAssetMetadataLoader(my_dir)
+        asset_provider = loader.load()
+        asset_store.metadata_providers.append(asset_provider)
 
 
-Model Card
+Asset Card
 ~~~~~~~~~~
 
-A model card is a .YAML file that contains information about a model and instructs a 
-:py:class:`fairseq2.models.utils.generic_loaders.ModelLoader` on how to load the model into the memory. Each model card
-must have 2 mandatory attributes: `name` and `checkpoint`. `name` will be used to identify the model card, and it must
-be unique `across` all 
-fairseq2 provides example cards for different LLMs in
-:py:mod:`fairseq2.assets.cards`. 
-
-In fairseq2, a model card is accessed via :py:class:`fairseq2.assets.AssetCard`. Alternatively, one can call 
-`fairseq2.assets.AssetMetadataProvider.get_metadata(name: str)` to get the meta data of a given model card name.
+A model card is a .YAML file that contains information about an asset such as
+a model, dataset, or tokenizer. Each asset card must have a mandatory attribute
+`name`. `name` will be used to identify the relevant asset, and it must be
+unique across all fairseq2 provides example cards for different assets in
+:py:mod:`fairseq2.assets.cards`.
 
 See Also
 --------

@@ -9,7 +9,7 @@ It provides a robust way to:
 - Save model checkpoints during training
 - Load checkpoints to resume training
 - Manage multiple checkpoints with policies like keeping N-best or last N checkpoints
-- Handle distributed training scenarios including FSDP (Fully Sharded Data Parallel)
+- Handle distributed training scenarios including FSDP (Fully Sharded Data Parallel) and TP (Tensor Parallel)
 
 Architecture Overview
 ---------------------
@@ -37,22 +37,29 @@ The :class:`fairseq2.checkpoint.manager.CheckpointManager` provides a transactio
     # Initialize checkpoint manager
     ckpt_manager = FileCheckpointManager(
         checkpoint_dir=Path("checkpoints"),
-        gang=root_gang  # For distributed training coordination
+        gangs=root_gang,  # For distributed training coordination
+        file_system=file_system,  # File system abstraction
+        tensor_loader=tensor_loader,  # For loading tensors
+        tensor_dumper=tensor_dumper,  # For saving tensors
     )
 
     # Begin checkpoint operation
     ckpt_manager.begin_checkpoint(step_nr=1000)
 
     # Save model and optimizer state
-    ckpt_manager.save_state({
-        "model": model.state_dict(),
-        "optimizer": optimizer.state_dict(),
-        "step_nr": 1000,
-        "epoch": 5
-    })
+    ckpt_manager.save_state(
+        {
+            "model": model.state_dict(),
+            "optimizer": optimizer.state_dict(),
+            "step_nr": 1000,
+            "epoch": 5
+        },
+        model_key="model",  # Key for model state in the state dict
+        replicated_keys={"epoch"}  # Keys that are same across all processes
+    )
 
     # Save validation score if needed
-    ckpt_manager.save_score(valid_score)
+    ckpt_manager.save_score(valid_score, lower_better=True)  # Optional, lower is better
 
     # Commit the checkpoint
     ckpt_manager.commit_checkpoint()
@@ -87,18 +94,17 @@ Keep Last N Checkpoints
 .. code-block:: python
 
     # Keep only the last 5 checkpoints
-    ckpt_manager.keep_last_n_checkpoints(n=5)
+    ckpt_manager.keep_last_n_checkpoints(n=5, preserve_model=False)
 
 Keep Best N Checkpoints
 ^^^^^^^^^^^^^^^^^^^^^^^
 
 .. code-block:: python
 
     # Keep the 3 checkpoints with best validation scores
-    ckpt_manager.keep_best_n_checkpoints(
-        n=3,
-        lower_better=True  # True if lower scores are better
-    )
+    ckpt_manager.keep_best_n_checkpoints(n=3, preserve_model=False)
+
+The `preserve_model` parameter allows keeping model weights while deleting other checkpoint data.
 
 Distributed Training Support
 ----------------------------
@@ -125,19 +131,25 @@ A checkpoint directory contains:
 
     checkpoint_dir/
     ├── model.yaml           # Model metadata
+    ├── cc/                  # Carbon copy directory for files to copy to each checkpoint
     └── step_1000/          # Checkpoint at step 1000
-        └── model.pt        # Model training state
+        ├── model.pt        # Model training state
+        ├── rank_0.pt       # Process-specific state for rank 0
+        ├── rank_1.pt       # Process-specific state for rank 1
+        └── score.txt       # Optional validation score
 
-For sharded checkpoints (FSDP), each rank has its own files:
+For tensor parallel training, model files are suffixed with the TP rank:
 
 .. code-block:: text
 
     checkpoint_dir/
-    ├── model.yaml           # Model metadata
+    ├── model.yaml
     └── step_1000/
-        ├── model.pt         # Consolidated model
-        ├── rank_0.pt        # Model rank 0 state
-        └── rank_1.pt        # Model rank 1 state
+        ├── model.0.pt      # Model shard for TP rank 0
+        ├── model.1.pt      # Model shard for TP rank 1
+        ├── replicated.0.pt # Replicated state for TP rank 0
+        ├── replicated.1.pt # Replicated state for TP rank 1
+        └── score.txt
 
 Error Handling
 --------------
@@ -146,7 +158,9 @@ The checkpoint system provides specific exceptions for error cases:
 
 - ``CheckpointError``: Base class for checkpoint-related errors
 - ``CheckpointNotFoundError``: Raised when attempting to load non-existent checkpoint
-- ``InvalidOperationError``: Raised for invalid checkpoint operations
+- ``CheckpointSaveError``: Raised when saving a checkpoint fails
+- ``CheckpointLoadError``: Raised when loading a checkpoint fails
+- ``CheckpointDeleteError``: Raised when deleting a checkpoint fails
 
 Example error handling:
 
@@ -156,7 +170,7 @@ Example error handling:
         ckpt_manager.load_checkpoint(step_nr=1000)
     except CheckpointNotFoundError:
         print("Checkpoint not found")
-    except CheckpointError as e:
+    except CheckpointLoadError as e:
         print(f"Error loading checkpoint: {e}")
 
 Best Practices
@@ -171,3 +185,7 @@ Best Practices
 4. Handle checkpoint errors gracefully in production code
 
 5. For distributed training, ensure proper gang coordination
+
+6. Use the carbon copy directory (cc/) for files that should be present in every checkpoint
+
+7. Consider using ``preserve_model=True`` when cleaning up checkpoints to keep model weights while reducing storage
@@ -22,7 +22,7 @@ Here are some basic examples of using the CLI:
     # Get help about a specific command (e.g. recipe lm::instruction_finetune)
     fairseq2 lm instruction_finetune -h
 
-    # List available presets for a recipe (e.g. recipe lm::instruction_finetune)
+    # List available configuration presets for a recipe (e.g. recipe lm::instruction_finetune)
     fairseq2 lm instruction_finetune --list-presets
 
     # Dump the default configuration for a recipe (e.g. recipe lm::instruction_finetune)
@@ -60,17 +60,27 @@ Use ``--config`` to override specific values:
 .. code-block:: bash
 
     # Override single value
-    fairseq2 lm instruction_finetune <OUTPUT_DIR> --config max_num_tokens=512
+    fairseq2 lm instruction_finetune <OUTPUT_DIR> --config dataset.max_num_tokens=512
 
     # Override nested values
-    fairseq2 lm instruction_finetune <OUTPUT_DIR> --config optimizer_config.lr=4e-5
+    fairseq2 lm instruction_finetune <OUTPUT_DIR> --config optimizer.config.lr=4e-5
 
     # Override multiple values
-    fairseq2 lm instruction_finetune <OUTPUT_DIR> --config max_num_tokens=512 max_seq_len=512
+    fairseq2 lm instruction_finetune <OUTPUT_DIR> --config dataset.max_num_tokens=512 dataset.max_seq_len=512
 
     # Override a tuple
     fairseq2 lm instruction_finetune <OUTPUT_DIR> --config profile="[500,10]"
 
+or add, delete values:
+
+.. code-block:: bash
+
+    # Delete a configuration key
+    fairseq2 lm instruction_finetune <OUTPUT_DIR> --config del:common.metric_recorders.tensorboard
+
+    # Add a configuration key
+    fairseq2 lm instruction_finetune <OUTPUT_DIR> --config add:common.metric_recorders.tensorboard="{enabled: true}"
+
 .. note::
 
   Unlike ``--config-file``, only one ``--config`` argument can be used.
@@ -112,17 +122,43 @@ fairseq2 provides commands to manage and inspect assets:
     # List all available assets
     fairseq2 assets list
 
-    # Show details of a specific asset
-    fairseq2 assets show llama3_1_8b_instruct
-
     # List assets filtered by type
     fairseq2 assets list --type model
     fairseq2 assets list --type dataset
     fairseq2 assets list --type tokenizer
 
+    # Show details of a specific asset
+    fairseq2 assets show llama3_1_8b_instruct
+
+LLaMA Utilities
+---------------
+
+fairseq2 provides utilities for working with LLaMA models:
+
+.. code-block:: bash
+
+    # Convert fairseq2 LLaMA checkpoints to reference format
+    fairseq2 llama convert_checkpoint <MODEL_NAME> <INPUT_DIR> <OUTPUT_DIR>
+
+    # Write LLaMA configurations in Hugging Face format
+    fairseq2 llama write_hf_config <MODEL_NAME> <OUTPUT_DIR>
+
+Available Recipe Groups
+-----------------------
+
+fairseq2 includes several recipe groups for different tasks:
+
+- ``asr``: ASR (Automatic Speech Recognition) recipes
+- ``lm``: Language model recipes (instruction fine-tuning, preference optimization, etc.)
+- ``mt``: Machine translation recipes
+- ``wav2vec2``: wav2vec 2.0 pretraining recipes
+- ``wav2vec2_asr``: wav2vec 2.0 ASR recipes
+
+For more details about the recipe configurations, please refer to :ref:`basics-recipe`.
+
 See More
 --------
 
 For more technical details about implementing custom CLIs and extensions, see:
 
-- :doc:`/reference/api/fairseq2.recipes/cli`
+- :doc:`/reference/api/fairseq2.cli/index`