Pulse · NVIDIA/NeMo · GitHub

January 22, 2025 – February 22, 2025

Overview

325 Active pull requests

70 Active issues

3 Releases published by 1 person

v2.2.0rc0 NVIDIA Neural Modules 2.2.0rc0
published Feb 2, 2025
v2.2.0rc1 NVIDIA Neural Modules 2.2.0rc1
published Feb 4, 2025
v2.2.0rc2 NVIDIA Neural Modules 2.2.0rc2
published Feb 17, 2025

254 Pull requests merged by 56 people

chore: Cherry pick deepseek
#12324 merged Feb 22, 2025
Cherry pick fix masked loss calculation (12255) into r2.2.0
#12286 merged Feb 22, 2025
DeepSeek
#11971 merged Feb 22, 2025
Cherry pick [nemo1] Fix Mamba/Bert loading from checkpoint after TE extra states were introduced (12275) into r2.2.0
#12314 merged Feb 22, 2025
Cherry pick Energon ckpt multimodal (12245) into r2.2.0
#12307 merged Feb 22, 2025
cherry pick 12209
#12240 merged Feb 22, 2025
Fix BertEmbeddingDataset
#12272 merged Feb 22, 2025
Cherry pick Add modelopt to requirements_nlp.txt (12261) into r2.2.0
#12278 merged Feb 22, 2025
Cherry pick Add eval requirement to setup.py (12152) into r2.2.0
#12277 merged Feb 22, 2025
build: Bump PyT to 25.01
#11973 merged Feb 22, 2025
Cherry pick Fix the local path in Sortformer diarizer training tutorial (12135) into r2.2.0
#12316 merged Feb 22, 2025
automodel notebooks fix
#12238 merged Feb 22, 2025
Cherry pick build: Exclude tensorstore 0.1.72 (12317) into r2.2.0
#12318 merged Feb 22, 2025
Fixes and refactor for custom pretraining loop
#12319 merged Feb 22, 2025
Misc resiliency features
#12302 merged Feb 21, 2025
Add checkpointing support to custom pretraining loop
#12291 merged Feb 21, 2025
build: Exclude tensorstore 0.1.72
#12317 merged Feb 21, 2025
Fix the local path in Sortformer diarizer training tutorial
#12135 merged Feb 21, 2025
[nemo1] Fix Mamba/Bert loading from checkpoint after TE extra states were introduced
#12275 merged Feb 21, 2025
add ctc segmentation
#12312 merged Feb 21, 2025
ci: Fix test workflow
#12311 merged Feb 21, 2025
Update for pytorch 25.01 container
#12310 merged Feb 21, 2025
Test model loading for nemo export
#12262 merged Feb 21, 2025
Add test for evaluation
#12276 merged Feb 21, 2025
build: Editable nemo install (#12304)
#12308 merged Feb 21, 2025
Energon ckpt multimodal
#12245 merged Feb 21, 2025
build: Editable nemo install
#12304 merged Feb 21, 2025
Cherry pick Set L2_Speech_Batch_Size_OOMptimizer_Canary to be optional (12299) into r2.2.0
#12300 merged Feb 21, 2025
Set L2_Speech_Batch_Size_OOMptimizer_Canary to be optional
#12299 merged Feb 21, 2025
ci: Flaky tests release
#12293 merged Feb 20, 2025
Add sampling args in TRTLLM generate
#11612 merged Feb 20, 2025
build: Bump mcore ref
#12287 merged Feb 20, 2025
fix masked loss calculation
#12255 merged Feb 20, 2025
fix speechlm import ckpt on slurm
#12244 merged Feb 20, 2025
remove nemo1 tests
#12280 merged Feb 20, 2025
remove nemo1 unit tests
#12281 merged Feb 20, 2025
Fixing error when loading T5 checkpoint created with TE<1.13
#12264 merged Feb 20, 2025
Build bitsandbytes
#12279 merged Feb 20, 2025
chore(🤖): Bump NVIDIA/Megatron-LM to 05ac33c... (2025-02-20)
#12274 merged Feb 20, 2025
Add modelopt to requirements_nlp.txt
#12261 merged Feb 20, 2025
Add eval requirement to setup.py
#12152 merged Feb 20, 2025
Remove modelopt state when empty in NeMo 1.0 distillation
#12266 merged Feb 20, 2025
Training loop
#12268 merged Feb 20, 2025
Add optimizer fix
#12253 merged Feb 19, 2025
Remove some old nemo1 contents in doc
#12156 merged Feb 19, 2025
Llama Embedding Model Improvement
#12236 merged Feb 19, 2025
Add evaluate utilities to custom pretraining loop
#12250 merged Feb 19, 2025
Fix distillation state-dict loading bug
#12270 merged Feb 19, 2025
add default kwargs for trtllm model runner
#12248 merged Feb 19, 2025
ci: Fix pypi link of dry-run
#12267 merged Feb 19, 2025
Fix 2D bucketing test on Python 3.12
#12265 merged Feb 19, 2025
add default param dtype in mistral configs
#12186 merged Feb 19, 2025
ci: Bump release workflows
#12259 merged Feb 19, 2025
chore(🤖): Bump NVIDIA/Megatron-LM to 61b2c4f... (2025-02-19)
#12251 merged Feb 19, 2025
Ckpt fixes pytorch update
#12228 merged Feb 19, 2025
build: force-reinstall
#12214 merged Feb 19, 2025
Restructure llm perf scripts to support vlm/diffusion/flux collections
#12252 merged Feb 19, 2025
fix[export]: reshard model correctly handles extra_state when it's a tensor
#12132 merged Feb 19, 2025
ci: Generate coverage for e2e tests
#12120 merged Feb 18, 2025
Add setup function to setup model, optimizer and dataloaders for training
#12247 merged Feb 18, 2025
Update scheduler
#12243 merged Feb 18, 2025
Asr fixes 2.2
#12227 merged Feb 18, 2025
Fix Llama Embedding Tutorial
#12149 merged Feb 18, 2025
Add ConfigContainer plus data and tokenizer modules to nemo/tron
#12241 merged Feb 18, 2025
Apply missing lr_mult and wd_mult to the lr and weight_decay of megatron param groups.
#12123 merged Feb 18, 2025
ci: Disable flaky transcription tests
#12237 merged Feb 18, 2025
add missing __init__
#12239 merged Feb 18, 2025
[Draft] Llama Embedding Model Fix
#12235 merged Feb 18, 2025
skip-linting label optionally disables lint checks
#12179 merged Feb 18, 2025
ci: Simplify install-check
#12231 merged Feb 18, 2025
Cherry pick nemo-automodel checkpoint-io refactor (12070) into r2.2.0
#12234 merged Feb 18, 2025
Cherry pick Fix loading extra states from torch tensor (12185) into r2.2.0
#12226 merged Feb 18, 2025
Add automodel multinode tut and fix sft peft bug
#12209 merged Feb 18, 2025
nemo-automodel checkpoint-io refactor
#12070 merged Feb 18, 2025
Cherry pick build: Pin down transformers (12229) into r2.2.0
#12230 merged Feb 17, 2025
build: Pin down transformers
#12229 merged Feb 17, 2025
Cherry pick disable moe logging to avoid deepseek hang (12168) into r2.2.0
#12192 merged Feb 17, 2025
Fix loading extra states from torch tensor
#12185 merged Feb 17, 2025
Cherry pick Fix multi-GPU in-framework deployment (12090) into r2.2.0
#12172 merged Feb 17, 2025
ci: Remove pull_request trigger
#12224 merged Feb 17, 2025
ci: Update release workflows
#12223 merged Feb 17, 2025
ci: Add install-test
#12215 merged Feb 17, 2025
ci: Use release-ref
#12219 merged Feb 17, 2025
ci: Code-freeze dry-run
#12217 merged Feb 17, 2025
Cherry pick Update TTS code to remove calls to deprecated functions (12153) into r2.2.0
#12201 merged Feb 17, 2025
Cherry pick Add function calling SFT NeMo2.0 tutorial (11868) into r2.2.0
#12180 merged Feb 17, 2025
Revert pytest -s
#12202 merged Feb 15, 2025
Fix nemo-run stdin exception
#12197 merged Feb 15, 2025
Update TTS code to remove calls to deprecated functions
#12153 merged Feb 15, 2025
Set weights_only=False in torch.load in EMA callback and AdapterMixin
#12198 merged Feb 15, 2025
Update forward for eval/inference pass in FlowMatching models
#12056 merged Feb 15, 2025
Use pip --no-deps --force-reinstall when building the test container
#12175 merged Feb 14, 2025
Fixes for bumping pyt to 25.01
#12165 merged Feb 14, 2025
Cherry pick build: Force re-install VCS dependencies (12155) into r2.2.0
#12191 merged Feb 14, 2025
use getattr for .optim and allow variable num of args
#12184 merged Feb 14, 2025
Add perf model configs gb200
#12140 merged Feb 14, 2025
disable moe logging to avoid deepseek hang
#12168 merged Feb 14, 2025
The offset and duration of an entry are encoded in the filename of the audio inside the tarball when slice_with_duration is enabled.
#11950 merged Feb 14, 2025
Fix nsys callback tests
#12177 merged Feb 14, 2025
Add usage instructions for Cosmos TensorRT
#11650 merged Feb 14, 2025
Fix num nodes to match parallel mappings
#12170 merged Feb 14, 2025
Update Llama 3.1 model IDs
#12089 merged Feb 14, 2025
Distillation NeMo run entrypoint and recipe
#12143 merged Feb 14, 2025
Add function calling SFT NeMo2.0 tutorial
#11868 merged Feb 14, 2025
Migrate SpeechLM to NeMo 2.0
#10808 merged Feb 13, 2025
Fix multi-GPU in-framework deployment
#12090 merged Feb 13, 2025
remove nemo.collection.nlp imports from nemo2
#11904 merged Feb 13, 2025
Fix state transform
#12147 merged Feb 13, 2025
[Audio] Tiny fix for time generation in flow matching
#12091 merged Feb 13, 2025
AutoModel Notebooks
#12013 merged Feb 13, 2025
Revert "build: Force re-install VCS dependencies"
#12163 merged Feb 13, 2025
fix: add default kwargs for trtllm model runner
#12131 merged Feb 13, 2025
chore: Update notebooks
#12161 merged Feb 12, 2025
chore: Version bump
#12160 merged Feb 12, 2025
ci: Bump code-freeze workflow
#12159 merged Feb 12, 2025
build: Force re-install VCS dependencies
#12155 merged Feb 12, 2025
Minor Bug Fixes - LLaMa Embedding
#12146 merged Feb 12, 2025
update export io call
#12144 merged Feb 12, 2025
Skip initialization in hf export
#12136 merged Feb 12, 2025
interface for asymmetric pipeline schedule
#12039 merged Feb 12, 2025
chore(🤖): Bump NVIDIA/Megatron-LM to 55cdfc1... (2025-02-12)
#12148 merged Feb 12, 2025
AudioToAudioModel: fix model->dataloader sample_rate parameter injection
#12092 merged Feb 12, 2025
Add error message when downloading failed.
#12139 merged Feb 12, 2025
fix: export weight name mapping if model is nemo model
#11497 merged Feb 12, 2025
changed asr models outputs to be consistent
#11818 merged Feb 12, 2025
tests: Run FSDP2 on dual-gpu
#12145 merged Feb 11, 2025
Prevent downloading dataset every time in ci test
#12095 merged Feb 11, 2025
Update vLLM to 0.7.2
#12078 merged Feb 11, 2025
chore(🤖): Bump NVIDIA/Megatron-LM to 26ad9b3... (2025-02-11)
#12130 merged Feb 11, 2025
fix the issue during batched inference of Sortformer diarizer
#12047 merged Feb 11, 2025
Rename neva datamodule
#12121 merged Feb 11, 2025
Bug fix with generation of expert_tensor_parallel_rank
#12125 merged Feb 11, 2025
Add Automodel support for Deepseek v3 model
#12099 merged Feb 11, 2025
Add performance-optimized example for llama2 70b LoRA
#12055 merged Feb 11, 2025
ci: Disable checks
#12129 merged Feb 11, 2025
[MoE] Add type annotation for mixtral configs
#12126 merged Feb 10, 2025
Malay/bw scripts
#11961 merged Feb 10, 2025
DAPT playbooks - with NeMo 2.0
#12067 merged Feb 10, 2025
Propogate dp last changes from mcore
#12012 merged Feb 10, 2025
Add neva pretrain script
#12033 merged Feb 10, 2025
ci: Bump bot
#12117 merged Feb 10, 2025
ci: Bump Mcore inplace
#12115 merged Feb 10, 2025
Update mcore commit (02.06.25)
#12114 merged Feb 10, 2025
refactor peft module matching; introduce exclude_modules
#12066 merged Feb 9, 2025
build: Optimize
#12112 merged Feb 9, 2025
Ensure nemo.collections.vlm does not strictly require transformer engine
#12108 merged Feb 9, 2025
ci: Fix flaky test
#12113 merged Feb 9, 2025
build: Better caching
#12109 merged Feb 9, 2025
etp docs
#12111 merged Feb 9, 2025
ci: Update bump workflow
#12106 merged Feb 8, 2025
ci: Modular unit tests
#12104 merged Feb 8, 2025
ci: Update bump workflow
#12105 merged Feb 8, 2025
build: Improve installer
#12016 merged Feb 8, 2025
ci: codecov
#12030 merged Feb 8, 2025
chore(beep boop 🤖): Bump MCORE_TAG=bcee052... (2025-02-08)
#12100 merged Feb 8, 2025
Fix SBERT with sequence_len_offset
#12057 merged Feb 8, 2025
Refactor VLM modules / Add InternVit submodule support
#11851 merged Feb 7, 2025
Add Llama2 7B recipe
#11649 merged Feb 7, 2025
fix nmt dataclass issue
#12081 merged Feb 7, 2025
chore(beep boop 🤖): Bump MCORE_TAG=6219d96... (2025-02-07)
#12088 merged Feb 7, 2025
minor fix in model's summary identation during logging
#12084 merged Feb 7, 2025
set TOKENIZERS_PARALLELISM=True
#12083 merged Feb 7, 2025
Fix hf_dataset bug
#12072 merged Feb 7, 2025
Fix Linting
#12079 merged Feb 7, 2025
Add Llama Embedding Tutorial
#12042 merged Feb 6, 2025
throw MegatronOptimizerModule warning only with mcore models
#12085 merged Feb 6, 2025
Debug Apex distributed optimizer to handle Transformer Engine 2.0
#12004 merged Feb 6, 2025
chore(beep boop 🤖): Bump MCORE_TAG=0ae1d14... (2025-02-06)
#12073 merged Feb 6, 2025
Update optimization features readme from nemo1 to nemo2
#12071 merged Feb 6, 2025
attn_implementation eager fallback
#12060 merged Feb 6, 2025
Adding nemo CI
#12052 merged Feb 6, 2025
Conformer-based spectrogram estimator
#12002 merged Feb 6, 2025
add cp_comm_type param to Mistral config
#12049 merged Feb 6, 2025
Pipeline-parallel support for Knowledge Distillation (NeMo 2)
#11766 merged Feb 6, 2025
Sortformer Diarizer 4spk v1 model PR Part 4: Sortformer Documents and Notebook Tutorials
#11707 merged Feb 5, 2025
Recipe changes for performance
#11763 merged Feb 5, 2025
ci: Lint Python files only
#12064 merged Feb 5, 2025
chore(beep boop 🤖): Bump MCORE_TAG=ca46c53... (2025-02-04)
#12053 merged Feb 5, 2025
chore: Add warning for rebase
#12061 merged Feb 5, 2025
Add padding in mllama vision encoder to align with HF
#11808 merged Feb 5, 2025
avoid missmatch error when loading older TE checkpoints
#12028 merged Feb 4, 2025
ci: Allow skipping docs
#12048 merged Feb 4, 2025
Adding TFLOPs callback for Multimodal models and NeVA calculator
#11969 merged Feb 4, 2025
Clip Model in Nemo2
#11980 merged Feb 4, 2025
fix llama-3.1 hf model_id
#11774 merged Feb 4, 2025
nemo-automodel: fsdp2 support for peft
#12008 merged Feb 4, 2025
ci: Update workflow
#12044 merged Feb 4, 2025
ci: Update weekly brain
#12043 merged Feb 4, 2025
chore(beep boop 🤖): Bump MCORE_TAG=284ed81... (2025-02-04)
#12038 merged Feb 4, 2025
Version bump to 2.2.0rc2.dev0
#12040 merged Feb 4, 2025
[MoE] fix run err in mixtral22B recipe and update its perf config
#12036 merged Feb 4, 2025
ci: Retry on timeout
#11974 merged Feb 4, 2025
ci: Always run linting
#12035 merged Feb 3, 2025
Replace reference of requirements_infer.txt with requirements_deploy.txt
#12029 merged Feb 3, 2025
ci: Run linting per domain
#12027 merged Feb 3, 2025
chore(beep boop 🤖): Bump MCORE_TAG=dbe8fa0... (2025-02-01)
#12014 merged Feb 3, 2025
chore(beep boop 🤖): Bump MCORE_TAG=d5069b8... (2025-01-29)
#11981 merged Feb 3, 2025
Weekly bump
#11896 merged Feb 3, 2025
Set zarr range to >=2.18.2 and <3.0.0
#12005 merged Feb 3, 2025
[Audio] Fix extra step in Euler sampler for flow matching inference
#11989 merged Feb 3, 2025
ci: Run unit tests on main
#11986 merged Feb 3, 2025
Version bump to 2.2.0rc1
#12023 merged Feb 2, 2025
ci: Release workflow
#12022 merged Feb 2, 2025
ci: Build wheel workflow
#12021 merged Feb 2, 2025
minor fix and simplify
#12007 merged Feb 2, 2025
Adding speechlm AutoModel test
#11990 merged Feb 2, 2025
chore(ci): Disable VMs cron job on forks
#12020 merged Feb 2, 2025
Add the NeMo2 memory profiling plugin
#12009 merged Feb 2, 2025
Llama3.2 1B Embedding Model Support
#11909 merged Jan 31, 2025
Mask vocab padding token ids from CE loss
#11999 merged Jan 31, 2025
Checkpoint saving for automodels via ModelCheckpoint
#11998 merged Jan 31, 2025
Remove deprecated tests/infer_data_path.py
#11997 merged Jan 31, 2025
Introduce evaluation API
#11895 merged Jan 31, 2025
Adding serialization to all Auto* objects in HuggingFace transformers
#11645 merged Jan 31, 2025
Speechllm develop gen duplex
#11993 merged Jan 30, 2025
nemo automodel sft squad data prep fix
#11994 merged Jan 30, 2025
remove --disable-ckpt from tests
#11996 merged Jan 30, 2025
remove renormalize_blend_weights flag
#11975 merged Jan 30, 2025
add exception when loading ckpt saved by TE < 1.13
#11988 merged Jan 30, 2025
callbacks and bf16 grad
#11985 merged Jan 30, 2025
Use override_vocab_size for trtllm export to support qwen
#11982 merged Jan 29, 2025
[checkpoint][docs] Fix typos in dist checkpointing docs
#11983 merged Jan 29, 2025
Update transcribe_utils.py
#11984 merged Jan 29, 2025
improve error and debug messages in model connector
#11979 merged Jan 29, 2025
add use_fast option
#11976 merged Jan 29, 2025
Add batching support for evaluation
#11934 merged Jan 29, 2025
ci: Add coverage reports
#11912 merged Jan 28, 2025
chore(beep boop 🤖): Bump MCORE_TAG=0e85db5... (2025-01-28)
#11967 merged Jan 28, 2025
Add options to add mp_policy and parallel_fn for NeMo automodel fsdp2
#11956 merged Jan 27, 2025
Update torch load for load from disk
#11963 merged Jan 27, 2025
Run Flake8 for nemo.export module
#11728 merged Jan 27, 2025
chore(beep boop 🤖): Bump MCORE_TAG=f960d4d... (2025-01-26)
#11958 merged Jan 27, 2025
[MoE] add expert tensor parallelism support for NeMo2.0 MoE
#11880 merged Jan 27, 2025
llm performance scripts
#11736 merged Jan 25, 2025
ci: Use single runner machines for unit tests
#11937 merged Jan 25, 2025
PTQ & TRT-LLM updates related to upcoming PyTorch 25.01 bump
#11941 merged Jan 25, 2025
enable loading older TE checkpoints
#11930 merged Jan 24, 2025
chore(beep boop 🤖): Bump MCORE_TAG=2167226... (2025-01-24)
#11947 merged Jan 24, 2025
chore(beep boop 🤖): Bump MCORE_TAG=0d59157... (2025-01-23)
#11932 merged Jan 24, 2025
Add sharding for speechlm and vlm
#11876 merged Jan 23, 2025
TPS-free 2D bucket estimation and filtering
#11738 merged Jan 23, 2025
Cherry pick build: Add sox to SDE (11882) into r2.1.1
#11936 merged Jan 23, 2025
build: Fix triton
#11940 merged Jan 23, 2025
build: Pin triton
#11938 merged Jan 23, 2025
Autodetect dtype on exporting to TensorRT-LLM
#11907 merged Jan 23, 2025
build: Add sox to SDE
#11882 merged Jan 23, 2025
Add more fine-grained performance metrics
#11619 merged Jan 23, 2025
Enable NeMo importer and loading dist CKPT for training
#11927 merged Jan 23, 2025
Create test_phi3.py
#11843 merged Jan 22, 2025
ci: Adjust input argument
#11921 merged Jan 22, 2025
chore(beep boop 🤖): Bump MCORE_TAG=5c12382... (2025-01-22)
#11925 merged Jan 22, 2025
Revert #11890 and add a test that would have caught the error
#11914 merged Jan 22, 2025
chore(beep boop 🤖): Bump MCORE_TAG=5c12382... (2025-01-22)
#11922 merged Jan 22, 2025
Add New Transformer Backbone for TTS Models
#11911 merged Jan 22, 2025

71 Pull requests opened by 41 people

nemo-ux: deprecate app state
#11935 opened Jan 23, 2025
add initial hf automodel docs
#11942 opened Jan 23, 2025
Add NVTX ranges to categorize execution
#11945 opened Jan 23, 2025
[checkpoint] Log timings for checkpoint IO save and load
#11972 opened Jan 28, 2025
Add docs on env vars
#11991 opened Jan 29, 2025
Avoid init_ddp for inference
#12011 opened Jan 31, 2025
Draft: Enable a minimal docs linkcheck build
#12015 opened Feb 1, 2025
Log checkpoint saves at start and finish
#12018 opened Feb 2, 2025
replace n-1 -> n
#12025 opened Feb 3, 2025
Script for estimating data weights with optional temperature
#12032 opened Feb 3, 2025
Add flux recipe for ci
#12037 opened Feb 4, 2025
NeMo export: Remove unnecessary expert key mapping
#12041 opened Feb 4, 2025
Fix/update audio to text dataset
#12045 opened Feb 4, 2025
Fix per-rank log file creation
#12058 opened Feb 5, 2025
Fix bugs in `AudioToMelSpectrogramPreprocessor.input_example`
#12063 opened Feb 5, 2025
numactl cmd
#12069 opened Feb 5, 2025
Configure FSDP to keep module params
#12074 opened Feb 6, 2025
Make TETransformerLayerAutocast Support Cuda Graph
#12075 opened Feb 6, 2025
Fix dataclass field in some asr examples.
#12076 opened Feb 6, 2025
Add T5TTSv2 and Updates NeMo Audio Codecs
#12082 opened Feb 6, 2025
[WIP] add auto model pretrain example
#12087 opened Feb 6, 2025
Save and Restore ModelOpt state in NeMo 2.0
#12094 opened Feb 7, 2025
Avoid rewrapping modules with DDP/FSDP if already wrapped
#12096 opened Feb 7, 2025
Training Performance Optimization for flux_controlnet
#12097 opened Feb 7, 2025
Add FastAPI v1/completions/ endpoint
#12101 opened Feb 8, 2025
fix max_utts hard reqs
#12119 opened Feb 10, 2025
Support Nvidia-DLFramework-Inspect
#12122 opened Feb 10, 2025
Enable ucc backend for pp [NeMo1/NeMo2]
#12128 opened Feb 10, 2025
Fix has_global_batch_sampler to handle datamodules without data_sampler attribute
#12137 opened Feb 11, 2025
updated nemotron h100 cfgs
#12138 opened Feb 11, 2025
Avoid rewrapping modules with DDP and Float16Module on repeated trainer.fit calls
#12141 opened Feb 11, 2025
Add err msg if download failed
#12142 opened Feb 11, 2025
Neva ETP EPP support
#12154 opened Feb 12, 2025
Parakeet RNNT with target lang ID
#12173 opened Feb 13, 2025
Remove getattr_proxy to avoid problematic edge cases
#12176 opened Feb 13, 2025
Abhi/llava next sp
#12182 opened Feb 14, 2025
Add DeepSeek-R1 Distillation NeMo 2.0 tutorial
#12187 opened Feb 14, 2025
Fix: 'IterableDatasetWrapper' has no len() when using Lhotse datasets
#12190 opened Feb 14, 2025
fix for te.linear
#12196 opened Feb 14, 2025
exp manager updates
#12211 opened Feb 17, 2025
Bump zarr version
#12216 opened Feb 17, 2025
Support customization of a few parameters in scripts/vlm/llava_next_pretrain
#12218 opened Feb 17, 2025
Moving async-queue to AppState
#12221 opened Feb 17, 2025
Version bump to `2.2.0rc3.dev0`
#12222 opened Feb 17, 2025
Add Trapezoidal / WSD LR scheduler
#12225 opened Feb 17, 2025
Update L2_NeMo_2_NeMo_Mcore_Mixtral_bitexact to reenable failure on mismatch
#12233 opened Feb 18, 2025
ONNX exporter
#12242 opened Feb 18, 2025
Fixed normalization of feature vector and weight vector
#12246 opened Feb 18, 2025
feat: Allow reshaping of HF checkpoint when converting from .nemo
#12249 opened Feb 18, 2025
Add energon neva pretrain script and fix checkpoint saving
#12256 opened Feb 19, 2025
Evo2 merge 20250214
#12263 opened Feb 19, 2025
Fix model validate broadcast error
#12269 opened Feb 19, 2025
NeVA performance recipe and script
#12271 opened Feb 19, 2025
Fix for te v2.0
#12273 opened Feb 19, 2025
Add trust_remote_code to load_context
#12282 opened Feb 20, 2025
Perf script fix
#12285 opened Feb 20, 2025
fix: typos in documentation files
#12288 opened Feb 20, 2025
Call default factory in dataclasses when saving yaml via nemo.lightning.io
#12289 opened Feb 20, 2025
Update README.md
#12294 opened Feb 20, 2025
Adding FLOP calculator for FLUX
#12295 opened Feb 20, 2025
Respect `pad_seq_length_to_mult` for chat datasets
#12297 opened Feb 20, 2025
Add nemo-run recipe for evaluation
#12301 opened Feb 21, 2025
fix loss reporting
#12303 opened Feb 21, 2025
chore(🤖): Bump `NVIDIA/Megatron-LM` to `c91756d...` (2025-02-21)
#12305 opened Feb 21, 2025
fixing max_utts
#12309 opened Feb 21, 2025
Bug fixes
#12315 opened Feb 21, 2025
build: Bump mcore
#12320 opened Feb 21, 2025
chore(🤖): Bump `NVIDIA/Megatron-LM` to `7980711...` (2025-02-22)
#12321 opened Feb 22, 2025
Entrypoint
#12322 opened Feb 22, 2025
build: Bump PyT to 25.01 (#11973)
#12323 opened Feb 22, 2025
use /tmp for HF_HOME
#12325 opened Feb 22, 2025

29 Issues closed by 7 people

Optical Flow classifier
#11847 closed Feb 21, 2025
Unserializable Error with using Energon Dataloader for NeVA (LLaVA) pretraining / fine-tuning and NeMo 2.0
#11931 closed Feb 20, 2025
How to use nemo docker container as base image
#11824 closed Feb 20, 2025
Failing convert_llama_hf_to_nemo.py
#11840 closed Feb 20, 2025
can't load saved fp8 checkpoint when resume training (MOE model)
#11828 closed Feb 19, 2025
How to set lhotse mixed noise parameters in yaml
#11812 closed Feb 17, 2025
[doc] [typo] The performance improvement of sequence packing is claimed to be as higher as 10x. Is it a typo?
#11817 closed Feb 17, 2025
convert_llama_hf_to_nemo.py use llama31
#11717 closed Feb 15, 2025
Llama 405B NeMo version
#11776 closed Feb 15, 2025
`cfg` must have `tokenizer` config to create a tokenizer !
#12019 closed Feb 13, 2025
llm.import_ckpt cannot run directly
#11756 closed Feb 11, 2025
[QST] Found no performance gain training Mixtral-8x7B with FP8 on H800
#11959 closed Feb 10, 2025
valueError: trainer constructor argument must be either None or pytorch_lightning.Trainer. But got <class 'lightning.pytorch.trainer.trainer.Trainer'> instead.
#11458 closed Feb 10, 2025
What kind of manifest does NEST require?
#11752 closed Feb 10, 2025
Issue contributing
#11440 closed Feb 9, 2025
NeMo won't install, no module named torch
#11601 closed Feb 9, 2025
NeMo/examples/slu/speech_intent_slot 0 files were filtered totalling 0.00 hours
#11734 closed Feb 9, 2025
LLM pretraining encounter `ImportError: cannot import name 'AttnBackend' from 'megatron.core.transformer.enums'`
#12000 closed Feb 7, 2025
I came here for llama3 meets upcycling
#11644 closed Feb 7, 2025
tutorial notebook fails installing mamba-ssm dependency
#11691 closed Feb 6, 2025
NeMo intermittent start-up failure with OMPI temp directory error on k8s
#11724 closed Feb 6, 2025
how to inference a .nemo file which is converted from a HuggingFace format?
#11478 closed Jan 31, 2025
IndexError: index 0 is out of bounds for dimension 1 with size 0
#11700 closed Jan 30, 2025
Which version of transformer engine should I use, when I try to open ub_tp_comm_overlap?
#11683 closed Jan 27, 2025
SymbolicValueError: STFT does not currently support complex types
#11684 closed Jan 27, 2025
docs.nvidia.com link to tutorial notebooks via `stable` tag, which fail to install dependencies
#11690 closed Jan 27, 2025
NeMo Git tag for patching a nvcr.io/nvidia/nemo:24.07 based Docker container
#11943 closed Jan 24, 2025
speaker diarization inference Error: Missing 'speaker_model_cfg' Key in Configuration When Using Pretrained CKPT File
#11623 closed Jan 24, 2025
How to disable `torch_dist` ckpt format ?
#11625 closed Jan 24, 2025

41 Issues opened by 34 people

Fail to convert Llama3.1 8b from nemo to hf using nemo converter; RuntimeError: Missing key in checkpoint state_dict: model.decoder.layers.self_attention.core_attention._extra_state/shard_0_32.*
#12298 opened Feb 21, 2025
Nvidia NEMO 2.0 Serialization Issue: I am facing the same serialization issue with fiddle
#12296 opened Feb 20, 2025
Issues around Resumed Runs
#12290 opened Feb 20, 2025
Off By One Error When Checkpointing and Old Checkpoints Getting Deleted During Run
#12284 opened Feb 20, 2025
Concurrency Issues with MSDD Diarization
#12254 opened Feb 19, 2025
There exists a 2.4B training parameter during fine-tuned training of a 70B model, where did this parameter come from?
#12213 opened Feb 17, 2025
Exported Llama Models Trained Using NeMo Generate The Same Token Repeatedly
#12212 opened Feb 17, 2025
loss divergence when CP>1 and MBS>1
#12210 opened Feb 17, 2025
Pre-Training Neva under pipeline parallel set to 2.
#12205 opened Feb 16, 2025
Checkpointing randomly fails
#12203 opened Feb 15, 2025
Support configuration of num_workers and max_samples_per_sequence in llava_next_pretrain
#12195 opened Feb 14, 2025
Bloated pre-requirements
#12188 opened Feb 14, 2025
HiFiGAN Finetune "Cannot re-initialize CUDA in forked subprocess."
#12178 opened Feb 13, 2025
Update TE version for support of `pad_between_seqs=True`
#12174 opened Feb 13, 2025
I am trying to train the FastConformer 120M model from scratch, but it is not converging?
#12167 opened Feb 13, 2025
NeMo is not friendly to HF compatibility.
#12166 opened Feb 13, 2025
Error in saving nemo checkpoint with Llama-3.1-70B SFT. /opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py
#12157 opened Feb 12, 2025
[HELP] Run into the NaN grad problem while going through the exmaple of official document with fp16
#12134 opened Feb 11, 2025
Implement multi-token prediction option for models
#12133 opened Feb 11, 2025
Fail to convert trained checkpoint to HF format
#12124 opened Feb 10, 2025
[QST] How to set MoE-specific TP size in recipe?
#12103 opened Feb 8, 2025
Loss Fails to Converge in Nemo2-sft.ipynb with Precision 16
#12102 opened Feb 8, 2025
ASR Lhotse dataloader : TypeError: object of type 'IterableDatasetWrapper' has no len()
#12093 opened Feb 7, 2025
AttributeError: 'HFDatasetDataModule' object has no attribute 'tokenizer'
#12080 opened Feb 6, 2025
Need some help/clarity on installing
#12068 opened Feb 5, 2025
extra_loggers is not used to log metrics or hyperparameters
#12046 opened Feb 4, 2025
llava-like dataset implementation "LazySupervisedDataset" likely fails to handle large dataset
#12034 opened Feb 3, 2025
ASR: Is there a coefficient for transcripted words/phrases?
#12026 opened Feb 3, 2025
How do I enable dllogger in NeMo 2.0?
#12010 opened Jan 31, 2025
ASR: How to convert .ckpt to nemo correctly?
#12003 opened Jan 31, 2025
num_sanity_val_steps too large issue
#11978 opened Jan 28, 2025
Add option for prefetch factor of data loader to config
#11977 opened Jan 28, 2025
Megatron BERT Embedding conversion inconsistency
#11970 opened Jan 28, 2025
Pickling error when trying to save checkpoints with custom checkpointIO
#11955 opened Jan 24, 2025
Add Git Tag associated with NeMo Docker containers
#11954 opened Jan 24, 2025
Gemma 2 NeMo 2.0 to HF conversion bug
#11951 opened Jan 24, 2025
Hybrid Sharding support with FSDP
#11946 opened Jan 23, 2025
MegatronGPTModel trains much worse when reducing micro_batch_size
#11939 opened Jan 23, 2025
Have a nemo training container without additional framework elements
#11933 opened Jan 23, 2025
Installation instruction for conda/pip does not work
#11929 opened Jan 22, 2025
Tenacity/s3fs not in requirements
#11926 opened Jan 22, 2025

36 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

Aligner/nemotron5
#11264 commented on Feb 22, 2025 • 8 new comments
Add Minitron pruning example for NeMo 2.0
#11848 commented on Feb 20, 2025 • 7 new comments
Updated preprocess_text.py to fix NeMo-text-processing :: WARNING :: Your input is too long and could take a long time to normalize. Use split_text_into_sentences() to make the input shorter and then call normalize_list().
#11780 commented on Feb 20, 2025 • 7 new comments
Allow configuration of PP communication backend to UCC in nemo2
#11755 commented on Feb 18, 2025 • 6 new comments
replaced classification model with EncDecSpeakerLabelModel
#11887 commented on Feb 22, 2025 • 5 new comments
Fast N-Gram LM on GPU + greedy decoding (RNN-T, TDT, CTC)
#10989 commented on Feb 15, 2025 • 5 new comments
Improving NeMo export
#11920 commented on Feb 21, 2025 • 1 new comment
Add "_skipme" option to Lhotse Dataloading
#11793 commented on Feb 21, 2025 • 1 new comment
Add a checkpoint averaging script for the new .distcp checkpoint format
#10462 commented on Jan 23, 2025 • 0 new comments
fix(huggingface-hub): allow offline mode
#11901 commented on Feb 21, 2025 • 0 new comments
fix: correct signal for Windows
#11898 commented on Feb 15, 2025 • 0 new comments
Lack of GPU memory
#11915 commented on Jan 23, 2025 • 0 new comments
Add nemo1 to nemo2 conversion for neva
#11860 commented on Feb 18, 2025 • 0 new comments
Not able to run LLaVA-Next pretraining with NeMo 2.0 using container version nemo:24.12
#11741 commented on Jan 23, 2025 • 0 new comments
FastPitch_Adapter_Finetuning doesn't works
#11666 commented on Jan 30, 2025 • 0 new comments
when i use container to do sft for any model, it has context not found error
#11825 commented on Feb 7, 2025 • 0 new comments
How to load local model using import ckpt function
#11867 commented on Feb 7, 2025 • 0 new comments
Add safetensor option when saving and restoring models
#11549 commented on Feb 15, 2025 • 0 new comments
Fixes ASR numpy > 2.x compatibility issues while replicating existing behavior
#11447 commented on Feb 14, 2025 • 0 new comments
NeMo-UX: MegatronAutoModel
#11341 commented on Feb 19, 2025 • 0 new comments
FilterbankFeatures may return NaNs on CUDA device - torch autocast problem
#11541 commented on Feb 13, 2025 • 0 new comments
Fix: Data from AIStore
#11241 commented on Feb 22, 2025 • 0 new comments
Add MCore FSDP2 support
#11216 commented on Feb 13, 2025 • 0 new comments
Add scripts for importing a ckpt and running a forward step on it for nemo.collections.llm
#11108 commented on Feb 19, 2025 • 0 new comments
[NeMo-UX] Add option to drop optimizer states
#11089 commented on Feb 20, 2025 • 0 new comments
ASR: Is there any checked and stable way for pretrain?
#11813 commented on Feb 15, 2025 • 0 new comments
Self_hosted not honor the parameters
#11924 commented on Feb 22, 2025 • 0 new comments
Add CI Tests for Canary/AEDMultitask "lang_field"
#10103 commented on Feb 21, 2025 • 0 new comments
Broken offline mode of NeMo
#11899 commented on Feb 21, 2025 • 0 new comments
XLarge Fastconformer Long FT does not converge with default parameters
#11894 commented on Feb 20, 2025 • 0 new comments
max_steps and time calculation are not working as expected.
#11900 commented on Feb 20, 2025 • 0 new comments
Support Pipeline Parallel in Knowledge Distillation
#11531 commented on Feb 17, 2025 • 0 new comments
Possible bug in ASRDecoderTimeStamps - math.ceil on fractional tokens_per_chunk leads to timestamps displacements on long files
#11604 commented on Feb 16, 2025 • 0 new comments
Canary ouputs English for Arabic Speech
#11826 commented on Feb 16, 2025 • 0 new comments
Cosmos support
#11844 commented on Feb 16, 2025 • 0 new comments
`prepare_energon_dataset.py` is supposed to save encoded latents but reconstructed videos are saved instead.
#11853 commented on Feb 16, 2025 • 0 new comments