-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Insights: deepspeedai/DeepSpeed
Overview
Could not load contribution data
Please try again later
8 Pull requests merged by 5 people
-
Update references to new X/Twitter handle
#7110 merged
Mar 5, 2025 -
Fix fused_qkv print model ValueError
#7109 merged
Mar 4, 2025 -
Avoid graph break due to unsupported frozenset
#7105 merged
Mar 4, 2025 -
Only run pre-commit on the changes
#7106 merged
Mar 4, 2025 -
Avoid graph breaks in torch.compile caused by inner classes in the backward hooks
#7062 merged
Mar 4, 2025 -
Avoid graph breaks by disabling sourceless calls in instrument_w_nvtx
#7081 merged
Mar 3, 2025 -
Use new dlpack api; Formatting fixes
#7101 merged
Mar 3, 2025 -
Remove workflows for very old torch versions
#7090 merged
Feb 28, 2025
5 Pull requests opened by 5 people
-
Update gaudi2 nightly,ci to latest 1.20.0 build
#7093 opened
Feb 28, 2025 -
Variable batch size and LR scheduler
#7104 opened
Mar 3, 2025 -
[Draft] Add support for seq split in Domino
#7111 opened
Mar 4, 2025 -
fix keep_module_on_host
#7112 opened
Mar 6, 2025 -
[XPU] Support XCCL on deepspeed side
#7113 opened
Mar 6, 2025
3 Issues closed by 3 people
-
nv-sd CI test failure
#7098 closed
Mar 3, 2025 -
[BUG]npu zero3 训练自定义模型时,报错Function SumBackward0 returned an invalid gradient at index 0
#7078 closed
Mar 3, 2025 -
[BUG]Does MoQ mode deprecated in DeepSpeed? I run with MoQ config, but no quantization found in log
#7091 closed
Feb 28, 2025
8 Issues opened by 8 people
-
Deepspeed on Power with CPU Acclerator and AuToTP
#7108 opened
Mar 4, 2025 -
[REQUEST] An option for SUM gradient allreduce instead of MEAN
#7107 opened
Mar 4, 2025 -
Support DualPipe training
#7100 opened
Mar 2, 2025 -
About delay param update of ZeRO Offload
#7099 opened
Mar 2, 2025 -
[REQUEST] Proposal for Enhancing ChatGPT's Response Quality During Training
#7097 opened
Mar 1, 2025 -
LLama factory, quantized model and deepspeed compatibility
#7096 opened
Mar 1, 2025 -
nv-nightly CI test failure
#7095 opened
Mar 1, 2025 -
[BUG]Does MoQ mode deprecated in DeepSpeed? I run with MoQ config, but no quantization found in log
#7092 opened
Feb 28, 2025
22 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
Training multiple models
#7018 commented on
Mar 6, 2025 • 10 new comments -
Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs
#6964 commented on
Mar 5, 2025 • 2 new comments -
Unpin once transformers latest is fixed
#7088 commented on
Mar 3, 2025 • 0 new comments -
Update Domino for Llama3
#7084 commented on
Mar 5, 2025 • 0 new comments -
Conditionally quote env vars
#7071 commented on
Mar 5, 2025 • 0 new comments -
Fix, pipeline model with moe cause error when send grad
#7055 commented on
Mar 5, 2025 • 0 new comments -
Enable ZeRO set/get APIs for NVMe offload
#7046 commented on
Mar 5, 2025 • 0 new comments -
Enable python 3.11 and 3.12 tests
#7007 commented on
Mar 5, 2025 • 0 new comments -
Enable torch.autocast with ZeRO
#6993 commented on
Mar 6, 2025 • 0 new comments -
Improve overflow handling in ZeRO
#6976 commented on
Mar 4, 2025 • 0 new comments -
Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models
#6553 commented on
Feb 28, 2025 • 0 new comments -
support autoTP with weight only quantization in DS inference path
#4750 commented on
Mar 5, 2025 • 0 new comments -
Getting requirements to build wheel: finished with status 'error'
#7043 commented on
Mar 5, 2025 • 0 new comments -
AssertionError: no sync context manager is incompatible with gradientpartitioning logic of ZeRo stage 3
#6793 commented on
Mar 4, 2025 • 0 new comments -
Deepspeed Inference not working on llama when input has padding and using kernel injection
#3960 commented on
Mar 4, 2025 • 0 new comments -
[REQUEST] Runable solution of RTX 5090 GPU + Linux Driver version + Pytorch version + Deepspeed version for LLM finetuning?
#7042 commented on
Mar 3, 2025 • 0 new comments -
[BUG] Deepspeed does not update the model when using "Qwen/Qwen2.5-3B" and is fine with ""Qwen/Qwen2.5-1.%B""
#7077 commented on
Mar 3, 2025 • 0 new comments -
Dynamic/variable batch size support
#1051 commented on
Mar 3, 2025 • 0 new comments -
[BUG] DS zero stage 1 or 2 communication uses reduce-scatter instead of All-reduce
#7059 commented on
Mar 1, 2025 • 0 new comments -
Suspected memory leak during zero3 training. oom eventually after several checkpoint
#3582 commented on
Mar 1, 2025 • 0 new comments -
Ascend 910B: attributeerror: 'deepspeedcpuadam' object has no attribute 'ds_opt_adam'
#7061 commented on
Feb 28, 2025 • 0 new comments -
[BUG] deepspeed zero2 training hangon and timeout after a fixed step
#7044 commented on
Feb 28, 2025 • 0 new comments