GitHub - abhilash1910/Framework-Optimization: Framework, Model & Kernel Optimizations for Distributed Deep Learning

Distributed Framework Optimization - Data Hack Summit 2023

Deep Learning Frameworks form the baseline over which millions of models (LLMs, multimodals , auto regressive) are being compiled and built on. Many of these frameworks require sophisticated optimization to make models train and infer faster in constrained hardware chips. The intrinsic kernels which form a part of these Frameworks (such as Pytorch) leverage profound adaptive features to help break perf- benchmarks in supercomputing and federated deep learning . This is a glimpse of different sub-kernel, intermediate framework and superficial model optimization techniques which help people run large models such as GPTs on constrained environments and clusters.

Pytorch

Most of the session would revolve around different model optimizations strategies and how the Pytorch framework can make training and finetuning efficient. This would involve features such as aten Graph Capture ,Lowering , Composite Graph Compilation ( by Inductor) followed by device specific IR which the device compiler can optimize further for model performance.

Distributed Pytorch

To extend different parallelisms over a dedicated set of hardware device combinations (CPU-GPU,GPU-GPU,multi XPU,multi TPU ,MPS) the distributed backend of pytorch comes into picture. It enables scale up and out of sharded models , data and parameters to efficiently distribute gradients, checkpoints, activations across different devices.

Data ,Model and Pipeline Parallelism

In data parallel training, the dataset is split into several shards, each shard is allocated to a device. This is equivalent to parallelize the training process along the batch dimension.

Model Parallelism involves sharding model blocks (not separate tensor lists) across devices in a uniform manner.

Pipeline parallelism splits the model layer by layer into several chunks, each chunk is given to a device. The caveat here includes a single optimizer.step forces forward (increasing pipeline stages) and backward (decreasing pipeline stages) in an interleaved manner .

Deepspeed Zero

ZeRO leverages the aggregate computation and memory resources of data parallelism to reduce the memory and compute requirements of each device (GPU) used for model training. ZeRO reduces the memory consumption of each GPU by partitioning the various model training states (weights, gradients, and optimizer states) across the available devices (GPUs and CPUs) in the distributed training hardware.

Triton Compiler

Triton is a deep learning compiler created specifically to abstract IR code and optimize kernels which would otherwise be difficult to optimiz in cuda.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Deepspeed_Inference		Deepspeed_Inference
MMP		MMP
Mini_Megatron		Mini_Megatron
ModelParallel_Engine/parallelizer		ModelParallel_Engine/parallelizer
Triton		Triton
images		images
Big_science_Inference_accelerate.ipynb		Big_science_Inference_accelerate.ipynb
DDP_basics.ipynb		DDP_basics.ipynb
GenAI-PPT-DistributedDL-Abhilash.pptx		GenAI-PPT-DistributedDL-Abhilash.pptx
SIMD_SingleMachine_ModelParallelism.ipynb		SIMD_SingleMachine_ModelParallelism.ipynb
image.png		image.png
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Framework Optimization - Data Hack Summit 2023

Pytorch

Distributed Pytorch

Data ,Model and Pipeline Parallelism

Deepspeed Zero

Triton Compiler

About

Releases

Packages

Languages

abhilash1910/Framework-Optimization

Folders and files

Latest commit

History

Repository files navigation

Distributed Framework Optimization - Data Hack Summit 2023

Pytorch

Distributed Pytorch

Data ,Model and Pipeline Parallelism

Deepspeed Zero

Triton Compiler

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages