Skip to content

Kowsher/SparseMoE-Layer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

Sparse Mixture of Experts Layer Implementation using Pytorch

Sparse MoE Block

This is the implementation for the Mixture of Experts (MoE) block in PyTorch, focusing on an efficient, block-sparse routing of tokens to experts.

Overview

This repository contains an efficient implementation of a Mixture of Experts (MoE) block in PyTorch. The implementation formulates MoE operations using block-sparse operations to handle imbalanced assignments of tokens to experts efficiently. Unlike standard MoE, which either drops tokens at the cost of reduced performance or pads with wasted computation, this implementation avoids these inefficiencies, making it both memory and computationally optimal.

Efficiency Features

  • Block-Sparse Operations: The implementation performs sparse routing of tokens to experts, ensuring that only selected experts are computed for each token. This significantly reduces the computational burden compared to dense MoE implementations.

  • Expert Assignment Without Padding: Tokens are assigned to the top-k experts without the need for additional padding, avoiding unnecessary memory usage and improving training efficiency.

  • Jitter Noise for Regularization: Jitter noise is introduced during training to add small perturbations to the input, helping prevent overfitting and improving the generalization capability of the model.

Installation

To use the SparseMoeBlock, clone this repository and import the class into your project:

https://github.com/Kowsher/SparseMoE-Layer.git
cd SparseMoE-Layer

Usage

Here's an example of how to use the SparseMoeBlock in your PyTorch model:

import torch
from moe import SparseMoeBlock

# Example model using SparseMoeBlock
class ExampleModel(torch.nn.Module):
    def __init__(self):
        super(ExampleModel, self).__init__()
        self.moe_block = SparseMoeBlock(num_experts=4, top_k=2, router_jitter_noise=0.1)

    def forward(self, x):
        return self.moe_block(x)

# Initialize model
model = ExampleModel()
input_data = torch.randn(10, 256)  # Example input data
output = model(input_data)
print(output)

Credits

This implementation is inspired by the concepts and code used in efficient mixtral architectures, adapting them to make an efficient block-sparse MoE implementation in PyTorch.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages