Sparse Mixture of Experts Layer Implementation using Pytorch

Sparse MoE Block

This is the implementation for the Mixture of Experts (MoE) block in PyTorch, focusing on an efficient, block-sparse routing of tokens to experts.

Overview

This repository contains an efficient implementation of a Mixture of Experts (MoE) block in PyTorch. The implementation formulates MoE operations using block-sparse operations to handle imbalanced assignments of tokens to experts efficiently. Unlike standard MoE, which either drops tokens at the cost of reduced performance or pads with wasted computation, this implementation avoids these inefficiencies, making it both memory and computationally optimal.

Efficiency Features

Block-Sparse Operations: The implementation performs sparse routing of tokens to experts, ensuring that only selected experts are computed for each token. This significantly reduces the computational burden compared to dense MoE implementations.
Expert Assignment Without Padding: Tokens are assigned to the top-k experts without the need for additional padding, avoiding unnecessary memory usage and improving training efficiency.
Jitter Noise for Regularization: Jitter noise is introduced during training to add small perturbations to the input, helping prevent overfitting and improving the generalization capability of the model.

Installation

To use the SparseMoeBlock, clone this repository and import the class into your project:

https://github.com/Kowsher/SparseMoE-Layer.git
cd SparseMoE-Layer

Usage

Here's an example of how to use the SparseMoeBlock in your PyTorch model:

import torch
from moe import SparseMoeBlock

# Example model using SparseMoeBlock
class ExampleModel(torch.nn.Module):
    def __init__(self):
        super(ExampleModel, self).__init__()
        self.moe_block = SparseMoeBlock(num_experts=4, top_k=2, router_jitter_noise=0.1)

    def forward(self, x):
        return self.moe_block(x)

# Initialize model
model = ExampleModel()
input_data = torch.randn(10, 256)  # Example input data
output = model(input_data)
print(output)

Credits

This implementation is inspired by the concepts and code used in efficient mixtral architectures, adapting them to make an efficient block-sparse MoE implementation in PyTorch.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
moe.py		moe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparse Mixture of Experts Layer Implementation using Pytorch

Sparse MoE Block

Overview

Efficiency Features

Installation

Usage

Credits

About

Releases

Packages

Languages

Kowsher/SparseMoE-Layer

Folders and files

Latest commit

History

Repository files navigation

Sparse Mixture of Experts Layer Implementation using Pytorch

Sparse MoE Block

Overview

Efficiency Features

Installation

Usage

Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages