TomMarkQE

Follow

TomMarkQE

Follow

1 follower · 14 following

Stars

deepseek-ai / DualPipe

A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

Python 2,507 239 Updated Mar 5, 2025

flexflow / flexflow-train

Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training

C++ 1,764 236 Updated Mar 5, 2025

Operations-Research-Science / Ebook-Linear_Programming

584 162 Updated Dec 24, 2022

microsoft / nnscaler

nnScaler: Compiling DNN models for Parallel Training

Python 98 13 Updated Feb 14, 2025

byungsoo-oh / ml-systems-papers

Curated collection of papers in machine learning systems

254 14 Updated Feb 28, 2025

ZxyGed / ConvexOptimization

中科大凌青老师凸优化课程的课程笔记

TeX 30 3 Updated Jan 19, 2021

facebookresearch / HolisticTraceAnalysis

A library to analyze PyTorch traces.

Python 341 52 Updated Feb 24, 2025

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 360 73 Updated Sep 8, 2024

Qwesh157 / conv_op_optimization

This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.

C++ 26 1 Updated Dec 27, 2024

FudanVI / benchmarking-chinese-text-recognition

This repository contains datasets and baselines for benchmarking Chinese text recognition.

Python 456 52 Updated Dec 2, 2022

ajeetkharel / gpt2-from-scratch

Build and Train a GPT-2 from scratch using PyTorch

Jupyter Notebook 14 9 Updated Jul 2, 2024

szcompressor / cuSZ

A GPU accelerated error-bounded lossy compression for scientific data.

C++ 72 28 Updated Feb 23, 2025

openai / gpt-2

Code for the paper "Language Models are Unsupervised Multitask Learners"

Python 23,144 5,609 Updated Aug 14, 2024

karpathy / llm.c

LLM training in simple, raw C/CUDA

Cuda 25,939 2,970 Updated Oct 2, 2024

lucidrains / mixture-of-experts

A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models

Python 699 54 Updated Sep 13, 2023

dqzboy / Docker-Proxy

🔥 🔥 🔥 自建Docker镜像加速服务，基于官方Docker Registry 一键部署Docker、K8s、Quay、Ghcr、Mcr、Nvcr等镜像加速\管理服务。支持免服务器部署到Render\Koyeb

HTML 2,767 418 Updated Mar 6, 2025

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 2,115 123 Updated Mar 6, 2025

HuangOwen / Awesome-LLM-Compression

Awesome LLM compression research papers and tools.

1,399 89 Updated Mar 5, 2025

swiftDialog / swiftDialog

Create user-notifications on macOS with swiftDialog

Swift 625 62 Updated Feb 26, 2025

leimao / CUDA-GEMM-Optimization

CUDA Matrix Multiplication Optimization

Cuda 168 16 Updated Jul 19, 2024

AIoT-MLSys-Lab / Efficient-LLMs-Survey

[TMLR 2024] Efficient Large Language Models: A Survey

1,112 95 Updated Feb 27, 2025

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 1,951 173 Updated Mar 5, 2025

tpoisot / CUDA-training

CUDA programs for our training session.

C 35 45 Updated Apr 10, 2012

jiekebo / CUDA-By-Example

C 53 30 Updated Mar 10, 2018

xiatwhu / baidu_topk

Cuda 13 3 Updated Dec 1, 2023

wkentaro / gdown

Google Drive Public File Downloader when Curl/Wget Fails

Python 4,499 361 Updated Aug 12, 2024

facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.

C++ 33,463 3,776 Updated Mar 7, 2025

upsj / gpu_selection

Parallel selection on GPUs

C++ 15 4 Updated Mar 23, 2021

brucefan1983 / CUDA-Programming

Sample codes for my CUDA programming book

Cuda 1,656 338 Updated Feb 15, 2025

Liu-xiandong / How_to_optimize_in_GPU

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 936 149 Updated Jul 29, 2023