Skip to content
View TomMarkQE's full-sized avatar

Block or report TomMarkQE

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

Python 2,507 239 Updated Mar 5, 2025

Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training

C++ 1,764 236 Updated Mar 5, 2025

nnScaler: Compiling DNN models for Parallel Training

Python 98 13 Updated Feb 14, 2025

Curated collection of papers in machine learning systems

254 14 Updated Feb 28, 2025

中科大凌青老师凸优化课程的课程笔记

TeX 30 3 Updated Jan 19, 2021

A library to analyze PyTorch traces.

Python 341 52 Updated Feb 24, 2025

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 360 73 Updated Sep 8, 2024

This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.

C++ 26 1 Updated Dec 27, 2024

This repository contains datasets and baselines for benchmarking Chinese text recognition.

Python 456 52 Updated Dec 2, 2022

Build and Train a GPT-2 from scratch using PyTorch

Jupyter Notebook 14 9 Updated Jul 2, 2024

A GPU accelerated error-bounded lossy compression for scientific data.

C++ 72 28 Updated Feb 23, 2025

Code for the paper "Language Models are Unsupervised Multitask Learners"

Python 23,144 5,609 Updated Aug 14, 2024

LLM training in simple, raw C/CUDA

Cuda 25,939 2,970 Updated Oct 2, 2024

A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models

Python 699 54 Updated Sep 13, 2023

🔥 🔥 🔥 自建Docker镜像加速服务,基于官方Docker Registry 一键部署Docker、K8s、Quay、Ghcr、Mcr、Nvcr等镜像加速\管理服务。支持免服务器部署到Render\Koyeb

HTML 2,767 418 Updated Mar 6, 2025

Tile primitives for speedy kernels

Cuda 2,115 123 Updated Mar 6, 2025

Awesome LLM compression research papers and tools.

1,399 89 Updated Mar 5, 2025

Create user-notifications on macOS with swiftDialog

Swift 625 62 Updated Feb 26, 2025

CUDA Matrix Multiplication Optimization

Cuda 168 16 Updated Jul 19, 2024

[TMLR 2024] Efficient Large Language Models: A Survey

1,112 95 Updated Feb 27, 2025

how to optimize some algorithm in cuda.

Cuda 1,951 173 Updated Mar 5, 2025

CUDA programs for our training session.

C 35 45 Updated Apr 10, 2012
Cuda 13 3 Updated Dec 1, 2023

Google Drive Public File Downloader when Curl/Wget Fails

Python 4,499 361 Updated Aug 12, 2024

A library for efficient similarity search and clustering of dense vectors.

C++ 33,463 3,776 Updated Mar 7, 2025

Parallel selection on GPUs

C++ 15 4 Updated Mar 23, 2021

Sample codes for my CUDA programming book

Cuda 1,656 338 Updated Feb 15, 2025

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 936 149 Updated Jul 29, 2023
Next