Skip to content
View TomMarkQE's full-sized avatar

Block or report TomMarkQE

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

nnScaler: Compiling DNN models for Parallel Training

Python 87 13 Updated Jan 10, 2025

Curated collection of papers in machine learning systems

220 14 Updated Dec 22, 2024

中科大凌青老师凸优化课程的课程笔记

TeX 30 3 Updated Jan 19, 2021

A library to analyze PyTorch traces.

Python 324 46 Updated Dec 3, 2024

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 334 69 Updated Sep 8, 2024

This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.

C++ 25 1 Updated Dec 27, 2024

This repository contains datasets and baselines for benchmarking Chinese text recognition.

Python 447 52 Updated Dec 2, 2022

Build and Train a GPT-2 from scratch using PyTorch

Jupyter Notebook 14 8 Updated Jul 2, 2024

A GPU accelerated error-bounded lossy compression for scientific data.

C++ 69 28 Updated Jan 25, 2025

Code for the paper "Language Models are Unsupervised Multitask Learners"

Python 22,844 5,561 Updated Aug 14, 2024

LLM training in simple, raw C/CUDA

Cuda 25,139 2,873 Updated Oct 2, 2024

A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models

Python 671 51 Updated Sep 13, 2023

🔥 🔥 🔥 自建Docker镜像加速服务,基于官方Docker Registry 一键部署Docker、K8s、Quay、Ghcr、Mcr、Nvcr等镜像加速\管理服务。支持免服务器部署到Render\Koyeb

HTML 2,504 377 Updated Jan 22, 2025

Tile primitives for speedy kernels

Cuda 1,960 100 Updated Jan 26, 2025

Awesome LLM compression research papers and tools.

1,329 88 Updated Jan 25, 2025

Create user-notifications on macOS with swiftDialog

Swift 606 59 Updated Dec 17, 2024

CUDA Matrix Multiplication Optimization

Cuda 155 14 Updated Jul 19, 2024

[TMLR 2024] Efficient Large Language Models: A Survey

1,078 89 Updated Jan 14, 2025

how to optimize some algorithm in cuda.

Cuda 1,845 153 Updated Jan 24, 2025

CUDA programs for our training session.

C 35 45 Updated Apr 10, 2012
Cuda 13 3 Updated Dec 1, 2023

Google Drive Public File Downloader when Curl/Wget Fails

Python 4,440 358 Updated Aug 12, 2024

A library for efficient similarity search and clustering of dense vectors.

C++ 32,551 3,718 Updated Jan 24, 2025

Parallel selection on GPUs

C++ 15 4 Updated Mar 23, 2021

Sample codes for my CUDA programming book

Cuda 1,626 333 Updated Jul 27, 2023

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 892 141 Updated Jul 29, 2023

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology

C++ 931 147 Updated Dec 16, 2024

Source code examples from the Parallel Forall Blog

HTML 1,257 635 Updated Jul 23, 2024
Next