Skip to content
View chhzh123's full-sized avatar

Highlights

  • Pro

Organizations

@cornell-zhang

Block or report chhzh123

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results
Python 141 14 Updated Feb 12, 2025

KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems

Python 199 12 Updated Feb 24, 2025

A throughput-oriented high-performance serving framework for LLMs

Cuda 741 29 Updated Sep 21, 2024

verl: Volcano Engine Reinforcement Learning for LLMs

Python 3,791 338 Updated Feb 25, 2025

Control Logic Synthesis: Drawing the Rest of the OWL

Racket 9 1 Updated Jun 17, 2024

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 330 23 Updated Feb 25, 2025

A simple, performant and scalable Jax LLM!

Python 1,630 322 Updated Feb 25, 2025
Python 10 12 Updated Feb 24, 2025

A PyTorch native library for large model training

Python 3,349 285 Updated Feb 25, 2025

Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.

Python 90 9 Updated Feb 21, 2025
Python 27 10 Updated Feb 5, 2025

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & RingAttention & RFT)

Python 5,032 499 Updated Feb 22, 2025
Python 274 279 Updated Feb 13, 2025

Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning

C++ 23 1 Updated Dec 12, 2024

HLS implementation of the tracklet pattern reco modules of the Hybrid tracking chain

VHDL 16 24 Updated Feb 25, 2025

Github mirror of trition-lang/triton repo.

MLIR 25 7 Updated Feb 25, 2025

Efficient Triton Kernels for LLM Training

Python 4,488 272 Updated Feb 25, 2025

A playbook for effectively prompting post-trained LLMs

833 34 Updated Jan 21, 2025

A bibliography and survey of the papers surrounding o1

TeX 1,167 50 Updated Nov 16, 2024

GPU Performance Advisor

Python 64 8 Updated Jul 25, 2022
C++ 473 72 Updated Dec 10, 2024

Differentiable Combinatorial Scheduling at Scale (ICML'24). Mingju Liu, Yingjie Li, Jiaqi Yin, Zhiru Zhang, Cunxi Yu.

Python 19 1 Updated Oct 31, 2024

depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.

Python 589 17 Updated Dec 7, 2024

Low-bit LLM inference on CPU with lookup table

C++ 686 53 Updated Jan 9, 2025

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Cuda 977 61 Updated Feb 15, 2025

Sum example for Xilinx VCK5000

C++ 8 3 Updated Jul 29, 2022
Next