chhzh123

Follow

Hongzheng Chen chhzh123

Follow

Compiler for accelerators

449 followers · 195 following

Achievements

Achievements

Highlights

Pro

Organizations

Lists (13)

Sort

Accelerators

Advice

benchmarks

BNN

Compiler

DL Compiler

Deep learning compilers

22 repositories

DL Runtime

12 repositories

DSE

HLS

16 repositories

Important

12 repositories

NLP

profiler

Tools

Stars

Goedel-LM / Goedel-Prover

Python 141 14 Updated Feb 12, 2025

ScalingIntelligence / KernelBench

KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems

Python 199 12 Updated Feb 24, 2025

efeslab / Nanoflow

A throughput-oriented high-performance serving framework for LLMs

Cuda 741 29 Updated Sep 21, 2024

cornell-zhang / past-python-bindings

C++ 1 Updated Oct 16, 2024

volcengine / verl

verl: Volcano Engine Reinforcement Learning for LLMs

Python 3,791 338 Updated Feb 25, 2025

UCSBarchlab / owl

Control Logic Synthesis: Drawing the Rest of the OWL

Racket 9 1 Updated Jun 17, 2024

deepseek-ai / DeepSeek-R1

81,893 10,567 Updated Feb 24, 2025

tile-ai / tilelang

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 330 23 Updated Feb 25, 2025

AI-Hypercomputer / maxtext

A simple, performant and scalable Jax LLM!

Python 1,630 322 Updated Feb 25, 2025

aws-samples / nki-llama

Python 10 12 Updated Feb 24, 2025

pytorch / torchtitan

A PyTorch native library for large model training

Python 3,349 285 Updated Feb 25, 2025

pytorch-labs / tritonbench

Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.

Python 90 9 Updated Feb 21, 2025

deepseek-ai / DeepSeek-V3

Python 88,673 14,308 Updated Feb 24, 2025

aws-neuron / nki-samples

Python 27 10 Updated Feb 5, 2025

OpenRLHF / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & RingAttention & RFT)

Python 5,032 499 Updated Feb 22, 2025

Xilinx / XilinxBoardStore

Python 274 279 Updated Feb 13, 2025

erinadreno / list_of_Xilinx_FPGAs

Shell 24 7 Updated Sep 6, 2024

google / iopddl

Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning

C++ 23 1 Updated Dec 12, 2024

cms-L1TK / firmware-hls

HLS implementation of the tracklet pattern reco modules of the Hybrid tracking chain

VHDL 16 24 Updated Feb 25, 2025

facebookexperimental / triton

Github mirror of trition-lang/triton repo.

MLIR 25 7 Updated Feb 25, 2025

linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training

Python 4,488 272 Updated Feb 25, 2025

varungodbole / prompt-tuning-playbook

A playbook for effectively prompting post-trained LLMs

833 34 Updated Jan 21, 2025

srush / awesome-o1

A bibliography and survey of the papers surrounding o1

TeX 1,167 50 Updated Nov 16, 2024

Jokeren / GPA

GPU Performance Advisor

Python 64 8 Updated Jul 25, 2022

amd / RyzenAI-SW

C++ 473 72 Updated Dec 10, 2024

Yu-Maryland / Differentiable_Scheduler_ICML24

Differentiable Combinatorial Scheduling at Scale (ICML'24). Mingju Liu, Yingjie Li, Jiaqi Yin, Zhiru Zhang, Cunxi Yu.

Python 19 1 Updated Oct 31, 2024

thuml / depyf

depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.

Python 589 17 Updated Dec 7, 2024

microsoft / T-MAC

Low-bit LLM inference on CPU with lookup table

C++ 686 53 Updated Jan 9, 2025

thu-ml / SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Cuda 977 61 Updated Feb 15, 2025

FPGAtestbed / vck5000_sum_example

Sum example for Xilinx VCK5000

C++ 8 3 Updated Jul 29, 2022