-
Cornell University
- Ithaca, NY
- https://chhzh123.github.io/
- https://orcid.org/0000-0002-6617-0075
Highlights
- Pro
Lists (13)
Sort Name ascending (A-Z)
Stars
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
A throughput-oriented high-performance serving framework for LLMs
verl: Volcano Engine Reinforcement Learning for LLMs
Control Logic Synthesis: Drawing the Rest of the OWL
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
A simple, performant and scalable Jax LLM!
A PyTorch native library for large model training
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & RingAttention & RFT)
Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning
HLS implementation of the tracklet pattern reco modules of the Hybrid tracking chain
Github mirror of trition-lang/triton repo.
Efficient Triton Kernels for LLM Training
A playbook for effectively prompting post-trained LLMs
A bibliography and survey of the papers surrounding o1
Differentiable Combinatorial Scheduling at Scale (ICML'24). Mingju Liu, Yingjie Li, Jiaqi Yin, Zhiru Zhang, Cunxi Yu.
depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.