Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
cli		cli
llama-cpp		llama-cpp
web		web
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
th-llama.cpp		th-llama.cpp
th-llama.hpp		th-llama.hpp
th.cpp		th.cpp
th.hpp		th.hpp

Repository files navigation

TokenHawk

LLaMA inference using hand-written WebGPU code.

Description

TokenHawk uses WebGPU to perform Llama inference. All code is written by hand and there are two files:

th.cpp - Contains GPU shaders to support running LLMs.
th-llama.cpp - GPU implementation of llama.

The command line version of TokenHawk is native C++ code. It statically links to Google's C++ WebGPU library which makes profiling and debuging simpler.

The Web UI version uses emcripten to cross-compile these two files into WASM.

llama.cpp is currently used to load models and perform tokenization.

As of, May 13, 2023, only 7B llama models are supported. Wider model support should evolve quickly.

Command Line

See the CLI directory for build and usage instructions.

Use the command line for performance tuning WebGPU code. Here's an example of usage:

$ ./th -m models/llama-7B/ggml-model-f16.bin "<prompt goes here>"

Web UI

See the Web directory for build and usage instructions.

For simple and quick access, use the Web UI. You can try it out online here, or host it locally:

python web/serve.py

Performance

TokenHawk is pretty fast. On a 4090 using 7B-f16, TokenHawk clocks in at 30 tk/s while CUDA is 50 tk/s. And there is still room for improvement. We'll focus on the following perf improvements in the coming weeks:

Profile and optimize matrix multiplication.
Optimize single token generation.
- Add a two-stage parallel reduction step.
Optimize WARP and Wavefront sizes for Nvidia and AMD.
Per-GPU hyper-parameter optimization.
Investigate feasibility of GPU-only operation. No hitting the CPU.
Investigate native f16 support. f16 is currently emulated in shaders.
Store intermediate GPU buffers in fp16. Specifically the context and working buffers.
Add 4-bit quantization.

Data

More data to come.

Acknowledgments

Thanks to llama.cpp for GGML, tokenization, and its file format.

And Google's Dawn for the WebGPU implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TokenHawk

Description

Command Line

Web UI

Performance

Data

See Also

Compilers

Acknowledgments

About

Releases

Packages

Languages

License

Aishou/token-hawk

Folders and files

Latest commit

History

Repository files navigation

TokenHawk

Description

Command Line

Web UI

Performance

Data

See Also

Compilers

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages