Skip to content

Commit

Permalink
Update TODO
Browse files Browse the repository at this point in the history
  • Loading branch information
turboderp committed Jun 18, 2023
1 parent 3160ae2 commit 2e0672e
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions doc/TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,20 +40,22 @@
- [x] ~~Build attention mask in CUDA rather than PyTorch~~
- [x] ~~Disable attention mask when it isn't needed~~ (not possible with SDP)
- [x] Figure out why inference appears to be CPU-bound (kernel launch overhead)
- [ ] Reduce no. kernel launches to minimum (tail launch, fusion etc.)
- [x] Reduce no. kernel launches to minimum (tail launch, fusion etc.)
- [x] Measure PyTorch module overhead (negligible in eval mode)
- [x] Examine if scaled_dot_product_attention is actually the best attention method for single tokens (it's not)
- [ ] Implement attention in CUDA
- [x] Rewrite at least the quantized matmul kernel. Should be a bunch of special cases to consider
- [x] Experiment with concurrent streams where possible (fused MLP and QKV proj.)
- [ ] Faster low-rank matmul to speed up LoRAs

## Generation

- [x] Memory-efficient beam search implementation
- [ ] Optimized beam search
- [ ] Multi-token censoring/de-censoring
- [ ] Multi-token repetition penalties
- [ ] (Multi) LoRA support
- [x] (Multi) LoRA support
- [ ] Allow stackable LoRAs
- [x] Guided generation (chat with multiple bots at once, etc.)
- [ ] Multiple chat modes with prompt templates (instruct, etc.)
- [ ] Batched generation
Expand Down

0 comments on commit 2e0672e

Please sign in to comment.