This is a fork of ExLlamaV2, the purpose is to make exllama run full on gpu to gain a bit more inference speed and also add a transformers like api to be able to use guidance and other 3rd party libraries.
Original repo: [https://github.com/turboderp/exllamav2]
TBD