Minimal llm rust api streaming endpoint.
It is a minimalist service to interact with a LLM, in a streaming mode.
It is designed to run quantized version of llama2, mistral or phi-2 quantized model, on a CPU
It is a very simple Rest Streaming API using :
- Rust
- Warp
- Candle
The selection of the model is activated by a feature, either mistral or phi-2 ( by default)
A Makefile facilitates clean,update, build,run
Prior to execution , please run :
make clean
and then
make update
To build the service , just type
With Phi-2 , type :
- to build for CPU
make build
- to build using CUDA
make build_cuda
or with mistral, type :
- to build for CPU
make FEATURE=mistral build
- to build using CUDA
make FEATURE=mistral build_cuda
or With llama, type :
- to build for CPU
make FEATURE=llama build
- to build using CUDA
make FEATURE=llama build_cuda
Then, to run it,
make run
Once launched, to use the API, you can
- From a linux terminal, use curl
- curl -X POST -H "Content-Type: application/json" --no-buffer 'http://127.0.0.1:3030/token_stream' -d '{"query":"Where is located Paris ?"}'
- From a browser, a very simple UI is available at :
- http://127.0.0.1:3030/
Provided these models are compatible with phi-2 , mistral or llama, you can specify your own huggingface repo
and quantized file , as well as customer tokenizer repo ( usually model file and tokenizer are on a different repo).
\
You can type:
make FEATURE=mistral build
make run MODEL_REPO="Your quantized model repo" MODEL_FILE="Your quantized gguf file" TOKENIZER_REPO="Your tokenizer repo"
This is useful should you be willing to run a fine tuned version of either phi-2 ,mistral or llama
For example, here is a phi-2 model fine tuned using guidelines described in the tutorial
https://youtu.be/J0RbOtLrJhQ?si=2lcEAzxX-ToeMPWR
make run MODEL_REPO="fcn94/phi-2-finetuned-med-text" MODEL_FILE="model-v2-q4k.gguf" TOKENIZER_REPO="fcn94/phi-2-finetuned-med-text"
You can test the following prompt with standard phi-2 model and with this fine-tuned model
curl -X POST -H "Content-Type: application/json" --no-buffer 'http://127.0.0.1:3030/token_stream' -d '{"query":"I have a headache with low fever. What should I do ?"}'
For this repo, features phi-2 and mistral are using gguf file generated by 'tensor-tools' from candle
Majority of open source gguf files from hugging face are following llama formalism
If you are using such file , here is a suggested modus operandi
You can type:
make FEATURE=llama build
and
make run MODEL_REPO="Your quantized model repo" MODEL_FILE="Your quantized gguf file" TOKENIZER_REPO="Your tokenizer repo"
For example , using a popular repo
make run MODEL_REPO="TheBloke/Mistral-7B-Instruct-v0.2-GGUF" MODEL_FILE="mistral-7b-instruct-v0.2.Q4_K_M.gguf" TOKENIZER_REPO="mistralai/Mistral-7B-Instruct-v0.2"
Four context prompts are defined in ./config/prompt_comfig.toml
You can type :
for default ( general)
make run
or
for classifier
make run CONTEXT_TYPE=classifier
or
for sql
make run CONTEXT_TYPE=sql
or
for math
make run CONTEXT_TYPE=math
- This is heavily inspired by one of the example from candle repository https://github.com/huggingface/candle/tree/main/candle-examples/examples/mistral