Huggingface models serving with FastAPI
git clone https://github.com/haoxian-lab/hf-serve.git
cd hf-serve
pip install .
docker pull sharockys/hf-serve
docker run -p 8000:8000 sharockys/hf-serve
curl -X POST "http://localhost:8000/" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"data\":\"I love you\"}"
HF_SERVE_MODEL_NAME
: model name in huggingface model hubHF_SERVE_TASK
: task name, one oftext-classification
,feature-extraction
HF_SERVE_USE_GPU
: whether to use gpu, defaultFalse
HF_SERVE_DEVICE
: device name, defaultcpu
. For Mac OS usemps
, for Nvidia GPU usecuda
. Automatically if not specified.HF_SERVE_MODEL_CACHE_DIR
: model cache dir, default/tmp/hf-serve
HF_SERVE_CLEAR_MODEL_CACHE_ON_SHUTDOWN
: whether to clear model cache on shutdown, defaultFalse
locust -f benchmark/locustfile.py
And follow the instructions to start benchmarking.