Releases: airockchip/rknn-llm
Releases Β· airockchip/rknn-llm
release-v1.1.4
- Add support for converting HuggingFace GPTQ-int4 models (requires groupsize to be 32, 64, or 128, and desc_act set to false).
- Add support for TeleChat/TeleChat2/MiniCPM-S models.
- Support exporting llm model in Qwen2VL
- Resolve issues with LoRA inference.
- Fix an import error related to IPython.
release-v1.1.2
- Fix inference error in chatglm3 model
- Fix inference issue with embedding input
- Support exporting llm model in MiniCPMV
release-v1.1.1
- Fixed the inference error in the minicpm3 mode
- Fixed the runtime error in rkllm_server_demo.
- Added the rkllm-toolkit installation package for Python 3.10.
- Supported gguf model conversion when tie_word_embeddings is set to true.
release-v1.1.0
- Added support for grouped quantization (w4a16 group sizes of 32/64/128, w8a8 group sizes of 128/256/512).
- Added gdq algorithm to improve 4-bit quantization accuracy.
- Added hybrid quantization algorithm, supporting a combination of grouped and non-grouped quantization based on specified ratios.
- Added support for Llama3, Gemma2, and Minicpm3 models.
- Added support for gguf model conversion (currently supports q4_0 and fp16 only).
- Added support for LoRa models.
- Added storage and loading of prompt cache
- Added PC-side emulation accuracy testing and inference interface support for rkllm-toolkit.
- Fixed catastrophic forgetting issue when the token count exceeds max_context.
- Optimized prefill speed.
- Optimized generate speed.
- Optimized model initialization time
- Added support for four input interfaces: prompt, embedding, token, and multimodal.
release-v1.0.1
- Optimize model conversion memory occupation
- Optimize inference memory occupation
- Increase prefill speed
- Reduce initialization time
- Improve quantization accuracy
- Add support for Gemma, ChatGLM3, MiniCPM, InternLM2, and Phi-3
- Add Server invocation
- Add inference interruption interface
- Add logprob and token_id to the return value