-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for Qwen 7B support #147
Comments
RK3576 AFAIK doesn't support w8a8. what you expect ? |
@yuguolong - all of my models are for RK3588 only. You will need to run a conversion yourself with the rk3576 platform and w4a16 quantization. You can use my interactive pipeline to walk through it, as long as the original safetensors model is in Huggingface. |
@imkebe It does support, I have run qwen2.5-3B w8a8 on rk3567. |
@c0zaut Models are converted for rk3576 chip. I tested Qwen2.5-3B, Qwen2.5-7B and other models. Qwen2.5-3B-w8a8 can run, only Qwen2.5-7B-w8a8 can not run. |
I am using rkllm-runtime version: 1.1.4, rknpu driver version: 0.9.8 with RK3588 and not able to run both Qwen2.5-3B w8a8 and Qwen2.5-7B w8a8 |
@onestepbackk |
@onestepbackk - I got similar behavior when trying to use run_async instead of just run. What is your callback function? |
Yes, the model is loading.
same thing happens with the run function. |
Can you try to load it here? https://github.com/c0zaut/RKLLM-Gradio Video tutorial: https://youtu.be/sTHNZZP0S3E Models for RK3588: https://huggingface.co/c01zaut/Qwen2.5-7B-Instruct-RK3588-1.1.4 ^ there's a bunch of different version, group sizes, hybrid ratios, etc that you can try out. |
@c0zaut |
@onestepbackk - I need to push my most recent changes to GitHub, since this uses the older version of the toolkit, but: https://github.com/c0zaut/ez-er-rkllm-toolkit Just change out the whl file and update the Dockerfile and it should work just fine! That's what I have been doing, but I haven't been able to get to the device with this repo in a little while. I'll try to push something this weekend. The non-interactive container allows you to set a bunch of different settings. I wouldn't recommend setting hybrid_ration to 1.0, though, since that is the same as w8a8 for a groupsize quant; for a w8a8 model, it is just 100% w8a8_g128 (or whatever is compatible.) TL;DR for groupsize quantization: w8a8 is faster, but less accurate; w8a8_g* is going to be slower, but more accurate. If you optimize, I would recommend using a small sample of your own dataset. |
Hi I have been experimenting with Qwen2.5 1.5B model and it is working without an issue.
But regarding Qwen2.5 7B model, I can convert the huggingface model to rkllm model, but when I try to use it, it doesn't do anything.
It seems like it properly loads the model to the memory, but if I input some prompt, the NPU load is at 0%, and gives me blank as an answer.
I am using rkllm-runtime version: 1.1.4, rknpu driver version: 0.9.8, platform: RK3588 (Orangepi 5 pro)
Here is a sample output using llm_demo
user: Hi how are you?
robot:
user: answer me
robot:
user:
The text was updated successfully, but these errors were encountered: