Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Qwen 7B support #147

Open
onestepbackk opened this issue Dec 24, 2024 · 12 comments
Open

Request for Qwen 7B support #147

onestepbackk opened this issue Dec 24, 2024 · 12 comments

Comments

@onestepbackk
Copy link

onestepbackk commented Dec 24, 2024

Hi I have been experimenting with Qwen2.5 1.5B model and it is working without an issue.
But regarding Qwen2.5 7B model, I can convert the huggingface model to rkllm model, but when I try to use it, it doesn't do anything.
It seems like it properly loads the model to the memory, but if I input some prompt, the NPU load is at 0%, and gives me blank as an answer.

I am using rkllm-runtime version: 1.1.4, rknpu driver version: 0.9.8, platform: RK3588 (Orangepi 5 pro)

Here is a sample output using llm_demo
user: Hi how are you?
robot:
user: answer me
robot:
user:

@yuguolong
Copy link

I also had problems loading the model.

platform

rkllm-runtime version: 1.1.4, rknpu driver version: 0.9.7, platform: RK3576

model

Qwen2.5-7B format:w8a8

Error message:

image

The model can be run in w4a16 format, but cannot be loaded in w8a8 state.

@imkebe
Copy link

imkebe commented Dec 25, 2024

RK3576 AFAIK doesn't support w8a8. what you expect ?

@c0zaut
Copy link

c0zaut commented Dec 25, 2024

@yuguolong - all of my models are for RK3588 only. You will need to run a conversion yourself with the rk3576 platform and w4a16 quantization. You can use my interactive pipeline to walk through it, as long as the original safetensors model is in Huggingface.

@yuguolong
Copy link

@imkebe It does support, I have run qwen2.5-3B w8a8 on rk3567.

@yuguolong
Copy link

@c0zaut Models are converted for rk3576 chip. I tested Qwen2.5-3B, Qwen2.5-7B and other models. Qwen2.5-3B-w8a8 can run, only Qwen2.5-7B-w8a8 can not run.

@onestepbackk
Copy link
Author

I am using rkllm-runtime version: 1.1.4, rknpu driver version: 0.9.8 with RK3588 and not able to run both Qwen2.5-3B w8a8 and Qwen2.5-7B w8a8

@yuguolong
Copy link

@onestepbackk
Is the model loading OK?

@c0zaut
Copy link

c0zaut commented Dec 27, 2024

@onestepbackk - I got similar behavior when trying to use run_async instead of just run. What is your callback function?

@onestepbackk
Copy link
Author

onestepbackk commented Dec 27, 2024

@yuguolong

@onestepbackk Is the model loading OK?

Yes, the model is loading.

@c0zaut

@onestepbackk - I got similar behavior when trying to use run_async instead of just run. What is your callback function?

same thing happens with the run function.

@c0zaut
Copy link

c0zaut commented Dec 28, 2024

@onestepbackk

Can you try to load it here? https://github.com/c0zaut/RKLLM-Gradio

Video tutorial: https://youtu.be/sTHNZZP0S3E

Models for RK3588: https://huggingface.co/c01zaut/Qwen2.5-7B-Instruct-RK3588-1.1.4

^ there's a bunch of different version, group sizes, hybrid ratios, etc that you can try out.

@onestepbackk
Copy link
Author

onestepbackk commented Jan 8, 2025

@c0zaut
Awesome project!
I tried your pre-exported model and it works nicely, even with my code.
It seems like it is related to how I export the model.
Can you tell me how you exported the model?
I exported to rkllm by
llm.build(do_quantization=True, optimization_level=1, quantized_dtype='w8a8', quantized_algorithm='normal', target_platform='rk3588', num_npu_core=3, extra_qparams=None)

@c0zaut
Copy link

c0zaut commented Jan 11, 2025

@onestepbackk - I need to push my most recent changes to GitHub, since this uses the older version of the toolkit, but: https://github.com/c0zaut/ez-er-rkllm-toolkit

Just change out the whl file and update the Dockerfile and it should work just fine! That's what I have been doing, but I haven't been able to get to the device with this repo in a little while. I'll try to push something this weekend.

The non-interactive container allows you to set a bunch of different settings. I wouldn't recommend setting hybrid_ration to 1.0, though, since that is the same as w8a8 for a groupsize quant; for a w8a8 model, it is just 100% w8a8_g128 (or whatever is compatible.)

TL;DR for groupsize quantization: w8a8 is faster, but less accurate; w8a8_g* is going to be slower, but more accurate. If you optimize, I would recommend using a small sample of your own dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants