Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model can't inference for Llama3.2-1B when use -d fp16 to convert pte #9534

Open
WeiMa01 opened this issue Mar 24, 2025 · 4 comments
Open

Model can't inference for Llama3.2-1B when use -d fp16 to convert pte #9534

WeiMa01 opened this issue Mar 24, 2025 · 4 comments
Assignees
Labels
module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@WeiMa01
Copy link

WeiMa01 commented Mar 24, 2025

when we runing Llama3.2-1B fp16.pte which use -d fp16 to convert Llama3.2-1B w/BF16 to pte, It meet an issue:
error log:
I 00:00:00.013206 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.027952 executorch:runner.cpp:67] Creating LLaMa runner: model_path=llama3_2_fp16_org.pte, tokenizer_path=../tokenizer.model
E 00:00:00.728030 executorch:XNNCompiler.cpp:635] Failed to create multiply node 266 with code: xnn_status_invalid_parameter
E 00:00:00.728090 executorch:XNNPACKBackend.cpp:106] XNNCompiler::compileModel failed: 0x1
E 00:00:00.728099 executorch:method.cpp:110] Init failed for backend XnnpackBackend: 0x1
E 00:00:00.771031 executorch:XNNCompiler.cpp:635] Failed to create multiply node 266 with code: xnn_status_invalid_parameter
E 00:00:00.771096 executorch:XNNPACKBackend.cpp:106] XNNCompiler::compileModel failed: 0x1
E 00:00:00.771104 executorch:method.cpp:110] Init failed for backend XnnpackBackend: 0x1

convert command:
python -m examples.models.llama.export_llama --model "llama3_2" --checkpoint "/model_convert/Llama-3.2-1B/original/consolidated_00.pth" --params "/Llama-3.2-1B/original/params.json" --use_sdpa_with_kv_cache -X --xnnpack-extended-ops --output_name "llama3_2_fp16_direct_convert_runtime.pte" -kv -d fp16 --max_seq_length 256

cc @digantdesai @mcr229 @cbilgin

@JacobSzwejbka
Copy link
Contributor

cc @mcr229

@JacobSzwejbka JacobSzwejbka added module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Mar 24, 2025
@mcr229
Copy link
Contributor

mcr229 commented Mar 24, 2025

@WeiMa01 I think there is an issue in the llama dtype conversion (source type bf16 --> target type fp16), in which one of the edges isn't converted, causing a multiply node with one bf16 and one fp16 input, which is likely what is being complained about here.

On the XNNPACK side we are trying to add better error checking here: #9023

But i think there is something with the model transformation code that isn't consistently changing all the dtypes to fp16. I remember @kimishpatel and @jackzhxng talking about this?

@jackzhxng
Copy link
Contributor

Oh nope, the entire model should be getting converted to fp16 here, especially since there is no quantization

@keyprocedure
Copy link
Contributor

keyprocedure commented Apr 2, 2025

Hi @WeiMa01, I was able to reproduce the issue and it seemed related to the XNNPACK partitioner delegating a mixed dtype op. With a fix for #9023 in place, the model runs successfully on my end.

Just checking in to see if it’s now working on your side too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants