This example demonstrates using WasmEdge WASI-NN MLX plugin to perform an inference task with LLM model.
Family | Models |
---|---|
LLaMA 2 | llama_2_7b_chat_hf |
LLaMA 3 | llama_3_8b |
TinyLLaMA | tiny_llama_1.1B_chat_v1.0 |
The MLX backend relies on MLX, but we will auto-download MLX when you build WasmEdge. You do not need to install it yourself. If you want to custom MLX, install it yourself or set the CMAKE_PREFIX_PATH
variable when configuring cmake.
Build and install WasmEdge from source:
cd <path/to/your/wasmedge/source/folder>
cmake -GNinja -Bbuild -DCMAKE_BUILD_TYPE=Release -DWASMEDGE_PLUGIN_WASI_NN_BACKEND="mlx"
cmake --build build
# For the WASI-NN plugin, you should install this project.
cmake --install build
Then you will have an executable wasmedge
runtime under /usr/local/bin
and the WASI-NN with MLX backend plug-in under /usr/local/lib/wasmedge/libwasmedgePluginWasiNN.so
after installation.
In this example, we will use tiny_llama_1.1B_chat_v1.0
, which you can change to llama_2_7b_chat_hf
or llama_3_8b
.
# Download model weight
wget https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/model.safetensors
# Download tokenizer
wget https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/tokenizer.json
Run the following command to build wasm, the output WASM file will be at target/wasm32-wasi/release/
cargo build --target wasm32-wasi --release
Execute the WASM with the wasmedge
using nn-preload to load model.
wasmedge --dir .:. \
--nn-preload default:mlx:AUTO:model.safetensors \
./target/wasm32-wasi/release/wasmedge-mlx.wasm default
If your model has multiple weight files, you need to provide all in the nn-preload.
For example:
wasmedge --dir .:. \
--nn-preload default:mlx:AUTO:llama2-7b/model-00001-of-00002.safetensors:llama2-7b/model-00002-of-00002.safetensors \
./target/wasm32-wasi/release/wasmedge-mlx.wasm default
There are some metadata for MLX plugin you can set.
- model_type (required): LLM model type.
- tokenizer (required): tokenizer.json path
- max_token (option): maximum generate token number, default is 1024.
- enable_debug_log (option): if print debug log, default is false.
The following three parameters need to be set together.
- is_quantized (option): If the weight is quantized. If is_quantized is false, then MLX backend will quantize the weight.
- group_size (option): The group size to use for quantization.
- q_bits (option): The number of bits to quantize to.
let graph = GraphBuilder::new(GraphEncoding::Mlx, ExecutionTarget::AUTO)
.config(serde_json::to_string(&json!({"model_type": "tiny_llama_1.1B_chat_v1.0", "tokenizer":tokenizer_path, "max_token":100}))