LLM applications running on Apple Silicon in real-time thanks to Apple MLX framework.
Here's also a Youtube Video.
git clone https://github.com/riccardomusmeci/mlx-llm
cd mlx-llm
pip install .
Go check models for a summary of available models.
To create a model with weights:
from mlx_llm.model import create_model
# loading weights from HuggingFace
model = create_model("TinyLlama-1.1B-Chat-v0.6")
# loading weights from local file
model = create_model("TinyLlama-1.1B-Chat-v0.6", weights="path/to/weights.npz")
To list all available models:
from mlx_llm.model import list_models
print(list_models())
You can run benchmarks with mlx-llm
to compare mlx versions, models, and devices:
from mlx_llm.bench import Benchmark
benchmark = Benchmark(
apple_silicon="m1_pro_32GB",
model_name="TinyLlama-1.1B-Chat-v0.6",
prompt="What is the meaning of life?",
max_tokens=100,
temperature=0.1,
verbose=False
)
benchmark.start()
# just the output dir, the file name will be benchmark.csv
benchmark.save("results") # if benchmark.csv is already there, it will append the new results
Warning
Download first the model weights before running the benchmark (just use create_model
and then run the test).
Go to benchmark.csv to check my experiments.
If you want to run benchmarks for all available LLMs:
cd scripts
./run_benchmarks.sh
Warning
The test will take a while since it will download all the models if not already present. Also, once test for a model is done, all the 🤗 hub cache will be deleted.
Note
Run the benchmarks on your Apple Silicon device and then PR-me the results. I will be happy to add them to the benchmark.csv file.
Models in mlx-llm
are able to extract embeddings from a given text.
import mlx.core as mx
from mlx_llm.model import create_model
from transformers import AutoTokenizer
model = create_model("e5-mistral-7b-instruct")
tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-mistral-7b-instruct')
text = ["I like to play basketball", "I like to play tennis"]
tokens = tokenizer(text)
x = mx.array(tokens["input_ids"])
embeds = model.embed(x)
For a better example go check 🤗 e5-mistral-7b-instruct page.
With mlx-llm
you can run a variety of applications, such as:
- Chat with an LLM
- Retrieval Augmented Generation (RAG) running locally
Below an example of how to chat with an LLM, but for more details go check the examples documentation.
mlx-llm
comes with tools to easily run your LLM chat on Apple Silicon.
You can chat with an LLM by specifying a personality and some examples of user-model interaction (this is mandatory to have a good chat experience):
from mlx_llm.playground.chat import ChatLLM
personality = "You're a salesman and beet farmer know as Dwight K Schrute from the TV show The Office. Dwight replies just as he would in the show. You always reply as Dwight would reply. If you don't know the answer to a question, please don't share false information."
# examples must be structured as below
examples = [
{
"user": "What is your name?",
"model": "Dwight K Schrute",
},
{
"user": "What is your job?",
"model": "Assistant Regional Manager. Sorry, Assistant to the Regional Manager."
}
]
chat_llm = ChatLLM.build(
model_name="LLaMA-2-7B-chat",
tokenizer="mlx-community/Llama-2-7b-chat-mlx", # HF tokenizer or a local path to a tokenizer
personality=personality,
examples=examples,
)
chat_llm.run(max_tokens=500, temp=0.1)
[ ] Chat and RAG with streamlit???
[ ] Test with quantized models
[ ] LoRA and QLoRA
If you have any questions, please email [email protected]