Benchmark GGUF models with a ONE line of code. The fastest benchmarking tool for quantized GGUF models, featuring multiprocessing support and 8 evaluation tasks.
Currently supports text GGUF models.
Supports Windows, Linux, and macOS.
-
Install Nexa SDK Python Pacakage
-
Install Nexa Eval Package
pip install 'nexaai[eval]'
Choose a GGUF model from Nexa Model Hub to benchmark. You can also upload your own GGUF models.
# Evaluate Llama3.2-1B Q4_K_M quantization with "ifeval" task
nexa eval Llama3.2-1B-Instruct:q4_K_M --tasks ifeval
# Use Multiprocessing. You can specify number of workerse to optimize performance.
nexa eval Llama3.2-1B-Instruct:q4_K_M --tasks ifeval --num_workers 4
usage: nexa eval model_path [-h] [--tasks TASKS] [--limit LIMIT]
positional arguments:
model_path Path or identifier for the model in Nexa Model Hub. Text after 'nexa run'.
options:
-h, --help show this help message and exit
--tasks TASKS Tasks to evaluate, comma-separated
--limit LIMIT Limit the number of examples per task. If <1, limit is a percentage of the total number of examples.
-
General Tasks
-
Math Tasks
math
: Mathematical reasoningmgsm_direct
: Grade school math problems
-
Reasoning Tasks
gpqa
: General purpose question answering
-
Coding Tasks
openai_humaneval
: Code generation and completion
-
Safety Tasks
do-not-answer
: Adversarial question handlingtruthfulqa
: Model truthfulness evaluation
GGUF (GGML Universal Format) models are optimized for on-device AI deployment:
- Reduced memory footprint through quantization
- Cross-platform compatibility via llama.cpp
- No external dependencies
- Supported by popular projects: llama.cpp, whisper.cpp, stable-diffusion.cpp, and more
Quantization affects three key factors:
- File size
- Model quality
- Performance
Benchmarking helps you:
- Verify accuracy retention after quantization
- Select the optimal model for your specific use case
- Make informed decisions about quantization levels
Adapted From Language Model Evaluation Harness.