Name		Name	Last commit message	Last commit date
parent directory ..
benchmark_tasks @ 2ae2a08		benchmark_tasks @ 2ae2a08
nexa_perf		nexa_perf
nexa_task		nexa_task
README.md		README.md
__init__.py		__init__.py
evaluator.py		evaluator.py
evaluator_utils.py		evaluator_utils.py
nexa_eval.py		nexa_eval.py
nexa_models.py		nexa_models.py
prompts.py		prompts.py
utils.py		utils.py

README.md

GGUF Benchmark

Benchmark GGUF models with a ONE line of code. The fastest benchmarking tool for quantized GGUF models, featuring multiprocessing support and 8 evaluation tasks.

Currently supports text GGUF models.

🔧 Installation

Supports Windows, Linux, and macOS.

Install Nexa SDK Python Pacakage
Install Nexa Eval Package
```
pip install 'nexaai[eval]'
```

🚀 Quick Start

Choose a GGUF model from Nexa Model Hub to benchmark. You can also upload your own GGUF models.

# Evaluate Llama3.2-1B Q4_K_M quantization with "ifeval" task
nexa eval Llama3.2-1B-Instruct:q4_K_M --tasks ifeval


# Use Multiprocessing. You can specify number of workerse to optimize performance.
nexa eval Llama3.2-1B-Instruct:q4_K_M --tasks ifeval --num_workers 4

CLI Reference for EVAL

usage: nexa eval model_path [-h] [--tasks TASKS] [--limit LIMIT]

positional arguments:
  model_path            Path or identifier for the model in Nexa Model Hub. Text after 'nexa run'.

options:
  -h, --help            show this help message and exit
  --tasks TASKS         Tasks to evaluate, comma-separated
  --limit LIMIT         Limit the number of examples per task. If <1, limit is a percentage of the total number of examples.

📊 Evaluation Tasks

General Tasks
- ifeval: General language understanding
- mmlu_pro: Massive multitask language understanding
Math Tasks
- math: Mathematical reasoning
- mgsm_direct: Grade school math problems
Reasoning Tasks
- gpqa: General purpose question answering
Coding Tasks
- openai_humaneval: Code generation and completion
Safety Tasks
- do-not-answer: Adversarial question handling
- truthfulqa: Model truthfulness evaluation

💡 Why GGUF Models?

GGUF (GGML Universal Format) models are optimized for on-device AI deployment:

Reduced memory footprint through quantization
Cross-platform compatibility via llama.cpp
No external dependencies
Supported by popular projects: llama.cpp, whisper.cpp, stable-diffusion.cpp, and more

📈 Why Benchmark?

Quantization affects three key factors:

File size
Model quality
Performance

Benchmarking helps you:

Verify accuracy retention after quantization
Select the optimal model for your specific use case
Make informed decisions about quantization levels

Acknowledgements

Adapted From Language Model Evaluation Harness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval

eval

README.md

GGUF Benchmark

🔧 Installation

🚀 Quick Start

CLI Reference for EVAL

📊 Evaluation Tasks

💡 Why GGUF Models?

📈 Why Benchmark?

Acknowledgements

Files

eval

Directory actions

More options

Directory actions

More options

Latest commit

History

eval

Folders and files

parent directory

README.md

GGUF Benchmark

🔧 Installation

🚀 Quick Start

CLI Reference for EVAL

📊 Evaluation Tasks

💡 Why GGUF Models?

📈 Why Benchmark?

Acknowledgements