OLMo-Eval

OLMo-Eval is a repository for evaluating open language models.

Note of Deprecation

NOTE: This repository has been superceded by the OLMES repository, available at https://github.com/allenai/olmes (Open Language Model Evaluation System).

Overview

The olmo_eval framework is a way to run evaluation pipelines for language models on NLP tasks. The codebase is extensible and contains task_sets and example configurations, which run a series of tango steps for computing the model outputs and metrics.

Using this pipeline, you can evaluate m models on t task_sets, where each task_set consists of one or more individual tasks. Using task_sets allows you to compute aggregate metrics for multiple tasks. The optional google-sheet integration can be used for reporting.

The pipeline is built using ai2-tango and ai2-catwalk.

Installation

After cloning the repository, please run

conda create -n eval-pipeline python=3.10
conda activate eval-pipeline
cd OLMo-Eval
pip install -e .

Quickstart

The current task_sets can be found at configs/task_sets. In this example, we run gen_tasks on EleutherAI/pythia-1b. The example config is here.

The configuration can be run as follows:

tango --settings tango.yml run configs/example_config.jsonnet --workspace my-eval-workspace

This executes all the steps defined in the config, and saves them in a local tango workspace called my-eval-workspace. If you add a new task_set or model to your config and run the same command again, it will reuse the previous outputs, and only compute the new outputs.

The output should look like this:

New models and datasets can be added by modifying the example configuration.

Load pipeline output

from tango import Workspace
workspace = Workspace.from_url("local://my-eval-workspace")
result = workspace.step_result("combine-all-outputs")

Load individual task results with per instance outputs

result = workspace.step_result("outputs_pythia-1bstep140000_gen_tasks_drop")

Evaluating common models on standard benchmarks

The eval_table config evaluates falcon-7b, mpt-7b, llama2-7b, and llama2-13b, on standard_benchmarks and MMLU. Run as follows:

tango --settings tango.yml run configs/eval_table.jsonnet --workspace my-eval-workspace

PALOMA

This repository was also used to run evaluations for the PALOMA paper

Details on running the evaluation on PALOMA can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
.github		.github
configs		configs
docs		docs
olmo_eval		olmo_eval
paloma		paloma
scripts		scripts
test_fixtures		test_fixtures
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
ADVANCED.md		ADVANCED.md
BEAKER.md		BEAKER.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASE_PROCESS.md		RELEASE_PROCESS.md
pyproject.toml		pyproject.toml
tango-in-beaker.yml		tango-in-beaker.yml
tango.yml		tango.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OLMo-Eval

Note of Deprecation

Overview

Installation

Quickstart

Load pipeline output

Evaluating common models on standard benchmarks

PALOMA

Advanced

About

Releases

Packages

Contributors 5

Languages

License

allenai/OLMo-Eval

Folders and files

Latest commit

History

Repository files navigation

OLMo-Eval

Note of Deprecation

Overview

Installation

Quickstart

Load pipeline output

Evaluating common models on standard benchmarks

PALOMA

Advanced

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages