Name		Name	Last commit message	Last commit date
parent directory ..
eval_scripts		eval_scripts
llava		llava
pope		pope
table		table
webpage		webpage
README.md		README.md
eval_gpt_mmhal.py		eval_gpt_mmhal.py
eval_gpt_review.py		eval_gpt_review.py
eval_gpt_review_bench.py		eval_gpt_review_bench.py
eval_gpt_review_visual.py		eval_gpt_review_visual.py
eval_pope.py		eval_pope.py
eval_science_qa.py		eval_science_qa.py
eval_science_qa_gpt4.py		eval_science_qa_gpt4.py
eval_science_qa_gpt4_requery.py		eval_science_qa_gpt4_requery.py
generate_webpage_data_from_table.py		generate_webpage_data_from_table.py
mmagibench.py		mmagibench.py
model_mmbench.py		model_mmbench.py
model_qa.py		model_qa.py
model_vqa.py		model_vqa.py
model_vqa_ds.py		model_vqa_ds.py
model_vqa_logit.py		model_vqa_logit.py
model_vqa_mmhal.py		model_vqa_mmhal.py
model_vqa_science.py		model_vqa_science.py
qa_baseline_gpt35.py		qa_baseline_gpt35.py
run_llava.py		run_llava.py
summarize_eval_pope.py		summarize_eval_pope.py
summarize_gpt_mmhal.py		summarize_gpt_mmhal.py
summarize_gpt_review.py		summarize_gpt_review.py

README.md

Evaluations

This directory contains end-to-end pipelines for AI-enhanced evaluation. We will introduce the evaluation pipeline and the data format in this document.

Generate Answers

ChatGPT (gpt-3.5-turbo)

Make sure you have setup the OpenAI API Key in your environment. Then run:

python qa_baseline_gpt35.py --question table/question.jsonl --output table/answer/answer_gpt35.jsonl

Bard

Unfortunately, Bard has not release its public APIs till now. You may have to enter the anwsers manually. Or you could find a third-party project that interfaces with Bard.

Vicuna and others

To generate answers with Vicuna or other models, specify path to the model checkpoint. Then run:

python model_qa.py --model-name /model/path --question-file tables/question.jsonl --answer-file table/answer/answer.jsonl

Evaluate Answers Automatically

Generete Reviews with GPT-4

PS: If you do not current have access to GPT-4 API, but you have access to GPT-4 chatbot, you can evaluate the answers manually, according to the instructions in the Data Format section. table/review/*.jsonl are some examples of reviews.

TODO: add instructions

Visualize Results

You can generate the data for the webpage by running:

python eval/generate_webpage_data_from_table.py

Then you can serve a static website in webpage to see the results.

Data Format

If you want to have a deeper understanding of our evaluation pipeline or want to contribute to the evaluation process, you need to learn the data format we used for evaluation.

Our evaluation data are encoded with JSON Lines.

Random ID Generation

We use the shortuuid Python library for generating short random UUIDs.

import shortuuid
shortuuid.uuid() -> str

Models

model.jsonl contains model information we used for generating anwsers.

Each row contains a record of a model with the following field:

model_id (str): A unique ID for a model. Models with different IDs is supposed to have different performance. This ID is generated by {model_name}:{model_version}.
model_name (str): The name of a model. This is not unique, because a model could be trained and updated continuously, but it is still considered as the same model with different versions.
model_version (str): The version of a model.
model_metadata (Any): Any metadata of a model (descriptions etc). This is optional.

For example:

{
  "model_id": "vicuna-13b:v1",
  "model_name": "vicuna-13b",
  "model_version": "v1",
  "model_metadata": "learning rate 1e-5, 3 epochs, 13b"
}

Prompts

We store prompts in prompt.jsonl. Each row contains a record of a prompt with the following field:

prompt_id (int): A unique integer ID for a prompt. Prompts with different IDs are supposed to have different purpose.
system_prompt (str): The system prompt given to a model. This is the prompt that the model sees first.
prompt_template (str): The prompt body. This is the user prompt that the model sees after the system prompt. It is a Python f-string template, so that we can fill in the inputs later.
defaults (dict): A dictionary of default values for the prompt template. It can be empty.
description (str): A description of the functionality of the prompt.

For example:

{
  "prompt_id": 1,
  "system_prompt": "You are a helpful assistant.",
  "prompt_template": "[Question]\n{question}\n\n[Assistant 1]\n{answer_1}\n\n[End of Assistant 1]\n\n[Assistant 2]\n{answer_2}\n\n[End of Assistant 2]\n\n[System]\n{prompt}\n\n",
  "defaults": {"prompt": "Which assistant is more helpful?"},
  "description": "Compare two assistants' answers to a question."
}

Reviewers

reviewer.jsonl contains reviewer information we used for reviewing answers generated by different models. Each row contains a record of a reviewer with the following field:

reviewer_id (str): A unique ID for a reviewer. Reviewers with different IDs is supposed to have different reviewing performance.
prompt_id (str): The ID of the prompt given to the reviewer (e.g., an AI assistant). Different prompts could result in different reviewing performance.
metadata (dict): Metadata of a reviewer about its configurations.
description (str): A description of the reviewer.

For example:

{
  "reviewer_id": "gpt-4-0328-default",
  "prompt_id": 1,
  "temperature": 0.2,
  "max_tokens": 8192,
  "description": "GPT-4 for generic questions."
}

Questions

question.jsonl contains questions we used for evaluation. Each row contains a record of a question with the following field:

question_id (int): A unique integer for a question. Questions with different IDs is supposed to be different.
text (str): The question text.
category (str): The category of the question. Questions with the same category are supposed to be similar or originate from the same source.

Answers

answer/xxx.jsonl contains answers generated by different models. Each row contains a record of an answer with the following field:

answer_id (str): A unique UUID for an answer. Answers with different IDs is supposed to be different.
question_id (int): The ID of the question the answer is generated for.
model_id (str): The ID of the model the answer is generated by.
text (str): The answer text.
metadata (dict): Any metadata of the answer.

Example:

{
  "answer_id": "[short uuid]",
  "question_id": 1,
  "model_id": "vicuna-13b:v1",
  "text": "Here are five tips...",
  "metadata": {}
}

Reviews

review/xxx.jsonl contains reviews given by reviewers, comparing peformance between a pair of models. Each row contains a record of a review with the following field:

review_id (str): A unique UUID for a review. Reviews with different IDs is supposed to be different.
question_id (int): The ID of the question the review is given for.
answer1_id (str): The ID of the first answer.
answer2_id (str): The ID of the second answer.
text (str): The review text.
score (list): A list of scores given by the reviewer. The first score is for the first answer, and the second score is for the second answer.
reviewer_id (str): The ID of the reviewer.
metadata (dict): Any metadata of the review.

{
  "review_id": "[short uuid]",
  "question_id": 1,
  "answer1_id": "[answer1_id]",
  "answer2_id": "[answer2_id]",
  "text": "Assistant 2 is better...",
  "score": [9.0, 7.5],
  "reviewer_id": "gpt-4-0328-default",
  "metadata": {}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval

Eval

README.md

Evaluations

Generate Answers

ChatGPT (gpt-3.5-turbo)

Bard

Vicuna and others

Evaluate Answers Automatically

Generete Reviews with GPT-4

Visualize Results

Data Format

Random ID Generation

Models

Prompts

Reviewers

Questions

Answers

Reviews

Files

Eval

Directory actions

More options

Directory actions

More options

Latest commit

History

Eval

Folders and files

parent directory

README.md

Evaluations

Generate Answers

ChatGPT (gpt-3.5-turbo)

Bard

Vicuna and others

Evaluate Answers Automatically

Generete Reviews with GPT-4

Visualize Results

Data Format

Random ID Generation

Models

Prompts

Reviewers

Questions

Answers

Reviews