Skip to content

Commit

Permalink
[ACUTE-Eval] Fast Acute OSS Part 2 - Everything Else (facebookresearc…
Browse files Browse the repository at this point in the history
…h#2573)

* fast acute OSS

* autoformat

* readme changes

* remove todos

* typing

* incorporate matchups-per-pair arg

* Update README.md

* readme update
  • Loading branch information
klshuster authored Apr 29, 2020
1 parent a7d9100 commit e3c2afa
Show file tree
Hide file tree
Showing 7 changed files with 1,017 additions and 7 deletions.
109 changes: 103 additions & 6 deletions parlai/mturk/tasks/acute_eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,12 +118,6 @@ The title, description, and keywords of the task as shown on MTurk default to va

A comprehensive list of settings specific to ACUTE-Eval can be found in `add_args()` in `run.py`. ParlAI MTurk arguments can be found in `~/ParlAI/parlai/core/params.py` under `add_mturk_args()`. For the arguments most likely to be useful for running ACUTE-Eval, see `example_script.py`:


## Creating the pairings file

Coming soon.


** **

# ACUTE-Eval Analysis
Expand Down Expand Up @@ -154,3 +148,106 @@ Where `</path/to/pairs/file>` is your pairings file from the ACUTE Eval run. Run
1. **all.html** - List of all conversations, indicating which was chosen as the winner by a turker.
2. **reason.html** - List of all conversations where reasons are provided by the turkers for why they chose a winner.

# Fast-ACUTE

We provide an all-in-one script to run ACUTE-Eval in the smoothest experience possible.

The script combines three major steps of ACUTE-Eval into one simple command:

1. Generation (or compilation) of chat logs for given models;
2. Execution of ACUTE-Eval
3. Analysis of ACUTE-Eval results.

## Setup Steps

### 1. Determine What You Will Be Evaluating; Populate Config.

This is an important step - do you have conversation logs between a model and a human? Would you like evaluate model self-chat? Do you want to evaluate dataset logs?

Each of these options involves _slightly_ different preparation. However, each involves specifying a config.

In the `configs.py` file in this directory, you will find a `CONFIG` dictionary that maps a _unique_ identifier to appropriate configuration arguments; these arguments differ depending on what you will be evaluating.

*NOTE*: the `CONFIG` is _append only_, and all configs must have a *unique* identifier.

I will enumerate a few of these options below.

#### Model self-chat

If you would like to evaluate a model chatting to itself, you simply specify the appropriate model parameters in the config. The parameters are any that you would need to specify on the command line, and include things like the model-file, fixed candidates file, etc. You can see an example in the `example_model` config.

#### JSONL Logs

If you have logs in the appropriate JSONL format, as would be generated by the self-chat script, then all you need to specify is the `log_path`. You can see an example in the `example_model_log` config.

The appropriate JSONL format is one that can be read by ParlAI's [Conversations](https://github.com/facebookresearch/ParlAI/blob/master/parlai/utils/conversations.py) class. Note that the identifier in the config should match **EXACTLY** the `id` of the model in the conversations.

#### Dataset

If you'd like to evaluate examples from a dataset available in ParlAI directly, simply specify the `task` in the config. You can see an example in the `example_dataset` config.

### 1b. (Optional) Determine the Self-Chat Task You Will Use

If you will be evaluating models via self-chat, you will need to determine the self-chat task you will use to help generate the self-chats. This is not so much any work on your part other than identfying a task that is setup for self-chat, i.e., a task that has the appropriate worlds used for conducting self-chat with the models. This is not strictly necessary, but you may want to introduce context, e.g. as in `convai2` or `blended_skill_talk`.

### 2. Run `fast_eval.py`

Now that you've setup everything, all you need to do is run one of the following commands.

If you want to compare a set of models in round-robin fashion, you would run:

python parlai/mturk/tasks/acute_eval/fast_eval.py --ids <comma-separated list of config identifiers>

If you want multiple model comparisons, but do not want to compare ALL models with eachother, you would run:

python parlai/mturk/tasks/acute_eval/fast_eval.py --id-pairs <comma-separated, colon-delimited list of config identifiers>

The ids specified for each of those flags corresponds to the entry in the `CONFIG`.

If you are running self-chat, you can optionally specify a seed task to use for self-chat with `-t <self_chat_task>`.

A few examples are as follows:

python parlai/mturk/tasks/acute_eval/fast_eval.py --ids example_model_1,example_model_2,example_model_log,example_dataset -t blended_skill_talk

python parlai/mturk/tasks/acute_eval/fast_eval.py --id-pairs example_model_1:example_model_2,example_model_1:example_model_log,example_dataset:example_model_2 -t blended_skill_talk

When you are ready to run a **LIVE** ACUTE-Eval, please specify `--live-acute true`.

#### Onboarding

The default onboaring dialogue pair is in `example/onboarding.json`. We recommend you use a different onboarding example as the one provided is quite easy.

To use a custom onboarding path, specify the `--onboarding-path` when running `fast_eval.py`. The onboarding file should be a jsonl file, where each line is a json dict consisting of a pair of dialogues to evaluate, and where `is_onboarding` is set to True.

## Script Execution

The script operates in three phases:

#### Phase 1: Compile Chat Logs

The script will first compile the chat logs for each identifier specified on the command line.

For `model`s, the script will run self-chat (if a self-chat log does not already exist); for `log`s, the script will simply load the log from disk; and for `task`s, the script will convert the task into the appropriate format.

Self-chats are saved to `PARLAI_PATH/data/acute_evals/self_chats/`

### Phase 2: ACUTE-Eval

The script will then prepare the conversation-pairs file (and save to `PARLAI_PATH/data/pairings_files/`, unique according to which chat files were used to create it) and run ACUTE-Eval with appropriate arguments.

Upon subsequent runs with the same configuration of `--ids` or `--id-pairs`, you will have the option to re-use a pairings file or to regenerate it.

### Phase 3: Analysis

After finishing ACUTE-Eval, the script will analyze and save relevat results to `PARLAI_PATH/data/acute_evals/acute_results/<date>/<pairings_file>/`

4 Results will be generated:

1. A csv file of significance result, which shows the win rates of model pairs with p value
2. A csv file of grid result, where the model comparisons and win rates are laid out in a nice grid (as seen in the ACUTE-Eval paper).
3. A html file of nicely visualized conversations with reason annotated only
4. A html file of ALL nicely visualized conversations

**NOTE** the the `analysis.py` file can be run by itself as long as you specify the ACUTE-Eval `run_id`, whether a sandbox run, and whether it is qfunction eval.

2 changes: 1 addition & 1 deletion parlai/mturk/tasks/acute_eval/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -541,7 +541,7 @@ def _path(filename):
with open(_path('grid.csv'), 'w') as f:
f.write(self.get_win_fractions().to_csv(index=True))
_print_progress(
f"To visualize grid result, try cat {_path('grid.csv')} | column -t -s, | less -S"
f"To visualize grid result, try cat {_path('grid.csv')} | sed 's/,/ ,/g' | column -t -s, | less -S"
)

# Render conversations if valid pairings filepath provided
Expand Down
46 changes: 46 additions & 0 deletions parlai/mturk/tasks/acute_eval/configs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/usr/bin/env python3

# Copyright (c) Facebook, Inc. and its affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

"""
Model Configuration file for Fast ACUTE Eval.
CONFIG: Dict[str, Dict]
- maps ids to their appropriate options
- for models, please only include options that you would specify on the command line
"""
import os
from typing import Dict

ROOT_DIR = '/checkpoint/parlai/acute_evals/'
CONFIG: Dict[str, Dict] = {
'example_model_1': {
'model_file': 'zoo:tutorial_transformer_generator/model',
'model': 'transformer/generator',
# general args
'batchsize': 1,
'skip_generation': False,
'interactive_mode': False,
'beam_size': 3,
'beam_min_length': 3,
'inference': 'beam',
'beam_block_ngram': 3,
'beam_context_block_ngram': 3,
},
'example_model_2': {
'model_file': 'zoo:tutorial_transformer_generator/model',
'model': 'transformer/generator',
# general args
'batchsize': 1,
'skip_generation': False,
'interactive_mode': False,
'inference': 'nucleus',
'topp': 0.9,
},
'example_model_log': {
'log_path': f"{os.path.dirname(os.path.realpath(__file__))}/example/chat_log.jsonl"
},
'example_dataset': {'task': 'convai2', 'prepended_context': True},
}
140 changes: 140 additions & 0 deletions parlai/mturk/tasks/acute_eval/dump_task_to_acute_format.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
#!/usr/bin/env python3

# Copyright (c) Facebook, Inc. and its affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""
Convert a ParlAI teacher to acute-eval format.
Examples
--------
.. code-block:: shell
py parlai/mturk/tasks/acute_eval/dump_task_to_acute_format.py -t convai2
"""

from parlai.core.params import ParlaiParser
from parlai.agents.repeat_label.repeat_label import RepeatLabelAgent
from parlai.core.worlds import create_task
from parlai.utils.conversations import Conversations
from parlai.utils.misc import TimeLogger
import random
import tempfile


def setup_args():
"""
Set up conversion args.
"""
parser = ParlaiParser()
parser.add_argument(
'-n',
'--num-episodes',
default=-1,
type=int,
help='Total number of episodes to convert, -1 to convert \
all examples',
)
parser.add_argument(
'-of',
'--outfile',
default=None,
type=str,
help='Output file where to save, by default will be \
created in /tmp',
)
parser.add_argument(
'-s1id', '--speaker-0-id', type=str, help='Speaker id of agent who speaks first'
)
parser.add_argument(
'-s1id',
'--speaker-1-id',
type=str,
help='Speaker id of agent who speaks second',
)
parser.add_argument(
'--prepended-context',
type='bool',
default=False,
help='specify if the context is prepended to the first act',
)
parser.add_argument('-ltim', '--log-every-n-secs', type=float, default=10)
parser.set_defaults(datatype='train:ordered')

return parser


def dump_data(opt):
"""
Dump task data to ACUTE-Eval.
"""
# create repeat label agent and assign it to the specified task
agent = RepeatLabelAgent(opt)
world = create_task(opt, agent)
task = opt.get('task')
speaker_0_id = opt.get('speaker_0_id') or f'{task}_as_human'
speaker_1_id = opt.get('speaker_1_id') or f'{task}_as_model'
if opt['outfile'] is None:
outfile = tempfile.mkstemp(
prefix='{}_{}_'.format(opt['task'], opt['datatype']), suffix='.txt'
)[1]
else:
outfile = opt['outfile']

num_episodes = (
world.num_episodes()
if opt['num_episodes'] == -1
else min(opt['num_episodes'], world.num_episodes())
)
log_timer = TimeLogger()

print(f'[ starting to convert, saving output to {outfile} ]')
dialogues = []
for _ in range(num_episodes):
episode = []
episode_done = False
while not episode_done:
world.parley()
acts = world.get_acts()
text = acts[0].get('text')
split_text = text.split('\n')
label = random.choice(
acts[0].get('labels', acts[0].pop('eval_labels', None))
)
if not episode and opt.get('prepended_context'):
# first turn
context = split_text[:-1]
text = split_text[-1]
context_turn = [
{'text': context, 'episode_done': False, 'id': 'context'}
for _ in range(2)
]
episode.append(context_turn)
turn = [
{'text': text, 'episode_done': False, 'id': speaker_0_id},
{'text': label, 'episode_done': False, 'id': speaker_1_id},
]
episode.append(turn)
if acts[0].get('episode_done', False):
episode[-1][-1]['episode_done'] = True
episode_done = True
dialogues.append(episode)

if log_timer.time() > opt['log_every_n_secs']:
text, _log = log_timer.log(world.total_parleys, world.num_examples())
print(text)

if world.epoch_done():
break

Conversations.save_conversations(dialogues, outfile, opt)


def main():
random.seed(42)
# Get command line arguments
parser = setup_args()
opt = parser.parse_args()
dump_data(opt)


if __name__ == '__main__':
main()
Loading

0 comments on commit e3c2afa

Please sign in to comment.