[ACUTE-Eval] Fast Acute OSS Part 2 - Everything Else (facebookresearc…

…h#2573) * fast acute OSS * autoformat * readme changes * remove todos * typing * incorporate matchups-per-pair arg * Update README.md * readme update
jeremyzzzhj · Apr 29, 2020 · e3c2afa · e3c2afa
1 parent a7d9100
commit e3c2afa
Show file tree

Hide file tree

Showing 7 changed files with 1,017 additions and 7 deletions.
diff --git a/parlai/mturk/tasks/acute_eval/README.md b/parlai/mturk/tasks/acute_eval/README.md
@@ -118,12 +118,6 @@ The title, description, and keywords of the task as shown on MTurk default to va
 
 A comprehensive list of settings specific to ACUTE-Eval can be found in `add_args()` in `run.py`. ParlAI MTurk arguments can be found in `~/ParlAI/parlai/core/params.py` under `add_mturk_args()`. For the arguments most likely to be useful for running ACUTE-Eval, see `example_script.py`:
 
-
-## Creating the pairings file
-
-Coming soon.
-
-
 ** **
 
 # ACUTE-Eval Analysis
@@ -154,3 +148,106 @@ Where `</path/to/pairs/file>` is your pairings file from the ACUTE Eval run. Run
 1. **all.html** - List of all conversations, indicating which was chosen as the winner by a turker.
 2. **reason.html** - List of all conversations where reasons are provided by the turkers for why they chose a winner.
 
+# Fast-ACUTE
+
+We provide an all-in-one script to run ACUTE-Eval in the smoothest experience possible.
+
+The script combines three major steps of ACUTE-Eval into one simple command:
+
+1. Generation (or compilation) of chat logs for given models;
+2. Execution of ACUTE-Eval
+3. Analysis of ACUTE-Eval results.
+
+## Setup Steps
+
+### 1. Determine What You Will Be Evaluating; Populate Config.
+
+This is an important step - do you have conversation logs between a model and a human? Would you like evaluate model self-chat? Do you want to evaluate dataset logs?
+
+Each of these options involves _slightly_ different preparation. However, each involves specifying a config.
+
+In the `configs.py` file in this directory, you will find a `CONFIG` dictionary that maps a _unique_ identifier to appropriate configuration arguments; these arguments differ depending on what you will be evaluating.
+
+*NOTE*: the `CONFIG` is _append only_, and all configs must have a *unique* identifier.
+
+I will enumerate a few of these options below.
+
+#### Model self-chat
+
+If you would like to evaluate a model chatting to itself, you simply specify the appropriate model parameters in the config. The parameters are any that you would need to specify on the command line, and include things like the model-file, fixed candidates file, etc. You can see an example in the `example_model` config.
+
+#### JSONL Logs
+
+If you have logs in the appropriate JSONL format, as would be generated by the self-chat script, then all you need to specify is the `log_path`. You can see an example in the `example_model_log` config.
+
+The appropriate JSONL format is one that can be read by ParlAI's [Conversations](https://github.com/facebookresearch/ParlAI/blob/master/parlai/utils/conversations.py) class. Note that the identifier in the config should match **EXACTLY** the `id` of the model in the conversations.
+
+#### Dataset
+
+If you'd like to evaluate examples from a dataset available in ParlAI directly, simply specify the `task` in the config. You can see an example in the `example_dataset` config.
+
+### 1b. (Optional) Determine the Self-Chat Task You Will Use
+
+If you will be evaluating models via self-chat, you will need to determine the self-chat task you will use to help generate the self-chats. This is not so much any work on your part other than identfying a task that is setup for self-chat, i.e., a task that has the appropriate worlds used for conducting self-chat with the models. This is not strictly necessary, but you may want to introduce context, e.g. as in `convai2` or `blended_skill_talk`.
+
+### 2. Run `fast_eval.py`
+
+Now that you've setup everything, all you need to do is run one of the following commands.
+
+If you want to compare a set of models in round-robin fashion, you would run:
+
+    python parlai/mturk/tasks/acute_eval/fast_eval.py --ids <comma-separated list of config identifiers>
+
+If you want multiple model comparisons, but do not want to compare ALL models with eachother, you would run:
+
+    python parlai/mturk/tasks/acute_eval/fast_eval.py --id-pairs <comma-separated, colon-delimited list of config identifiers>
+
+The ids specified for each of those flags corresponds to the entry in the `CONFIG`.
+
+If you are running self-chat, you can optionally specify a seed task to use for self-chat with `-t <self_chat_task>`.
+
+A few examples are as follows:
+
+    python parlai/mturk/tasks/acute_eval/fast_eval.py --ids  example_model_1,example_model_2,example_model_log,example_dataset -t blended_skill_talk
+
+    python parlai/mturk/tasks/acute_eval/fast_eval.py --id-pairs  example_model_1:example_model_2,example_model_1:example_model_log,example_dataset:example_model_2 -t blended_skill_talk
+
+When you are ready to run a **LIVE** ACUTE-Eval, please specify `--live-acute true`.
+
+#### Onboarding
+
+The default onboaring dialogue pair is in `example/onboarding.json`. We recommend you use a different onboarding example as the one provided is quite easy.
+
+To use a custom onboarding path, specify the `--onboarding-path` when running `fast_eval.py`. The onboarding file should be a jsonl file, where each line is a json dict consisting of a pair of dialogues to evaluate, and where `is_onboarding` is set to True.
+
+## Script Execution
+
+The script operates in three phases:
+
+#### Phase 1: Compile Chat Logs
+
+The script will first compile the chat logs for each identifier specified on the command line.
+
+For `model`s, the script will run self-chat (if a self-chat log does not already exist); for `log`s, the script will simply load the log from disk; and for `task`s, the script will convert the task into the appropriate format.
+
+Self-chats are saved to `PARLAI_PATH/data/acute_evals/self_chats/`
+
+### Phase 2: ACUTE-Eval
+
+The script will then prepare the conversation-pairs file (and save to `PARLAI_PATH/data/pairings_files/`, unique according to which chat files were used to create it) and run ACUTE-Eval with appropriate arguments.
+
+Upon subsequent runs with the same configuration of `--ids` or `--id-pairs`, you will have the option to re-use a pairings file or to regenerate it.
+
+### Phase 3: Analysis
+
+After finishing ACUTE-Eval, the script will analyze and save relevat results to `PARLAI_PATH/data/acute_evals/acute_results/<date>/<pairings_file>/`
+
+4 Results will be generated:
+
+1. A csv file of significance result, which shows the win rates of model pairs with p value
+2. A csv file of grid result, where the model comparisons and win rates are laid out in a nice grid (as seen in the ACUTE-Eval paper).
+3. A html file of nicely visualized conversations with reason annotated only
+4. A html file of ALL nicely visualized conversations
+
+**NOTE** the the `analysis.py` file can be run by itself as long as you specify the ACUTE-Eval `run_id`, whether a sandbox run, and whether it is qfunction eval.
+
diff --git a/parlai/mturk/tasks/acute_eval/analysis.py b/parlai/mturk/tasks/acute_eval/analysis.py
@@ -541,7 +541,7 @@ def _path(filename):
         with open(_path('grid.csv'), 'w') as f:
             f.write(self.get_win_fractions().to_csv(index=True))
         _print_progress(
-            f"To visualize grid result, try cat {_path('grid.csv')} | column -t -s, | less -S"
+            f"To visualize grid result, try cat {_path('grid.csv')} | sed 's/,/ ,/g' | column -t -s, | less -S"
         )
 
         # Render conversations if valid pairings filepath provided

diff --git a/parlai/mturk/tasks/acute_eval/configs.py b/parlai/mturk/tasks/acute_eval/configs.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+"""
+Model Configuration file for Fast ACUTE Eval.
+
+CONFIG: Dict[str, Dict]
+    - maps ids to their appropriate options
+    - for models, please only include options that you would specify on the command line
+"""
+import os
+from typing import Dict
+
+ROOT_DIR = '/checkpoint/parlai/acute_evals/'
+CONFIG: Dict[str, Dict] = {
+    'example_model_1': {
+        'model_file': 'zoo:tutorial_transformer_generator/model',
+        'model': 'transformer/generator',
+        # general args
+        'batchsize': 1,
+        'skip_generation': False,
+        'interactive_mode': False,
+        'beam_size': 3,
+        'beam_min_length': 3,
+        'inference': 'beam',
+        'beam_block_ngram': 3,
+        'beam_context_block_ngram': 3,
+    },
+    'example_model_2': {
+        'model_file': 'zoo:tutorial_transformer_generator/model',
+        'model': 'transformer/generator',
+        # general args
+        'batchsize': 1,
+        'skip_generation': False,
+        'interactive_mode': False,
+        'inference': 'nucleus',
+        'topp': 0.9,
+    },
+    'example_model_log': {
+        'log_path': f"{os.path.dirname(os.path.realpath(__file__))}/example/chat_log.jsonl"
+    },
+    'example_dataset': {'task': 'convai2', 'prepended_context': True},
+}
diff --git a/parlai/mturk/tasks/acute_eval/dump_task_to_acute_format.py b/parlai/mturk/tasks/acute_eval/dump_task_to_acute_format.py
@@ -0,0 +1,140 @@
+#!/usr/bin/env python3
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Convert a ParlAI teacher to acute-eval format.
+Examples
+--------
+.. code-block:: shell
+py parlai/mturk/tasks/acute_eval/dump_task_to_acute_format.py  -t  convai2
+"""
+
+from parlai.core.params import ParlaiParser
+from parlai.agents.repeat_label.repeat_label import RepeatLabelAgent
+from parlai.core.worlds import create_task
+from parlai.utils.conversations import Conversations
+from parlai.utils.misc import TimeLogger
+import random
+import tempfile
+
+
+def setup_args():
+    """
+    Set up conversion args.
+    """
+    parser = ParlaiParser()
+    parser.add_argument(
+        '-n',
+        '--num-episodes',
+        default=-1,
+        type=int,
+        help='Total number of episodes to convert, -1 to convert \
+                                all examples',
+    )
+    parser.add_argument(
+        '-of',
+        '--outfile',
+        default=None,
+        type=str,
+        help='Output file where to save, by default will be \
+                                created in /tmp',
+    )
+    parser.add_argument(
+        '-s1id', '--speaker-0-id', type=str, help='Speaker id of agent who speaks first'
+    )
+    parser.add_argument(
+        '-s1id',
+        '--speaker-1-id',
+        type=str,
+        help='Speaker id of agent who speaks second',
+    )
+    parser.add_argument(
+        '--prepended-context',
+        type='bool',
+        default=False,
+        help='specify if the context is prepended to the first act',
+    )
+    parser.add_argument('-ltim', '--log-every-n-secs', type=float, default=10)
+    parser.set_defaults(datatype='train:ordered')
+
+    return parser
+
+
+def dump_data(opt):
+    """
+    Dump task data to ACUTE-Eval.
+    """
+    # create repeat label agent and assign it to the specified task
+    agent = RepeatLabelAgent(opt)
+    world = create_task(opt, agent)
+    task = opt.get('task')
+    speaker_0_id = opt.get('speaker_0_id') or f'{task}_as_human'
+    speaker_1_id = opt.get('speaker_1_id') or f'{task}_as_model'
+    if opt['outfile'] is None:
+        outfile = tempfile.mkstemp(
+            prefix='{}_{}_'.format(opt['task'], opt['datatype']), suffix='.txt'
+        )[1]
+    else:
+        outfile = opt['outfile']
+
+    num_episodes = (
+        world.num_episodes()
+        if opt['num_episodes'] == -1
+        else min(opt['num_episodes'], world.num_episodes())
+    )
+    log_timer = TimeLogger()
+
+    print(f'[ starting to convert, saving output to {outfile} ]')
+    dialogues = []
+    for _ in range(num_episodes):
+        episode = []
+        episode_done = False
+        while not episode_done:
+            world.parley()
+            acts = world.get_acts()
+            text = acts[0].get('text')
+            split_text = text.split('\n')
+            label = random.choice(
+                acts[0].get('labels', acts[0].pop('eval_labels', None))
+            )
+            if not episode and opt.get('prepended_context'):
+                # first turn
+                context = split_text[:-1]
+                text = split_text[-1]
+                context_turn = [
+                    {'text': context, 'episode_done': False, 'id': 'context'}
+                    for _ in range(2)
+                ]
+                episode.append(context_turn)
+            turn = [
+                {'text': text, 'episode_done': False, 'id': speaker_0_id},
+                {'text': label, 'episode_done': False, 'id': speaker_1_id},
+            ]
+            episode.append(turn)
+            if acts[0].get('episode_done', False):
+                episode[-1][-1]['episode_done'] = True
+                episode_done = True
+                dialogues.append(episode)
+
+            if log_timer.time() > opt['log_every_n_secs']:
+                text, _log = log_timer.log(world.total_parleys, world.num_examples())
+                print(text)
+
+        if world.epoch_done():
+            break
+
+    Conversations.save_conversations(dialogues, outfile, opt)
+
+
+def main():
+    random.seed(42)
+    # Get command line arguments
+    parser = setup_args()
+    opt = parser.parse_args()
+    dump_data(opt)
+
+
+if __name__ == '__main__':
+    main()