We prepared some example queries to run auto evaluation. You can run them by following the steps below.
- complete the
evaluator_config.json
(referring to the schema inevaluator_config_template.json
) under theauto_eval
folder and thetaskweaver_config.json
under thetaskweaver
folder. - cd to the
auto_eval
folder. - run the below command to start the auto evaluation for single case.
python taskweaver_eval.py -m single -f cases/init_say_hello.yaml
- run the below command to start the auto evaluation for multiple cases.
python taskweaver_eval.py -m batch -f ./cases
- -m/--mode: specifies the evaluation mode, which can be either single or batch.
- -f/--file: specifies the path to the test case file or directory containing test case files.
- -r/--result: specifies the path to the result file for batch evaluation mode. This parameter is only valid in batch mode. The default value is
sample_case_results.csv
. - -t/--threshold: specifies the interrupt threshold for multi-round chat evaluation. When the evaluation score of a certain round falls below this threshold, the evaluation will be interrupted. The default value is
None
, which means that no interrupt threshold is used. - -flush/--flush: specifies whether to flush the result file. This parameter is only valid in batch mode. The default value is
False
, which means that the evaluated cases will not be loaded again. If you want to re-evaluate the cases, you can set this parameter toTrue
.
A test case is a yaml file that contains the following fields:
- config_var(optional): set the config values for Taskweaver if needed.
- app_dir: the path to the project directory for Taskweaver.
- eval_query (a list, supports multiple queries)
- user_query: the user query to be evaluated.
- scoring_points:
- score_point: describes the criteria of the agent's response
- weight: the value that determines how important that criterion is
- eval_code(optional): evaluation code that will be run to determine if the criterion is met. In this case, this scoring point will not be evaluated using LLM.
- ...
- scoring_points:
- ...
- user_query: the user query to be evaluated.
- post_index: the index of the
post_list
in responseround
that should be evaluated. If it is set tonull
, then the entireround
will be evaluated.
Note: for the eval_code
field, you can use the variable agent_response
in your evaluation code snippet.
It can be a Round
or Post
JSON object determined by the post_index
field.