We provide detailed evaluation methods for MLVU, including Multiple-choice tasks and generation tasks.
Firstly, If you want to benchmark MLVU in your models, you can refer to our template test code as follows:
python multiple_choice_evaluation/choice_bench.py
You must load your model into this template and evaluate the multiple-choice performance online.
- Step 1 Get the inference results of Sub-Scene Captioning and Video Summary.
python generation_evaluation/open_bench.py
- Step 2 Run the evaluation for the generation tasks. For Sub-Scene Captioning, modify your pred_path (by step 1) and output_dir then run
python evaluate_ssc.py --pred_path /your_path/subplot_all.json --output_dir /eval_subplot --output_json /eval_subplot.json
python calculate.py --path /eval_subplot
For Video Summarization, modify your pred_path (by step 1) and output_dir then run
python evaluate_summary.py --pred_path /your_path/summary_all.json --output_dir /eval_summary --output_json /eval_summary.json
Then run, and you need to modify the path in it to your output_dir
python calculate_sum.py --path /eval_summary
(Take VideoChat2 as an example:)
- step 1: Download original models as well as weights from VideoChat2
- step 2: Put choice_bench.py and open_bench.py into the folder as the same as demo.py
- step 3: modify your path of the MLVU in choice_bench.py and open_bench.py
- step 4: run the inference and online evaluation for Multiple-choice tasks.
- step 5: run the inference and evaluation for the generation tasks.