Merge pull request mlcommons#700 from georgelyuan/master

compliance test updates
chenqiny · Aug 26, 2020 · 4a75c56 · 4a75c56
2 parents 198b530 + d1b0cc3
commit 4a75c56
Show file tree

Hide file tree

Showing 3 changed files with 38 additions and 8 deletions.
diff --git a/compliance/nvidia/TEST01/README.md b/compliance/nvidia/TEST01/README.md
@@ -21,14 +21,21 @@ The subset of samples results chosen to to be written to the accuracy JSON is de
 
 There is an audit.config file for each individual benchmark, located in the benchmark subdirectories in this test directory. The `accuracy_log_sampling_target` value for each benchmark is chosen taking into consideration the performance sample count and size of the inference result. If performance with sampling enabled cannot meet the pass threshold set in verify_performance.py, `accuracy_log_sampling_target` may be reduced to check that performance approaches the submission score.
 
+
 ## Log size
-3d-unet is unique in that its inference result output per-sample is drastically larger than that of other benchmarks. For all other benchmarks, the accuracy JSON results can be checked using python JSON libraries, which can be enabled by providing `--fastmode` to the run_verification.py script. For 3d-unet, using fastmode will result in verify_performance.py running out of memory, so the alternative way of using UNIX-based commandline utilities must be used by not supplying the `--fastmode` switch.
+In v0.7, the new workloads that have been added can generate significantly more output data than the workloads used in v0.5. Typically, the default mode of operation of the accuracy script is to check the accuracy JSON files using python JSON libraries. In the case that such scripts run out of memory, another fallback mode of operation can be enabled using UNIX-based commandline utilities which can be enabled using the `--unixmode` switch.
 
 ## Prerequisites
-This script works best with Python 3.3 or later. For 3d-unet, the accuracy verification script require the `wc`,`sed`,`awk`,`head`,`tail`,`grep`, and `md5sum` UNIX commandline utilities.
+This script works best with Python 3.3 or later. For `--unixmode`,  the accuracy verification script also require the `wc`,`sed`,`awk`,`head`,`tail`,`grep`, and `md5sum` UNIX commandline utilities.
 This script also assumes that the submission runs have already been run and that results comply with the submission directory structure as described in [https://github.com/mlperf/policies/blob/master/submission_rules.adoc#562-inference](https://github.com/mlperf/policies/blob/master/submission_rules.adoc#562-inference)
 ## Non-determinism
-Note that under MLPerf inference rules, certain forms of non-determinism is acceptable, which can cause inference results to differ across runs. It is foreseeable that the results obtained during the accuracy run can be different from that obtained during the performance run, which will cause the accuracy checking script to report failure. Test failure will automatically result in an objection, but the objection can be overturned by comparing the quality of the results generated in performance mode to that obtained in accuracy mode. This can be done by using the accuracy measurement scripts provided as part of the repo to ensure that the accuracy score meets the target. An example is provided for GNMT in the gnmt folder.
+Under MLPerf inference rules, certain forms of non-determinism is acceptable, which can cause inference results to differ across runs. It is foreseeable that the results obtained during the accuracy run can be different from that obtained during the performance run, which will cause the accuracy checking script to report failure. Test failure will automatically result in an objection, but the objection can be overruled by providing proof of the quality of inference results. 
+`create_accuracy_baseline.sh` is provided for this purpose. By running:
+
+    bash ./create_accuracy_baseline.sh <path to mlperf_log_accuracy.json from the accuracy run> <path to mlperf_log_accuracy.json from the compliance test run>
+
+ this script creates a baseline accuracy log called `mlperf_log_accuracy_baseline.json` using only a subset of the results from `mlperf_log_accuracy.json` from the accuracy run that corresponds to the QSL indices contained in `mlperf_log_accuracy.json` in the compliance test run. This provides an apples-to-apples accuracy log comparison between the accuracy run and compliance run.
+The submitter can then run the reference accuracy script on `mlperf_log_accuracy_baseline.json` and the compliance test run's `mlperf_log_accuracy.json` and report the F1/mAP/DICE/WER/Top1%/AUC score. 
 
 ## Instructions
 
@@ -37,7 +44,9 @@ Run test with the provided audit.config in the corresponding benchmark subdirect
 
 ### Part II
 Run the verification script:
-  `python3 run_verification.py -r RESULTS_DIR -c COMPLIANCE_DIR -o OUTPUT_DIR [--dtype {byte,float32,int32,int64}] [--fastmode]`
+
+    python3 run_verification.py -r RESULTS_DIR -c COMPLIANCE_DIR -o OUTPUT_DIR [--dtype {byte,float32,int32,int64}] [--unixmode]
+
 
 
  - RESULTS_DIR: Specifies the path to the corresponding results

diff --git a/compliance/nvidia/TEST01/create_accuracy_baseline.sh b/compliance/nvidia/TEST01/create_accuracy_baseline.sh
@@ -0,0 +1,19 @@
+# Usage:
+# 1) bash ./create_accuracy_baseline.sh <accuracy_accuracy_log_file> <perf_accuracy_log_file>
+# 2) python inference/v0.5/translation/gnmt/tensorflow/process_accuracy.py <perf_accuracy_log_file>
+# 3) python inference/v0.5/translation/gnmt/tensorflow/process_accuracy.py on generated baseline
+# 4) Compare BLEU scores
+
+#!/bin/bash
+accuracy_log=$1
+perf_log=$2
+patterns="unique_patterns.txt"
+accuracy_baseline=$(basename -- "$accuracy_log")
+accuracy_baseline="${accuracy_baseline%.*}"_baseline.json
+
+cut -d ':' -f 2,3 ${perf_log} | cut -d ',' -f 2- | sort | uniq | grep qsl > ${patterns}
+echo '[' > ${accuracy_baseline}
+grep -f ${patterns} ${accuracy_log} >> ${accuracy_baseline}
+sed -i '$ s/,$/]/g' ${accuracy_baseline}
+rm ${patterns}
+echo "Created a baseline accuracy file: ${accuracy_baseline}"
diff --git a/compliance/nvidia/TEST04-A/run_verification.py b/compliance/nvidia/TEST04-A/run_verification.py
@@ -40,17 +40,17 @@ def main():
     parser.add_argument(
         "--test4A_dir", "-a",
         help="Specifies the path to the directory containing the logs from the TEST04-A audit config run.",
-        default=""
+        required=True
     )
     parser.add_argument(
         "--test4B_dir", "-b",
         help="Specifies the path to the directory containing the logs from the TEST04-B audit config test run.",
-        default=""
+        required=True
     )
     parser.add_argument(
         "--output_dir", "-o",
         help="Specifies the path to the output directory where compliance logs will be uploaded to, i.e. inference_results_v0.7/closed/NVIDIA/compliance/T4x8/resnet/Offline.",
-        default=""
+        required=True
     )
     parser.add_argument(
         "--dtype", default="byte", choices=["byte", "float32", "int32", "int64"], help="data type of the label (only needed in fastmode")
@@ -67,7 +67,9 @@ def main():
     dtype = args.dtype
 
     # run verify performance
-    verify_performance_command = "python3 verify_test4_performance.py -u " + test4A_dir + "/mlperf_log_summary.txt" + " -s " + test4B_dir + "/mlperf_log_summary.txt | tee verify_performance.txt"
+    verify_performance_binary = os.path.join(os.path.dirname(__file__),"verify_test4_performance.py")
+
+    verify_performance_command = "python3 " + verify_performance_binary + " -u " + test4A_dir + "/mlperf_log_summary.txt" + " -s " + test4B_dir + "/mlperf_log_summary.txt | tee verify_performance.txt"
     try:
         os.system(verify_performance_command)
     except: