Add Model Tuner front-end tool. (pytorch#3816)

Summary: **Summary** Add a front-end tool for tuning (calibrating) the quantization parameters of a model. The parameters are tuned using the model accuracy as the optimization metric. **Motivation** The motivation for this tool is this: when using extremely brutal quantization schemes (for example the "SymmetricWithPower2Scale") the accuracy difference between the floating-point model and the quantized model can be pretty high (up to 10's of percents). One such example is an internally designed model which has: - 99 % accuracy for the floating-point model - 81 % accuracy for the (initial) quantized model using SymmetricWithPower2Scale schema - 98 % accuracy for the (final) quantized model after tuning Attached here is the console output of this Tool after running it for the model: - [ModelTuner_Log.txt](https://github.com/pytorch/glow/files/3886201/ModelTuner_Log.txt) **Algorithm** The quantization parameters are initially chosen such that no saturation occurs (quantized range includes the min/max of the profile). For some of the tensors where the histogram exhibits some outlier values it might be better to use quantization parameters which saturate the outliers with the benefit of having smaller quantization step for the bulk of the histogram. The tuning algorithm "plays" with the **scale** quantization parameter by trying sequentially for each node new values around the original value and picks the one which provides the best accuracy (it tries original_scale, original_scale/2 and original_scale/4). One might say that there is a philosophical problem with this approach: the algorithm is over-fitting the quantization parameters for a given dataset. But from a practical point of view it is better to have maybe an over-fitting of a couple of percent than have an under-fitting of 10s of percent for the quantization mechanism. **Refactorizations** - Had to refactor a a little the Loader class in order to provide more information through its API for the tools which is using the Loader. - Refactored Base.cpp to provide a function for validating the quantization parameters. - Refactored ProtobufLoader.cpp to add function to retrieve the unique input placeholder for a model with a single input placeholder. **Documentation** doc/ModelTuner.md **Test Plan** None Pull Request resolved: pytorch#3816 Differential Revision: D19166714 Pulled By: jfix71 fbshipit-source-id: cf1caf51abd4ac8b5ba90a96e937487131389ddf
ChunliF · Dec 19, 2019 · 41d3e30 · 41d3e30
1 parent de133ee
commit 41d3e30
Show file tree

Hide file tree

Showing 14 changed files with 878 additions and 114 deletions.
diff --git a/docs/ModelTuner.md b/docs/ModelTuner.md
@@ -0,0 +1,115 @@
+## ModelTuner
+
+This front end tool is used for tuning (calibrating) the quantization parameters of a model.
+During the quantization flow, the model is first profiled by gathering the dynamic range (min/max)
+for each tensor in the graph. Next, the quantization parameters are chosen in such a way that, for
+the given profile, no saturation occurs. Although this makes sense at first glance, there
+is actually a tradeoff when choosing the quantization parameters for a given tensor: it might be
+be beneficial overall if the quantization parameters are chosen such that to provide a smaller
+quantization step (e.g. smaller **scale** parameter) which means a better representation of most
+of the tensor values (the bulk of the histogram) at the expense of actually saturating the extreme
+values (outliers).
+
+This tool is basically tuning the quantization parameters by using the following simple algorithm:
+- For each node in the graph, try different quantization parameters in the vicinity of the initially
+chosen values (right after the profiling). For example, this is done by successively dividing the
+**scale** parameter by 2 for a maximum of 3 iterations.
+- For each tested quantization parameters, keep the ones which provide the best accuracy with respect
+to a given dataset.
+
+### Command line options
+
+The specific command line options for running this tool are presented below. Apart from the specific
+options, some generic options are used which are also used for the other front end tools (see the 
+image-classifier documentation):
+- options for specifying the model, the quantization options (schema, precision), the backend,
+the image preprocessing options (layout, channel order, normalization).
+
+```
+model-tuner -model=<model-path> <image-options> <quantization-options> -dataset-path=<dataset-folder>
+-dataset-file=<dataset-file> -load-profile=<input-profile> -dump-tuned-profile=<tuned-profile>
+```
+
+where:
+- *dataset-path* - the folder where the dataset files are located. The assumption is that all the dataset files
+                    are located in the same directory.
+- *dataset-file* - the path to the dataset description file which contains on each line a data path and integer
+                   label separated by space (" ") or comma (","). The integer labels start with 0 (0,1,...).
+                   An example might look like this:
+                     image0.png 0 
+                     image1.png 13
+                     .............
+                   Another example might look like this:
+                     image0.png,0, 
+                     image1.png,13,
+                     ..............
+- *load-profile* - the path of the input profile which is loaded and tuned.
+- *dump-tuned-profile* - the path where the tuned profile is written.
+
+More information can be acquired by typing the following command:
+```
+model-tuner -help
+```
+
+### Extra command line options
+
+There are a couple of extra command line parameters which can be used to tweak the algorithm behavior:
+- *target-accuracy* - The tuning procedure is stopped when the accuracy has reached or surpassed the given
+                      value. A float value between 0.0 and 1.0 is expected. If not specified, the tuning will
+                      run until completion.
+- *max-iter-per-node* - The maximum number of tuning iterations per node (default is 3).
+- *acc-drop-skip* - The accuracy drop for which the tuning of any node is skipped. The default value is 0.05 (5%).
+
+### Command line output
+
+When running this tool the console output will might look like this:
+
+```
+Computing initial accuracy ... 
+Initial accuracy: 81.0180 %
+Number of nodes: 277
+Target accuracy: 100.0000 %
+
+[1/277] Tuning node "broadcast_B_tile0_save__1:0"
+  [1/3] Testing scale = 0.00195
+  Accuracy = 81.0180 %
+  Tunning stopped for this node (no effect)
+Best accuracy : 81.0180 %
+Iteration time: 34 seconds
+Remaining time: 2 hours 36 minutes
+
+[2/277] Tuning node "W52__1:0"
+  [1/3] Testing scale = 0.06250
+  Accuracy = 81.4422 %
+  [2/3] Testing scale = 0.03125
+  Accuracy = 79.0032 %
+  [3/3] Testing scale = 0.01562
+  Accuracy = 67.1262 %
+Best accuracy : 81.4422 %
+Iteration time: 68 seconds
+Remaining time: 5 hours 11 minutes
+
+..................................
+..................................
+
+[277/277] Tuning node "W42__1:0"
+  [1/3] Testing scale = 0.01562
+  Accuracy = 90.2439 %
+  Tunning stopped for this node
+Best accuracy : 97.9852 %
+Iteration time: 66 seconds
+Remaining time: 0 hours 0 minutes
+
+
+Final accuracy: 97.9852 %
+
+Total time: 5 hours 6 minutes
+```
+
+Notes:
+- The quantization tuning procedure is a long procedure: the order of magnitude of the time
+required to run is similar to training. For example, the model used for tuning in the above example
+is a medium size model (e.g. similar to a Mobile Net with a scale factor of 0.5). For this reason
+the tool also prints an estimated remaining time for running the tuning (the estimation gets
+better after calibrating more nodes).
+- When the estimated time for the tuning is too much, one might use a smaller tuning dataset.
diff --git a/include/glow/Importer/ProtobufLoader.h b/include/glow/Importer/ProtobufLoader.h
@@ -165,8 +165,8 @@ class ProtobufLoader {
   bool hasNodeByName(llvm::StringRef name) const;
 
   /// Constructs new ProtobufLoader object. It will populate the network into
-  /// \p F. The list \p types and \p names are used to initialized the inputs
-  /// and outputs with specific names and types. If \p errPtr is not null then
+  /// \p F. The list \p types and \p names are used to initialize the inputs
+  /// of the model with specific names and types. If \p errPtr is not null then
   /// if an error occurs it will get assigned there otherwise if an error
   /// occurs it will abort.
   ProtobufLoader(llvm::ArrayRef<const char *> tensorNames,
@@ -191,16 +191,21 @@ class ProtobufLoader {
   /// that there is only one output, returns Error otherwise. For image
   /// classification, this single final output is usually the result of the
   /// last softmax or regression layer.
-  Expected<Placeholder *> getSingleOutput() {
-    RETURN_ERR_IF_NOT(outputVarsByName_.size() == 1,
-                      "There must be only one output.");
-    return outputVarsByName_.begin()->second;
-  }
+  Expected<Placeholder *> getSingleOutput() const;
+
+  /// \returns the single input of the network. The function assumes that there
+  /// is only one input, returns Error otherwise. For most of the models the
+  /// single input is usually an image tensor.
+  Expected<Placeholder *> getSingleInput() const;
 
   /// \returns the Placeholder for the external output with \p name.
   /// \pre outputVarsByName_.find(name) != outputVarsByName_.end()
   Expected<Placeholder *> getOutputByName(llvm::StringRef name) const;
 
+  /// \returns the Placeholder for the external input with \p name.
+  /// \pre inputVarsByName_.find(name) != inputVarsByName_.end()
+  Expected<Placeholder *> getInputByName(llvm::StringRef name) const;
+
   /// \returns True if the operator with name \p typeName having input node
   /// list as \p inputs is constant foldable.
   bool isConstantFoldable(llvm::ArrayRef<NodeValue> inputs,

diff --git a/include/glow/Quantization/Base/Base.h b/include/glow/Quantization/Base/Base.h
@@ -255,6 +255,15 @@ Tensor tensor4BitsFusedRowwiseDequantization(const Tensor &input);
 QuantizationTransform32To8 quantizeScaleOffset32To8(float scale,
                                                     int32_t offset);
 
+/// Function to get the quantized range for a given precision type \p qTy.
+/// \returns the range as a (min, max) pair.
+std::pair<int64_t, int64_t> getQuantizationRange(ElemKind qTy);
+
+/// Function to validate that the given quantization parameters \p qParams
+/// comply with the given quantization \p schema and precision \p qTy.
+void validateQuantizationParams(TensorQuantizationParams qParams, Schema schema,
+                                ElemKind qTy);
+
 /// Calculate TensorQuantizationParams based on the clipped \p min and \p max
 /// floating point range and using the base quantization type \p qTy and the
 /// quantization method described by \p schema.

diff --git a/lib/Importer/ProtobufLoader.cpp b/lib/Importer/ProtobufLoader.cpp
@@ -84,6 +84,18 @@ bool ProtobufLoader::hasConstantByName(llvm::StringRef name) const {
   return getConstantByNameOrNull(name) != nullptr;
 }
 
+Expected<Placeholder *> ProtobufLoader::getSingleOutput() const {
+  RETURN_ERR_IF_NOT(outputVarsByName_.size() == 1,
+                    "There must be only one output.");
+  return outputVarsByName_.begin()->second;
+}
+
+Expected<Placeholder *> ProtobufLoader::getSingleInput() const {
+  RETURN_ERR_IF_NOT(inputVarsByName_.size() == 1,
+                    "There must be only one input.");
+  return inputVarsByName_.begin()->second;
+}
+
 Expected<Placeholder *>
 ProtobufLoader::getOutputByName(llvm::StringRef name) const {
   auto it = outputVarsByName_.find(name);
@@ -94,6 +106,16 @@ ProtobufLoader::getOutputByName(llvm::StringRef name) const {
   return it->second;
 }
 
+Expected<Placeholder *>
+ProtobufLoader::getInputByName(llvm::StringRef name) const {
+  auto it = inputVarsByName_.find(name);
+  RETURN_ERR_IF_NOT(
+      it != inputVarsByName_.end(),
+      llvm::Twine("No external input Variable was registered with name ", name)
+          .str());
+  return it->second;
+}
+
 NodeValue
 ProtobufLoader::getNodeValueByNameOrNullNodeValue(llvm::StringRef name) const {
   auto it = nodeValueByName_.find(name);
@@ -187,11 +209,10 @@ ProtobufLoader::ProtobufLoader(llvm::ArrayRef<const char *> tensorNames,
     for (size_t i = 0, e = tensorNames.size(); i < e; i++) {
       RETURN_ERR_IF_NOT(!hasNodeByName(tensorNames[i]),
                         "Input names have duplicate");
-      auto placeholderOrErr =
-          createAndRegisterPlaceholder(tensorNames[i], types[i]);
-      if (!placeholderOrErr) {
-        return placeholderOrErr.takeError();
-      }
+      Placeholder *placeholder;
+      ASSIGN_VALUE_OR_RETURN_ERR(
+          placeholder, createAndRegisterPlaceholder(tensorNames[i], types[i]));
+      inputVarsByName_.try_emplace(tensorNames[i], placeholder);
     }
     return Error::success();
   };

diff --git a/lib/Quantization/Base/Base.cpp b/lib/Quantization/Base/Base.cpp
@@ -265,11 +265,7 @@ QuantizationTransform32To8 quantizeScaleOffset32To8(float scale,
                                     offset);
 }
 
-TensorQuantizationParams chooseQuantizationParams(float min, float max,
-                                                  Schema schema, ElemKind qTy) {
-  assert(min <= max && "min must not be bigger than max");
-
-  // Compute the quantized int range.
+std::pair<int64_t, int64_t> getQuantizationRange(ElemKind qTy) {
   // Pick int64_t in order to cover the uint32_t range.
   int64_t qmin;
   int64_t qmax;
@@ -310,6 +306,45 @@ TensorQuantizationParams chooseQuantizationParams(float min, float max,
   default:
     llvm_unreachable("Quantized type not supported");
   }
+  return std::pair<int64_t, int64_t>(qmin, qmax);
+}
+
+void validateQuantizationParams(TensorQuantizationParams qParams, Schema schema,
+                                ElemKind qTy) {
+
+  // Get the quantized range.
+  auto minMaxPair = getQuantizationRange(qTy);
+  int64_t qmin = minMaxPair.first;
+  int64_t qmax = minMaxPair.second;
+
+  // Validate params.
+  (void)(qmin);
+  (void)(qmax);
+  assert((qmin <= qParams.offset) && (qParams.offset <= qmax) &&
+         "The offset must be within the quantized range");
+  if (schema == quantization::Schema::Symmetric) {
+    assert((qParams.offset == 0) &&
+           "Symmetric quantization should have offset 0");
+  } else if (schema == quantization::Schema::SymmetricWithUnsigned) {
+    assert((qParams.offset == qmin || qParams.offset == 0) &&
+           "SymmetricWithUnsigned quantization should have offset 0 or qmin");
+  } else if (schema == quantization::Schema::SymmetricWithPower2Scale) {
+    assert((qParams.offset == 0) &&
+           "SymmetricWithPower2Scale quantization should have offset 0");
+    assert(isFloatPowerOf2(qParams.scale) &&
+           "SymmetricWithPower2Scale quantization parameter should be a power "
+           "of 2");
+  }
+}
+
+TensorQuantizationParams chooseQuantizationParams(float min, float max,
+                                                  Schema schema, ElemKind qTy) {
+  assert(min <= max && "min must not be bigger than max");
+
+  // Get the quantized range.
+  auto minMaxPair = getQuantizationRange(qTy);
+  int64_t qmin = minMaxPair.first;
+  int64_t qmax = minMaxPair.second;
 
   // We extend the [min, max] interval to ensure that it contains 0.
   // Otherwise, we would not meet the requirement that 0 be an exactly
@@ -403,27 +438,7 @@ TensorQuantizationParams chooseQuantizationParams(float min, float max,
   }
 
   TensorQuantizationParams result{static_cast<float>(scale), nudgedZeroPoint};
-  // The only valid offset for symmetric quantization is 0.
-  assert((result.offset == 0 || schema != quantization::Schema::Symmetric) &&
-         "Symmetric quantization should be centered on 0");
-
-  // The only valid offsets for symmetric quantization with unsigned support are
-  // 0 and qmin.
-  assert((result.offset == qmin || result.offset == 0 ||
-          schema != quantization::Schema::SymmetricWithUnsigned) &&
-         "Symmetric quantization with unsigned should be centered on 0 or on "
-         "-qmin");
-
-  // For SymmetricWithPower2Scale schema the offset should be 0.
-  assert((result.offset == 0 ||
-          schema != quantization::Schema::SymmetricWithPower2Scale) &&
-         "Symmetric quantization should be centered on 0");
-
-  // For SymmetricWithPower2Scale schema the scale should be a power of 2.
-  assert((isFloatPowerOf2(result.scale) ||
-          schema != quantization::Schema::SymmetricWithPower2Scale) &&
-         "Scale quantization parameter should be a power of 2");
-
+  validateQuantizationParams(result, schema, qTy);
   return result;
 }
 

diff --git a/tools/loader/CMakeLists.txt b/tools/loader/CMakeLists.txt
@@ -82,3 +82,21 @@ target_link_libraries(model-compiler
                         GraphOptimizer
                         Quantization
                         LLVMSupport)
+
+add_executable(model-tuner
+  Loader.cpp
+  LoaderUtils.cpp
+  ModelTuner.cpp)
+
+target_link_libraries(model-tuner
+                      PRIVATE
+                        Backends
+                        Base
+                        Converter
+                        Graph
+                        HostManager
+                        Importer
+                        ExecutionEngine
+                        GraphOptimizer
+                        Quantization
+                        LLVMSupport)
diff --git a/tools/loader/ImageClassifier.cpp b/tools/loader/ImageClassifier.cpp
@@ -252,7 +252,8 @@ buildAndCompileAndGetInAndOutPair(Loader &loader, PlaceholderBindings &bindings,
 
   // Compile the model, and perform quantization/emit a bundle/dump debug info
   // if requested from command line.
-  CompilationContext cctx{&bindings};
+  CompilationContext cctx = loader.getCompilationContext();
+  cctx.bindings = &bindings;
   cctx.backendOpts.autoInstrument = autoInstrument;
   loader.compile(cctx);