Node timing and profile details format in chrome://tracing.

Working example in ./Examples/Image/Classification/MLP/Python/SimpleMNIST.py Note that node timing would be added to profiler details when profiler is enabled, i.e. import cntk as C C.debugging.debug.set_node_timing(True) C.debugging.start_profiler() C.debugging.enable_profiler() trainer|evaluator|function executions trainer|evaluator|function.print_node_timing() C.debugging.stop_profiler()
weiplanet · Feb 2, 2018 · 3660b7a · 3660b7a
1 parent 3a19d2f
commit 3660b7a
Show file tree

Hide file tree

Showing 23 changed files with 274 additions and 48 deletions.
diff --git a/Documentation/current_iteration.md b/Documentation/current_iteration.md
@@ -1,39 +1,16 @@
-# CNTK v2.4 Release Notes
+# CNTK Current Iteration
 
-## Highlights of this Release
-- Move to CUDA9, cuDNN 7 and Visual Studio 2017.
-- Removed Python 3.4 support.
-- Support Volta GPU and FP16.
-- Better ONNX support.
-- CPU perf improvement.
-- More OPs.
+## Change profiler details output format to be chrome://tracing
 
-## OPs
-- ``top_k`` operation: in the forward pass it computes the top (largest) k values and corresponding indices along the specified axis. In the backward pass the gradient is scattered to the top k elements (an element not in the top k gets a zero gradient).
-- ``gather`` operation now supports an axis argument
-- ``squeeze`` and ``expand_dims`` operations for easily removing and adding singleton axes
-- ``zeros_like`` and ``ones_like`` operations. In many situations you can just rely on CNTK correctly broadcasting a simple 0 or 1 but sometimes you need the actual tensor.
-- ``depth_to_space``: Rearranges elements in the input tensor from the depth dimension into spatial blocks. Typical use of this operation is for implementing sub-pixel convolution for some image super-resolution models.
-- ``space_to_depth``: Rearranges elements in the input tensor from the spatial dimensions to the depth dimension. It is largely the inverse of DepthToSpace.
-
-## ONNX
-There have been several improvements to ONNX support in CNTK.
-
-### Updates
-- Updated ONNX ``Reshape`` op to handle ``InferredDimension``.
-- Adding ``producer_name`` and ``producer_version`` fields to ONNX models.
-- Handling the case when neither ``auto_pad`` nor ``pads`` atrribute is specified in ONNX ``Conv`` op.
-
-### Bug fixes
-- Fixed bug in ONNX ``Pooling`` op serialization
-- Bug fix to create ONNX ``InputVariable`` with only one batch axis.
-- Bug fixes and updates to implementation of ONNX ``Transpose`` op to match updated spec.
-- Bug fixes and updates to implementation of ONNX ``Conv``, ``ConvTranspose``, and ``Pooling`` ops to match updated spec.
-
-## Operators
-### Group convolution
-- Fixed bug in group convolution. Output of CNTK ``Convolution`` op will change for groups > 1. More optimized implementation of group convolution is expected in the next release.
-- Better error reporting for group convolution in ``Convolution`` layer.
-
-### Halide Binary Convolution
-- The CNTK build can now use optional [Halide](http://halide-lang.org/) libraries to build ``Cntk.BinaryConvolution.so/dll`` library that can be used with the ``netopt`` module. The library contains optimized binary convolution operators that perform better than the python based binarized convolution operators. To enable Halide in the build, please download [Halide release](https://github.com/halide/Halide/releases) and set ``HALIDE_PATH`` environment varibale before starting a build. In Linux, you can use ``./configure --with-halide[=directory]`` to enable it. For more information on how to use this feature, please refer to [How_to_use_network_optimization](https://github.com/Microsoft/CNTK/blob/master/Manual/Manual_How_to_use_network_optimizations.ipynb).
+## Enable per-node timing. Working example [here](../Examples/Image/Classification/MLP/Python/SimpleMNIST.py)
+- per-node timing creates items in profiler details when profiler is enabled.
+- usage in Python:
+```
+import cntk as C
+C.debugging.debug.set_node_timing(True)
+C.debugging.start_profiler() # optional
+C.debugging.enable_profiler() # optional
+#<trainer|evaluator|function> executions
+<trainer|evaluator|function>.print_node_timing()
+C.debugging.stop_profiler()
+```
diff --git a/Examples/Image/Classification/MLP/Python/SimpleMNIST.py b/Examples/Image/Classification/MLP/Python/SimpleMNIST.py
@@ -96,7 +96,7 @@ def simple_mnist(tensorboard_logdir=None):
         max_samples = num_samples_per_sweep * num_sweeps_to_train_with,
         progress_frequency=num_samples_per_sweep
     ).train()
-    
+
     # Load test data
     path = os.path.normpath(os.path.join(data_dir, "Test-28x28_cntk_text.txt"))
     check_path(path)
@@ -109,6 +109,10 @@ def simple_mnist(tensorboard_logdir=None):
     }
 
     # Test data for trained model
+    C.debugging.start_profiler()
+    C.debugging.enable_profiler()
+    C.debugging.set_node_timing(True)
+
     test_minibatch_size = 1024
     num_samples = 10000
     num_minibatches_to_test = num_samples / test_minibatch_size
@@ -118,6 +122,9 @@ def simple_mnist(tensorboard_logdir=None):
         eval_error = trainer.test_minibatch(mb)
         test_result = test_result + eval_error
 
+    C.debugging.stop_profiler()
+    trainer.print_node_timing()
+
     # Average of evaluation errors of all test minibatches
     return test_result / num_minibatches_to_test
 

diff --git a/Source/CNTKv2LibraryDll/API/CNTKLibrary.h b/Source/CNTKv2LibraryDll/API/CNTKLibrary.h
@@ -3213,6 +3213,8 @@ namespace CNTK
                                        const std::unordered_map<Variable, ValuePtr>& rootGradientValues,
                                        std::unordered_map<Variable, ValuePtr>& backPropagatedGradientValuesForInputs);
 
+        CNTK_API virtual void PrintNodeTiming() {}
+
     protected:
         ///
         /// Computes and stores the values of specified variables in the 'outputs' map, using provided 'inputs' values for each input of the Function.
@@ -5378,6 +5380,11 @@ namespace CNTK
             return m_progressWriters;
         }
 
+        ///
+        /// Prints per-node average timing per-minibatch
+        ///
+        CNTK_API virtual void PrintNodeTiming();
+
         CNTK_API virtual ~Evaluator() {}
 
     private:
@@ -5538,6 +5545,11 @@ namespace CNTK
         ///
         CNTK_API void SummarizeTrainingProgress();
 
+        ///
+        /// Prints per-node average timing per-minibatch
+        ///
+        CNTK_API virtual void PrintNodeTiming();
+
     private:
         template <typename T1, typename ...CtorArgTypes>
         friend std::shared_ptr<T1> MakeSharedObject(CtorArgTypes&& ...ctorArgs);

diff --git a/Source/CNTKv2LibraryDll/API/CNTKLibraryInternals.h b/Source/CNTKv2LibraryDll/API/CNTKLibraryInternals.h
@@ -316,6 +316,9 @@ namespace CNTK
         CNTK_API void DisableProfiler();
         CNTK_API void StopProfiler();
 
+        CNTK_API void EnableNodeTiming();
+        CNTK_API void DisableNodeTimeing();
+
         CNTK_API bool AreEquivalent(const ::CNTK::FunctionPtr& f1, const ::CNTK::FunctionPtr& f2);
         CNTK_API bool AreEquivalent(const ::CNTK::Variable& v1, const ::CNTK::Variable& v2, bool allowParameterAndConstantsEquivalence = false);
 

diff --git a/Source/CNTKv2LibraryDll/Common.cpp b/Source/CNTKv2LibraryDll/Common.cpp
@@ -210,6 +210,16 @@ namespace CNTK
 #endif
         }
 
+        void EnableNodeTiming()
+        {
+            Microsoft::MSR::CNTK::Globals::SetNodeTiming(true);
+        }
+
+        void DisableNodeTimeing()
+        {
+            Microsoft::MSR::CNTK::Globals::SetNodeTiming(false);
+        }
+
         bool AreEquivalent(const Variable& var1, const Variable& var2, bool allowParameterAndConstantsEquivalence)
         {
             bool areDynamicAxesCompatible = (var1.DynamicAxes().size() == var2.DynamicAxes().size());

diff --git a/Source/CNTKv2LibraryDll/CompositeFunction.h b/Source/CNTKv2LibraryDll/CompositeFunction.h
@@ -126,6 +126,14 @@ namespace CNTK
             return CompositeFunctionOpName;
         }
 
+        void PrintNodeTiming() override
+        {
+            if (m_computationNetwork)
+            {
+                m_computationNetwork->PrintNodeTiming();
+            }
+        }
+
         template <typename FunctionType>
         static void PreorderTraverseVariables(const FunctionPtr& rootFunction, const FunctionType& functor, bool pythonOperandOrder = false)
         {

diff --git a/Source/CNTKv2LibraryDll/Evaluator.cpp b/Source/CNTKv2LibraryDll/Evaluator.cpp
@@ -196,4 +196,12 @@ namespace CNTK
         if (m_aggregatedTestEvalCriterionValue)
             m_aggregatedTestEvalCriterionValue->Reset();
     }
+
+    void Evaluator::PrintNodeTiming()
+    {
+        if (m_combinedEvalFunction)
+        {
+            m_combinedEvalFunction->PrintNodeTiming();
+        }
+    }
 }
diff --git a/Source/CNTKv2LibraryDll/Trainer.cpp b/Source/CNTKv2LibraryDll/Trainer.cpp
@@ -337,6 +337,15 @@ namespace CNTK
         m_progressWriters.insert(progressWriters.begin(), progressWriters.end());
     }
 
+    void Trainer::PrintNodeTiming()
+    {
+        if (m_combinedTrainingFunction)
+        {
+            m_combinedTrainingFunction->PrintNodeTiming();
+        }
+    }
+
+
     void Trainer::ExecuteForwardBackward(const std::unordered_map<Variable, ValuePtr>& arguments, std::unordered_map<Variable, ValuePtr>& outputsToFetch, const DeviceDescriptor& computeDevice, std::unordered_map<Variable, ValuePtr>& parameterGradients)
     {
 #ifndef  CNTK_UWP

diff --git a/Source/Common/Globals.cpp b/Source/Common/Globals.cpp
@@ -15,4 +15,5 @@ namespace Microsoft { namespace MSR { namespace CNTK {
 
     std::atomic<bool> Globals::m_enableShareNodeValueMatrices(true);
     std::atomic<bool> Globals::m_optimizeGradientAccumulation(true);
+    std::atomic<bool> Globals::m_enableNodeTiming(false);
 }}}
diff --git a/Source/Common/Include/Globals.h b/Source/Common/Include/Globals.h
@@ -28,11 +28,14 @@ namespace Microsoft { namespace MSR { namespace CNTK {
         static void SetShareNodeValueMatrices(bool enable) { m_enableShareNodeValueMatrices = enable; }
         static bool ShouldEnableShareNodeValueMatrices() { return m_enableShareNodeValueMatrices; }
 
+        static void SetNodeTiming(bool enable) { m_enableNodeTiming = enable; }
+        static bool ShouldEnableNodeTiming() { return m_enableNodeTiming; }
     private:
         static std::atomic<bool> m_forceDeterministicAlgorithms;
         // The global flag to enable matrices values in forward and backward prop
         static std::atomic<bool> m_enableShareNodeValueMatrices;
         static std::atomic<bool> m_forceConstantRandomSeed;
         static std::atomic<bool> m_optimizeGradientAccumulation;
+        static std::atomic<bool> m_enableNodeTiming;
     };
 }}}
diff --git a/Source/ComputationNetworkLib/ComputationNetwork.cpp b/Source/ComputationNetworkLib/ComputationNetwork.cpp
@@ -86,6 +86,18 @@ void ComputationNetwork::ClearNetwork()
     m_pMBLayoutOfNetwork->Init(1, 0);
 }
 
+void ComputationNetwork::PrintNodeTiming()
+{
+    for (auto& iter : m_nameToNodeMap)
+    {
+        auto nodePtr = iter.second->As<ComputationNode<float>>();
+        if (nodePtr)
+        {
+            nodePtr->PrintForwardBackwardTime();
+        }
+    }
+}
+
 // -----------------------------------------------------------------------
 // serialization
 // -----------------------------------------------------------------------

diff --git a/Source/ComputationNetworkLib/ComputationNetwork.h b/Source/ComputationNetworkLib/ComputationNetwork.h
@@ -87,6 +87,8 @@ class ComputationNetwork :
 
     DEVICEID_TYPE GetDeviceId() const { return m_deviceId; }
 
+    void PrintNodeTiming();
+
 protected:
     void ConstructFromRoots(DEVICEID_TYPE deviceId, std::deque<ComputationNodeBasePtr>&& roots, const map<ComputationNodeBasePtr, ComputationNodeBasePtr>& replacements);
     void ProcessSpecialNodes(const ScriptableObjects::IConfigRecord& config, std::deque<ComputationNodeBasePtr>& roots);

diff --git a/Source/ComputationNetworkLib/ComputationNetworkEvaluation.cpp b/Source/ComputationNetworkLib/ComputationNetworkEvaluation.cpp
@@ -145,7 +145,9 @@ ComputationNetwork::PARTraversalFlowControlNode::PARTraversalFlowControlNode(con
     if (node->IsOutOfDateWrtInputs())
     {
         node->BeginForwardProp();
+        node->BeginTiming(false /*backward*/);
         node->ForwardProp(fr.WithLayout(node->GetMBLayout()));
+        node->EndTiming(false /*backward*/);
         node->EndForwardProp();
 
         node->BumpEvalTimeStamp();
@@ -183,7 +185,9 @@ ComputationNetwork::PARTraversalFlowControlNode::PARTraversalFlowControlNode(con
         auto& node = *pnode;
 
         node->BeginBackprop();
+        node->BeginTiming(true /*backward*/);
         node->Backprop(fr.WithLayout(node->GetMBLayout()), true /*childrenInThisLoop*/, true /*childrenInOuterLoop*/);
+        node->EndTiming(true /*backward*/);
         node->EndBackprop();
 
         // Extreme Tracing, part 2/4
@@ -280,7 +284,9 @@ static bool DumpNode(ComputationNodeBasePtr nodep, bool dumpGradient)
     {
         for (auto& node : m_nestedNodes)
         {
+            node->BeginTiming(false /*backward*/);
             node->ForwardProp(t);
+            node->EndTiming(false /*backward*/);
             node->BumpEvalTimeStamp();
         }
     }
@@ -320,7 +326,9 @@ static bool DumpNode(ComputationNodeBasePtr nodep, bool dumpGradient)
         for (auto nodeIter2 = recurrentNodes.rbegin(); nodeIter2 != recurrentNodes.rend(); ++nodeIter2)
         {
             auto& node2 = *nodeIter2;
+            node2->BeginTiming(true /*backward*/);
             node2->Backprop(t, true /*childrenInThisLoop*/, false /*childrenInOuterLoop*/);
+            node2->EndTiming(true /*backward*/);
             // The above flags tell Backprop() to skip back-propagation from inside a node into
             // a node that is outside the loop, which is done later in EndBackprop() in PAR mode.
         }

diff --git a/Source/ComputationNetworkLib/ComputationNetworkLib.vcxproj b/Source/ComputationNetworkLib/ComputationNetworkLib.vcxproj
@@ -52,7 +52,7 @@
   </PropertyGroup>
   <ItemDefinitionGroup>
     <ClCompile>
-      <AdditionalIncludeDirectories>$(SolutionDir)Source\SequenceTrainingLib;$(BOOST_INCLUDE_PATH);$(SolutionDir)Source\CNTKv2LibraryDll\API;$(SolutionDir)Source\CNTKv2LibraryDll;$(SolutionDir)Source\Math;$(SolutionDir)Source\Common\Include;$(SolutionDir)Source\CNTK\BrainScript;$(SolutionDir)Source\ActionsLib;$(MSMPI_INC);$(NvmlInclude)</AdditionalIncludeDirectories>
+      <AdditionalIncludeDirectories>$(SolutionDir)Source\SequenceTrainingLib;$(BOOST_INCLUDE_PATH);$(SolutionDir)Source\CNTKv2LibraryDll\API;$(SolutionDir)Source\CNTKv2LibraryDll;$(SolutionDir)Source\Math;$(SolutionDir)Source\Common\Include;$(SolutionDir)Source\CNTK\BrainScript;$(SolutionDir)Source\ActionsLib;$(MSMPI_INC);$(NvmlInclude);$(SolutionDir)Source\PerformanceProfilerDll;</AdditionalIncludeDirectories>
       <PrecompiledHeader>
       </PrecompiledHeader>
       <PreprocessorDefinitions>WIN32;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>

diff --git a/Source/ComputationNetworkLib/ComputationNode.cpp b/Source/ComputationNetworkLib/ComputationNode.cpp
@@ -10,6 +10,14 @@
 #include "ComputationNetworkBuilder.h" // TODO: We should only pull in NewComputationNodeFromConfig(). Nodes should not know about network at large.
 #include "TensorShape.h"
 
+#ifndef  CNTK_UWP
+#include "PerformanceProfiler.h"
+#ifdef _WIN32
+#define PERFORMANCE_PROFILER_LIB_NAME "Cntk.PerformanceProfiler-"##CNTK_COMPONENT_VERSION##".lib"
+#pragma comment(lib, PERFORMANCE_PROFILER_LIB_NAME)
+#endif
+#endif
+
 #ifndef let
 #define let const auto
 #endif
@@ -769,6 +777,67 @@ template <class ElemType>
     }
 }
 
+template <class ElemType>
+/*virtual*/ void ComputationNode<ElemType>::BeginTiming(bool backward)
+{
+    if (!Globals::ShouldEnableNodeTiming()) return;
+
+    int phase = (backward ? (int)TimingPhase_Backward : (int)TimingPhase_Forward);
+    auto& timing = m_timing[phase];
+    timing.beginTime = std::chrono::system_clock::now();
+    timing.count++;
+#ifndef  CNTK_UWP
+    timing.profilerId = ProfilerTimeBegin();
+#endif
+}
+
+template <class ElemType>
+/*virtual*/ void ComputationNode<ElemType>::EndTiming(bool backward)
+{
+    if (!Globals::ShouldEnableNodeTiming()) return;
+
+    int phase = (backward ? (int)TimingPhase_Backward : (int)TimingPhase_Forward);
+    auto& timing = m_timing[phase];
+    timing.duration += (std::chrono::system_clock::now() - timing.beginTime);
+
+#ifndef  CNTK_UWP
+    // the order must match enum
+    static const char* postfixes[TimingPhase_Total] =
+    {
+        "Forward",
+        "Backward",
+    };
+
+    if (timing.profilerName.length() != m_nodeName.length() + strlen(postfixes[phase]))
+    {
+        static char name[256];
+        sprintf_s(name, _countof(name), "%S%s", m_nodeName.c_str(), postfixes[phase]);
+        timing.profilerName = name;
+    }
+    ProfilerTimeEnd(timing.profilerId, timing.profilerName.c_str());
+#endif
+}
+
+template<class ElemType>
+void ComputationNode<ElemType>::PrintForwardBackwardTime()
+{
+    if (GetInputs().size() == 0) return;
+
+    auto& forwardCount = m_timing[TimingPhase_Forward].count;
+    auto forwardDuration = m_timing[TimingPhase_Forward].duration.count();
+    auto& backwardCount = m_timing[TimingPhase_Backward].count;
+    auto backwardDuration = m_timing[TimingPhase_Backward].duration.count();
+    fprintf(stderr, "%-30S forward avg %07fs, backward avg %07fs (fwd# %d|bwd# %d)\n",
+        m_nodeName.c_str(),
+        forwardCount == 0 ? 0 : forwardDuration / forwardCount,
+        backwardCount == 0 ? 0 : backwardDuration / backwardCount,
+        forwardCount,
+        backwardCount);
+
+    for (auto& timing : m_timing)
+        timing.Reset();
+}
+
 template <class ElemType>
 /*virtual*/ void ComputationNode<ElemType>::DumpNodeInfo(const bool /*printValues*/, const bool printMetadata, File& fstream) const
 {