Skip to content

Commit

Permalink
Node timing and profile details format in chrome://tracing.
Browse files Browse the repository at this point in the history
Working example in ./Examples/Image/Classification/MLP/Python/SimpleMNIST.py

Note that node timing would be added to profiler details when profiler is enabled, i.e.

    import cntk as C
    C.debugging.debug.set_node_timing(True)
    C.debugging.start_profiler()
    C.debugging.enable_profiler()
    trainer|evaluator|function executions
    trainer|evaluator|function.print_node_timing()
    C.debugging.stop_profiler()
  • Loading branch information
KeDengMS committed Feb 2, 2018
1 parent 3a19d2f commit 3660b7a
Show file tree
Hide file tree
Showing 23 changed files with 274 additions and 48 deletions.
51 changes: 14 additions & 37 deletions Documentation/current_iteration.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,16 @@
# CNTK v2.4 Release Notes
# CNTK Current Iteration

## Highlights of this Release
- Move to CUDA9, cuDNN 7 and Visual Studio 2017.
- Removed Python 3.4 support.
- Support Volta GPU and FP16.
- Better ONNX support.
- CPU perf improvement.
- More OPs.
## Change profiler details output format to be chrome://tracing

## OPs
- ``top_k`` operation: in the forward pass it computes the top (largest) k values and corresponding indices along the specified axis. In the backward pass the gradient is scattered to the top k elements (an element not in the top k gets a zero gradient).
- ``gather`` operation now supports an axis argument
- ``squeeze`` and ``expand_dims`` operations for easily removing and adding singleton axes
- ``zeros_like`` and ``ones_like`` operations. In many situations you can just rely on CNTK correctly broadcasting a simple 0 or 1 but sometimes you need the actual tensor.
- ``depth_to_space``: Rearranges elements in the input tensor from the depth dimension into spatial blocks. Typical use of this operation is for implementing sub-pixel convolution for some image super-resolution models.
- ``space_to_depth``: Rearranges elements in the input tensor from the spatial dimensions to the depth dimension. It is largely the inverse of DepthToSpace.

## ONNX
There have been several improvements to ONNX support in CNTK.

### Updates
- Updated ONNX ``Reshape`` op to handle ``InferredDimension``.
- Adding ``producer_name`` and ``producer_version`` fields to ONNX models.
- Handling the case when neither ``auto_pad`` nor ``pads`` atrribute is specified in ONNX ``Conv`` op.

### Bug fixes
- Fixed bug in ONNX ``Pooling`` op serialization
- Bug fix to create ONNX ``InputVariable`` with only one batch axis.
- Bug fixes and updates to implementation of ONNX ``Transpose`` op to match updated spec.
- Bug fixes and updates to implementation of ONNX ``Conv``, ``ConvTranspose``, and ``Pooling`` ops to match updated spec.

## Operators
### Group convolution
- Fixed bug in group convolution. Output of CNTK ``Convolution`` op will change for groups > 1. More optimized implementation of group convolution is expected in the next release.
- Better error reporting for group convolution in ``Convolution`` layer.

### Halide Binary Convolution
- The CNTK build can now use optional [Halide](http://halide-lang.org/) libraries to build ``Cntk.BinaryConvolution.so/dll`` library that can be used with the ``netopt`` module. The library contains optimized binary convolution operators that perform better than the python based binarized convolution operators. To enable Halide in the build, please download [Halide release](https://github.com/halide/Halide/releases) and set ``HALIDE_PATH`` environment varibale before starting a build. In Linux, you can use ``./configure --with-halide[=directory]`` to enable it. For more information on how to use this feature, please refer to [How_to_use_network_optimization](https://github.com/Microsoft/CNTK/blob/master/Manual/Manual_How_to_use_network_optimizations.ipynb).
## Enable per-node timing. Working example [here](../Examples/Image/Classification/MLP/Python/SimpleMNIST.py)
- per-node timing creates items in profiler details when profiler is enabled.
- usage in Python:
```
import cntk as C
C.debugging.debug.set_node_timing(True)
C.debugging.start_profiler() # optional
C.debugging.enable_profiler() # optional
#<trainer|evaluator|function> executions
<trainer|evaluator|function>.print_node_timing()
C.debugging.stop_profiler()
```
9 changes: 8 additions & 1 deletion Examples/Image/Classification/MLP/Python/SimpleMNIST.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ def simple_mnist(tensorboard_logdir=None):
max_samples = num_samples_per_sweep * num_sweeps_to_train_with,
progress_frequency=num_samples_per_sweep
).train()

# Load test data
path = os.path.normpath(os.path.join(data_dir, "Test-28x28_cntk_text.txt"))
check_path(path)
Expand All @@ -109,6 +109,10 @@ def simple_mnist(tensorboard_logdir=None):
}

# Test data for trained model
C.debugging.start_profiler()
C.debugging.enable_profiler()
C.debugging.set_node_timing(True)

test_minibatch_size = 1024
num_samples = 10000
num_minibatches_to_test = num_samples / test_minibatch_size
Expand All @@ -118,6 +122,9 @@ def simple_mnist(tensorboard_logdir=None):
eval_error = trainer.test_minibatch(mb)
test_result = test_result + eval_error

C.debugging.stop_profiler()
trainer.print_node_timing()

# Average of evaluation errors of all test minibatches
return test_result / num_minibatches_to_test

Expand Down
12 changes: 12 additions & 0 deletions Source/CNTKv2LibraryDll/API/CNTKLibrary.h
Original file line number Diff line number Diff line change
Expand Up @@ -3213,6 +3213,8 @@ namespace CNTK
const std::unordered_map<Variable, ValuePtr>& rootGradientValues,
std::unordered_map<Variable, ValuePtr>& backPropagatedGradientValuesForInputs);

CNTK_API virtual void PrintNodeTiming() {}

protected:
///
/// Computes and stores the values of specified variables in the 'outputs' map, using provided 'inputs' values for each input of the Function.
Expand Down Expand Up @@ -5378,6 +5380,11 @@ namespace CNTK
return m_progressWriters;
}

///
/// Prints per-node average timing per-minibatch
///
CNTK_API virtual void PrintNodeTiming();

CNTK_API virtual ~Evaluator() {}

private:
Expand Down Expand Up @@ -5538,6 +5545,11 @@ namespace CNTK
///
CNTK_API void SummarizeTrainingProgress();

///
/// Prints per-node average timing per-minibatch
///
CNTK_API virtual void PrintNodeTiming();

private:
template <typename T1, typename ...CtorArgTypes>
friend std::shared_ptr<T1> MakeSharedObject(CtorArgTypes&& ...ctorArgs);
Expand Down
3 changes: 3 additions & 0 deletions Source/CNTKv2LibraryDll/API/CNTKLibraryInternals.h
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,9 @@ namespace CNTK
CNTK_API void DisableProfiler();
CNTK_API void StopProfiler();

CNTK_API void EnableNodeTiming();
CNTK_API void DisableNodeTimeing();

CNTK_API bool AreEquivalent(const ::CNTK::FunctionPtr& f1, const ::CNTK::FunctionPtr& f2);
CNTK_API bool AreEquivalent(const ::CNTK::Variable& v1, const ::CNTK::Variable& v2, bool allowParameterAndConstantsEquivalence = false);

Expand Down
10 changes: 10 additions & 0 deletions Source/CNTKv2LibraryDll/Common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,16 @@ namespace CNTK
#endif
}

void EnableNodeTiming()
{
Microsoft::MSR::CNTK::Globals::SetNodeTiming(true);
}

void DisableNodeTimeing()
{
Microsoft::MSR::CNTK::Globals::SetNodeTiming(false);
}

bool AreEquivalent(const Variable& var1, const Variable& var2, bool allowParameterAndConstantsEquivalence)
{
bool areDynamicAxesCompatible = (var1.DynamicAxes().size() == var2.DynamicAxes().size());
Expand Down
8 changes: 8 additions & 0 deletions Source/CNTKv2LibraryDll/CompositeFunction.h
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,14 @@ namespace CNTK
return CompositeFunctionOpName;
}

void PrintNodeTiming() override
{
if (m_computationNetwork)
{
m_computationNetwork->PrintNodeTiming();
}
}

template <typename FunctionType>
static void PreorderTraverseVariables(const FunctionPtr& rootFunction, const FunctionType& functor, bool pythonOperandOrder = false)
{
Expand Down
8 changes: 8 additions & 0 deletions Source/CNTKv2LibraryDll/Evaluator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -196,4 +196,12 @@ namespace CNTK
if (m_aggregatedTestEvalCriterionValue)
m_aggregatedTestEvalCriterionValue->Reset();
}

void Evaluator::PrintNodeTiming()
{
if (m_combinedEvalFunction)
{
m_combinedEvalFunction->PrintNodeTiming();
}
}
}
9 changes: 9 additions & 0 deletions Source/CNTKv2LibraryDll/Trainer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -337,6 +337,15 @@ namespace CNTK
m_progressWriters.insert(progressWriters.begin(), progressWriters.end());
}

void Trainer::PrintNodeTiming()
{
if (m_combinedTrainingFunction)
{
m_combinedTrainingFunction->PrintNodeTiming();
}
}


void Trainer::ExecuteForwardBackward(const std::unordered_map<Variable, ValuePtr>& arguments, std::unordered_map<Variable, ValuePtr>& outputsToFetch, const DeviceDescriptor& computeDevice, std::unordered_map<Variable, ValuePtr>& parameterGradients)
{
#ifndef CNTK_UWP
Expand Down
1 change: 1 addition & 0 deletions Source/Common/Globals.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,5 @@ namespace Microsoft { namespace MSR { namespace CNTK {

std::atomic<bool> Globals::m_enableShareNodeValueMatrices(true);
std::atomic<bool> Globals::m_optimizeGradientAccumulation(true);
std::atomic<bool> Globals::m_enableNodeTiming(false);
}}}
3 changes: 3 additions & 0 deletions Source/Common/Include/Globals.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,14 @@ namespace Microsoft { namespace MSR { namespace CNTK {
static void SetShareNodeValueMatrices(bool enable) { m_enableShareNodeValueMatrices = enable; }
static bool ShouldEnableShareNodeValueMatrices() { return m_enableShareNodeValueMatrices; }

static void SetNodeTiming(bool enable) { m_enableNodeTiming = enable; }
static bool ShouldEnableNodeTiming() { return m_enableNodeTiming; }
private:
static std::atomic<bool> m_forceDeterministicAlgorithms;
// The global flag to enable matrices values in forward and backward prop
static std::atomic<bool> m_enableShareNodeValueMatrices;
static std::atomic<bool> m_forceConstantRandomSeed;
static std::atomic<bool> m_optimizeGradientAccumulation;
static std::atomic<bool> m_enableNodeTiming;
};
}}}
12 changes: 12 additions & 0 deletions Source/ComputationNetworkLib/ComputationNetwork.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,18 @@ void ComputationNetwork::ClearNetwork()
m_pMBLayoutOfNetwork->Init(1, 0);
}

void ComputationNetwork::PrintNodeTiming()
{
for (auto& iter : m_nameToNodeMap)
{
auto nodePtr = iter.second->As<ComputationNode<float>>();
if (nodePtr)
{
nodePtr->PrintForwardBackwardTime();
}
}
}

// -----------------------------------------------------------------------
// serialization
// -----------------------------------------------------------------------
Expand Down
2 changes: 2 additions & 0 deletions Source/ComputationNetworkLib/ComputationNetwork.h
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,8 @@ class ComputationNetwork :

DEVICEID_TYPE GetDeviceId() const { return m_deviceId; }

void PrintNodeTiming();

protected:
void ConstructFromRoots(DEVICEID_TYPE deviceId, std::deque<ComputationNodeBasePtr>&& roots, const map<ComputationNodeBasePtr, ComputationNodeBasePtr>& replacements);
void ProcessSpecialNodes(const ScriptableObjects::IConfigRecord& config, std::deque<ComputationNodeBasePtr>& roots);
Expand Down
8 changes: 8 additions & 0 deletions Source/ComputationNetworkLib/ComputationNetworkEvaluation.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,9 @@ ComputationNetwork::PARTraversalFlowControlNode::PARTraversalFlowControlNode(con
if (node->IsOutOfDateWrtInputs())
{
node->BeginForwardProp();
node->BeginTiming(false /*backward*/);
node->ForwardProp(fr.WithLayout(node->GetMBLayout()));
node->EndTiming(false /*backward*/);
node->EndForwardProp();

node->BumpEvalTimeStamp();
Expand Down Expand Up @@ -183,7 +185,9 @@ ComputationNetwork::PARTraversalFlowControlNode::PARTraversalFlowControlNode(con
auto& node = *pnode;

node->BeginBackprop();
node->BeginTiming(true /*backward*/);
node->Backprop(fr.WithLayout(node->GetMBLayout()), true /*childrenInThisLoop*/, true /*childrenInOuterLoop*/);
node->EndTiming(true /*backward*/);
node->EndBackprop();

// Extreme Tracing, part 2/4
Expand Down Expand Up @@ -280,7 +284,9 @@ static bool DumpNode(ComputationNodeBasePtr nodep, bool dumpGradient)
{
for (auto& node : m_nestedNodes)
{
node->BeginTiming(false /*backward*/);
node->ForwardProp(t);
node->EndTiming(false /*backward*/);
node->BumpEvalTimeStamp();
}
}
Expand Down Expand Up @@ -320,7 +326,9 @@ static bool DumpNode(ComputationNodeBasePtr nodep, bool dumpGradient)
for (auto nodeIter2 = recurrentNodes.rbegin(); nodeIter2 != recurrentNodes.rend(); ++nodeIter2)
{
auto& node2 = *nodeIter2;
node2->BeginTiming(true /*backward*/);
node2->Backprop(t, true /*childrenInThisLoop*/, false /*childrenInOuterLoop*/);
node2->EndTiming(true /*backward*/);
// The above flags tell Backprop() to skip back-propagation from inside a node into
// a node that is outside the loop, which is done later in EndBackprop() in PAR mode.
}
Expand Down
2 changes: 1 addition & 1 deletion Source/ComputationNetworkLib/ComputationNetworkLib.vcxproj
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@
</PropertyGroup>
<ItemDefinitionGroup>
<ClCompile>
<AdditionalIncludeDirectories>$(SolutionDir)Source\SequenceTrainingLib;$(BOOST_INCLUDE_PATH);$(SolutionDir)Source\CNTKv2LibraryDll\API;$(SolutionDir)Source\CNTKv2LibraryDll;$(SolutionDir)Source\Math;$(SolutionDir)Source\Common\Include;$(SolutionDir)Source\CNTK\BrainScript;$(SolutionDir)Source\ActionsLib;$(MSMPI_INC);$(NvmlInclude)</AdditionalIncludeDirectories>
<AdditionalIncludeDirectories>$(SolutionDir)Source\SequenceTrainingLib;$(BOOST_INCLUDE_PATH);$(SolutionDir)Source\CNTKv2LibraryDll\API;$(SolutionDir)Source\CNTKv2LibraryDll;$(SolutionDir)Source\Math;$(SolutionDir)Source\Common\Include;$(SolutionDir)Source\CNTK\BrainScript;$(SolutionDir)Source\ActionsLib;$(MSMPI_INC);$(NvmlInclude);$(SolutionDir)Source\PerformanceProfilerDll;</AdditionalIncludeDirectories>
<PrecompiledHeader>
</PrecompiledHeader>
<PreprocessorDefinitions>WIN32;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
Expand Down
69 changes: 69 additions & 0 deletions Source/ComputationNetworkLib/ComputationNode.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@
#include "ComputationNetworkBuilder.h" // TODO: We should only pull in NewComputationNodeFromConfig(). Nodes should not know about network at large.
#include "TensorShape.h"

#ifndef CNTK_UWP
#include "PerformanceProfiler.h"
#ifdef _WIN32
#define PERFORMANCE_PROFILER_LIB_NAME "Cntk.PerformanceProfiler-"##CNTK_COMPONENT_VERSION##".lib"
#pragma comment(lib, PERFORMANCE_PROFILER_LIB_NAME)
#endif
#endif

#ifndef let
#define let const auto
#endif
Expand Down Expand Up @@ -769,6 +777,67 @@ template <class ElemType>
}
}

template <class ElemType>
/*virtual*/ void ComputationNode<ElemType>::BeginTiming(bool backward)
{
if (!Globals::ShouldEnableNodeTiming()) return;

int phase = (backward ? (int)TimingPhase_Backward : (int)TimingPhase_Forward);
auto& timing = m_timing[phase];
timing.beginTime = std::chrono::system_clock::now();
timing.count++;
#ifndef CNTK_UWP
timing.profilerId = ProfilerTimeBegin();
#endif
}

template <class ElemType>
/*virtual*/ void ComputationNode<ElemType>::EndTiming(bool backward)
{
if (!Globals::ShouldEnableNodeTiming()) return;

int phase = (backward ? (int)TimingPhase_Backward : (int)TimingPhase_Forward);
auto& timing = m_timing[phase];
timing.duration += (std::chrono::system_clock::now() - timing.beginTime);

#ifndef CNTK_UWP
// the order must match enum
static const char* postfixes[TimingPhase_Total] =
{
"Forward",
"Backward",
};

if (timing.profilerName.length() != m_nodeName.length() + strlen(postfixes[phase]))
{
static char name[256];
sprintf_s(name, _countof(name), "%S%s", m_nodeName.c_str(), postfixes[phase]);
timing.profilerName = name;
}
ProfilerTimeEnd(timing.profilerId, timing.profilerName.c_str());
#endif
}

template<class ElemType>
void ComputationNode<ElemType>::PrintForwardBackwardTime()
{
if (GetInputs().size() == 0) return;

auto& forwardCount = m_timing[TimingPhase_Forward].count;
auto forwardDuration = m_timing[TimingPhase_Forward].duration.count();
auto& backwardCount = m_timing[TimingPhase_Backward].count;
auto backwardDuration = m_timing[TimingPhase_Backward].duration.count();
fprintf(stderr, "%-30S forward avg %07fs, backward avg %07fs (fwd# %d|bwd# %d)\n",
m_nodeName.c_str(),
forwardCount == 0 ? 0 : forwardDuration / forwardCount,
backwardCount == 0 ? 0 : backwardDuration / backwardCount,
forwardCount,
backwardCount);

for (auto& timing : m_timing)
timing.Reset();
}

template <class ElemType>
/*virtual*/ void ComputationNode<ElemType>::DumpNodeInfo(const bool /*printValues*/, const bool printMetadata, File& fstream) const
{
Expand Down
Loading

0 comments on commit 3660b7a

Please sign in to comment.