This is a repository for a Deformable Convolution operation in Tensorflow. This repo largely borrows cuda codes from original implementation.
Tensorflow(with GPU configured)
Cuda 8.0
g++ 4.9.2
Note: Only tested on platform where corresponding version of g++ and cuda installed, other version might generally be fine, but may need to modify the compile script.
- Set up
TF_INC
andCUDA_HOME
, whereTF_INC
can be set up asTF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')
. Make sureCUDA_HOME
be the path where cuda is installed, such as default:/usr/local/cuda
. - Build the op. If you have Tensorflow source installed, you could copy all cpp files contained in
./lib
andBUILD
to$(Tensorflow_source_dir)/tensoflow/core/user_ops
, then runbazel build --config=opt --config=cuda //tensorflow/core/user_ops:deform_conv.so
in$(Tensorflow_source_dir)
. If not, run./lib/nvcc_complie.sh
and./lib/g++_complie.sh
in sequence to builddeform_conv.so
. import lib.deform_conv_op as deform_conv_op
in your python script (make sure PYTHON_PATH was set currectly).
A simple WGAN script trained on MNIST, to validated the backpropagation.
Since offset mostly stays between -1 and 1 there is no need to visualize it. Considering the simplicity of discriminator task, I'm not suprised about it. Might considering bring scaled MNIST in and pretrain regular conv part or change the initializer of offset conv to random normal to make deform matters.
-
Basic test with original implementation.
-
--Make sure gradient work.(weird bug happened, data grad used to be correct except for first time calculated, now in my test it works normal, but if you find any bug just open an issue)
-
Simple benchmark.
-
Some demo and visualization.
-
Backward time costs too much.
-
Other ops.
Benchmark script was borrowed from here. The forward time is fine, for 100x3x224x224 data, it runs about in 0.077s. But backward time generaly undesired, it cost 0.558s to run a batch of same data. Note I write all backward of three inputs(data, offset, kernels) together, rather than like many tensorflow conv ops spliting input_backwards and kernel_backwards to two ops, so this might be one of the reason. In addition, because sometimes I find it hard to manipulate tensorflow::Tensor
, I write a simple cuda kernel that does nothing but add one tensor to another, for accumulating gradients along batch in kernel gradient implementation, don't know whether it affects performance.