Test failed with th -lcunn -e "nn.testcuda()" #117

LijieTu · 2015-07-23T15:31:09Z

When I run test.sh, th -lcunn -e "nn.testcuda()" is unstable and occasionally fails with different error messages:

Every time the error message is different, some examples are:

SpatialSubSampling_backward
error on state (backward)
LT(<) violation val=1.2421855926514, condition=0.01
/root/torch/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
/root/torch/install/share/lua/5.1/cunn/test.lua:1391: in function 'v'

LogSoftMax_forward_batch
error on state (forward)
LT(<) violation val=0.0010080337524414, condition=0.001
/root/torch/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
/root/torch/install/share/lua/5.1/cunn/test.lua:2364: in function 'v'

I guess a similar issue was raised in #50 and solved(maybe?).

I updated to the latest torch and packages.

I use a ubuntu 14.04 docker image and cuda 7.0

Thanks.

xonobo · 2015-07-24T11:28:20Z

I have similar test failures. Just updated my torch distro today.

Here is the output of nn.testcuda and cutorch.test outputs:

th> nn.testcuda()
seed: 1437743792
Running 94 tests
==> Done Completed 183 asserts in 94 tests with 2 errors
SpatialSubSampling_forward_batch
Function call failed
...ayci/torch-distro/install/share/lua/5.1/torch/Tensor.lua:243: cuda runtime error (2) : out of memory at /home/bozkalayci/torch-distro/extra/cutorch/lib/THC/THCStorage.cu:30
stack traceback:
[C]: in function 'resize'
...ayci/torch-distro/install/share/lua/5.1/torch/Tensor.lua:243: in function 'cuda'
...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:1329: in function 'v'
...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:3496: in function <...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:3494>
[C]: in function 'xpcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:182: in function '_run'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:157: in function 'run'
...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:3516: in function 'testcuda'
[string "_RESULT={nn.testcuda()}"]:1: in main chunk
[C]: in function 'xpcall'
...alayci/torch-distro/install/share/lua/5.1/trepl/init.lua:630: in function 'repl'
...rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
[C]: at 0x00406670

Threshold_transposed
Function call failed
...ayci/torch-distro/install/share/lua/5.1/nn/Threshold.lua:20: cuda runtime error (2) : out of memory at /home/bozkalayci/torch-distro/extra/cunn/Threshold.cu:51
stack traceback:
[C]: in function 'Threshold_updateOutput'
...ayci/torch-distro/install/share/lua/5.1/nn/Threshold.lua:20: in function 'forward'
...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:58: in function 'pointwise_transposed'
...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:605: in function 'v'
...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:3496: in function <...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:3494>
[C]: in function 'xpcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:182: in function '_run'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:157: in function 'run'
...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:3516: in function 'testcuda'
[string "_RESULT={nn.testcuda()}"]:1: in main chunk
[C]: in function 'xpcall'
...alayci/torch-distro/install/share/lua/5.1/trepl/init.lua:630: in function 'repl'
...rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
[C]: at 0x00406670

                                                                  [47.9153s]

th> cutorch.test()
seed: 1437743979
Running 117 tests
==> Done Completed 845 asserts in 117 tests with 4 errors
multi_gpu_copy_noncontig
Function call failed
...ayci/torch-distro/install/share/lua/5.1/cutorch/init.lua:21: cuda runtime error (2) : out of memory at /home/bozkalayci/torch-distro/extra/cutorch/lib/THC/THCStorage.cu:30
stack traceback:
[C]: in function 'error'
...ayci/torch-distro/install/share/lua/5.1/cutorch/init.lua:21: in function 'withDevice'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:1661: in function 'v'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2297: in function <...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2295>
[C]: in function 'xpcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:182: in function '_run'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:157: in function 'run'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2317: in function 'test'
[string "_RESULT={cutorch.test()}"]:1: in main chunk
[C]: in function 'xpcall'
...alayci/torch-distro/install/share/lua/5.1/trepl/init.lua:630: in function 'repl'
...rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
[C]: at 0x00406670

pow2
Function call failed
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:129: cuda runtime error (2) : out of memory at /home/bozkalayci/torch-distro/extra/cutorch/lib/THC/THCTensorMath2.cu:43
stack traceback:
[C]: at 0x7fc2b41a9f30
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:129: in function 'compareFloatAndCudaTensorArgs'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:899: in function 'v'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2297: in function <...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2295>
[C]: in function 'xpcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:182: in function '_run'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:157: in function 'run'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2317: in function 'test'
[string "_RESULT={cutorch.test()}"]:1: in main chunk
[C]: in function 'xpcall'
...alayci/torch-distro/install/share/lua/5.1/trepl/init.lua:630: in function 'repl'
...rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
[C]: at 0x00406670

largeNoncontiguous
Function call failed
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:483: cuda runtime error (2) : out of memory at /home/bozkalayci/torch-distro/extra/cutorch/lib/THC/THCStorage.cu:30
stack traceback:
[C]: in function 'new'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:483: in function 'fn'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:72: in function 'compareFloatAndCuda'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:485: in function 'v'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2297: in function <...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2295>
[C]: in function 'xpcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:182: in function '_run'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:157: in function 'run'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2317: in function 'test'
[string "_RESULT={cutorch.test()}"]:1: in main chunk
[C]: in function 'xpcall'
...alayci/torch-distro/install/share/lua/5.1/trepl/init.lua:630: in function 'repl'
...rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
[C]: at 0x00406670

abs1
Function call failed
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:129: cuda runtime error (2) : out of memory at /home/bozkalayci/torch-distro/extra/cutorch/lib/THC/THCTensorMathPointwise.cu:58
stack traceback:
[C]: at 0x7fc2b41a92a0
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:129: in function 'compareFloatAndCudaTensorArgs'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:844: in function 'v'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2297: in function <...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2295>
[C]: in function 'xpcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:182: in function '_run'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:157: in function 'run'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2317: in function 'test'
[string "_RESULT={cutorch.test()}"]:1: in main chunk
[C]: in function 'xpcall'
...alayci/torch-distro/install/share/lua/5.1/trepl/init.lua:630: in function 'repl'
...rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
[C]: at 0x00406670

                                                                  [10.2387s]

jayavanth · 2015-08-12T16:47:00Z

Did you try with a different GPU? I switched to a better GPU and it worked. Make sure you have >8GB of global memory

LinusU · 2015-09-16T11:25:06Z

I'm having the same problem, would it be possible to tweak parameters of the test so that it uses a bit less memory? Would be nice if it worked on the Amazon GPU instances...

soumith · 2015-09-16T13:55:54Z

you could safely ignore out of memory errors.

LinusU · 2015-09-16T14:32:08Z

I wish I could tell Ansible that 😕

I'm automatically provisioning servers in the cloud and I would like to run the smoke test to see that everything worked. The current workaround is to comment out the entire test...

I'm also seeing the errors reported by the original poster just before the out of memory issue occurs.

SpatialSubSampling_backward
error on state (backward)
LT(<) violation val=1.2421855926514, condition=0.01
/root/torch/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
/root/torch/install/share/lua/5.1/cunn/test.lua:1391: in function 'v'

LogSoftMax_forward_batch
error on state (forward)
LT(<) violation val=0.0010080337524414, condition=0.001
/root/torch/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
/root/torch/install/share/lua/5.1/cunn/test.lua:2364: in function 'v'

soumith · 2015-10-04T03:17:27Z

We have to increase LogSoftMax's threshold slightly, and SpatialSubSampling actually looks like it has a corner-case bug. wrt out of memory errors, the tests could be modified to keep tensor sizes within the memory sizes reported in cutorch.getDeviceProperties. Any PRs for any of these are appreciated. If not I'll fix them at my own pace.

howardlinus · 2016-02-28T21:17:20Z

i have a similar problem when running ./test.sh, has this been resolved? (I use ubuntu 14.04 and cuda 7.0 in aws)

LogSoftMax_forward_batch
error on state (forward)
LT(<) violation val=0.0011835098266602, condition=0.001
/home/ubuntu/torch/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
/home/ubuntu/torch/install/share/lua/5.1/cunn/test.lua:315: in function 'v'

LijieTu mentioned this issue Jul 23, 2015

Torch failed with CUDA test in Hal cBio/cbio-cluster#291

Closed

apaszke mentioned this issue Mar 8, 2016

Numerical errors in SpatialSubSampling_backward test #236

Closed

davidsaxton mentioned this issue May 20, 2016

Fix SpatialSubSampling (was doing non-atomic writes in backprop). #277

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test failed with th -lcunn -e "nn.testcuda()" #117

Test failed with th -lcunn -e "nn.testcuda()" #117

LijieTu commented Jul 23, 2015

xonobo commented Jul 24, 2015

jayavanth commented Aug 12, 2015

LinusU commented Sep 16, 2015

soumith commented Sep 16, 2015

LinusU commented Sep 16, 2015

soumith commented Oct 4, 2015

howardlinus commented Feb 28, 2016

Test failed with th -lcunn -e "nn.testcuda()" #117

Test failed with th -lcunn -e "nn.testcuda()" #117

Comments

LijieTu commented Jul 23, 2015

xonobo commented Jul 24, 2015

jayavanth commented Aug 12, 2015

LinusU commented Sep 16, 2015

soumith commented Sep 16, 2015

LinusU commented Sep 16, 2015

soumith commented Oct 4, 2015

howardlinus commented Feb 28, 2016