Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failed with th -lcunn -e "nn.testcuda()" #117

Open
LijieTu opened this issue Jul 23, 2015 · 7 comments
Open

Test failed with th -lcunn -e "nn.testcuda()" #117

LijieTu opened this issue Jul 23, 2015 · 7 comments

Comments

@LijieTu
Copy link

LijieTu commented Jul 23, 2015

When I run test.sh, th -lcunn -e "nn.testcuda()" is unstable and occasionally fails with different error messages:

Every time the error message is different, some examples are:

SpatialSubSampling_backward
error on state (backward)
LT(<) violation val=1.2421855926514, condition=0.01
/root/torch/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
/root/torch/install/share/lua/5.1/cunn/test.lua:1391: in function 'v'

LogSoftMax_forward_batch
error on state (forward)
LT(<) violation val=0.0010080337524414, condition=0.001
/root/torch/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
/root/torch/install/share/lua/5.1/cunn/test.lua:2364: in function 'v'

I guess a similar issue was raised in #50 and solved(maybe?).

I updated to the latest torch and packages.

I use a ubuntu 14.04 docker image and cuda 7.0

Thanks.

@xonobo
Copy link

xonobo commented Jul 24, 2015

I have similar test failures. Just updated my torch distro today.

Here is the output of nn.testcuda and cutorch.test outputs:


th> nn.testcuda()
seed: 1437743792
Running 94 tests
==> Done Completed 183 asserts in 94 tests with 2 errors
SpatialSubSampling_forward_batch
Function call failed
...ayci/torch-distro/install/share/lua/5.1/torch/Tensor.lua:243: cuda runtime error (2) : out of memory at /home/bozkalayci/torch-distro/extra/cutorch/lib/THC/THCStorage.cu:30
stack traceback:
[C]: in function 'resize'
...ayci/torch-distro/install/share/lua/5.1/torch/Tensor.lua:243: in function 'cuda'
...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:1329: in function 'v'
...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:3496: in function <...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:3494>
[C]: in function 'xpcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:182: in function '_run'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:157: in function 'run'
...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:3516: in function 'testcuda'
[string "_RESULT={nn.testcuda()}"]:1: in main chunk
[C]: in function 'xpcall'
...alayci/torch-distro/install/share/lua/5.1/trepl/init.lua:630: in function 'repl'
...rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
[C]: at 0x00406670

Threshold_transposed
Function call failed
...ayci/torch-distro/install/share/lua/5.1/nn/Threshold.lua:20: cuda runtime error (2) : out of memory at /home/bozkalayci/torch-distro/extra/cunn/Threshold.cu:51
stack traceback:
[C]: in function 'Threshold_updateOutput'
...ayci/torch-distro/install/share/lua/5.1/nn/Threshold.lua:20: in function 'forward'
...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:58: in function 'pointwise_transposed'
...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:605: in function 'v'
...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:3496: in function <...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:3494>
[C]: in function 'xpcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:182: in function '_run'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:157: in function 'run'
...kalayci/torch-distro/install/share/lua/5.1/cunn/test.lua:3516: in function 'testcuda'
[string "_RESULT={nn.testcuda()}"]:1: in main chunk
[C]: in function 'xpcall'
...alayci/torch-distro/install/share/lua/5.1/trepl/init.lua:630: in function 'repl'
...rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
[C]: at 0x00406670

                                                                  [47.9153s]    

th> cutorch.test()
seed: 1437743979
Running 117 tests
==> Done Completed 845 asserts in 117 tests with 4 errors
multi_gpu_copy_noncontig
Function call failed
...ayci/torch-distro/install/share/lua/5.1/cutorch/init.lua:21: cuda runtime error (2) : out of memory at /home/bozkalayci/torch-distro/extra/cutorch/lib/THC/THCStorage.cu:30
stack traceback:
[C]: in function 'error'
...ayci/torch-distro/install/share/lua/5.1/cutorch/init.lua:21: in function 'withDevice'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:1661: in function 'v'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2297: in function <...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2295>
[C]: in function 'xpcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:182: in function '_run'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:157: in function 'run'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2317: in function 'test'
[string "_RESULT={cutorch.test()}"]:1: in main chunk
[C]: in function 'xpcall'
...alayci/torch-distro/install/share/lua/5.1/trepl/init.lua:630: in function 'repl'
...rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
[C]: at 0x00406670

pow2
Function call failed
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:129: cuda runtime error (2) : out of memory at /home/bozkalayci/torch-distro/extra/cutorch/lib/THC/THCTensorMath2.cu:43
stack traceback:
[C]: at 0x7fc2b41a9f30
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:129: in function 'compareFloatAndCudaTensorArgs'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:899: in function 'v'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2297: in function <...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2295>
[C]: in function 'xpcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:182: in function '_run'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:157: in function 'run'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2317: in function 'test'
[string "_RESULT={cutorch.test()}"]:1: in main chunk
[C]: in function 'xpcall'
...alayci/torch-distro/install/share/lua/5.1/trepl/init.lua:630: in function 'repl'
...rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
[C]: at 0x00406670

largeNoncontiguous
Function call failed
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:483: cuda runtime error (2) : out of memory at /home/bozkalayci/torch-distro/extra/cutorch/lib/THC/THCStorage.cu:30
stack traceback:
[C]: in function 'new'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:483: in function 'fn'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:72: in function 'compareFloatAndCuda'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:485: in function 'v'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2297: in function <...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2295>
[C]: in function 'xpcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:182: in function '_run'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:157: in function 'run'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2317: in function 'test'
[string "_RESULT={cutorch.test()}"]:1: in main chunk
[C]: in function 'xpcall'
...alayci/torch-distro/install/share/lua/5.1/trepl/init.lua:630: in function 'repl'
...rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
[C]: at 0x00406670

abs1
Function call failed
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:129: cuda runtime error (2) : out of memory at /home/bozkalayci/torch-distro/extra/cutorch/lib/THC/THCTensorMathPointwise.cu:58
stack traceback:
[C]: at 0x7fc2b41a92a0
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:129: in function 'compareFloatAndCudaTensorArgs'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:844: in function 'v'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2297: in function <...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2295>
[C]: in function 'xpcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:182: in function '_run'
...ayci/torch-distro/install/share/lua/5.1/torch/Tester.lua:157: in function 'run'
...ayci/torch-distro/install/share/lua/5.1/cutorch/test.lua:2317: in function 'test'
[string "_RESULT={cutorch.test()}"]:1: in main chunk
[C]: in function 'xpcall'
...alayci/torch-distro/install/share/lua/5.1/trepl/init.lua:630: in function 'repl'
...rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
[C]: at 0x00406670

                                                                  [10.2387s]    

@jayavanth
Copy link

Did you try with a different GPU? I switched to a better GPU and it worked. Make sure you have >8GB of global memory

@LinusU
Copy link

LinusU commented Sep 16, 2015

I'm having the same problem, would it be possible to tweak parameters of the test so that it uses a bit less memory? Would be nice if it worked on the Amazon GPU instances...

@soumith
Copy link
Member

soumith commented Sep 16, 2015

you could safely ignore out of memory errors.

@LinusU
Copy link

LinusU commented Sep 16, 2015

I wish I could tell Ansible that 😕

I'm automatically provisioning servers in the cloud and I would like to run the smoke test to see that everything worked. The current workaround is to comment out the entire test...

I'm also seeing the errors reported by the original poster just before the out of memory issue occurs.

SpatialSubSampling_backward
error on state (backward)
LT(<) violation val=1.2421855926514, condition=0.01
/root/torch/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
/root/torch/install/share/lua/5.1/cunn/test.lua:1391: in function 'v'

LogSoftMax_forward_batch
error on state (forward)
LT(<) violation val=0.0010080337524414, condition=0.001
/root/torch/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
/root/torch/install/share/lua/5.1/cunn/test.lua:2364: in function 'v'

@soumith
Copy link
Member

soumith commented Oct 4, 2015

We have to increase LogSoftMax's threshold slightly, and SpatialSubSampling actually looks like it has a corner-case bug. wrt out of memory errors, the tests could be modified to keep tensor sizes within the memory sizes reported in cutorch.getDeviceProperties. Any PRs for any of these are appreciated. If not I'll fix them at my own pace.

@howardlinus
Copy link

i have a similar problem when running ./test.sh, has this been resolved? (I use ubuntu 14.04 and cuda 7.0 in aws)

LogSoftMax_forward_batch
error on state (forward)
LT(<) violation val=0.0011835098266602, condition=0.001
/home/ubuntu/torch/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
/home/ubuntu/torch/install/share/lua/5.1/cunn/test.lua:315: in function 'v'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants