Some source code about matrix multiplication implementation on CUDA.
--- General Information for device 0 ---
Name: GeForce GTX 1080 Ti
Compute capability: 6.1
Clock rate: 1683000
Device copy overlap: Enabled
Kernel execution timeout: Disabled
--- Memory Information for device 0 ---
Total global mem: 11720130560
Total constant mem: 65536
Max mem pitch: 2147483647
Texture Alignment: 512
--- MP Information for device 0 ---
Multiprocessor count: 28
Shared mem per mp: 49152
Registers per mp: 65536
Threads in warp: 32
Max threads per block: 1024
Max thread dimensions: (1024, 1024, 64)
Max grid dimensions: (2147483647, 65535, 65535)
compile the file as follows:
nvcc *.cu --std=c++11