CUDA-NN

A collection of learning resources, featuring a straightforward implementation in Python and a comprehensive implementation of a neural network in CUDA.

Basic CUDA code for learners to get started.
A simple Python implementation of PointNet.
CUDA implementations of tensors, layers, and the PointNet model, including training and inference code.

Basic CUDA Code

This section includes three main components:

Matmul: The base version of matrix multiplication along with several optimized variations.
Reduce: Different versions for summation operations on matrices.
Transpose: Basic and optimized versions of matrix transpose.

Optimizations focus on the following aspects:

Shared Memory: Tiling techniques and optimizations to minimize discrete reads.
Register Memory: Increasing the ratio of computation to memory access to enhance FLOPS.
Bank Conflict: Strategies to avoid bank conflicts caused by shared memory.
Memory Access Acceleration: Using float4 reading to improve performance.

To run the starter code:

cd CUDA-NN/base/Matmul
nvcc v0.cu -o v0
./v0

Helpful References:

Simple Implementation of PointNet in Python

This section contains the code for the model implementation and training of PointNet on a single GPU.

To run:

cd CUDA-NN/python
python train.py

Implementation of PointNet in CUDA

As a significant assignment for the UCAS GPU course, we were tasked with implementing PointNet inference and training in CUDA, achieving certain performance benchmarks in both accuracy and speed. This project serves as an excellent opportunity to practice CUDA.

The tensors, layers, and sub-modules of PointNet are implemented in CUDA-NN/src. Configure the data path in CUDA-NN/src/test.cu and the model parameters path in CUDA-NN/src/CMakeLists.txt.

To run the project:

cd CUDA-NN/CUDA-NN
mkdir build
cd build
cmake ..
make run/test

make run: Reads test data from CUDA-NN/data/beat and saves the output to CUDA-NN/data/cuout.
make test: Runs the official test program. Please ensure hdf5 is installed and download the h5 files for the MNIST dataset. This will read the test dataset and output both the running time and accuracy.

Helpful Reference:

CUDA-DNN-MNIST (GitHub)

Additional Resources for the UCAS GPU Course

To avoid redundant wheel-building, this repository offers a basic framework for model inference, along with numerous suggestions and reusable tools. This repository achieves an inference speed of around 5 seconds, but there are still many areas for optimization due to limited effort. I hope this repository can help future students have more resources to explore deeper optimization techniques.

1. Profiling the Program

There are two primary tools available for profiling the program to analyze performance:

NVIDIA Nsight Systems (nsys)
NVIDIA Nsight Compute (ncu)

Be mindful of how to use these tools. Optimizing inference encompasses various aspects, so do not overlook any critical area:

Data Transmission: nsys profiles the time consumed by many APIs like cudaFree, cudaMalloc, and cudaMemcpy, which can significantly impact inference time.
Kernel Performance: ncu provides detailed insights into each running kernel, including execution time, memory reads/writes, resource usage, and optimization suggestions.

Helpful Reference:

Profiling and Performance Analysis

2. Merging the Project

Versions on the course server (Ubuntu 18.04):

nvcc: 12.1
gcc: 7.5

The profiler only supports submitting a single .cu file, so it’s necessary to merge the entire project into a single file. To run the merge script:

bash merge.sh test/main # Merge test.cu or main.cu

Then use the official nvcc to compile the merged.cu file as written in compile.sh:

nvcc merged.cu -o train -Xcompiler "-O3 -std=c++14" -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -lhdf5_cpp -lhdf5

./train

Note: The compilation process may fail due to the version of gcc or the architecture of the GPU. Please verify compatibility if errors occur.

3. Verifying Program Accuracy

To validate the correctness of layers or models, it’s essential to compare them with the official implementation in Torch at the input/output level. To do this:

Ensure that the absolute or relative paths used in both CUDA and Python programs are correct. It is recommended to use absolute paths instead of relative ones.
Verify that the information in config.yaml matches.
Ensure that make run executes successfully in the CUDA-NN/build directory.

Then run:

python beat.py

If all test points pass, you will see output like this:

Test (Python) module POINTNET
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 21.80it/s]
--------------------------------------------------
Test (CUDA) module POINTNET
--------------------------------------------------
Outputs matched
--------------------------------------------------
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 2445.66it/s]
[AC] All 10 tests passed successfully!

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
CUDA-NN		CUDA-NN
asserts		asserts
base		base
docs		docs
python		python
.gitignore		.gitignore
README.md		README.md
beat.py		beat.py
compile.sh		compile.sh
config.yaml		config.yaml
merge.sh		merge.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA-NN

Basic CUDA Code

Helpful References:

Simple Implementation of PointNet in Python

Implementation of PointNet in CUDA

Helpful Reference:

Additional Resources for the UCAS GPU Course

1. Profiling the Program

Helpful Reference:

2. Merging the Project

3. Verifying Program Accuracy

About

Releases

Packages

Languages

tsyhahaha/CUDA-NN

Folders and files

Latest commit

History

Repository files navigation

CUDA-NN

Basic CUDA Code

Helpful References:

Simple Implementation of PointNet in Python

Implementation of PointNet in CUDA

Helpful Reference:

Additional Resources for the UCAS GPU Course

1. Profiling the Program

Helpful Reference:

2. Merging the Project

3. Verifying Program Accuracy

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages