Skip to content

some learning code, simple reproduction and complex implementation about neural network

Notifications You must be signed in to change notification settings

tsyhahaha/CUDA-NN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CUDA-NN

A collection of learning resources, featuring a straightforward implementation in Python and a comprehensive implementation of a neural network in CUDA.

  • Basic CUDA code for learners to get started.
  • A simple Python implementation of PointNet.
  • CUDA implementations of tensors, layers, and the PointNet model, including training and inference code.

Basic CUDA Code

This section includes three main components:

  • Matmul: The base version of matrix multiplication along with several optimized variations.
  • Reduce: Different versions for summation operations on matrices.
  • Transpose: Basic and optimized versions of matrix transpose.

Optimizations focus on the following aspects:

  • Shared Memory: Tiling techniques and optimizations to minimize discrete reads.
  • Register Memory: Increasing the ratio of computation to memory access to enhance FLOPS.
  • Bank Conflict: Strategies to avoid bank conflicts caused by shared memory.
  • Memory Access Acceleration: Using float4 reading to improve performance.

To run the starter code:

cd CUDA-NN/base/Matmul
nvcc v0.cu -o v0
./v0

Helpful References:

Simple Implementation of PointNet in Python

This section contains the code for the model implementation and training of PointNet on a single GPU.

To run:

cd CUDA-NN/python
python train.py

Implementation of PointNet in CUDA

As a significant assignment for the UCAS GPU course, we were tasked with implementing PointNet inference and training in CUDA, achieving certain performance benchmarks in both accuracy and speed. This project serves as an excellent opportunity to practice CUDA.

The tensors, layers, and sub-modules of PointNet are implemented in CUDA-NN/src. Configure the data path in CUDA-NN/src/test.cu and the model parameters path in CUDA-NN/src/CMakeLists.txt.

To run the project:

cd CUDA-NN/CUDA-NN
mkdir build
cd build
cmake ..
make run/test
  • make run: Reads test data from CUDA-NN/data/beat and saves the output to CUDA-NN/data/cuout.
  • make test: Runs the official test program. Please ensure hdf5 is installed and download the h5 files for the MNIST dataset. This will read the test dataset and output both the running time and accuracy.

Helpful Reference:

Additional Resources for the UCAS GPU Course

To avoid redundant wheel-building, this repository offers a basic framework for model inference, along with numerous suggestions and reusable tools. This repository achieves an inference speed of around 5 seconds, but there are still many areas for optimization due to limited effort. I hope this repository can help future students have more resources to explore deeper optimization techniques.

1. Profiling the Program

There are two primary tools available for profiling the program to analyze performance:

  • NVIDIA Nsight Systems (nsys)
  • NVIDIA Nsight Compute (ncu)

Be mindful of how to use these tools. Optimizing inference encompasses various aspects, so do not overlook any critical area:

  • Data Transmission: nsys profiles the time consumed by many APIs like cudaFree, cudaMalloc, and cudaMemcpy, which can significantly impact inference time.
  • Kernel Performance: ncu provides detailed insights into each running kernel, including execution time, memory reads/writes, resource usage, and optimization suggestions.

Helpful Reference:

2. Merging the Project

Versions on the course server (Ubuntu 18.04):

  • nvcc: 12.1
  • gcc: 7.5

The profiler only supports submitting a single .cu file, so it’s necessary to merge the entire project into a single file. To run the merge script:

bash merge.sh test/main # Merge test.cu or main.cu

Then use the official nvcc to compile the merged.cu file as written in compile.sh:

nvcc merged.cu -o train -Xcompiler "-O3 -std=c++14" -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -lhdf5_cpp -lhdf5

./train

Note: The compilation process may fail due to the version of gcc or the architecture of the GPU. Please verify compatibility if errors occur.

3. Verifying Program Accuracy

To validate the correctness of layers or models, it’s essential to compare them with the official implementation in Torch at the input/output level. To do this:

  • Ensure that the absolute or relative paths used in both CUDA and Python programs are correct. It is recommended to use absolute paths instead of relative ones.
  • Verify that the information in config.yaml matches.
  • Ensure that make run executes successfully in the CUDA-NN/build directory.

Then run:

python beat.py

If all test points pass, you will see output like this:

Test (Python) module POINTNET
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 21.80it/s]
--------------------------------------------------
Test (CUDA) module POINTNET
--------------------------------------------------
Outputs matched
--------------------------------------------------
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 2445.66it/s]
[AC] All 10 tests passed successfully!

About

some learning code, simple reproduction and complex implementation about neural network

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published