Name		Name	Last commit message	Last commit date
parent directory ..
marlin		marlin
README.md		README.md
base_graph.cu		base_graph.cu
base_unified_memory.cu		base_unified_memory.cu
base_zero_copy.cu		base_zero_copy.cu
compile.sh		compile.sh
gemm_fp16_wmma.cu		gemm_fp16_wmma.cu
gemm_fp32.cu		gemm_fp32.cu

README.md

CUDA - Compute Unified Device Architecture

CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs).

cuBLAS

The NVIDIA cuBLAS library is a fast GPU-accelerated implementation of the standard basic linear algebra subroutines (BLAS). Using cuBLAS APIs, you can speed up your applications by deploying compute-intensive operations to a single GPU or scale up and distribute work across multi-GPU configurations efficiently.

cuDNN

cuDNN is a GPU-accelerated library of primitives for deep neural networks, provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.

Main Page

cuSOLVER

The NVIDIA cuSOLVER library provides a collection of dense and sparse direct solvers which deliver significant acceleration for Computer Vision, CFD, Computational Chemistry, and Linear Optimization applications.

Thrust

Thrust is a powerful library of parallel algorithms and data structures. Thrust provides a flexible, high-level interface for GPU programming that greatly enhances developer productivity. Using Thrust, C++ developers can write just a few lines of code to perform GPU-accelerated sort, scan, transform, and reduction operations orders of magnitude faster than the latest multi-core CPUs.

CUB

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model: Parallel primitives(Warp-wide, Block-wide, Device-wide) and Utilities.

NCCL

NCCL (pronounced "Nickel") is a stand-alone library of standard collective communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, and reduce-scatter. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda

cuda

README.md

CUDA - Compute Unified Device Architecture

cuBLAS

cuDNN

cuSOLVER

Thrust

CUB

NCCL

Files

cuda

Directory actions

More options

Directory actions

More options

Latest commit

History

cuda

Folders and files

parent directory

README.md

CUDA - Compute Unified Device Architecture

cuBLAS

cuDNN

cuSOLVER

Thrust

CUB

NCCL