GitHub - sgrauerg6/cudaBeliefProp: CUDA belief propagation as presented in paper "GPU implementation of belief propagation using CUDA for cloud tracking and reconstruction" published at the 2008 IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS 2008). Code has been updated to work on current NVIDIA GPUs and with additional optimizations. Cite paper if using this code.

sgrauerg6 / cudaBeliefProp Public

Notifications You must be signed in to change notification settings
Fork 2
Star 4

CUDA belief propagation as presented in paper "GPU implementation of belief propagation using CUDA for cloud tracking and reconstruction" published at the 2008 IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS 2008). Code has been updated to work on current NVIDIA GPUs and with additional optimizations. Cite paper if using this code.

GPL-2.0 license

4 stars 2 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1,419 Commits
BeliefProp		BeliefProp
Documentation		Documentation
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
COPYING		COPYING
Doxyfile		Doxyfile
README		README
common.mk		common.mk

Repository files navigation

/*
Copyright (C) 2024 Scott Grauer-Gray

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
*/

GPU and Optimized CPU Implementations of Belief Propagation for Stereo

This README describes an implementation of the CUDA belief propagation algorithm
originally described in "GPU implementation of belief propagation using CUDA
for cloud tracking and reconstruction" which was published at the 2008 IAPR
Workshop on Pattern Recognition in Remote Sensing (PRRS 2008) and can be 
found at the following address: https://sgrauerg6.github.io/cudaBeliefProp.pdf.
Please cite this work if using this code for GPU acceleration as part of a
research paper or other work.

Will note that there have been changes from the initial implementation as
presented in the paper, both for optimization, to work on the latest NVIDIA GPUs
with the latest CUDA toolkits, and to add a parallel CPU implementation.
Also, the motion estimation portion of the code was removed to narrow the focus
of this work.  The additional optimizations as well as the parallel CPU
implementation are described in
https://sgrauerg6.github.io/OptimizingGlobalStereoMatchingOnNVIDIAGPUsAndCPUs.pdf.

Updated paper with discussion of selected optimizations and results at
https://sgrauerg6.github.io/OptimizedBpGpuVsCpu_Jan2025.pdf.

The original code is available at
https://sgrauerg6.github.io/beliefPropCodePost.zip.

Doxygen documentation for current code:
Documentation/html/index.html and
https://sgrauerg6.github.io/Documentation/cudaBeliefProp/html/index.html

This code is distributed using the GNU General Public License, so any derivative
work which is distributed must also contain this license.

See "RUNTIME RESULTS" section at end of README for runtime
info across various processors including the NVIDIA A100 and H100 GPUs,
AMD Genoa and Genoa-X CPUs, Intel Emerald Rapids and Sapphire Rapids CPUs,
and Amazon Graviton, Microsoft Cobalt, and NVIDIA Grace ARM server CPUs.

Please email comments and bug reports to [email protected]


INSTRUCTIONS - EVALUATE OPTIMIZED IMPLEMENTATION ON TARGET ARCHITECTURE VS
OTHER ARCHITECTURES:

This code runs and evaluates belief propagation for stereo vision on the CPU
and GPU using CUDA on Linux-based systems.

The parameters are defined in BpResultsEvaluation/BpEvaluationStereoSets.h and
are documented in that file and the paper.  
In order to compile/re-compile the program, navigate to the directory
beliefprop/LinuxMakeAndRunBPOptimizedCPU (for optimized CPU
implementation on x86 processors), the directory 
beliefprop/LinuxMakeAndRunBPOptimizedCPU_ARM (for optimized CPU implementation
on ARM processors), or LinuxMakeAndRunBPCUDA (for CUDA implementation) and run
"make clean" and "make".

A number of stereo sets including the Tsukuba stereo are included in the
"BeliefProp/BpStereoSets" folder.  In order to run and evaluate the
implementation across six of these stereo sets using default parameters,
perform the following steps:

1a.  (CUDA ONLY) If needed, set the PATH and LD_LIBRARY_PATH to the necessary
paths needed to run CUDA programs (usually PATH is appended with $CUDA_DIR/bin
and LD_LIBRARY_PATH=$CUDA_DIR/lib64).

-If using a CUDA device with Compute Capability less than 8.0, need to switch
half-precision processing from default bfloat16 datatype to half datatype by
removing flag to use bfloat16 datatype in
"BFLOAT_DATA_TYPE_SETTING = -DUSE_BFLOAT16_FOR_HALF_PRECISION" line in 
beliefprop/LinuxMakeAndRunBPCUDA/Makefile. Note that this does make it so that
the results in the test with conesFullSizeCropped at half precision will not
be "good" due to half-precision value(s) becoming infinity or NaN when
processing that stereo set.

1b.  (CPU ONLY) Make any necessary or desired adjustments to run on target CPU:

-On x86 CPU, AVX-512 vectorization is used by default.

-If AVX-512 is not supported on target CPU but AVX-256 is supported, set
vectorization to AVX-256 by changing line in
beliefprop/LinuxMakeAndRunBPOptimizedCPU/Makefile.h from
"CPU_VECTORIZATION_DEFINE = -DAVX_512_VECTORIZATION" to
"CPU_VECTORIZATION_DEFINE = -DAVX_256_VECTORIZATION".

-If AVX-512 FP16 vectorization is supported on target CPU (currently only Intel
Sapphire Rapids and Emerald Rapids CPUs support it), can enable AVX512 FP16
vectorization by setting
"CPU_VECTORIZATION_DEFINE = -DAVX_512_VECTORIZATION" to
"CPU_VECTORIZATION_DEFINE = -DAVX_512_F16_VECTORIZATION" and
FLOAT16_VECTORIZATION_DEFINE = -DFLOAT16_VECTORIZATION in
beliefprop/LinuxMakeAndRunBPOptimizedCPU/Makefile.h (both must be set). Also
may need to use a relatively recent gcc version for AVX512 FP16 vectorization
to compile (in testing worked on gcc with Ubuntu 24.04 but not Ubuntu 22.04).
This setting does make it so that the results in the test with
conesFullSizeCropped at half precision will not be "good" due to half-precision
value(s) becoming infinity or NaN when processing that stereo set (same as
"half" datatype in CUDA as described in 1a).

-If on a multi-CPU system, may want to adjust environment variables OMP_PLACES
and OMP_PROC_BIND to optimize OpenMP thread configuration. In some tests on
multi-CPU systems have gotten faster results with setting OMP_PROC_BIND=true
and OMP_PLACES="sockets".

2.  By default, belief propagation is run on 6 stereo sets using the float,
double, and half datatypes with and without using templated disparity counts
for each stereo set, and is run up to 15 times in each configuration for
accurate benchmarking with median runtimes used in evaluation. As a result,
it takes awhile to run the full evaluation. For quicker results, can limit the
number of stereo sets used, datatypes used in evaluation, optimization options,
templated disparity count options, alternate implementation run settings, and
number of runs for each configuration by enabling defines in the common.mk file
that are included in the Makefile for each implementation.

3.  If running optimized CPU implementation on x86 CPU, navigate to the folder 
    beliefprop/LinuxMakeAndRunBPOptimizedCPU.
    If running optimized CPU implementation on ARM CPU, navigate to the folder
    beliefprop/LinuxMakeAndRunBPOptimizedCPU_ARM.
    If running CUDA implemenetation, navigate to folder
    beliefprop/LinuxMakeAndRunBPCUDA.

4.  Execute the commands "make clean" and "make" on the command line.

5.  Run "./driverCPUBp {RUN_NAME}" or "./driverCudaBp {RUN_NAME}",
calling generated executable "driver{CPU/CUDA}Bp" with input parameter
{RUN_NAME} to run the implementation on multiple stereo sets and in multiple
configurations (run name cannot have any spaces). Implementation is run
multiple times on each run configuration to more accurately measure the
runtime. The single-thread CPU implementation code from
http://cs.brown.edu/people/pfelzens/bp/index.html is also run to compare the
output disparity maps/runtimes from the original single-thread implementation
with the optimized implementations. If no {RUN_NAME} parameter is given, run
name is set to "CurrentRun". Recommended that {RUN_NAME} includes CPU/GPU
architecture so can easily match run results with corresponding architecture.
-Example {RUN_NAME}: "AMDRome48Cores" which corresponds to runs on a 48-core
CPU with the AMD Rome architecture and is used as baseline when computing
runtime speedup.

6.  The locations of the output disparity maps are given in the console output.
Detailed input and results for each run including input stereo set and
parameters, detailed timings along with overall speedup data, output disparity
map evaluation as compared to ground truth for each run are written to files in
subfolders of beliefprop/impResults folder with the file names and paths for
each result file given in the console output. A comparison of runtimes and
speedups across runs with results in beliefprop/impResults is written to
beliefprop/impResults/EvaluationAcrossRuns.csv, with the runs ordered from
fastest to slowest; each run is labeled using {RUN_NAME} from (5).

INSTRUCTIONS - SINGLE RUN OF OPTIMIZED BELIEF PROPAGATION ON GPU OR CPU:
-Input stereo set reference and test images must be grayscale images of PGM
file type
-Stereo sets using PGM file type available in BeliefProp/BpStereoSets

1. Follow instructions 1, 3, and 4 in above "EVALUATE OPTIMIZED IMPLEMENTATION
ON TARGET ARCHITECTURE VS OTHER ARCHITECTURES" instructions to compile program.

2. Run "./driverCPUBpCustom" or "./driverCudaBpCustom" with the following
arguments in order:
-File path of reference image of stereo set (must be PGM type)
-File path of test image of stereo set (must be PGM type)
-Number of possible disparity values
-File path of disparity image that is generated and saved during program run

Example run call: ./driverCPUBpCustom ../BpStereoSets/barn1/refImage.pgm ../BpStereoSets/barn1/testImage.pgm 32 ../BpStereoSets/barn1/dispMapFromSingleRun.pgm

Sample output from example run:
BP processing runtime (optimized w/ OpenMP + SIMD on CPU): 0.0518162
Output disparity map saved to ../BpStereoSets/barn1/dispMapFromSingleRun.pgm

## SAST Tools

[PVS-Studio](https://pvs-studio.com/en/pvs-studio/?utm_source=website&utm_medium=github&utm_campaign=open_source) - static analyzer for C, C++, C#, and Java code.

RUNTIME RESULTS:

Relative speedup using optimized implementation across GPUs/CPUs using
float type w/ prior run on AMD Rome (48 cores) used as baseline:
-Column marked as "Median Optimized Runtime (including transfer time)" in 
beliefprop/impResults/RunResults/{RUN_NAME}_RunResults.csv used for runtime
comparison to get relative speedup for each run
-Displayed speedup is average relative speedup compared to baseline across a
number of runs with different input stereo sets and/or run configurations
NVIDIA H100 (GH200): 3.34x
NVIDIA H100 (SXM5): 2.99x
AMD Genoa-X (176 Cores across 2 CPUs): 2.31x
NVIDIA H100 (PCIe): 2.27x
AMD Genoa-X (88 Cores*): 2.12x
AMD Genoa (192 Cores across 2 CPUs): 2.07x
AMD Genoa (96 Cores*): 1.98x
Amazon Graviton4 (96 ARM cores): 1.97x
Microsoft Cobalt (96 ARM cores): 1.91x
NVIDIA A100: 1.89x
Intel Emerald Rapids (48 cores): 1.84x
NVIDIA RTX 3090 Ti: 1.65x
AMD Milan-X (120 Cores across 2 CPUs): 1.60x
Intel Sapphire Rapids (48 cores): 1.53x
NVIDIA Grace CPU in GH200 (64 ARM cores): 1.44x
AMD Milan-X (60 cores*): 1.37x
Amazon Graviton3 (64 ARM cores): 1.20x
Amazon Graviton3E (64 ARM cores): 1.20x
NVIDIA V100: 1.15x
AMD Rome (48 Cores): 1.00x
Intel Ice Lake (32 cores): 1.00x
AMD Milan (48 cores): 0.91x
Amazon Graviton2 (64 ARM cores): 0.90x
NVIDIA P100: 0.78x
Intel Cascade Lake (24 cores): 0.66x
Intel Tiger Lake laptop (8 Cores): 0.33x
AMD 3600 (6 Cores): 0.27x
Spreadsheet with detailed results:
https://docs.google.com/spreadsheets/d/1j_JWKx9K2puHnWB0qRaxP3TyEkMDaoNpiLhUHleev-w/edit?usp=sharing

*Run on system with 2 socketed CPUs with total of 2x listed core count across
CPUs and with SMT disabled...benchmarked runs are with max number of threads
limited to thread count on single CPU w/ threads pinned to socket so should be
using single CPU in the run.
Commands to pin OMP threads to socket:
export OMP_PLACES="sockets"
export OMP_PROC_BIND=true

Single-thread implementation speedups (relative single-core performance): 
Intel Emerald Rapids (48 cores): 1.48x
Intel Sapphire Rapids (48 cores): 1.41x
NVIDIA Grace CPU in GH200 (64 ARM cores): 1.36x
Intel Tiger Lake laptop (8 Cores): 1.31x
AMD Genoa-X (88 Cores): 1.24x
Microsoft Cobalt (96 ARM cores): 1.13x
AMD Genoa (96 Cores): 1.12x
Amazon Graviton4 (96 ARM cores): 1.10x
Intel Ice Lake (32 cores): 1.07x
AMD 3600 (6 Cores): 1.00x
AMD Milan (48 cores): 0.97x
AMD Milan-X (60 cores): 0.95x
Intel Cascade Lake (24 cores): 0.92x
Amazon Graviton3 (64 ARM cores): 0.86x
Amazon Graviton3E (64 ARM cores): 0.86x
AMD Rome (48 Cores): 0.78x
Amazon Graviton2 (64 ARM cores): 0.63x

Optimizations used or available as option
-Indexing into data cost/message arrays designed to maximize memory coalescence
-Optimizing CUDA Thread Block dimensions / CPU thread count
-Templated disparity count (requires disparity known at compile time)
-Processing using half-precision floating point

CUDA Implementation specifically
-Device memory is allocated as single block with arrays for
data costs/messages at each level generated using offsets into the 
allocated block of device memory to minimize the number of cudaMalloc()
and cudaFree() calls

Optimized CPU implementation specifically
-OpenMP and SIMD instructions (NEON, AVX-256, and AVX-512
supported) used to maximize parallel processing on the CPU

Optimizations described in the following writeups
https://sgrauerg6.github.io/OptimizingGlobalStereoMatching.pdf
https://sgrauerg6.github.io/OptimizedBpGpuVsCpu_Jan2025.pdf

Previously looked at an optimization of using shared and/or register memory in place
of local memory during processing and that work is discussed in the following paper:
https://sgrauerg6.github.io/LCPC_2010_optBeliefProp.pdf