Overview

A concise practical guide on code performance analysis.

CPU profiling

Cache stats (via perf)

Get a quick snapshot of your app's cache performance:

sudo perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./myapp

Sample a stat further using perf record:

sudo perf record -e cache-misses ./myapp

Print a summary report:

sudo perf report --stdio

CPU Flamegraphs

Flamegraphs (created by Brendan Gregg) are very useful in helping you quickly identify code hotspots.

Launch your linux docker container with SYS_ADMIN capabilities: docker run --privileged --cap-add SYS_ADMIN -t [your_image_id] /bin/bash . From this point forward we are inside our launched container
sudo sysctl -w kernel.kptr_restrict=0 and sudo sysctl -w kernel.perf_event_paranoid=1
clone Brendan Gregg's Flamegraph repo: git clone https://github.com/brendangregg/Flamegraph.git
clone and build your code (with symbols)
run your code / unit tests under perf. As I use gtest, typically I list my tests and choose one [my_gtest_app] --gtest_list_tests
next, let's run our gtest of interest under perf: perf record -a -g ./my_gtest_app --gtest_filter=MyTestSuite.MyTest (see here for perf record sampling docs)
run perf script to generate trace output from perf data: perf script > out.perf
next, we'll normalise this data and fold all the call-graphs: ../../Flamegraph/stackcollapse-perf.pl out.perf > out.folded
now let's generate the flamegraph: ../../Flamegraph/flamegraph.pl out.folded > out.svg
you should now have an out.svg file in the current directory
let's copy the out.svg from our container to our host. In a separate shell (i.e. outside our container), obtain your running container's name: docker ps then: docker cp your_container_name:/your/out.svg/location/out.svg .
open it in an svg viewer (a web browser for instance)

The above summarised: NB. To install perf (on Ubuntu): sudo apt-get install linux-tools

#!/bin/bash
process=$@
if [ ! -d Flamegraph ]; then
    git clone https://github.com/brendangregg/Flamegraph.git
fi
perf record -g -p $process
perf script > out.perf
Flamegraph/stackcollapse-perf.pl out.perf > out.folded
Flamegraph/flamegraph.pl out.folded > flamegraph.svg

Other tools

Intel VTune Profiler - Good for getting an overview of your app's utilisation of CPU across your app's threads (red = spinning, green = good utilisation).

Memory profiling

Memory Flamegraphs

See Brendan Gregg's Memory Flamegraphs

Finding leaks with eBPF

Install bpf utilities:

sudo apt-get install bpfcc-tools linux-headers-$(uname -r)

Next, launch your app and grab its pid, then: sudo memleak-bpfcc -p123 -a replacing 123 with your app's process id. memleak-bpfcc docs

Links

perf mem

perf mem reference and redhat perf mem guide:

(start your app then...)
sudo perf mem record -a sleep 30
perf mem report

valgrind + massif

valgrind --tool=massif --xtree-memory=full ./your_gtest_app --gtest_filter=your_test_suite.your_test
once this completes you can view the output (massif.out.xxx) in the massif-visualizer and also view the xtmemory.kcg.xxx file in kcachegrind

valgrind + memcheck

valgrind --tool=memcheck --xml=yes --xml-file=./output_file.xml --leak-check=full ./your_gtest_app --gtest_filter=your_test_suite.your_test
for a quick summary let's grab the python ValgrindCI tool: python -m pip install ValgrindCI --user
for a summary: valgrind-ci ./output_file.xml --summary or to use it as part of CI and abort on errors: valgrind-ci ./output_file.xml --abort-on-errors

ctest (cmake)

cd [your_cmake_build_dir]
list your available tests: ctest -N
pick a test and run it under valgrind's memcheck tool: ctest -T memcheck -R my_test_name
list the generated memory checker reports: ls -lat [your_cmake_build_dir]/Testing/Temporary/MemoryChecker*
Fix the leaks - for this start with fixing the definitely lost using that description as a search term in your MemoryCheck*.log(s)

GPU profiling

Summary of linux command line utilities for monitoring GPU utilisation
Renderdoc - great for profiling OpenGL / Vulkan apps across a range of platforms (including Android)

Locating performance regressions

See Differential Flamegraphs

TODO - add worked example

Locating performance regressions in your google benchmark history

I've created a small python utility that generates a chart for each of your google benchmarks as a time series (run it over your accumulated benchmark history data) but it also attempts to estimate the location of the build that introduced your slowdown using a sliding window.

linux

Use the gold linker: pass this to your cmake -G etc line to use the faster gold linker on linux: -DCMAKE_EXE_LINKER_FLAGS=-fuse-ld=gold

Computer Vision Performance Profiling / Optimisation

How to profile OpenCV apps:

set OPENCV_TRACE=1 environment variable (and optionally OPENCV_TRACE_LOCATION to the path to write the OpenCV trace logs to)
run your app
generate a top 10 most costly OpenCV functions report as follows: [opencv_repo_location]/modules/ts/misc/trace_profiler.py [your_opencv_trace_dir]/OpenCVTrace.txt 10 . NB. this report includes run time cost as well as the number of threads that called this function (the "thr" column) - useful when trying to evaluate cpu usage

OpenCV

you can influence the number of internal threads used by OpenCV via the cv::setNumThreads()
OpenCV function call profiling
OpenCV Graph API - an effort to optimise CV pipelines

Tensorflow

just use the prebuilt c libs and cppflow (if they match your target's cpu)

Tensorflowlite

Tools

Performance

Android

CMake

NB. setting ANDROID_ARM_NEON=ON will globally enable NEON in CMake based projects but if using NDK >= 21 then NEON is enabled by default.

Refs

Ideas on how to not release slower code

You may already be using benchmarks as an early warning system with respect to performance regression detection in your game engine / software but in case you aren't, this may be useful. In my side project I programatically create a benchmark (using google benchmark) for each 3d game scene in a set of scenes to put my engine's performance through its paces. Then, as part of a CI build I run those benchmarks and feed the results to my google benchmark charting project that generates a chart for each of your google benchmarks as a time series. The code (that you will need to modify) to adapt to your engine / scenes is as follows:

First, the CMakeLists.txt file that pulls in google benchmark (+ my engine - adapt to your own libs):

set(BENCHMARK_ENABLE_TESTING FALSE)
set(BENCHMARK_ENABLE_INSTALL FALSE)
include(FetchContent)
FetchContent_Declare(googlebenchmark
    GIT_REPOSITORY https://github.com/google/benchmark
    GIT_TAG        "v1.5.4"
)
FetchContent_MakeAvailable(googlebenchmark)

include_directories(${FIREFLY_INCLUDE_DIRS}
    ${OPENGL_INCLUDE_DIRS}
    ${SDL2_INCLUDE_DIR})

add_executable(firefly_benchmarks benchmarks.cpp)

target_link_libraries(firefly_benchmarks
    benchmark::benchmark
    ${OPENGL_LIBRARIES}
    ${SDL2_LIBRARY}
    ${FIREFLY_LIBRARIES})

add_custom_command(TARGET firefly_benchmarks
    POST_BUILD COMMAND ${CMAKE_COMMAND} -E copy_directory ${CMAKE_SOURCE_DIR}/benchmarks $<TARGET_FILE_DIR:firefly_benchmarks>/benchmarks)

Next, here's benchmarks.cpp (my c++ google benchmark code that prgramatically creates a benchmark for each scene defined in the scenes variable, again, adapt to your own engine):

#include <benchmark/benchmark.h>
#include <unordered_map>
#include <vector>
#include <set>
#include <firefly.h>
using namespace firefly;

#include <SDL2/SDL.h>
#include <SDL2/SDL_opengl.h>

static const int SCREEN_WIDTH  = 640;
static const int SCREEN_HEIGHT = 480;
SDL_Window*   displayWindow    = 0;
SDL_GLContext displayContext;

auto BenchmarkScene = [](benchmark::State& st, std::string sceneFilePath)
{
    size_t frames = 0;
    auto scene = SDK::GetInstance().Load(sceneFilePath);
    for(auto _ : st)
    {
        SDL_Event e;
        while (SDL_PollEvent(&e))
        {
            switch(e.type)
            {
                case SDL_WINDOWEVENT:
                {
                    switch(e.window.event)
                    {
                        case SDL_WINDOWEVENT_RESIZED:
                        {
                            firefly::SDK::GetInstance().OnSize(e.window.data1, e.window.data2);
                            break;
                        }
                    }
                }
            }
        }
        SDK::GetInstance().Update(*scene);
        SDL_GL_SwapWindow(displayWindow);
        frames++;
    }
    scene->Release();
    firefly::SDK::GetInstance().Reset();
};

int main(int argc, char** argv)
{
    std::vector<std::string> scenes = {
        "BallOnPlatform.msf",
        "TiledGrass.msf"
    };

    SDL_Init(SDL_INIT_VIDEO);
    SDL_GL_SetAttribute(SDL_GL_CONTEXT_MAJOR_VERSION, FIREFLY_GL_MAJOR_VERSION);
    SDL_GL_SetAttribute(SDL_GL_CONTEXT_MINOR_VERSION, FIREFLY_GL_MINOR_VERSION);
    if(FIREFLY_GL_MAJOR_VERSION >= 3 && FIREFLY_GL_MINOR_VERSION >= 3)
        SDL_GL_SetAttribute(SDL_GL_CONTEXT_PROFILE_MASK, SDL_GL_CONTEXT_PROFILE_CORE);
    displayWindow = SDL_CreateWindow("",
                                      SDL_WINDOWPOS_UNDEFINED,
                                      SDL_WINDOWPOS_UNDEFINED,
                                      SCREEN_WIDTH,
                                      SCREEN_HEIGHT,
                                      SDL_WINDOW_RESIZABLE | SDL_WINDOW_OPENGL);
    displayContext = SDL_GL_CreateContext(displayWindow);
    SDL_GL_SetSwapInterval(0);
    firefly::SDKSettings sdkSettings;
    sdkSettings.renderer = firefly::RendererManager::OPENGL;
    sdkSettings.device   = 0;
    sdkSettings.loadProc = (firefly::IRenderer::GLADloadproc)SDL_GL_GetProcAddress;
    sdkSettings.window   = displayWindow;
    firefly::SDK::GetInstance().Initialise(sdkSettings);
    firefly::SDK::GetInstance().OnSize(SCREEN_WIDTH, SCREEN_HEIGHT);

    for(auto& scene : scenes)
    {
#if FIREFLY_PLATFORM == PLATFORM_LINUX
        auto benchmark = benchmark::RegisterBenchmark(scene.c_str(),
                                     BenchmarkScene,
                                     firefly::GetFireflyDir() + "../bin/benchmarks/" + scene);
        benchmark->Iterations(200);
        benchmark->Unit(benchmark::kMillisecond);
#endif
    }
    benchmark::Initialize(&argc, argv);
    benchmark::RunSpecifiedBenchmarks();
    firefly::SDK::GetInstance().Destroy();
	return 0;
}

This effectively creates a firefly_benchmarks executable (firefly being my internal side project's name). I then invoke this (on linux) as follows as part of my build:

#!/bin/bash

if [ ! -d benchmarking ]; then
    mkdir benchmarking
fi

cd benchmarking

if [ ! -d benchmark_monitor ]; then
    git clone https://github.com/bensanmorris/benchmark_monitor.git
    cd benchmark_monitor

    python3 -m venv env
    env/Scripts/activate
    pip3 install -r requirements.txt    

    cd ..
fi

counter=1
while [ $counter -le 30 ]
do
    echo $counter
    ./../bin/firefly_benchmarks --benchmark_out=benchmark_$counter.json && python3 ./benchmark_monitor/benchmark_monitor.py -d . -w 6 -a 0.01
    ((counter++))
done

rm -rf benchmark_monitor
rm benchmark_*.zip
timestamp=$(date +%s)
zip -r benchmark_$timestamp.zip .
cp ./benchmark_$timestamp.zip ../../benchmarks

It's not very sophisticated but the end result is a sequence of charts for each benchmarked scene that I then compare to the previous release. I have managed to not release a few performance regressions as a result of this so hope it's of use to you.

Performance related YouTube channels and videos

What's a Creel - great channel on intrinsics and assembler

References

Brendan Gregg's memory flamegraphs
Brendan Gregg's differential Flamegraphs
Linux perf examples
Perf mem reference
Agner Fogg's C++ optimisation manual: https://www.agner.org/optimize/optimizing_cpp.pdf
Profiling OpenCV Applications
Danila Kutenin's Making the Most Out of Your Compiler CppCon talk

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

CPU profiling

Cache stats (via perf)

CPU Flamegraphs

Other tools

Memory profiling

Memory Flamegraphs

Finding leaks with eBPF

perf mem

valgrind + massif

valgrind + memcheck

ctest (cmake)

GPU profiling

Locating performance regressions

Locating performance regressions in your google benchmark history

linux

Computer Vision Performance Profiling / Optimisation

OpenCV

Tensorflow

Tensorflowlite

Tools

Performance

Android

CMake

Refs

Ideas on how to not release slower code

Performance related YouTube channels and videos

References

About

Releases

Packages

Contributors 2

bensanmorris/practical_performance

Folders and files

Latest commit

History

Repository files navigation

Overview

CPU profiling

Cache stats (via perf)

CPU Flamegraphs

Other tools

Memory profiling

Memory Flamegraphs

Finding leaks with eBPF

perf mem

valgrind + massif

valgrind + memcheck

ctest (cmake)

GPU profiling

Locating performance regressions

Locating performance regressions in your google benchmark history

linux

Computer Vision Performance Profiling / Optimisation

OpenCV

Tensorflow

Tensorflowlite

Tools

Performance

Android

CMake

Refs

Ideas on how to not release slower code

Performance related YouTube channels and videos

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages