A concise practical guide on code performance analysis.
Get a quick snapshot of your app's cache performance:
sudo perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./myapp
Sample a stat further using perf record:
sudo perf record -e cache-misses ./myapp
Print a summary report:
sudo perf report --stdio
Flamegraphs (created by Brendan Gregg) are very useful in helping you quickly identify code hotspots.
- Launch your linux docker container with SYS_ADMIN capabilities:
docker run --privileged --cap-add SYS_ADMIN -t [your_image_id] /bin/bash
. From this point forward we are inside our launched container sudo sysctl -w kernel.kptr_restrict=0
andsudo sysctl -w kernel.perf_event_paranoid=1
- clone Brendan Gregg's Flamegraph repo:
git clone https://github.com/brendangregg/Flamegraph.git
- clone and build your code (with symbols)
- run your code / unit tests under perf. As I use gtest, typically I list my tests and choose one
[my_gtest_app] --gtest_list_tests
- next, let's run our gtest of interest under perf:
perf record -a -g ./my_gtest_app --gtest_filter=MyTestSuite.MyTest
(see here for perf record sampling docs) - run perf script to generate trace output from perf data:
perf script > out.perf
- next, we'll normalise this data and fold all the call-graphs:
../../Flamegraph/stackcollapse-perf.pl out.perf > out.folded
- now let's generate the flamegraph:
../../Flamegraph/flamegraph.pl out.folded > out.svg
- you should now have an out.svg file in the current directory
- let's copy the out.svg from our container to our host. In a separate shell (i.e. outside our container), obtain your running container's name:
docker ps
then:docker cp your_container_name:/your/out.svg/location/out.svg .
- open it in an svg viewer (a web browser for instance)
The above summarised:
NB. To install perf (on Ubuntu): sudo apt-get install linux-tools
#!/bin/bash
process=$@
if [ ! -d Flamegraph ]; then
git clone https://github.com/brendangregg/Flamegraph.git
fi
perf record -g -p $process
perf script > out.perf
Flamegraph/stackcollapse-perf.pl out.perf > out.folded
Flamegraph/flamegraph.pl out.folded > flamegraph.svg
- Intel VTune Profiler - Good for getting an overview of your app's utilisation of CPU across your app's threads (red = spinning, green = good utilisation).
See Brendan Gregg's Memory Flamegraphs
Install bpf utilities:
sudo apt-get install bpfcc-tools linux-headers-$(uname -r)
Next, launch your app and grab its pid, then: sudo memleak-bpfcc -p123 -a
replacing 123 with your app's process id. memleak-bpfcc docs
Links
- BCC docs
- Linux kernel tracepoints docs
- Linux kernel tracepoint headers - for details on tracepoint parameters
- uprobes - user space call capture example
perf mem reference and redhat perf mem guide:
(start your app then...)
sudo perf mem record -a sleep 30
perf mem report
valgrind --tool=massif --xtree-memory=full ./your_gtest_app --gtest_filter=your_test_suite.your_test
- once this completes you can view the output (massif.out.xxx) in the massif-visualizer and also view the xtmemory.kcg.xxx file in kcachegrind
valgrind --tool=memcheck --xml=yes --xml-file=./output_file.xml --leak-check=full ./your_gtest_app --gtest_filter=your_test_suite.your_test
- for a quick summary let's grab the python ValgrindCI tool:
python -m pip install ValgrindCI --user
- for a summary:
valgrind-ci ./output_file.xml --summary
or to use it as part of CI and abort on errors:valgrind-ci ./output_file.xml --abort-on-errors
cd [your_cmake_build_dir]
- list your available tests:
ctest -N
- pick a test and run it under valgrind's memcheck tool:
ctest -T memcheck -R my_test_name
- list the generated memory checker reports:
ls -lat [your_cmake_build_dir]/Testing/Temporary/MemoryChecker*
- Fix the leaks - for this start with fixing the definitely lost using that description as a search term in your MemoryCheck*.log(s)
- Summary of linux command line utilities for monitoring GPU utilisation
- Renderdoc - great for profiling OpenGL / Vulkan apps across a range of platforms (including Android)
TODO - add worked example
I've created a small python utility that generates a chart for each of your google benchmarks as a time series (run it over your accumulated benchmark history data) but it also attempts to estimate the location of the build that introduced your slowdown using a sliding window.
- Use the gold linker: pass this to your
cmake -G etc
line to use the faster gold linker on linux:-DCMAKE_EXE_LINKER_FLAGS=-fuse-ld=gold
How to profile OpenCV apps:
- set OPENCV_TRACE=1 environment variable (and optionally OPENCV_TRACE_LOCATION to the path to write the OpenCV trace logs to)
- run your app
- generate a top 10 most costly OpenCV functions report as follows:
[opencv_repo_location]/modules/ts/misc/trace_profiler.py [your_opencv_trace_dir]/OpenCVTrace.txt 10
. NB. this report includes run time cost as well as the number of threads that called this function (the "thr" column) - useful when trying to evaluate cpu usage
- you can influence the number of internal threads used by OpenCV via the
cv::setNumThreads()
- OpenCV function call profiling
- OpenCV Graph API - an effort to optimise CV pipelines
- just use the prebuilt c libs and cppflow (if they match your target's cpu)
- Tensorflowlite visualize.py script - generates an HTML page listing your nodes / types and whether they are quantized or not
- Netron - a GUI based NN visualizer (supports both TF and TFLite format models)
- Tensorflowlite performance best practices
- Tensorflowlite GPU backend (gpu accelerated tflite)
- Tensorflowlite Neural Net API delegate
- Tensorflowlite Model quantization (smaller models, potentially better CPU data alignment)
NB. setting ANDROID_ARM_NEON=ON
will globally enable NEON in CMake based projects but if using NDK >= 21 then NEON is enabled by default.
You may already be using benchmarks as an early warning system with respect to performance regression detection in your game engine / software but in case you aren't, this may be useful. In my side project I programatically create a benchmark (using google benchmark) for each 3d game scene in a set of scenes to put my engine's performance through its paces. Then, as part of a CI build I run those benchmarks and feed the results to my google benchmark charting project that generates a chart for each of your google benchmarks as a time series. The code (that you will need to modify) to adapt to your engine / scenes is as follows:
First, the CMakeLists.txt
file that pulls in google benchmark (+ my engine - adapt to your own libs):
set(BENCHMARK_ENABLE_TESTING FALSE)
set(BENCHMARK_ENABLE_INSTALL FALSE)
include(FetchContent)
FetchContent_Declare(googlebenchmark
GIT_REPOSITORY https://github.com/google/benchmark
GIT_TAG "v1.5.4"
)
FetchContent_MakeAvailable(googlebenchmark)
include_directories(${FIREFLY_INCLUDE_DIRS}
${OPENGL_INCLUDE_DIRS}
${SDL2_INCLUDE_DIR})
add_executable(firefly_benchmarks benchmarks.cpp)
target_link_libraries(firefly_benchmarks
benchmark::benchmark
${OPENGL_LIBRARIES}
${SDL2_LIBRARY}
${FIREFLY_LIBRARIES})
add_custom_command(TARGET firefly_benchmarks
POST_BUILD COMMAND ${CMAKE_COMMAND} -E copy_directory ${CMAKE_SOURCE_DIR}/benchmarks $<TARGET_FILE_DIR:firefly_benchmarks>/benchmarks)
Next, here's benchmarks.cpp
(my c++ google benchmark code that prgramatically creates a benchmark for each scene defined in the scenes
variable, again, adapt to your own engine):
#include <benchmark/benchmark.h>
#include <unordered_map>
#include <vector>
#include <set>
#include <firefly.h>
using namespace firefly;
#include <SDL2/SDL.h>
#include <SDL2/SDL_opengl.h>
static const int SCREEN_WIDTH = 640;
static const int SCREEN_HEIGHT = 480;
SDL_Window* displayWindow = 0;
SDL_GLContext displayContext;
auto BenchmarkScene = [](benchmark::State& st, std::string sceneFilePath)
{
size_t frames = 0;
auto scene = SDK::GetInstance().Load(sceneFilePath);
for(auto _ : st)
{
SDL_Event e;
while (SDL_PollEvent(&e))
{
switch(e.type)
{
case SDL_WINDOWEVENT:
{
switch(e.window.event)
{
case SDL_WINDOWEVENT_RESIZED:
{
firefly::SDK::GetInstance().OnSize(e.window.data1, e.window.data2);
break;
}
}
}
}
}
SDK::GetInstance().Update(*scene);
SDL_GL_SwapWindow(displayWindow);
frames++;
}
scene->Release();
firefly::SDK::GetInstance().Reset();
};
int main(int argc, char** argv)
{
std::vector<std::string> scenes = {
"BallOnPlatform.msf",
"TiledGrass.msf"
};
SDL_Init(SDL_INIT_VIDEO);
SDL_GL_SetAttribute(SDL_GL_CONTEXT_MAJOR_VERSION, FIREFLY_GL_MAJOR_VERSION);
SDL_GL_SetAttribute(SDL_GL_CONTEXT_MINOR_VERSION, FIREFLY_GL_MINOR_VERSION);
if(FIREFLY_GL_MAJOR_VERSION >= 3 && FIREFLY_GL_MINOR_VERSION >= 3)
SDL_GL_SetAttribute(SDL_GL_CONTEXT_PROFILE_MASK, SDL_GL_CONTEXT_PROFILE_CORE);
displayWindow = SDL_CreateWindow("",
SDL_WINDOWPOS_UNDEFINED,
SDL_WINDOWPOS_UNDEFINED,
SCREEN_WIDTH,
SCREEN_HEIGHT,
SDL_WINDOW_RESIZABLE | SDL_WINDOW_OPENGL);
displayContext = SDL_GL_CreateContext(displayWindow);
SDL_GL_SetSwapInterval(0);
firefly::SDKSettings sdkSettings;
sdkSettings.renderer = firefly::RendererManager::OPENGL;
sdkSettings.device = 0;
sdkSettings.loadProc = (firefly::IRenderer::GLADloadproc)SDL_GL_GetProcAddress;
sdkSettings.window = displayWindow;
firefly::SDK::GetInstance().Initialise(sdkSettings);
firefly::SDK::GetInstance().OnSize(SCREEN_WIDTH, SCREEN_HEIGHT);
for(auto& scene : scenes)
{
#if FIREFLY_PLATFORM == PLATFORM_LINUX
auto benchmark = benchmark::RegisterBenchmark(scene.c_str(),
BenchmarkScene,
firefly::GetFireflyDir() + "../bin/benchmarks/" + scene);
benchmark->Iterations(200);
benchmark->Unit(benchmark::kMillisecond);
#endif
}
benchmark::Initialize(&argc, argv);
benchmark::RunSpecifiedBenchmarks();
firefly::SDK::GetInstance().Destroy();
return 0;
}
This effectively creates a firefly_benchmarks
executable (firefly being my internal side project's name). I then invoke this (on linux) as follows as part of my build:
#!/bin/bash
if [ ! -d benchmarking ]; then
mkdir benchmarking
fi
cd benchmarking
if [ ! -d benchmark_monitor ]; then
git clone https://github.com/bensanmorris/benchmark_monitor.git
cd benchmark_monitor
python3 -m venv env
env/Scripts/activate
pip3 install -r requirements.txt
cd ..
fi
counter=1
while [ $counter -le 30 ]
do
echo $counter
./../bin/firefly_benchmarks --benchmark_out=benchmark_$counter.json && python3 ./benchmark_monitor/benchmark_monitor.py -d . -w 6 -a 0.01
((counter++))
done
rm -rf benchmark_monitor
rm benchmark_*.zip
timestamp=$(date +%s)
zip -r benchmark_$timestamp.zip .
cp ./benchmark_$timestamp.zip ../../benchmarks
It's not very sophisticated but the end result is a sequence of charts for each benchmarked scene that I then compare to the previous release. I have managed to not release a few performance regressions as a result of this so hope it's of use to you.
- What's a Creel - great channel on intrinsics and assembler