bench
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
Time measurement ------------------ For usable benchmarking we need a precise, reliable and repeatable time measurement. It seems to be easy? We are on computer right? Unfortunately I found it almost impossible! Hopefully I am totaly wrong. Several "kind of time" exist: real (wall clock), user CPU time, system CPU time. What I believe we have to use is task user + system CPU time. There are varions commands/functions which can be used to measure time, e.g.: 1) time command (real, user, sys time): time COMMAND 2) getrusage function (user, sys time): #include <sys/time.h> #include <sys/resource.h> struct rusage usage; getrusage( RUSAGE_SELF, &start); // measured code getrusage( RUSAGE_SELF, &end); double user_elapsed = end.ru_utime.tv_sec + end.ru_utime.tv_usec/1000000. - start.ru_utime.tv_sec - start.ru_utime.tv_usec/1000000.; double sys_elapsed = end.ru_stime.tv_sec + end.rs_utime.tv_usec/1000000. - start.ru_stime.tv_sec -start.ru_stime.tv_usec/1000000.; 3) times function (user, sys time): #include <sys/times.h> struct tms start, end; times(&start); // measured code times(&end); double user_elapsed = ((double)(end.tms_utime - start.tms_utime))/sysconf(_SC_CLK_TCK); double sys_elapsed = ((double)(end.tms_stime - start.tms_stime))/sysconf(_SC_CLK_TCK); 4) clock function (user + sys time ?): #include <time.h> clock_t start = clock(); // measured code clock_t end = clock(); double elapsed = ((double) (end - start)) / CLOCKS_PER_SEC; 5) clock_gettime functions (user + sys time ?): #include <time.h> struct timespec start, end; clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &start ); // measured code clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &end ); double elapsed = end.tv_sec + end.tv_nsec / 1000000000. - start.tv_sec - start.tv_nsec / 1000000000.; 6) QTime class (QElapsedTimer Qt >= 4.7) (real time): QTime time; time.start(); // measured code double elapsed = time.elapsed() / 1000.; I tried all of them and all are giving the same mostly useless values. If the same piece of code is measured more times, the results differ. It seems that all those functions read user time from the same place, kernel task_struct.utime. The problem is, if I understand correctly, how the task_struct.utime is updated. Whenever timer interrupt comes (every 1/HZ) to scheduler, it calls update_process_times->account_process_tick->irqtime_account_process_tick->account_user_time and increases task_struct.utime by cputime_one_jiffy (1/HZ). See kernel/sched.c and kernel/timer.c. It means, that utime is not increased by pure time when the the task code is running, but by fixed interval which includes some task switching overhead??? In a simple test, I could observe 30% increase of utime if another application was running at the same time. Unfortunately I don't see anything better than user time + sys time, running test with highest priority and avoiding other use of computer when tests are running, e.g.: sudo chrt -f 99 COMMAND To be sure that the measured values are correct, it is necessary to run more cycles and check some standard deviation or so. There is also high level benchmark support available in QTestLib, it is possible to use QBENCHMARK macro + QGSTEST_MAIN to create easily test executable. Such test may be run with various options, some notes on modes/options: -tickcounter - reads rdtsc register (on Linux), thus it counts real time, result is not constant -callgrind - reruns the command with callgrind, number of 'instr. loads' is constant for constant number of iterations, number of instructions per iterarion decreases with number of iterarions (for small numbers of iterations) for a simple function the number of instractions of the second iteration may be 40% of the first one - cache, prediction??? Callgrind is really very slow. I am not sure what 'instr. loads' exactly means and if it can be somehow converted to time, but I don't believe so. AFAIK each instruction need a different number of cycles and it may be different even for the same instruction because of CPU cache, then there are instruction pipelines etc. Build options ------------- CMAKE_BUILD_TYPE should be RelWithDebInfo so that it compiles with optimisations but also adds debug information so that it can be profiled with callgrind and visualized with kcachegrind.