Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance counters are not correctly measured in AMD ZEN series #14

Closed
joonsung-kim opened this issue Nov 4, 2020 · 4 comments
Closed

Comments

@joonsung-kim
Copy link

Hi.

I have tried to measure the performance counters related to decoder parts (i.e., uops dispatched from legacy x86 decoder <DeDisUopsFromDecoder.DecoderDispatched> or micro-op cache <DeDisUopsFromDecoder.OpCacheDispatched>).
I have tested with a simple code snippet consisting of 8 multi-byte nops (each multi-byte nop is 4 bytes) without unrolling. I thought this code snippet results in a series of micro-op cache hits; however, the results show all uops are dispatched from the legacy x86 decoder, not micro-op cache.

command

sudo ./kernel-nanoBench.sh -basic_mode -unroll_count 1 -loop_count 100000 -cpu 1 -asm "nop ax; nop ax; nop ax; nop ax; nop ax; nop ax; nop ax; nop ax" -config configs/cfg_Zen_all.txt | grep -i "dedisuops"

results (I slightly modified the source code to dump absolute measured counters)

DeDisUopsFromDecoder.DecoderDispatched: 10.00 (1000019)
DeDisUopsFromDecoder.OpCacheDispatched: 0.00 (0)

I cannot understand why every instruction is decoded by the legacy x86 decoder.

I also checked with a simple test program consisting of the same code pattern (see below).
test.s build command: <nasm -f elf64 test.s -o test.o; ld test.o -o test>

global _start

_start:
        mov rdi, 100000
        call test_uop_cache_hit
    mov rax, 60
    mov rdi, 0
    syscall

test_uop_cache_hit:
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax

    dec rdi
    jnz test_uop_cache_hit
    ret

Then, I checked the performance counters with the perf tool.

$perf stat -e cycles,instructions,r01AA,r02AA,r03AA ./test

 Performance counter stats for './test':

            298349      cycles                                                      
           1037949      instructions              #    3.48  insn per cycle                                            
             86233      r01AA                                                       
            999280      r02AA                                                       
           1085721      r03AA                                                       

       0.000433346 seconds time elapsed

The results show major uops are decoded by micro-op cache (r01AA => decoded by the legacy x86 decoder // r02AA => decoded by micro-op cache // r03AA => all uops).

Why nanoBench and perf show different results?

Sincerely.
Joonsung Kim.

@andreas-abel
Copy link
Owner

Note that the perf tool runs the benchmark in user space. If you use the user-space version of nanoBench (i.e., use nanoBench.sh instead of kernel-nanoBench.sh), the results are very similar to perf.

I do not know why the uops don't come from the uop cache when running the benchmark in kernel space. However, I don't think that the measurements are incorrect.

@joonsung-kim
Copy link
Author

@andreas-abel

Thanks. with user-mode nanoBench, it works correctly as I expected :). However, still, I can't figure out why kernel-mode nanoBench provides unexplainable results. (Personally, I prefer to use kernel-mode nanoBench to minimize extra overheads.)

Is there any plan to fix this issue in kernel-mode nanoBench?

@andreas-abel
Copy link
Owner

I don't think there is anything to be fixed in nanoBench, as I don't think there is anything wrong. If you don't like how the CPU behaves in kernel mode, you would need to contact AMD ;)

@joonsung-kim
Copy link
Author

Yes, I also think there seems to be nothing wrong with kernel-mode nanoBench. It would be better to contact AMD people. Thanks for your reply :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants