-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance counters are not correctly measured in AMD ZEN series #14
Comments
Note that the perf tool runs the benchmark in user space. If you use the user-space version of nanoBench (i.e., use I do not know why the uops don't come from the uop cache when running the benchmark in kernel space. However, I don't think that the measurements are incorrect. |
Thanks. with user-mode nanoBench, it works correctly as I expected :). However, still, I can't figure out why kernel-mode nanoBench provides unexplainable results. (Personally, I prefer to use kernel-mode nanoBench to minimize extra overheads.) Is there any plan to fix this issue in kernel-mode nanoBench? |
I don't think there is anything to be fixed in nanoBench, as I don't think there is anything wrong. If you don't like how the CPU behaves in kernel mode, you would need to contact AMD ;) |
Yes, I also think there seems to be nothing wrong with kernel-mode nanoBench. It would be better to contact AMD people. Thanks for your reply :) |
Hi.
I have tried to measure the performance counters related to decoder parts (i.e., uops dispatched from legacy x86 decoder <
DeDisUopsFromDecoder.DecoderDispatched
> or micro-op cache <DeDisUopsFromDecoder.OpCacheDispatched
>).I have tested with a simple code snippet consisting of 8 multi-byte nops (each multi-byte nop is 4 bytes) without unrolling. I thought this code snippet results in a series of micro-op cache hits; however, the results show all uops are dispatched from the legacy x86 decoder, not micro-op cache.
command
results (I slightly modified the source code to dump absolute measured counters)
I cannot understand why every instruction is decoded by the legacy x86 decoder.
I also checked with a simple test program consisting of the same code pattern (see below).
test.s build command: <nasm -f elf64 test.s -o test.o; ld test.o -o test>
Then, I checked the performance counters with the perf tool.
The results show major uops are decoded by micro-op cache (r01AA => decoded by the legacy x86 decoder // r02AA => decoded by micro-op cache // r03AA => all uops).
Why nanoBench and perf show different results?
Sincerely.
Joonsung Kim.
The text was updated successfully, but these errors were encountered: