Self measuring instruction performance tests #839

mikaelsky · 2024-03-05T03:04:37Z

mikaelsky
Mar 5, 2024
Collaborator

I've added the first of a series of self measuring performance test utilizing mcycle and some assembly stuff. This will help update the datasheet instruction cycle counts :)

https://github.com/mikaelsky/neorv32/tree/Performance_tests

Right now the MRET test isn't functional. It traps out, probably because of U being enabled.

The main.c file has a boat load of parameters to adjust what is tested and how.

@stnolting one challenge right now, which I'm unsure why its happening, is the cycle counts seem off.
If you look at the readme I pasted in the performance measurement for basic arithmetic instructions. When I run the test internally on xcelium I get the expected cycle count of 2 for everything, except sub which is 3 for some reason. But running with GHDL I get 4, so ~2x.
I'll dig some more on my end. The only core config difference I see is that our core config has C enabled, but that shouldn't matter for I instructions. Beyond I had to do some NOP tuning for all the branch and jump instructions.

A further detail is e.g. branch which comes in at [no branch, forward, backward] (3,7,8) cycles internally but (4,11,11) cycles in the default sim setup. I added a forward and backwards test as the riscv spec notes that if the offset is negative/backwards the branch should always assume true, so can be faster than forward.

I will add my M and Zfinx self measuring tests as well.

mikaelsky · 2024-03-05T04:56:47Z

mikaelsky
Mar 5, 2024
Collaborator Author

Note: added M and Zfinx test suites as well.

0 replies

mikaelsky · 2024-03-05T15:20:40Z

mikaelsky
Mar 5, 2024
Collaborator Author

Just did a comparison run on our internal setup. Our config is a bit different, e.g. we only have machine mode available.
Its worth noting the I'm only taking the neorv32_cpu (and the DM/DMT) in my setup.
The bus fabric is a complete rewrite in verilog, quite different than yours.. but also a lot more conventional :) The bus-switch got ported to verilog, feature enhanced and performance tuned (we should be 1 cycle faster for consecutive accesses as we don't fly back to IDLE).

GCC is version 12.2.0 and I run with -O0 to ensure I don't loose any assembly instructions.

Note: virtual_uart is a test-bench thing that allows us to print without having an actual UART in the design. Hence the "virtual".

VIRTUAL_UART: perform: for (i=0;i<1,i++) {256 instructions}
VIRTUAL_UART: add tot. 570 cyc
VIRTUAL_UART: total 570 cyc
VIRTUAL_UART: add rd,rs1,rs2 inst. 2 cyc
VIRTUAL_UART: addi tot. 828 cyc
VIRTUAL_UART: total 1398 cyc
VIRTUAL_UART: addi rd,rs1,imm inst. 3 cyc
VIRTUAL_UART: sub tot. 828 cyc
VIRTUAL_UART: total 2226 cyc
VIRTUAL_UART: sub rd,rs1,rs2 inst. 3 cyc
VIRTUAL_UART: lui tot. 570 cyc
VIRTUAL_UART: total 2796 cyc
VIRTUAL_UART: lui rd,imm inst. 2 cyc
VIRTUAL_UART: auipc tot. 569 cyc
VIRTUAL_UART: total 3365 cyc
VIRTUAL_UART: auipc rd,imm inst. 2 cyc
VIRTUAL_UART: instructions tested: 5
VIRTUAL_UART: total 3365 cycles
VIRTUAL_UART: avg. inst. execute cyles 2.628

This is a screen cap from Simvision of the core running the add loop. Its 2 cycles per add.

Assembly with -O0

  #if rv32I_arith == 1
    instToTest += 5;
     19c:       fe842783                lw      a5,-24(s0)
     1a0:       0795                    add     a5,a5,5
     1a2:       fef42423                sw      a5,-24(s0)
    startTime = NEORV32_MTIME->TIME_LO;
     1a6:       e8c00793                li      a5,-372
     1aa:       43dc                    lw      a5,4(a5)
     1ac:       fef42023                sw      a5,-32(s0)
    for (i = 0; i < instLoop; i++) {
     1b0:       fe042623                sw      zero,-20(s0)
     1b4:       a431                    j       3c0 <main+0x28e>
      #elif instCalls == 64
        cpy_64(addInst);
      #elif instCalls == 128
        cpy_128(addInst);
      #elif instCalls == 256
        cpy_256(addInst);
     1b6:       952e                    add     a0,a0,a1
     1b8:       952e                    add     a0,a0,a1
     1ba:       952e                    add     a0,a0,a1
     1bc:       952e                    add     a0,a0,a1
     1be:       952e                    add     a0,a0,a1
     1c0:       952e                    add     a0,a0,a1
     1c2:       952e                    add     a0,a0,a1
     1c4:       952e                    add     a0,a0,a1

1 reply

mikaelsky Mar 5, 2024
Collaborator Author

As a reference this is the results for the arith test suite from the git repo.

<<< I performance test >>>
perform: for (i=0;i<1,i++) {256 instructions}
add tot. 1058 cyc
total 1058 cyc
add rd,rs1,rs2 inst. 4 cyc
addi tot. 1058 cyc
total 2116 cyc
addi rd,rs1,imm inst. 4 cyc
sub tot. 1058 cyc
total 3174 cyc
sub rd,rs1,rs2 inst. 4 cyc
lui tot. 1058 cyc
total 4232 cyc
lui rd,imm inst. 4 cyc
auipc tot. 1058 cyc
total 5290 cyc
auipc rd,imm inst. 4 cyc
instructions tested: 5
total 5290 cycles
avg. inst. execute cyles 4.132

mikaelsky · 2024-03-05T15:43:36Z

mikaelsky
Mar 5, 2024
Collaborator Author

Additional data. This is the performance with C (compressed) turned off:

VIRTUAL_UART: perform: for (i=0;i<1,i++) {256 instructions}
VIRTUAL_UART: add tot. 568 cyc
VIRTUAL_UART: total 568 cyc
VIRTUAL_UART: add rd,rs1,rs2 inst. 2 cyc
VIRTUAL_UART: addi tot. 568 cyc
VIRTUAL_UART: total 1136 cyc
VIRTUAL_UART: addi rd,rs1,imm inst. 2 cyc
VIRTUAL_UART: sub tot. 568 cyc
VIRTUAL_UART: total 1704 cyc
VIRTUAL_UART: sub rd,rs1,rs2 inst. 2 cyc
VIRTUAL_UART: lui tot. 568 cyc
VIRTUAL_UART: total 2272 cyc
VIRTUAL_UART: lui rd,imm inst. 2 cyc
VIRTUAL_UART: auipc tot. 568 cyc
VIRTUAL_UART: total 2840 cyc
VIRTUAL_UART: auipc rd,imm inst. 2 cyc
VIRTUAL_UART: instructions tested: 5
VIRTUAL_UART: total 2840 cycles
VIRTUAL_UART: avg. inst. execute cyles 2.218

1 reply

mikaelsky Mar 5, 2024
Collaborator Author

what is interesting here is that addi and sub instructions take 3 cycles when compressed instructions are turned on

stnolting · 2024-03-05T21:15:57Z

stnolting
Mar 5, 2024
Maintainer

That's impressive - you really put a lot of work in this! 👍

However, this really surprised me:

add rd,rs1,rs2 inst. 4 cyc

All simple ALU operations should complete within 2 clock cycles. What kind of memory system did you use for these evaluations?

Furthermore, I would suggest stopping the cycle counter if the actual benchmarking instructions are not executed. Otherwise, the loop overhead instructions are also taken into account.

neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, -1); // halt all counters
...
  startTime = neorv32_cpu_csr_read(CSR_MCYCLE);
  for (i = 0; i < instLoop; i++) {
    neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, 0); // enable all counters
    #if instCalls == 16
    ...
    #endif
    neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, -1); // halt all counters
  }
  stopTime = neorv32_cpu_csr_read(CSR_MCYCLE);

Somewhere in the code I saw that you were having troubles with NOP instructions when compiling with compressed instructions. This example shows how to prevent the assembler from emitting compressed NOPs (c.nop) when the C ISA extension is enabled.

asm volatile (".option push  \n"
              ".option norvc \n"
              "nop           \n"
              ".option pop   \n");

what is interesting here is that addi and sub instructions take 3 cycles when compressed instructions are turned on

The big "problem" with the C extension is that it allows uncompressed 32-bit instructions to be split between two consecutive 32-bit memory words. In the worst case, an additional instruction fetch is required to fetch the "missing half" of the instruction word. In inline-assembly code you can manually force a specific alignment with the .balign operation to circumvent this.

The bus-switch got ported to verilog, feature enhanced and performance tuned (we should be 1 cycle faster for consecutive accesses as we don't fly back to IDLE).

Oh, no IDLE status? How do you manage that? Don't you then have a direct combinatorial feedback from ACK/ERR to STB?

4 replies

mikaelsky Mar 5, 2024
Collaborator Author

Soo many questions :) Let me do some tweaking and I can push in a pull request for this feature. I also plan on adding a C and B test suite set.

What kind of memory system did you use for these evaluations?
I used the default provided with the repo for the software examples. Not sure what was setup in that area.
Furthermore, I would suggest stopping the cycle counter if the actual benchmarking instructions.
I does have tiny impact, but for the default software example area GCC is run with -Os. As the default for loop length is 1, it all gets optimized out anyways. At least when I looked at the assembly it was gone.
Internally we are running with -O0 for easier debug. Here the for loop is indeed accounted for, it is ~40 cycles per loop impact, with a inner loop size of 256 instructions it ends up impacting by 40/256... which is <1 cycle and gets rounded off my the integer divide.
For small inner loop sizes it matters 100%, but for now I've preferred to just script my way around it and just run more smaller code images.
Not that I haven't though about your suggestion. I intentionally kept it simple first, but I see where you are coming from :) The overhead would then only be the two CSR reads vs the instruction under test.
This example shows how to prevent the assembler from emitting compressed NOPs (c.nop) when the C ISA extension is enabled.
I will add that to the code base. Makes the code a lot more portable!
You can manually force a specific alignment with the .balign operation to circumvent this.
I will look into this. It explains some of the odd-behavior I noticed where depending on the code settings I got 2 or 3 cycles for sub and auipc. This means that for the C test suite I will have to create an aligned and a misaligned test, basically. Good input!!!
Oh, no IDLE status? How do you manage that? Don't you then have a direct combinatorial feedback from ACK/ERR to STB?
The usual way, cheating ;) Nah its a bit simpler than that.
When we see a port ack/err we look at the other port(s) for pending STB and go straight to that port and do not pass idle. So for instructions that do a lot of load/store we ping/pong quickly between the two. In our case the bus mux has 4 ports, so yeah a bit more fun :) This is instead of stacking two 2 port muxes as seems to be the default... wondering if that would add an additional cycle delay?
We also allow STB to go directly out as our re-written peripheral interface always respond with ack/err delayed by 1 cycle. Which means we don't have a combinatorial path between STB and ACK/ERR anywhere in the bus-fabric. This means that as soon as we are in idle STB is allowed to flow through.

mikaelsky Mar 6, 2024
Collaborator Author

So I made a "benchmark" tb where I disable basically everything outside the core. Doing that we get this result which is more in line with the expectations:

<<< I performance test >>>
perform: for (i=0;i<1,i++) {256 instructions}
add tot. 515 cyc
total 515 cyc
add rd,rs1,rs2 inst. 2 cyc
addi tot. 515 cyc
total 1030 cyc
addi rd,rs1,imm inst. 2 cyc
sub tot. 515 cyc
total 1545 cyc
sub rd,rs1,rs2 inst. 2 cyc
lui tot. 515 cyc
total 2060 cyc
lui rd,imm inst. 2 cyc
auipc tot. 515 cyc
total 2575 cyc
auipc rd,imm inst. 2 cyc
instructions tested: 5
total 2575 cycles
avg. inst. execute cyles 2.11

mikaelsky Mar 6, 2024
Collaborator Author

And this is why we benchmark :) If I turn on the instruction cache we get this:
So turning on the instruction cache cuts performance by 2 cycles (in this case half, but e.g. for Zfinx is less)

<<< I performance test >>>
perform: for (i=0;i<1,i++) {256 instructions}
add tot. 1058 cyc
total 1058 cyc
add rd,rs1,rs2 inst. 4 cyc
addi tot. 1058 cyc
total 2116 cyc
addi rd,rs1,imm inst. 4 cyc
sub tot. 1058 cyc
total 3174 cyc
sub rd,rs1,rs2 inst. 4 cyc
lui tot. 1058 cyc
total 4232 cyc
lui rd,imm inst. 4 cyc
auipc tot. 1058 cyc
total 5290 cyc
auipc rd,imm inst. 4 cyc
instructions tested: 5
total 5290 cycles
avg. inst. execute cyles 4.132

stnolting Mar 6, 2024
Maintainer

Let me do some tweaking and I can push in a pull request for this feature.

❤️

I used the default provided with the repo for the software examples. Not sure what was setup in that area.

The "simple" default testbench has the caches enabled, which does not bring any performance boost for this specific setup. The caches are enabled just for verifying the processor.

Internally we are running with -O0 for easier debug. Here the for loop is indeed accounted for, it is ~40 cycles per loop impact, with a inner loop size of 256 instructions it ends up impacting by 40/256... which is <1 cycle and gets rounded off my the integer divide.

👍

Not that I haven't though about your suggestion. I intentionally kept it simple first, but I see where you are coming from :)

😉

I will look into this. It explains some of the odd-behavior I noticed where depending on the code settings I got 2 or 3 cycles for sub and auipc. This means that for the C test suite I will have to create an aligned and a misaligned test, basically. Good input!!!

I think there is also some GCC option to force a strict alignment of instructions - at least for branch labels. But I'm not sure if that would help in this specific case.

In our case the bus mux has 4 ports

4 host ports? Do I read "multi-core" between the lines? 😅

This is instead of stacking two 2 port muxes as seems to be the default... wondering if that would add an additional cycle delay?

The "flow" of requests throughout the default bus mux is entirely combinatorial. Anyway, I wonder if your approach could be mapped to this default mux without impacting the critical path?! 🤔

So I made a "benchmark" tb where I disable basically everything outside the core. Doing that we get this result which is more in line with the expectations:

Yeah, this looks way better!

And this is why we benchmark :) If I turn on the instruction cache we get this:
So turning on the instruction cache cuts performance by 2 cycles (in this case half, but e.g. for Zfinx is less)

The caches in their current setup are really a problem. If memory accesses only take 2 cycles (e.g. internal IMEM and DMEM) then using the caches does not make any sense as they increase overall memory latency. We are therefore discussing whether we should remove the internal caches and replace them with a separate cache that only buffers access via the external memory interface: #793

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self measuring instruction performance tests #839

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Self measuring instruction performance tests #839

mikaelsky Mar 5, 2024 Collaborator

Replies: 4 comments · 6 replies

mikaelsky Mar 5, 2024 Collaborator Author

mikaelsky Mar 5, 2024 Collaborator Author

mikaelsky Mar 5, 2024 Collaborator Author

mikaelsky Mar 5, 2024 Collaborator Author

mikaelsky Mar 5, 2024 Collaborator Author

stnolting Mar 5, 2024 Maintainer

mikaelsky Mar 5, 2024 Collaborator Author

mikaelsky Mar 6, 2024 Collaborator Author

mikaelsky Mar 6, 2024 Collaborator Author

stnolting Mar 6, 2024 Maintainer

mikaelsky
Mar 5, 2024
Collaborator

Replies: 4 comments 6 replies

mikaelsky
Mar 5, 2024
Collaborator Author

mikaelsky
Mar 5, 2024
Collaborator Author

mikaelsky Mar 5, 2024
Collaborator Author

mikaelsky
Mar 5, 2024
Collaborator Author

mikaelsky Mar 5, 2024
Collaborator Author

stnolting
Mar 5, 2024
Maintainer

mikaelsky Mar 5, 2024
Collaborator Author

mikaelsky Mar 6, 2024
Collaborator Author

mikaelsky Mar 6, 2024
Collaborator Author

stnolting Mar 6, 2024
Maintainer