-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quark
is stuck in an undefined state at reset in simulator with 4-state (01xz
) support
#61
Comments
Hi @jeras, thank you very much for the feedback, I'll double check and test. |
Internal state register On FPGA this would not affect power-on reset, since all registers are initialized to |
|
The So by splitting the always statement into one with reset for A related issue is descibed in this document section "3.1 Synchronous reset flip-flops with non reset follower flip-flops": |
Thank you for the observations and hints! Are you going to produce an ASIC with the Quark in it? The Quark is designed around using BRAMs as its register set, which would be implemented with actual registers and muxes that gives hardwired zero for x0. Also absolutely go for a barrel shifter in ASIC! In terms of size, the Quark-with-barrel-shifter still fits in HX1K FPGA, and it is a real gain. The official Quark does not have it because we needed every possible LUT for adding peripherals, but beyond the constraints of the tiny HX1K FPGA, much better performance/area ratio can be achieved. |
By the way, if I were about to design an ASIC, I would shoot for some very cool analog chip with special optical features :-) Digital design can be done with a vanilla FPGA next to it. |
First I must say I noticed some initialization code at the end of the RTL under a BENCH I do not plan to make an ASIC with Before you continue reading please note, I did not go through the episodic design notes yet. I will try to get through before asking too many further questions. An open a separate issue if I will have further questions or suggestions. @Mecrisp, you mentioned quark was designed to use BRAM on FPGA (if I understood the sentence correctly). But the RTL is written like a FlipFlop implementation with a write port and 2 asynchronous read ports (data multiplexers). A short name for this port configuration is 2R1W. Yesterday I compiled it with Vivado for the Arty board and apparently a BRAM block was used for the register file. I am perplexed how BRAM was inferred from the given RTL. The Artix BRAM has true dual port support, 2 independent read/write ports in short 2RW. So one port would have to be shared between a read and a write, which could not be done simultaneously. Since this sharing is not explicit in the RTL, I am not sure how the synthesis tool was able to infer a BRAM. In my experience inference is a very delicate process often resulting in suboptimal implementations. I only tried to synthesize for Lattice HX1K using your makefiles (with yosys), and there it is even less clear to me if and how a BRAM was used. The HX1K ports are 1W1R and the largest data width configuration is 16, so at least 4 blocks would be needed to implement a 1W2R 32-bit register file. If (when) I would start a CPU design for HX1K, I would just store the register file in the main memory (probably the end), this would result in the smallest logic consumption. I think a similar approach might be used in 8-bit AVR, since load/store instructions can access the GPRs. |
When aluShamt has a non-zero value after coming fresh from a short reset pulse, shifting is active, but as the processor starts in WAIT_ALU_OR_MEM state, it will perform up to 31 dummy shift cycles and then resume operation. I know, this is not the most elegant solution, but we were aggressively optimising for minimum LUT usage, and therefore we added reset initialisations only when absolutely necesssary. BENCH section will cover your needs. Most of the paths we explored during design are described in #1
Cool! There is no asynchronous read of the register file, reads are actually clocked as well. See this snipplet of the relevant lines copied from Quark source:
Correct. We are specifying a 2R1W implementation on a FPGA that only offers 1R1W block RAMs in hardware. In order to achieve 32 bit wide access, actually 4 blocks of memory are used on HX1K for register file. Like you, we were totally surprised by the marvellous RAM interference capabilities Yosys has to offer. Keeping the register file in memory is possible; the SERV core is working that way, but it decreases performance when compared to a separate register file. The Quark as-is needs 3 cycles for regular and 5 cycles for load/store instructions (up to 34 cycles for shifts); Quark-Bicycle achieves 2 cycles for regular instructions and 4 cycles load/store with a barrel shifter and additional address bus multiplexers in execute state. Cycles given when running from RAM, no memory busy. |
Thanks for all the precious details. I still need to read through the SERV documentation, I would like to understand how much is serialized, since probably it is not everything (instruction decoder?). There is probably still some middle ground between SERV and quark which is worth exploring. I think it should be possible to get a CPI around 3~5 while storing instruction/data and registers in the same memory, which is close to the current quark. But memory utilization would be much better, so some memory could be available to peripherals, or there would just be more memory for CPU instruction/data. I had a look at the Quark-Bicycle source code. I will definitely borrow the idea of sharing a single shifter for left/right. I apologize in advance for the following rant, my social skills are not good enough for me to be educational but not condescending. Code where every bit is coded individually rubs me the wrong way (it took me 15 min just to find an appropriate phrase): function logic [XLEN-1:0] reverse (logic [XLEN-1:0] value);
for (int i=0; i<32; i++) reverse[i] = value[XLEN-i-1];
endfunction: reverse This function can be then used for both reversal operations. I have a special disdain for generated CRC code. When the same can be achieved with totally generic XOR and a shift on a vector inside a loop. |
How will you handle memory busy signals when you aim for similiar CPI but with registers in main memory? |
Yesterday I wrote a draft spec for a single memory CPU: I find all standard low resource buses inadequate. |
You could try a combinatorial piece of logic for short-circuiting the wait for busy/ready signaling depending on the address range. Bruno inserted a short-circuit for sw/sh/sb working this way; if you have asynchronous read of the register file, you might do similiar for fetch. Quark is currently limited to a minimum of 2 CPI per instruction because the register file needs one cycle after the opcode arrives. |
I already took advantage off all possible combinational bypasses in the draft design. I made this design with ASIC in mind. On an ASIC you can usually choose what kind of memories you would like to use, and single port memories are about 30% smaller than dual port. So on ASIC a processor without a GPR register file would not make much sense with dual port memories. It would probably be possible to create some king of pipeline for quark with at least some instructions executing in a single clock cycle, but I am not sure it would be worth the effort. Do all processors in the FemtoRV32 family use a synchronous GPR register file? My original R5P processor design has a CPI of 1 for all instructions. This requires an asynchronous GPR register file. So my focus was on FPGA devices with 1RW1R distributed memories where the 1R port can be asynchronous. Some of this FPGA are Artix, EPS5, Cyclone 5, GOWIN, ... I only started thinking about a CPU without a register file after talking to you and thinking, it could be very small in an ASIC. As I understand cost was the main reason for focusing on Lattice HX1K, did you consider low cost solutions from GOWIN and other low cost vendors? I personally prefer the major vendors due to mature tools, but with improvements in open source tools low cost FPGA might become more attractive. |
Great! I'll keep an eye on your ongoing development efforts.
At this point in time, yes.
We are strong FOSS advocates here, and Project Apicula for Gowin synthesis wasn't available back then. It all started with Lattice HX1K.
I do FPGA firmware development for aerospace as my main job, using commercial tools, and I still strongly prefer Yosys. |
I tried to run a simulation of
femtorv32_quark.v
using the Vivado simulator, because I my SoC gets past synthesis well, but gets minimized to nothing during implementation, I do not know what is going on.So I noticed the memory bus address is
x
after reset is removed. This would not be a problem with Verilator, but in a 4-state simulatorx
is likely to be propagated through the design. In my case the CPU does not change state with clock.I went looking at what is on the address bus at reset:
https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark.v#L217-L218
https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark.v#L335
And the state is initialized to
WAIT_ALU_OR_MEM
while stateFETCH_INSTR
sounds like the right choice to have PC on the address bus at reset.So I made this change, without checking if there are any other non obvious expectations. Verilator simulations got the same results as before and in Vivado simulation the CPU actually started fetching instructions.
Could you please review this change with your insight and maybe modify similar issues in other implementations.
The morale of the story is, Verilator 2-state simulation can sometimes hide some issues, expecially if don't care values are often used in RTL. My RTL is full of them, so I have to simulate with
--x-assign
argument set to both0
and1
.The text was updated successfully, but these errors were encountered: