Skip to content

Commit

Permalink
[rtl] Add Single Cycle Multiplier targeting FPGA
Browse files Browse the repository at this point in the history
* Integrate option to implement a multiplier using 3 parallel 17 bit
        multipliers in order to compute MUL instructions in 1 cycle
        MULH in 2 cycles.

* Add parameter SingleCycleMultiply to select single cycle
        multiplication.

The single cycle multiplication capability is intended for FPGA
targets. Using three parallel multiplication units improves performance
of multiplication operations at the cost of DSP primitives. For ASIC
targets, the area consumed by the multiplication structure will grow
approximately 3-4x.

The functionality is selected within the module using the parameter
`SingleCycleMultiply`. From the top level it can be chosen by setting
the parameter `MultiplierImplementation` to 'single_cc'.

Signed-off-by: ganoam <[email protected]>
  • Loading branch information
ganoam authored and vogelpi committed Feb 11, 2020
1 parent ba2240f commit 48c4b6a
Show file tree
Hide file tree
Showing 6 changed files with 357 additions and 181 deletions.
40 changes: 27 additions & 13 deletions doc/instruction_decode_execute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,19 +71,33 @@ Multiplier/Divider Block (MULT/DIV)
Source Files: :file:`rtl/ibex_multdiv_slow.sv` :file:`rtl/ibex_multdiv_fast.sv`

The Multiplier/Divider (MULT/DIV) is a state machine driven block to perform multiplication and division.
The fast and slow versions differ in multiplier only, both implement the same form of long division algorithm.
The ALU block is used by the long division algorithm in both the fast and slow blocks.

Fast Multiplier
- Completes multiply in 3-4 cycles using a MAC (multiply accumulate) which is capable of a 17-bit x 17-bit multiplication with a 34-bit accumulator.
- A MUL instruction takes 3 cycles, MULH takes 4.
- This MAC is internal to the mult/div block (no external ALU use).
- Beware it is simply implemented with the ``*`` and ``+`` operators so results heavily depend upon the synthesis tool used.
- In some cases it may be desirable to replace this with a specific implementation (such as a hard macro in an FPGA or an explicit gate level implementation).

Slow Multiplier
- Completes multiply in clog2(``op_b``) + 1 cycles (for MUL) or 33 cycles (for MULH) using a Baugh-Wooley multiplier.
- The ALU block is used to compute additions.
The fast and slow versions differ in multiplier only. All versions implement the same form of long division algorithm. The ALU block is used by the long division algorithm in all versions.

Multiplier
The multiplier can be implemented in three variants controlled via the parameter ``MultiplierImplementation``.

Single-Cycle Multiplier
This implementation is chosen by setting the ``MultiplierImplementation`` parameter to "single-cycle". The single-cycle multiplier makes use of three parallel multiplier units, designed to be mapped to hardware multiplier primitives on FPGAs. It is therefore the **first choice for FPGA synthesis**.

- Using three parallel 17-bit x 17-bit multiplication units and a 34-bit accumulator, it completes a MUL instruction in 1 cycle. MULH is completed in 2 cycles.
- This MAC is internal to the mult/div block (no external ALU use).
- Beware it is simply implemented with the ``*`` and ``+`` operators so results heavily depend upon the synthesis tool used.
- ASIC synthesis has not yet been tested but is expected to consume 3-4x the area of the fast multiplier for ASIC.

Fast Multi-Cycle Multiplier
This implementation is chosen by setting the ``MultiplierImplementation`` parameter to "fast". The fast multi-cycle multiplier provides a reasonable trade-off between area and performance. It is the **first choice for ASIC synthesis**.

- Completes multiply in 3-4 cycles using a MAC (multiply accumulate) which is capable of a 17-bit x 17-bit multiplication with a 34-bit accumulator.
- A MUL instruction takes 3 cycles, MULH takes 4.
- This MAC is internal to the mult/div block (no external ALU use).
- Beware it is simply implemented with the ``*`` and ``+`` operators so results heavily depend upon the synthesis tool used.
- In some cases it may be desirable to replace this with a specific implementation such as an explicit gate level implementation.

Slow Multi-Cycle Multiplier
To select the slow multi-cycle multiplier, set the ``MultiplierImplementation`` parameter to "slow".

- Completes multiply in clog2(``op_b``) + 1 cycles (for MUL) or 33 cycles (for MULH) using a Baugh-Wooley multiplier.
- The ALU block is used to compute additions.

Divider
Both the fast and slow blocks use the same long division algorithm, it takes 37 cycles to compute (though only requires 2 cycles when there is a divide by 0) and proceeds as follows:
Expand Down
5 changes: 4 additions & 1 deletion doc/integration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,10 @@ Parameters
| ``BranchTargetALU`` | bit | 0 | *EXPERIMENTAL* - Enables branch target ALU removing a stall |
| | | | cycle from taken branches |
+------------------------------+-------------+------------+-----------------------------------------------------------------+
| ``MultiplierImplementation`` | string | "fast" | Multiplicator type, "slow", or "fast" |
| ``MultiplierImplementation`` | string | "fast" | Multiplicator type: |
| | | | "slow": multi-cycle slow, |
| | | | "fast": multi-cycle fast, |
| | | | "single-cycle": single-cycle |
+------------------------------+-------------+------------+-----------------------------------------------------------------+
| ``DbgTriggerEn`` | bit | 0 | Enable debug trigger support (one trigger only) |
+------------------------------+-------------+------------+-----------------------------------------------------------------+
Expand Down
9 changes: 6 additions & 3 deletions doc/pipeline_details.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,12 @@ Read the description for more information.
| | | takes to receive a response the longer loads and stores |
| | | will stall. |
+-----------------------+--------------------------------------+-------------------------------------------------------------+
| Multiplication | 2/3 (Fast Multiplier) | Fast: 2 for MUL, 3 for MULH. |
| | | Slow: clog2(``op_b``) for MUL, 32 for MULH. |
| | clog2(``op_b``)/32 (Slow Multiplier) | See details in :ref:`mult-div` |
| Multiplication | 0/1 (Single-Cycle Multiplier) | 0 for MUL, 1 for MULH. |
| | | |
| | 2/3 (Fast Multi-Cycle Multiplier) | 2 for MUL, 3 for MULH. |
| | | |
| | clog2(``op_b``)/32 (Slow Multi-Cycle | clog2(``op_b``) for MUL, 32 for MULH. |
| | Multiplier) | See details in :ref:`mult-div`. |
+-----------------------+--------------------------------------+-------------------------------------------------------------+
| Division | 1 or 37 | 1 stall cycle if divide by 0, otherwise full long division. |
| | | See details in :ref:`mult-div` |
Expand Down
10 changes: 7 additions & 3 deletions lint/verilator_waiver.vlt
Original file line number Diff line number Diff line change
Expand Up @@ -29,15 +29,19 @@ lint_off -msg UNUSED -file "*/rtl/ibex_alu.sv" -lines 104

// Bits of signal are not used: alu_adder_ext_i[0]
// Bottom bit is round, not needed
lint_off -msg UNUSED -file "*/rtl/ibex_multdiv_fast.sv" -lines 26
lint_off -msg UNUSED -file "*/rtl/ibex_multdiv_fast.sv" -lines 28

// Bits of signal are not used: mac_res_ext[34]
// cleaner to write all bits even if not all are used
lint_off -msg UNUSED -file "*/rtl/ibex_multdiv_fast.sv" -lines 51
lint_off -msg UNUSED -file "*/rtl/ibex_multdiv_fast.sv" -lines 43

// Bits of signal are not used: res_adder_h[32]
// cleaner to write all bits even if not all are used
lint_off -msg UNUSED -file "*/rtl/ibex_multdiv_fast.sv" -lines 71
lint_off -msg UNUSED -file "*/rtl/ibex_multdiv_fast.sv" -lines 65

// Bits of signal are not used: mult1_res[33:32]
// cleaner to write all bits even if not all are used
lint_off -msg UNUSED -file "*/rtl/ibex_multdiv_fast.sv" -lines 115

// Signal is not used: test_en_i
// testability signal
Expand Down
30 changes: 26 additions & 4 deletions rtl/ibex_ex_block.sv
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@
* Execution block: Hosts ALU and MUL/DIV unit
*/
module ibex_ex_block #(
parameter bit RV32M = 1,
parameter bit BranchTargetALU = 0,
parameter MultiplierImplementation = "fast"
parameter bit RV32M = 1,
parameter bit BranchTargetALU = 0,
parameter MultiplierImplementation = "fast"
) (
input logic clk_i,
input logic rst_ni,
Expand Down Expand Up @@ -131,7 +131,29 @@ module ibex_ex_block #(
.multdiv_result_o ( multdiv_result )
);
end else if (MultiplierImplementation == "fast") begin : gen_multdiv_fast
ibex_multdiv_fast multdiv_i (
ibex_multdiv_fast #(
.SingleCycleMultiply(0)
) multdiv_i (
.clk_i ( clk_i ),
.rst_ni ( rst_ni ),
.mult_en_i ( mult_en_i ),
.div_en_i ( div_en_i ),
.operator_i ( multdiv_operator_i ),
.signed_mode_i ( multdiv_signed_mode_i ),
.op_a_i ( multdiv_operand_a_i ),
.op_b_i ( multdiv_operand_b_i ),
.alu_operand_a_o ( multdiv_alu_operand_a ),
.alu_operand_b_o ( multdiv_alu_operand_b ),
.alu_adder_ext_i ( alu_adder_result_ext ),
.alu_adder_i ( alu_adder_result_ex_o ),
.equal_to_zero ( alu_is_equal_result ),
.valid_o ( multdiv_valid ),
.multdiv_result_o ( multdiv_result )
);
end else if (MultiplierImplementation == "single-cycle") begin: gen_multdiv_single_cycle
ibex_multdiv_fast #(
.SingleCycleMultiply(1)
) multdiv_i (
.clk_i ( clk_i ),
.rst_ni ( rst_ni ),
.mult_en_i ( mult_en_i ),
Expand Down
Loading

0 comments on commit 48c4b6a

Please sign in to comment.