[rtl] Add Single Cycle Multiplier targeting FPGA

* Integrate option to implement a multiplier using 3 parallel 17 bit multipliers in order to compute MUL instructions in 1 cycle MULH in 2 cycles. * Add parameter SingleCycleMultiply to select single cycle multiplication. The single cycle multiplication capability is intended for FPGA targets. Using three parallel multiplication units improves performance of multiplication operations at the cost of DSP primitives. For ASIC targets, the area consumed by the multiplication structure will grow approximately 3-4x. The functionality is selected within the module using the parameter `SingleCycleMultiply`. From the top level it can be chosen by setting the parameter `MultiplierImplementation` to 'single_cc'. Signed-off-by: ganoam <[email protected]>
mundaym · Feb 11, 2020 · 48c4b6a · 48c4b6a
1 parent ba2240f
commit 48c4b6a
Show file tree

Hide file tree

Showing 6 changed files with 357 additions and 181 deletions.
diff --git a/doc/instruction_decode_execute.rst b/doc/instruction_decode_execute.rst
@@ -71,19 +71,33 @@ Multiplier/Divider Block (MULT/DIV)
 Source Files: :file:`rtl/ibex_multdiv_slow.sv` :file:`rtl/ibex_multdiv_fast.sv`
 
 The Multiplier/Divider (MULT/DIV) is a state machine driven block to perform multiplication and division.
-The fast and slow versions differ in multiplier only, both implement the same form of long division algorithm.
-The ALU block is used by the long division algorithm in both the fast and slow blocks.
-
-Fast Multiplier
-  - Completes multiply in 3-4 cycles using a MAC (multiply accumulate) which is capable of a 17-bit x 17-bit multiplication with a 34-bit accumulator.
-  - A MUL instruction takes 3 cycles, MULH takes 4.
-  - This MAC is internal to the mult/div block (no external ALU use).
-  - Beware it is simply implemented with the ``*`` and ``+`` operators so results heavily depend upon the synthesis tool used.
-  - In some cases it may be desirable to replace this with a specific implementation (such as a hard macro in an FPGA or an explicit gate level implementation).
-
-Slow Multiplier
-  - Completes multiply in clog2(``op_b``) + 1 cycles (for MUL) or 33 cycles (for MULH) using a Baugh-Wooley multiplier.
-  - The ALU block is used to compute additions.
+The fast and slow versions differ in multiplier only. All versions implement the same form of long division algorithm. The ALU block is used by the long division algorithm in all versions.
+
+Multiplier
+  The multiplier can be implemented in three variants controlled via the parameter ``MultiplierImplementation``.
+
+  Single-Cycle Multiplier
+    This implementation is chosen by setting the ``MultiplierImplementation`` parameter to "single-cycle". The single-cycle multiplier makes use of three parallel multiplier units, designed to be mapped to hardware multiplier primitives on FPGAs. It is therefore the **first choice for FPGA synthesis**.
+
+    - Using three parallel 17-bit x 17-bit multiplication units and a 34-bit accumulator, it completes a MUL instruction in 1 cycle. MULH is completed in 2 cycles.
+    - This MAC is internal to the mult/div block (no external ALU use).
+    - Beware it is simply implemented with the ``*`` and ``+`` operators so results heavily depend upon the synthesis tool used.
+    - ASIC synthesis has not yet been tested but is expected to consume 3-4x the area of the fast multiplier for ASIC.
+
+  Fast Multi-Cycle Multiplier
+    This implementation is chosen by setting the ``MultiplierImplementation`` parameter to "fast". The fast multi-cycle multiplier provides a reasonable trade-off between area and performance. It is the **first choice for ASIC synthesis**.
+
+    - Completes multiply in 3-4 cycles using a MAC (multiply accumulate) which is capable of a 17-bit x 17-bit multiplication with a 34-bit accumulator.
+    - A MUL instruction takes 3 cycles, MULH takes 4.
+    - This MAC is internal to the mult/div block (no external ALU use).
+    - Beware it is simply implemented with the ``*`` and ``+`` operators so results heavily depend upon the synthesis tool used.
+    - In some cases it may be desirable to replace this with a specific implementation such as an explicit gate level implementation.
+
+  Slow Multi-Cycle Multiplier
+    To select the slow multi-cycle multiplier, set the ``MultiplierImplementation`` parameter to "slow".
+
+    - Completes multiply in clog2(``op_b``) + 1 cycles (for MUL) or 33 cycles (for MULH) using a Baugh-Wooley multiplier.
+    - The ALU block is used to compute additions.
 
 Divider
   Both the fast and slow blocks use the same long division algorithm, it takes 37 cycles to compute (though only requires 2 cycles when there is a divide by 0) and proceeds as follows:

diff --git a/doc/integration.rst b/doc/integration.rst
@@ -90,7 +90,10 @@ Parameters
 | ``BranchTargetALU``          | bit         | 0          | *EXPERIMENTAL* - Enables branch target ALU removing a stall     |
 |                              |             |            | cycle from taken branches                                       |
 +------------------------------+-------------+------------+-----------------------------------------------------------------+
-| ``MultiplierImplementation`` | string      | "fast"     | Multiplicator type, "slow", or "fast"                           |
+| ``MultiplierImplementation`` | string      | "fast"     | Multiplicator type:                                             |
+|                              |             |            | "slow": multi-cycle slow,                                       |
+|                              |             |            | "fast": multi-cycle fast,                                       |
+|                              |             |            | "single-cycle": single-cycle                                       |
 +------------------------------+-------------+------------+-----------------------------------------------------------------+
 | ``DbgTriggerEn``             | bit         | 0          | Enable debug trigger support (one trigger only)                 |
 +------------------------------+-------------+------------+-----------------------------------------------------------------+

diff --git a/doc/pipeline_details.rst b/doc/pipeline_details.rst
@@ -47,9 +47,12 @@ Read the description for more information.
 |                       |                                      | takes to receive a response the longer loads and stores     |
 |                       |                                      | will stall.                                                 |
 +-----------------------+--------------------------------------+-------------------------------------------------------------+
-| Multiplication        | 2/3 (Fast Multiplier)                | Fast: 2 for MUL, 3 for MULH.                                |
-|                       |                                      | Slow: clog2(``op_b``) for MUL, 32 for MULH.                 |
-|                       | clog2(``op_b``)/32 (Slow Multiplier) | See details in :ref:`mult-div`                              |
+| Multiplication        | 0/1 (Single-Cycle Multiplier)        | 0 for MUL, 1 for MULH.                                      |
+|                       |                                      |                                                             |
+|                       | 2/3 (Fast Multi-Cycle Multiplier)    | 2 for MUL, 3 for MULH.                                      |
+|                       |                                      |                                                             |
+|                       | clog2(``op_b``)/32 (Slow Multi-Cycle | clog2(``op_b``) for MUL, 32 for MULH.                       |
+|                       | Multiplier)                          | See details in :ref:`mult-div`.                             |
 +-----------------------+--------------------------------------+-------------------------------------------------------------+
 | Division              | 1 or 37                              | 1 stall cycle if divide by 0, otherwise full long division. |
 |                       |                                      | See details in :ref:`mult-div`                              |

diff --git a/lint/verilator_waiver.vlt b/lint/verilator_waiver.vlt
@@ -29,15 +29,19 @@ lint_off -msg UNUSED -file "*/rtl/ibex_alu.sv" -lines 104
 
 // Bits of signal are not used: alu_adder_ext_i[0]
 // Bottom bit is round, not needed
-lint_off -msg UNUSED -file "*/rtl/ibex_multdiv_fast.sv" -lines 26
+lint_off -msg UNUSED -file "*/rtl/ibex_multdiv_fast.sv" -lines 28
 
 // Bits of signal are not used: mac_res_ext[34]
 // cleaner to write all bits even if not all are used
-lint_off -msg UNUSED -file "*/rtl/ibex_multdiv_fast.sv" -lines 51
+lint_off -msg UNUSED -file "*/rtl/ibex_multdiv_fast.sv" -lines 43
 
 // Bits of signal are not used: res_adder_h[32]
 // cleaner to write all bits even if not all are used
-lint_off -msg UNUSED -file "*/rtl/ibex_multdiv_fast.sv" -lines 71
+lint_off -msg UNUSED -file "*/rtl/ibex_multdiv_fast.sv" -lines 65
+
+// Bits of signal are not used: mult1_res[33:32]
+// cleaner to write all bits even if not all are used
+lint_off -msg UNUSED -file "*/rtl/ibex_multdiv_fast.sv" -lines 115
 
 // Signal is not used: test_en_i
 // testability signal

diff --git a/rtl/ibex_ex_block.sv b/rtl/ibex_ex_block.sv
@@ -9,9 +9,9 @@
  * Execution block: Hosts ALU and MUL/DIV unit
  */
 module ibex_ex_block #(
-    parameter bit    RV32M                    = 1,
-    parameter bit    BranchTargetALU          = 0,
-    parameter        MultiplierImplementation = "fast"
+    parameter bit RV32M                    = 1,
+    parameter bit BranchTargetALU          = 0,
+    parameter     MultiplierImplementation = "fast"
 ) (
     input  logic                  clk_i,
     input  logic                  rst_ni,
@@ -131,7 +131,29 @@ module ibex_ex_block #(
         .multdiv_result_o   ( multdiv_result        )
     );
   end else if (MultiplierImplementation == "fast") begin : gen_multdiv_fast
-    ibex_multdiv_fast multdiv_i (
+    ibex_multdiv_fast #(
+        .SingleCycleMultiply(0)
+    ) multdiv_i (
+        .clk_i              ( clk_i                 ),
+        .rst_ni             ( rst_ni                ),
+        .mult_en_i          ( mult_en_i             ),
+        .div_en_i           ( div_en_i              ),
+        .operator_i         ( multdiv_operator_i    ),
+        .signed_mode_i      ( multdiv_signed_mode_i ),
+        .op_a_i             ( multdiv_operand_a_i   ),
+        .op_b_i             ( multdiv_operand_b_i   ),
+        .alu_operand_a_o    ( multdiv_alu_operand_a ),
+        .alu_operand_b_o    ( multdiv_alu_operand_b ),
+        .alu_adder_ext_i    ( alu_adder_result_ext  ),
+        .alu_adder_i        ( alu_adder_result_ex_o ),
+        .equal_to_zero      ( alu_is_equal_result   ),
+        .valid_o            ( multdiv_valid         ),
+        .multdiv_result_o   ( multdiv_result        )
+    );
+  end else if (MultiplierImplementation == "single-cycle") begin: gen_multdiv_single_cycle
+    ibex_multdiv_fast #(
+        .SingleCycleMultiply(1)
+    ) multdiv_i (
         .clk_i              ( clk_i                 ),
         .rst_ni             ( rst_ni                ),
         .mult_en_i          ( mult_en_i             ),