|
||
|
4-bit serial multiplier (add & shift algorithm) |
|
|
Dedicated processor (datapath + control unit)
1. Specifications | Planning | Developing | Test functional | Test gate-level | Prototype | Report |
Let us propose another version of the Mult_4bit. This time it will be a sequential system organised as a dedicated processor. The symbol is proposed in Fig. 1. The device will sample the multiplicand A and the multiplier B operands after detecting the start (ST) click. The operation will run for several CLK cycles and when the result R ready an end of operation flag (EO) pulse will the asserted, so that the machine will back to idle, ready to perform another operation.
The idea behind such serial multipliers is to perform large calculations using limited resources. In this case, we will multiply using simply an Adder_16bit. and running the classic algorithm of add and shift. Another full tutorial shows how to create an Adder_4bit using only one Adder_1bit. And even another one shows a typical add-and-accumulate algorithm. Similar approaches can be taken for example when performing 64-bit arithmetic operations in 8-bit microprocessor hardware.
![]() |
Fig. 1. Symbol. |
Fig. 2 shows an example multiplication: 13 x 12 = 156. The operating time takes several CLK periods. Input data can change any time; only the values sampled after the ST pulse will be processed. In the same way, if the user clicks again the ST button while the machine is still operating, it will be ignored. To make it simply to represent R in LED or 7-segment displays, the result will be kept available until the next operation.
![]() |
Fig. 2. Example timing diagram. |
The duration of the start pulse must be taken into account so that when too long, the processor will not start a new operation, it will wait for the signal go down to present results and complete the operation cycle.
Alternatively, as a better solution to handle user clicks, we can use a synchroniser to generate TCLK duration ST pulses for activating the processor. This extra circuit (hardware) will shorten and make more robust and reliable the state diagram (software).This is the way to introduce as well in the design of the dedicated processor the CLK_Generator component (L8.2) to obtain all the required CLK signals to run the machine, for instance 200 Hz for the button interface, 10 MHz for the processor.
Let us synthesise the project for the MAX10 FPGA building a prototype on the DE10-Lite board.
Additional circuits may be included in a top design to be able to represent results in BCD lighting 7-segment digits.
In the same way, additional adaptation circuits may be added to the basic design to operate with integer numbers.
The theory behind the idea of the add and shift algorithm from Wikipedia and all the many other options available for implementing multiplications in computers or hardware processors.
Specifications | 2. Planning | Developing | Test functional | Test gate-level | Prototype | Report |
Devise a top schematic based on organising a dedicated processor architecture as in Fig. 3 that suit almost any application.
![]() |
Fig. 3. Dedicated processor concept. |
The design starts imagining the sequence of data and arithmetic and logic operations involved in the add-and-shift algorithm for multiplications. Fig. 4 shows an example. The Adder_8bit is used only when the multiplier B0 test bit is '1'.
![]() |
Fig. 4a. Example operation using the Adder_8bit. Data and shift registers are required to save partial sums and carries. |
Solving several example operations, we can infer as shown in Fig. 4b how the algorithm or pseudo code flowchart may be.
![]() |
Fig. 4b. A kind of flowchart (or pseudo-code) that follows the shift and add multiplying algorithm. It can be implemented using both hardware circuits and software programming. In this chapter 2 we will try to solve it by means of a synchronous dedicated processor. |
The datapath will be in charge of operating the algorithm, and we can imagine how many components will be included and how to connect them so that operands can be shifted and conveyed to the single Adder_16bit available while saving partial and final data. We propose to use standard components, such Shift_reg_4bit and the Data_reg_4bit for all memory registers. A versatile chip Counter_mod16 may be used for controlling the shift-and-add loop. The synthesiser will optimise the number of resources used.
![]() |
Fig. 5. Proposed datapath. |
Which is the number of D_FF used in this application datapath?
The control unit FSM will generate all the datapath control signals and activate them in the corresponding states. Fig. 6 shows an example of state diagram. The processor will be idle until ST is sample high.
![]() |
Fig. 6. Control unit state diagram. |
In this way, the top dedicated architecture adapted to solve this project will be as represented in Fig. 7 adding the CLK_Generator circuit to obtain the SYS_CLK with the frequency given as a design option.
![]() |
Fig. 7. Top dedicated processor system for engineering the serial Mult_8bit. |
The control unit will be a FSM as proposed in L6.1. Adapt its general architecture to this problem explaining where each signal is connected.
![]() |
Fig. 8. FSM for this project. The three processes are translated in a single plan C1 circuit into VHDL using behavioural descriptions. |
Deduce how many D_FF are required when encoding the control unit FSM states using the following options and draw the state register memory:
Option #1: radix-2 (sequential). 7 states means that r = 3 , and one code is never used.
Option #2: Gray. 7 states means that r = 3 , and one code is never used.
Option #3: Johnson. 7 states means that r = 7
Option #4: One-hot. 7 states means that r = 7
Draw the FSM CC2 truth table and their equivalent behavioural interpretations (plan B) using flowcharts ready for VHDL translation.
![]() |
Fig. 8. CC2 truth table and flowchart. |
Draw the FSM CC1 truth table and flowchart ready for VHDL translation.
![]() |
Fig. 9. CC1 truth table and flowchart. |
How many VHDL files will be required in this project?
Project location:
C:\CSD\P8\Mult_4bit\(files)
Specifications | Planning | 3. Developing | Test functional | Test gate-level | Prototype | Report |
Write the VHDL files to translate your plans. Find in our products portfolio the required components. Components not available will be designed from similar ones and reported as an annex.
File list and example projects: "Mult_4bit.zip".
Start a Quartus Prime synthesis project for one of the following programmable target chips:
Option #1: Cyclone IV EP4CE115F29C7
Option #2: MAX II EPM2210F324C3
Option #3: MAX 10 10M50DAF484C7
Check the resources used, specially the number of D_FF.
![]() |
Fig. 10. Resources used. |
Inspect and discuss your RTL view schematic.
![]() |
Fig. 11. RTL. |
Technology view and resources located in the Chip Planner.
![]() |
Fig. 12. Technology |
You can also print the FSM state diagram interpreted by Quartus Prime, and check the correctness of state encoding and state transitions.
![]() |
Fig. 13. State diagram generated by Quartus Prime. |
Specifications | Planning | Developing | 4. Test functional | Test gate-level | Prototype | Report |
Testbench fixture.
![]() |
Fig. 14. Testbench fixture. |
In your functional testbench you can include internal control signals and also state signals to better visualise how the circuit works and be able to debug any errors. Add as many operand test vectors as necessary.
![]() |
Fig. 15. Example of functional testing capture. The device is running the operation 11 x 10 = 110 in 13 CLK cycles. |
Specifications | Planning | Developing | Test functional | 5. Test gate-level | Prototype | Report |
We can try to measure propagation delays at a given CLK transition. For this ModelSim measurement we can replace the MAX10 FPGA by a Cyclone IV or a MAXII device.
![]() |
Fig. 16. Visualising propagation delays at a given CLK rising edge. |
Using the Timing Analyser Quartus Prime tool we'll be able to measure the theoretical maximum speed of operation in millions of operations per second that in synchronous circuits is totally related to the maximum CLK frequency at which the machine operates.
![]() |
Fig. 17. Timing analyser results to deduce the maximum theoretical operational speed = ___ Mops. |
Specifications | Planning | Developing | Test functional | Test gate-level | 6. Prototype | Report |
Demonstrate how the project works for real as a laboratory prototype for the MAX10 FPGA populating the DE10-Lite board.
This is a legacy prototype adaptation that was solved for the Lattice Semiconductor LC4128V CPLD target chip.
Specifications | Planning | Developing | Test functional | Test gate-level | Prototype | 7. Report |
Follow this rubric for writing reports.
Other questions: How to multiply 8x8 bit or 16x16 radix-2 numbers using a simpler Mult_4bit core hardware multiplier?