UPC EETAC Bachelor's Degree in Telecommunications Systems and in Network Engineering EEL

 

4-bit serial multiplier (add & shift algorithm)

P8


Dedicated processor (datapath + control unit)

1. Specifications Planning Developing Test functional Test gate-level Prototype Report

Let us propose another version of the Mult_4bit. This time it will be a sequential system organised as a dedicated processor. The symbol is proposed in Fig. 1. The device will sample the multiplicand A and the multiplier B operands after detecting the start (ST) click. The operation will run for several CLK cycles and when the result R ready an end of operation flag (EO) pulse will the asserted, so that the machine will back to idle, ready to perform another operation.

The idea behind such serial multipliers is to perform large calculations using limited resources. In this case, we will multiply using simply an Adder_16bit. and running the classic algorithm of add and shift. Another full tutorial shows how to create an Adder_4bit using only one Adder_1bit. And even another one shows a typical add-and-accumulate algorithm. Similar approaches can be taken for example when performing 64-bit arithmetic operations in 8-bit microprocessor hardware. 

Symbol for the Mult_4bit

Fig. 1. Symbol.

Fig. 2 shows an example multiplication:  13  x  12 = 156. The operating time takes several CLK periods. Input data can change any time; only the values sampled after the ST pulse will be processed. In the same way, if the user clicks again the ST button while the machine is still operating, it will be ignored. To make it simply to represent R in LED or 7-segment displays, the result will be kept available until the next operation.

Example of timing diagram

Fig. 2. Example timing diagram.

The duration of the start pulse must be taken into account so that when too long, the processor will not start a new operation, it will wait for the signal go down to present results and complete the operation cycle. 

Alternatively, as a better solution to handle user clicks, we can use a synchroniser to generate TCLK duration ST pulses for activating the processor. This extra circuit (hardware) will shorten and make more robust and reliable the state diagram (software).This is the way to introduce as well in the design of the dedicated processor the CLK_Generator component (L8.2) to obtain all the required CLK signals to run the machine, for instance 200 Hz for the button interface, 10 MHz for the processor.

Let us synthesise the project for the MAX10 FPGA building a prototype on the DE10-Lite board.

Additional circuits may be included in a top design to be able to represent results in BCD lighting 7-segment digits.

In the same way, additional adaptation circuits may be added to the basic design to operate with integer numbers.

The theory behind the idea of the add and shift algorithm from Wikipedia and all the many other options available for implementing multiplications in computers or hardware processors. 

 


Specifications 2. Planning Developing Test functional Test gate-level Prototype Report

Devise a top schematic based on organising a dedicated processor architecture as in Fig. 3 that suit almost any application.

Dedicated processor concept

Fig. 3. Dedicated processor concept.

The design starts imagining the sequence of data and arithmetic and logic operations involved in the add-and-shift algorithm for multiplications. Fig. 4 shows an example.  The Adder_8bit is used only when the multiplier B0 test bit is '1'.

Example operation

Fig. 4a. Example operation using the Adder_8bit. Data and shift registers are required to save partial sums and carries. 

Solving several example operations, we can infer as shown in Fig. 4b how the algorithm or pseudo code flowchart may be.

Algorithm

Fig. 4b. A kind of flowchart (or pseudo-code) that follows the shift and add multiplying algorithm. It can be implemented using both hardware circuits and software programming. In this chapter 2 we will try to solve it by means of a synchronous dedicated processor.

The datapath will be in charge of operating the algorithm, and we can imagine how many components will be included and how to connect them so that operands can be shifted and conveyed to the  single Adder_16bit available while saving partial and final data. We propose to use standard components, such Shift_reg_4bit and the Data_reg_4bit for all memory registers. A versatile chip Counter_mod16 may be used for controlling the shift-and-add loop. The synthesiser will optimise the number of resources used.

Datapath circuit

Fig. 5. Proposed datapath.

Which is the number of D_FF used in this application datapath?


The control unit FSM will generate all the datapath control signals and activate them in the corresponding states. Fig. 6 shows an example of state diagram. The processor will be idle until ST is sample high. 

State diagram for the control unit

Fig. 6. Control unit state diagram.

In this way, the top dedicated architecture adapted to solve this project will be as represented in Fig. 7 adding the CLK_Generator circuit to obtain the SYS_CLK with the frequency given as a design option.

Dedicated processsor architecture

Fig. 7. Top dedicated processor system for engineering the serial Mult_8bit.

The control unit will be a FSM as proposed in L6.1. Adapt its general architecture to this problem explaining where each signal is connected.

FSM architecture using plan C1 to translate the three processes

Fig. 8. FSM for this project. The three processes are translated in a single plan C1 circuit into VHDL using behavioural descriptions.

Deduce how many D_FF are required when encoding the control unit FSM states using the following options and draw the state register memory:

Option #1: radix-2 (sequential). 7 states means that r = 3  , and one code is never used.

Option #2: Gray. 7 states means that r = 3  , and one code is never used.

Option #3: Johnson. 7 states means that r = 7  

Option #4: One-hot. 7 states means that r = 7  

 Draw the FSM CC2 truth table and their equivalent behavioural interpretations (plan B) using flowcharts ready for VHDL translation.

Picture

Fig. 8. CC2 truth table and flowchart.

Draw the FSM CC1 truth table and flowchart ready for VHDL translation.

Picture

Fig. 9. CC1 truth table and flowchart.

How many VHDL files will be required in this project?

Project location:

 C:\CSD\P8\Mult_4bit\(files)

 


Specifications Planning 3. Developing Test functional Test gate-level Prototype Report

Write the VHDL files to translate your plans. Find in our products portfolio the required components. Components not available will be designed from similar ones and reported as an annex.

File list and example projects: "Mult_4bit.zip".

Start a Quartus Prime synthesis project for one of the following programmable target chips:

Option #1: Cyclone IV EP4CE115F29C7

Option #2: MAX II EPM2210F324C3

Option #3: MAX 10 10M50DAF484C7

Check the resources used, specially the number of D_FF.

Picture

Fig. 10. Resources used.

Inspect and discuss your RTL view schematic.

Picture

Fig. 11. RTL.

Technology view and resources located in the Chip Planner.

Picture

Fig. 12. Technology

You can also print the FSM state diagram interpreted by Quartus Prime, and check the correctness of state encoding and state transitions.

Picture

Fig. 13. State diagram generated by Quartus Prime.

 


Specifications Planning Developing 4. Test functional Test gate-level Prototype Report

Testbench fixture.

Picture

Fig. 14. Testbench fixture.

In your functional testbench you can include internal control signals and also state signals to better visualise how the circuit works and be able to debug any errors. Add as many operand test vectors as necessary.

Testing some values

Fig. 15. Example of functional testing capture. The device is running the operation 11 x 10 = 110 in 13 CLK cycles.

 


Specifications Planning Developing Test functional 5. Test gate-level Prototype Report

We can try to measure propagation delays at a given CLK transition. For this ModelSim measurement we can replace the MAX10 FPGA by a Cyclone IV or a MAXII device.

Picture

Fig. 16. Visualising propagation delays at a given CLK rising edge.

Using the Timing Analyser Quartus Prime tool we'll be able to measure the theoretical maximum speed of operation in millions of operations per second that in synchronous circuits is totally related to the maximum CLK frequency at which the machine operates.  

Picture

Fig. 17. Timing analyser results to deduce the maximum theoretical operational speed = ___ Mops.

 


Specifications Planning Developing Test functional Test gate-level 6. Prototype Report

Demonstrate how the project works for real as a laboratory prototype for the MAX10 FPGA populating the DE10-Lite board. 

 

This is a legacy prototype adaptation that was solved for the Lattice Semiconductor LC4128V CPLD target chip.

 


Specifications Planning Developing Test functional Test gate-level Prototype 7. Report

Follow this rubric for writing reports.

Other questions: How to multiply 8x8 bit or 16x16 radix-2 numbers using a simpler Mult_4bit core hardware multiplier?