FONT SIZE : AAA
(a) You should already have the GUI open from the previous exercise, but if you don’t open the project matrix_mult_prj in the directory C:\Zynq_Book\HLS\tut3A and save in to the \tut3B directory using File > Save As and selecting the \tut3B directory as the location.
(b) Expand the tabs for Source and Test Bench in the Explorertab of the Synthesis view. As before, this shows that the source and test files have been successfully added to the project. Double clicking on each of these will open them in the editor allowing the code to be inspected and altered as required.
matrix_mult.cpp contains code that performs the multiplication of two matrices through use of iterative loops that run through the rows and columns of the matrices to calculate the product.
matrix_mult.h contains definitions and the prototype function for the matrix multiplication.
matrix_mult_test.cpp is the test bench file which calculates the product of two given matrixes using both the HLS hardware solution and software, comparing to two to ensure successful operation.
(c) Click the Run C Simulation button in the toolbar to run a C simulation of the solution. Leave the options as default (no boxes checked, no input arguments) and click OK. Upon completion of the simulation, the “Test passed!” message will be displayed in the console in the bottom of the screen as in Figure 3.6.
(d) The next step is to synthesise the C++ code using HLS. Click the C Synthesis button in the toolbar. Vivado HLS will begin the process of converting the C++ code into an RTL model with associated VHDL/Verilog/SystemC code. The console details the steps performed in achieving this.
Upon completion, a Synthesis Report will open automatically. This details various aspects of the synthesised design, such as information concerning timing and latency and FPGA resource utilisation estimates.
The synthesised design has an interval of 687clock cycles. Each input array contains 25 elements (as it used 5x5 matrices) and so this suggests roughly 27 clock cycles per input read.
(e) We can now run a C/RTL cosimulation to ensure that the synthesised RTL behaves exactly the same as the C++ code under test.
Click the Run C/RTL Cosimulation button . For the RTL selection, ensure VHDL is selected and click OK. Cosimulation will now begin, with the RTL system being generated using VHDL. This process make take a short while to complete but progress can be viewed in the console.
Upon completion, the Cosimulation Report will be opened as in Figure 3.8
Note the “Pass” message of Figure 3.8 indicating that the RTL behaves the same as the C++ source code.
(f) Create a new solution for the design by either clicking the New Solution button in the toolbar or the menu option Project > New Solution. Click Finish to accept the defaults for solution2.
(g) Double click on matrix_mult.cpp in the Source section of the Explorer tab to ensure the code is visible in the workspace. We will now insert a directive which will pipeline the nested loops of the matrix multiplication code. This will perform loop flattening, removing the need for loop transitions.
Open the Directive tab to the right of the workspace. Click on Product and you will observe the associated portion of code highlighted in the editor, in this instance the multiplication of array elements to produce the product elements of the resulting matrix. Right click on Product and select Insert Directive. This will open the Directives Editor. Use the type drop down menu to select the option PIPELINE. Click OK to accept the default options. The directives tab should now resemble Figure 3.9.
(h) Click the C Synthesis button to synthesise the RTL design. The console yields some information about the process of flattening the Row loop. It also explains that the default initiation internal (II) target of 1 could not be met for the Product loop. This is due to loop dependency.
From the synthesis report shown in Figure 3.10 it is observed that the top level loop, Row_Col has not been pipelined as loop Col was not flattened. It is also observed an II of 2 was achieved despite the target of 1.
(i) Open the Analysis perspective by clicking on . This will also open the Performance view showing how the various operations within the code are scheduled as clock cycles.
(j) Expand the loops Row_Col and Product by clicking on them to obtain the view shown in Figure 3.11.
Note that the highlighted write operation occurs in state C3, node_33(write). Right clicking on this cell and selecting Goto Source will highlight the associated line of code in the source file. This is a write operation initialised as a write to a port in the RTL which occurs before any operations in the loop, Product, can be executed. This prevents the flattening of loop Product in to Row_Col. Furthermore, the inability to meet the target of II = 1 can be explained by considering consecutive iterations of the loop. Consulting the console reveals the following message:
@W [SCHED‐68] Unable to enforce a carried dependency constraint (II = 1, distance = 1) between ‘store’ operation (matrix_mult.cpp:16) of variable ‘tmp_8’ on array ‘prod’ and ‘load’ operation (‘prod_load’, matrix_mult.cpp16) on array ‘prod’.
There exists a dependency between iterations of the operation at line 18 of the source code, which is the operation within the Product loop.
prod[i][j] += a[i][k] * b[k][j];
Due to the presence of the += operator, this line of code contains a read from array prod (the aforementioned load operation) and a write to array prod (a store operation). With an II of 1, a succeeding Product loop iteration would occur one clock cycle after the initiation of the first iteration. This is visualised in Figure 3.12. With II set to 1, the highlighted overlap is observed. Arrays are mapped to BRAM by default, and since this overlap requires a read and a write operation to be performed on the same clock cycle, this is simply not possible as both operations cannot occur on the BRAM at the same time. Therefore, setting the II to 2 allows the write operation to be completed before the read operation of the next loop iteration begins.
(k) Return to the Synthesis perspective by clicking on . We will now create a new solution which pipelines the Col loop, unrolling the Product loop at to eliminate inter-iteration dependency but at the cost of increased operators and hence hardware cost.
(l) Create a new solution for the design by either clicking the New Solution button in the toolbar or the menu option Project > New Solution. From the drop-down menus, ensure solution1 is selected, as in Figure 3.13, as this contains no existing directives or constraints.
Click Finish to create the solution.
(m) Ensure the source code matrix_mult.cpp is visible in the editor. In the Directives tab, right-click on loop Col and select Insert Directive. From the drop-down menu, select directive type PIPELINE and click OK to select the directive with the defaults (II = 1).
(n) Click the C Synthesis button to synthesise the RTL design. Observing the Console will show that while Product was unrolled and loop Row was flattened the II target of 1 could not be met for loop Row_Col, this time due to limitations in the resources.
@W [SCHED‐69] Unable to schedule ‘load’ operation (‘b_load_4’, matrix_mult.cpp:16) on array ‘b’ due to limited memory ports.
(o) Open the Analysis perspective by clicking on . This will open the Performance view. Switch to the Resource view by clicking the tab at the bottom of the screen.
(p) Expand the Memory Ports to view resource sharing on the memory within the system.
Figure 3.14 shows the operations per resource on each clock cycle. In actual fact, the 2 cycle read operation on b beginning in C3 overlaps with those in C4 so only a single cycle is visible. There are instances of both a and b being subjected to 3 read operations at once, which you will remember is not possible for dual-port BRAM. It is therefore necessary to partition these arrays into smaller sections, allowing modification of the array without altering the source code.
(q) Return to the Synthesis perspective by clicking on . Create a new solution for the design by either clicking the New Solution button in the toolbar or the menu option Project > New Solution. Click Finish to accept the defaults for solution4.
For this solution, we will reshape the input arrays using directives. The Product loop is accessed via loop index k, therefore arrays a and b should be partitioned along their k dimension. Inspecting line 16 of matrix_mult.cpp it is observed that for a[i][k] this is dimension 2 and for b[k][j] dimension 1.
(r) Ensure the source code matrix_mult.cpp is visible in the editor, and open the Directives tab. Right-click on variable a and select Insert Directive. Ensure the directive is configured as in Figure 3.15, with ARRAY_RESHAPE selected as directive type and dimension specified as 2.
(s) Repeat for array b, this time ensuring dimension is set to 1.
(t) Click the C Synthesis button to synthesise the RTL design. The synthesis report will open, showing that the target II of 1 has now been met.
The top-level of the design takes 35 clock cycles for completion, with the Row_Col loop outputting a sample after an iteration latency of 10. A sample is then read in every cycle (due to an II of 1), and after 25 counts all samples have been read in. The 35 clock cycle life of this design is therefore justified by the 25 counts plus the latency of 10, as 25 + 10 = 35.
The function then proceeds to calculate the next set of data.
(u) The final optimisation in this exercise is to pipeline the function, rather than the loops within that function for comparison. Create a new solution for the design by either clicking the New Solution button in the toolbar or the menu option Project > New Solution. Click Finish to accept the defaults for solution5.
(v) Ensure the source code matrix_mult.cpp is visible in the editor, and open the Directives tab. First, remove the previously inserted pipeline directive on loop Col. Right-click on the directive and select Remove Directive.
(w) Right-click on the top level function matrix_mult and select Insert Directive. Select PIPELINE as the directive type and click OK.
(x) Click the C Synthesis button to synthesise the RTL design.
(y) Vivado HLS provides a tool for comparing synthesis reports. Click the button or the menu option Project > Compare Reports.
Ensure solution4 and solution5 are added as in Figure 3.17. Click OK.
(z) Figure 3.18 shows the comparison of synthesis report for solution4 (with loop pipelining) and solution5 (with top level function pipelining). It is observed that pipelining the top level function results in a design which reaches completion in fewer clocks, requiring only 13 clock cycles to begin a new transaction, rather than 36 for pipelining the loop.
However, this comes at the cost of increased hardware utilisation due to unrolling of all loops within the design. A tradeoff is therefore necessary between system performance and the hardware utilisation of the design, and it is possible that a partially unrolled design may meet the performance requirements at a reduced hardware cost.
(aa) This completes the exercise. Close the Vivado HLS GUI.
We will now briefly explore the concept of interface synthesis in Vivado HLS, using the matrix multiplier function of the previous two exercises.
Manufacturer:Xilinx
Product Categories: FPGAs
Lifecycle:Active Active
RoHS:
Manufacturer:Xilinx
Product Categories: Embedded - FPGAs (Field Programmable Gate Array)
Lifecycle:Active Active
RoHS: No RoHS
Manufacturer:Xilinx
Product Categories: FPGAs (Field Programmable Gate Array)
Lifecycle:Obsolete -
RoHS: No RoHS
Manufacturer:Xilinx
Product Categories: Embedded - CPLDs (Complex Programmable Logic Devices)
Lifecycle:Active Active
RoHS:
Manufacturer:Xilinx
Product Categories: FPGAs
Lifecycle:Active Active
RoHS:
Support