FONT SIZE : AAA
When your optimization goals are different from those provided by the default opti- mizations performed by HLS, you can specify optimization directives to control the RTL implementation. HLS provides a number of optimizations, and it is diffi cult to review all of the optimizations here; however, it is worth reviewing a few of the key optimizations to provide a sense of what is possible. The minmax_frame example shown in Fig. 10.8 can be used to highlight the key HLS optimizations.
Arrays are a collection of elements accessed through an index and are synthesized into a block RAM , which is a collection of elements accessed through an address. If an array is on the top-level interface, it is assumed to be outside the design and a block RAM interface is created. Conversely, if the array is inside the C function, it is implemented as a block RAM inside the design.
Arrays may be partitioned and mapped. Partitioning an array splits it into multi- ple smaller block RAMs (or block RAM interfaces). Since a block RAM only has a maximum of two ports, arrays are typically partitioned to improve data access, allowing more data samples to be accessed in a single clock cycle. Mapping arrays together implements multiple arrays in the C code into the same block RAM, saving resource but often reducing data accesses and limiting data throughput.
Fig. 10.8 Key optimization objects
In both cases, the array optimizations allow the C code to remain unchanged. Optimization directives are used to instruct HLS to implement the most ideal RTL structure without any need to change the source code.
Loops may be left rolled or they may be unrolled. In a rolled loop, HLS synthe- sizes one copy of the loop body and then executes it multiple times. Using the min- max_frame example from Fig. 10.8 , the logic to perform the reads and comparisons is created and then an FSM will ensure the logic is executed eight times (since N = 8 in this example). This ensures the minimum amount of logic is used, but it can take many clock cycles to complete all operations specifi ed by the loop.
Loops may be partially or fully unrolled. Using the minmax_frame example from Fig. 10.8 , if the loop is partially unrolled by a factor of, say, 2, this would create two copies of the logic in the loop body and the design will execute this logic (8/2 = 4) four times. This creates more logic than a rolled loop, but now allows more reads and writes to be performed in parallel, increasing throughput (or in other words, reducing the II ).
At this point, you can perhaps start to see the interaction between the options for interface synthesis and design synthesis:
• Completely unrolling the loop in the minmax_frame example creates eight copies of the hardware and allows all reads and writes to occur as soon as possible: poten- tially, all in the same clock cycle if the frequency is slow enough (or the target technology is fast enough).
• However, if the DataIn interface is implemented as a block RAM interface, only a maximum of two reads can be performed in each clock cycle. Most of the hard- ware is wasted since it must sit and wait for the data to become available at the input port.
• To take advantage of all the hardware created by a fully unrolled loop, the solution here is to also partition the DataIn input port into eight separate ports (or four separate dual-port block RAM interfaces).
Similarly, only partitioning the input port does not guarantee greater throughput: the loop also has to be unrolled to create enough hardware to consume the data.
Functions and loops represent scopes within a C design function and may have optimization directive applied to the objects within them. A scope in C is any region enclosed by the braces { and } . Optimization directives may be applied to functions and loops to control the resources used to implement the functionality. For example, if the C code contains 12 multiplications, HLS will by default create as many hard- ware multipliers as necessary to achieve the required performance. In most cases, this will typically be 12 multipliers.
Optimization directives may be used to limit the actual number of multipliers in the RTL design. For example, if an optimization directive is used to limit the number of multipliers to 1, this will force HLS to allocate only one multiplier in the RTL design and hence share the same hardware multiplier for all 12 multiplications in the C code. This will result in a smaller design, but sharing the resource (the multiplier will have a 12:1 mux in front of it) will mean the design requires more clock cycles to complete as only one multiplication may be performed in each clock cycle.
Functions and loops may also be pipelined to improve the design performance. Figure 10.9 shows another example of the performance metrics discussed earlier (in Fig. 10.7 ). In this example the design is pipelined. States S1 through S5 represent the number of clock cycles required to implement one execution of a function or one iteration of a loop. The design completes the read operation in state S1 and starts the operations in state S2 . While the operations in state S2 are being performed, the next iteration of the function or loop can be started, and the operations for the next S1 state can be performed while the operations in the current S2 state are performed.
As Fig. 10.9 demonstrates, when pipelining is used, there is no change to the latency, which is still 5 as in the previous example (Fig 10.7 ). However, the II is now 1: the design now processes a new data input every clock cycle, for a 5× increase in throughput. Thus, pipelining resulted in this improved performance
Fig. 10.9 Performance improvement with pipelining
with only a minimal increase in resources (typically a few extra LUTs and fl ip-fl ops in the FSM).
Pipelining is one of the most used and most important optimizations performed by HLS.
Manufacturer:Xilinx
Product Categories: Connecteurs
Lifecycle:Unconfirmed -
RoHS: No RoHS
Manufacturer:Xilinx
Product Categories: Batterie
Lifecycle:Obsolete -
RoHS: No RoHS
Manufacturer:Xilinx
Product Categories:
Lifecycle:Obsolete -
RoHS:
Manufacturer:Xilinx
Product Categories: FPGAs (Field Programmable Gate Array)
Lifecycle:Obsolete -
RoHS: No RoHS
Manufacturer:Xilinx
Product Categories: FPGAs (Field Programmable Gate Array)
Lifecycle:Obsolete -
RoHS:
Support