This website uses cookies. By using this site, you consent to the use of cookies. For more information, please take a look at our Privacy Policy.
Home > FPGA Technical Tutorials > Designing with Xilinx FPGAs Using Vivado > C-Based Design > Optimizing Your RTL

TABLE OF CONTENTS

Xilinx FPGA FPGA Forum

Optimizing Your RTL

FONT SIZE : AAA

When your optimization goals are different from those provided by the default opti- mizations performed by HLS, you can specify optimization directives to control the RTL implementation. HLS provides a number of optimizations, and it is diffi cult to review all of the optimizations here; however, it is worth reviewing a few of the key optimizations to provide a sense of what is possible. The minmax_frame example shown in Fig. 10.8 can be used to highlight the key HLS optimizations. 

Increasing Data Accesses 

Arrays are a collection of elements accessed through an index and are synthesized into a block RAM , which is a collection of elements accessed through an address. If an array is on the top-level interface, it is assumed to be outside the design and a block RAM interface is created. Conversely, if the array is inside the C function, it is implemented as a block RAM inside the design. 

Arrays may be partitioned and mapped. Partitioning an array splits it into multi- ple smaller block RAMs (or block RAM interfaces). Since a block RAM only has a maximum of two ports, arrays are typically partitioned to improve data access, allowing more data samples to be accessed in a single clock cycle. Mapping arrays together implements multiple arrays in the C code into the same block RAM, saving resource but often reducing data accesses and limiting data throughput. 

Key optimization objects.png

Fig. 10.8 Key optimization objects

In both cases, the array optimizations allow the C code to remain unchanged. Optimization directives are used to instruct HLS to implement the most ideal RTL structure without any need to change the source code. 

Loops may be left rolled or they may be unrolled. In a rolled loop, HLS synthe- sizes one copy of the loop body and then executes it multiple times. Using the min- max_frame example from Fig. 10.8 , the logic to perform the reads and comparisons is created and then an FSM will ensure the logic is executed eight times (since N = 8 in this example). This ensures the minimum amount of logic is used, but it can take many clock cycles to complete all operations specifi ed by the loop. 

Loops may be partially or fully unrolled. Using the minmax_frame example from Fig. 10.8 , if the loop is partially unrolled by a factor of, say, 2, this would create two copies of the logic in the loop body and the design will execute this logic (8/2 = 4) four times. This creates more logic than a rolled loop, but now allows more reads and writes to be performed in parallel, increasing throughput (or in other words, reducing the II ). 

At this point, you can perhaps start to see the interaction between the options for interface synthesis and design synthesis: 

• Completely unrolling the loop in the minmax_frame example creates eight copies of the hardware and allows all reads and writes to occur as soon as possible: poten- tially, all in the same clock cycle if the frequency is slow enough (or the target technology is fast enough). 

• However, if the DataIn interface is implemented as a block RAM interface, only a maximum of two reads can be performed in each clock cycle. Most of the hard- ware is wasted since it must sit and wait for the data to become available at the input port. 

• To take advantage of all the hardware created by a fully unrolled loop, the solution here is to also partition the DataIn input port into eight separate ports (or four separate dual-port block RAM interfaces). 

Similarly, only partitioning the input port does not guarantee greater throughput: the loop also has to be unrolled to create enough hardware to consume the data. 

Controlling Resources 

Functions and loops represent scopes within a C design function and may have optimization directive applied to the objects within them. A scope in C is any region enclosed by the braces { and } . Optimization directives may be applied to functions and loops to control the resources used to implement the functionality. For example, if the C code contains 12 multiplications, HLS will by default create as many hard- ware multipliers as necessary to achieve the required performance. In most cases, this will typically be 12 multipliers. 

Optimization directives may be used to limit the actual number of multipliers in the RTL design. For example, if an optimization directive is used to limit the number of multipliers to 1, this will force HLS to allocate only one multiplier in the RTL design and hence share the same hardware multiplier for all 12 multiplications in the C code. This will result in a smaller design, but sharing the resource (the multiplier will have a 12:1 mux in front of it) will mean the design requires more clock cycles to complete as only one multiplication may be performed in each clock cycle. 

Pipelining for Performance 

Functions and loops may also be pipelined to improve the design performance. Figure 10.9 shows another example of the performance metrics discussed earlier (in Fig. 10.7 ). In this example the design is pipelined. States S1 through S5 represent the number of clock cycles required to implement one execution of a function or one iteration of a loop. The design completes the read operation in state S1 and starts the operations in state S2 . While the operations in state S2 are being performed, the next iteration of the function or loop can be started, and the operations for the next S1 state can be performed while the operations in the current S2 state are performed. 

As Fig. 10.9 demonstrates, when pipelining is used, there is no change to the latency, which is still 5 as in the previous example (Fig 10.7 ). However, the II is now 1: the design now processes a new data input every clock cycle, for a 5× increase in throughput. Thus, pipelining resulted in this improved performance 

Performance improvement with pipelining.png

Fig. 10.9 Performance improvement with pipelining

with only a minimal increase in resources (typically a few extra LUTs and fl ip-fl ops in the FSM).

Pipelining is one of the most used and most important optimizations performed by HLS.

  • XC5VFX70T-3FF665C

    Manufacturer:Xilinx

  • FPGA Virtex-5 FXT Family 65nm Technology 1V 665-Pin FCBGA
  • Product Categories: Connecteurs

    Lifecycle:Unconfirmed -

    RoHS: No RoHS

  • XCV200E-6CS144I

    Manufacturer:Xilinx

  • FPGA Virtex-E Family 63.504K Gates 5292 Cells 357MHz 0.18um Technology 1.8V 144-Pin CSBGA
  • Product Categories: Batterie

    Lifecycle:Obsolete -

    RoHS: No RoHS

  • XCV200E-6FGG456C

    Manufacturer:Xilinx

  • FPGA Virtex-E Family 63.504K Gates 5292 Cells 357MHz 0.18um Technology 1.8V 456-Pin FBGA
  • Product Categories:

    Lifecycle:Obsolete -

    RoHS:

  • XC4028XL-3HQ304C

    Manufacturer:Xilinx

  • FPGA XC4000X Family 28K Gates 2432 Cells 0.35um Technology 3.3V 304-Pin HSPQFP EP
  • Product Categories: FPGAs (Field Programmable Gate Array)

    Lifecycle:Obsolete -

    RoHS: No RoHS

  • XCV200E-7FG456C

    Manufacturer:Xilinx

  • FPGA Virtex-E Family 63.504K Gates 5292 Cells 400MHz 0.18um Technology 1.8V 456-Pin FBGA
  • Product Categories: FPGAs (Field Programmable Gate Array)

    Lifecycle:Obsolete -

    RoHS:

Need Help?

Support

If you have any questions about the product and related issues, Please contact us.