This website uses cookies. By using this site, you consent to the use of cookies. For more information, please take a look at our Privacy Policy.
Home > FPGA Technical Tutorials > Designing with Xilinx FPGAs Using Vivado > Timing Closure > Timing Closure Techniques

TABLE OF CONTENTS

Xilinx FPGA FPGA Forum

Timing Closure Techniques

FONT SIZE : AAA

Critical Path Analysis

Timing reports can be generated at any stage during the synthesis and/or implementation phase. You should generate timing reports at each stage after synthesis, placement, and routing and analyze the paths to make sure that the design is converging.  Catching and fifi xing issues earlier in the flfl ow will save several iterations of the subsequent stages. For example, fifi xing issues at synthesis will save time in place and  route stage.   

A timing failure might happen due to multiple different reasons. Based on the  analysis of the timing paths, fifi xes may be required at synthesis stage or the placement and routing stage. Hence it is important to study the characteristic of top failing paths to determine the reasons and fifi xes. Below are some of the important  characteristics in the timing paths that can be examined and remedies that can be  taken to mitigate them

Logic vs. Wire Delay  

Critical path delay can be broken down into logic delay and wire delay . The percentage of logic and wire delay in critical path can help to determine where to reduce  delays. A low logic delay component usually means that wire delay is higher, where  potentially flfl oor planning the design can help in timing closure. A higher logic  delay component means that there are too many logic levels in the design.

Reducing Logic Levels

For paths with higher levels of logic, looking at the levels of logic in the top failing  paths can reveal if there are any issues in the RTL or inferring of the logic.

Synthesis step in Vivado infers structures in optimal way to balance between area  and speed. Different RTL coding styles guide the tool to infer structures that are  sometimes area optimal or performance optimal. By observing the logic levels in critical path, we can identify if we need to change either RTL coding style or guide  the tool to infer for performance as opposed to area. To reduce the levels of logic,  you can return to the RTL and check for the following general issues. In addition,  refer to Chap. 9 for controlling synthesis behavior.


• Use FSM_ENCODING in your RTL to infer ONE_HOT FSM , which are usually better for speed. 

• Use CASE statements instead of nested IF-ELSE statements; though the former takes more area, it has effi cient inferences of Muxes which leads to better delays. 

• Add pipeline registers to the critical path. Any change to RTL will require resynthesizing the design. Several iterations  may be needed to get optimal depth of logic. 

Clock Skew

Clock skew is the difference between delays that clock takes from common source to  capture flfl op/sequential element and the launch flfl op/sequential element. Examining the  magnitude of clock skew can reveal issues in clocking structure. A design with high  clock skew in critical paths usually means that the clocking structure needs to be  revisited. Using MMCMs to multiply/divide clocks is recommended than using LUTs.  UltraScale and newer devices have a very flfl exible clock architecture and offer lots of  clocks to the user. To ease the issue of reducing clock skew and to generate H-tree clocking structures, the device offers CLOCK_ROOT which is the center tap points  from where clock distribution happens. CLOCK_ROOT is chosen by Vivado for set of  clock loads such that clock skew for the set of loads is minimal. However, in some  cases where the paths are legal cross-clock domain paths, clock skew might be higher.  In these cases user can choose CLOCK_ROOT manually to reduce the clock skew.  UG912 from Xilinx explains the mechanism to modify CLOCK_ROOT location.

Reducing High-Fanout Signal

High-fanout signals typically pose a challenge to the place and route tools, as due to  the very nature they have many connections, and the placement will be spread out.  Due to this, delay on the net would be relatively higher. If the top several critical  paths have some commonality that all of them involve high-fanout signal, some  optimization can be done at RTL level to reduce the fanout coupled with options to  synthesis tool. Some options are:   

Duplicate the driver and tell the synthesis tool not to remove the duplicate logic  (attribute DONT_TOUCH ).   

For the signals other than control signals such as reset, set, and clock enable,  using max_fanout in synthesis will direct synthesis to replicate the driver .

Another option is to use phys_opt_design (post-placement). This command  performs timing-based logic replication of high-fanout drivers and critical-path  cells. Drivers are replicated, then loads are distributed among the replicated drivers,  and the replicated drivers are automatically placed. This optional command can be  run after placement and before routing.

Control Sets and Control Set Optimization

In Xilinx FPGA architecture (for 7 series and UltraScale), each slice has eight  flfl ip- flfl ops (FFs). These eight FFs share control signals, so the FFs that are placed in  the same slice should have same control sets. Hence the flfl ops in the same slice have  to share the control set. Placer algorithm honors this constraint by placing FFs of the  same control sets together. Xilinx FPGAs can accommodate several thousand control sets; however, the higher the number of control sets, the more complex the job  for placer to place flfl ops into slices without wasting flfl ops. report_control_sets command can be used to assess the number of unique control sets in the design. Under  verbose options, the command gives details on the distribution of the fanouts of the  control signal.   Vivado synthesis has an option which is used to specify threshold for synchronous control set optimization to lower number of control sets. The number set to this  value specififi es how large the fanout of a control set should be before it starts using  it as a control set. For example, if control_set_opt_threshold is set to 5, a synchronous reset that only fans out to 5 registers would be moved to the D input’s logic  rather than using the reset line of a register. The default threshold value is currently  set to 4.   Other ways to reduce control sets is to use resets judiciously. Be selective on the  use of resets by observing the following points:

• Have resets only where they have impact on functionality. 

• Use synchronous resets rather than asynchronous reset.

Floor Planning

Examining the critical path in the Vivado GUI will show the placement of the logic  in the path. Sometimes, placer while trying to optimize several constraints might  yield a suboptimal placement. Examining the top several critical paths in the GUI  will give an idea if the placer indeed did a suboptimal job in placement of criticalpath object. If so, flfl oor planning can be done to guide the placer. A hierarchical flfl oor   plan can reduce the route delay in the critical logic. A good starting point when flfl oor  planning for the fifi rst time is to flfl oor plan only the logic that the implementation  tools consider timing critical. Generally start with the lower-level hierarchies that  the place and route stage fifi nds to be timing critical. More often it is useful to look at the placement of block RAMs and DSP blocks, as these are not distributed throughout  the FPGA. Floor planning them not only gives better performance but also predictive results in future iterations of the same project. When the design meets timing, it  is also possible to reuse the placement.   For SSI devices, flfl oor planning poses additional requirements to consider, which  are explained in Chap. 13 .

Physical Optimization

Physical optimization performs optimization on the paths that fail to meet timing.  Optimizations involve replication, retiming, hold fifi xing, and placement improvement. Physical optimization is usually run after placement when the timing picture  is reasonably accurate. These optimizations are invoked by explicitly running the  optional phys_opt_design command. This command performs the following physical optimizations.  

High-Fanout Optimization : High-fanout nets, with negative slack within a percentage of the WNS, are considered for replication. The drivers are replicated and  the replicated drivers are placed near to cluster of loads.  

Placement-Based Optimization : Cells on the critical path are replaced to reduce  wire delays.  

Rewire : LUT connections are swapped to reduce the number of logic levels for  critical signals. LUT equations are modififi ed to maintain design functionality.  

Critical-Cell Optimization : Cells in failing paths are replicated. If the loads on a  specififi c cell are placed far apart, the cell may be replicated with new drivers placed  closer to load clusters. High fanout is not a requirement for this optimization to  occur, but the path must fail timing with slack within a percentage of the worst negative slack.  

DSP Register Optimization : Registers are moved out of the DSP cell into the  logic array or from logic to DSP cells if it improves the delay on the critical path.  

Block RAM Register Optimization : Registers are moved out of the block RAM  cell into the logic array or from logic to block RAM cells if it improves the delay on  the critical path.  

Retiming : Registers are moved across combinational logic to provide better  timing.  

Forced Net Replication : Net drivers are replicated, regardless of timing slack.  Replication is based on load placements and requires manual analysis to determine  if replication is suffifi cient. If further replication is required, nets can be replicated  repeatedly by successive commands. Although timing is ignored, the net must be in  a timing-constrained path to trigger the replication.   

The above optimizations are run only during post-placement physical optimization  steps; however, Vivado also allows to run physical optimization at post-route stage  also. Only a subset of the optimizations are run at post-route stage, as the runtime of  physical optimization post-routing is higer. 

Strategy and Directives

Directives are powerful features that are available with every implementation step  (synthesis, optimize design, placement, physical optimization, and routing).  Directives give the implementation step to direct behavior of the algorithms toward  alternate goal. It changes the implementation step by using:

• Different fl ows 

• Different algorithms 

• Different objectives

Directives allow each implementation step to enable more design space exploration  than in the default mode. Directives have different objectives such as reduce area ,  reduce runtime , improve performance , and improve power.

Directives are enabled by running any synthesis and implementation step with the  option -directive . Usually the names of the directive are chosen to indicate how different they are compared to the default behavior and their objective. Every implementation step has the directive explore . Explore allows the implementation step to  work in a high effort mode to meet the timing objective at the expense of runtime. For  designs with very tight requirements, it is recommended to use explore directive for  most of the implementation steps (especially placement and physical optimization).  Directives related to placement usually give the biggest improvement for performance.  Please refer to UG904 from Xilinx for details on the list of directives and what each  of the directive’s objectives is. 

Strategies defifi ne the flfl ow of Vivado and customize the different implementaiton  steps, and how each of these steps are confifi gured. As each synthesis and implementation step has varieties of options and directives, strategies confifi gure the best possible combination of these switches. You can also defifi ne your own custom strategy.  Strategies are categorized into the following:

• Performance 

• Area 

• Power 

• Flow 

• Congestion

Each of the above strategy categories has several strategies which can be used to  extract the last mile performance from the tools. In the context of timing closure,  categories related to performance and congestion are applicable. One way is to run  all the available performance strategies and pick the best results

Congestion and Congestion Alleviation

FPGA routing architecture has different kinds of routing resources to service different scenarios seen in placement of the design. Congestion can happen when in a  region there is more demand of certain kinds or all kinds of routing resources than their availability. Extent of the congestion regions defifi nes whether the congestion  is local or global. Router and placement algorithms, in order to alleviate congestion, introduce white spaces and detours . These changes may impact the routing  delays by worsening them, which impact the timing of the design. There are certain  steps you can take to reduce the effect of congestion on timing. Congested regions  can be determined by running congestion reporting using report design analysis .  Also designs with heavy utilization of block RAMs, MuxF7s, and MuxF8s and   distributed RAMs have a tendency to have congestion. Care should be taken to  reduce the utilization of any block with high connectivity. Blocks with high connectivity increase number of signals coming in a region where the blocks are placed.  If there are many high connectivity blocks placed in a small region, one can increase  the size of a region by defifi ning a pblock . The size of the pblock can be increased to  make it large enough to have enough routing resources to complete routing all nets  and thereby alleviating congestion.

Report Design Analysis

Report design analysis is a command that summarizes several important details on  the critical paths. Commonly occurring issues in critical paths are summarized in a  tabular format. By looking at the characteristics of several critical paths, issues can  be deduced. Report design analysis has three modes of operation:

• Timing 

• Congestion 

• Complexity

Timing mode is used to fifi nd out the characteristics of critical paths. For each of  the path, many important characterisitcs are printed. For example, it is easy to  determine if the top critical paths have block RAMs and whether they are registered or not. Or, if the top several critical paths have LUTs which are combined in  synthesis stage (we can turn this off by using -lc off option). Xilinx published  UG906 provides information on other meaningful information that can be obtained  from this report. 

Congestion mode gives the post-placement and post-routing congestion windows,  and complexity computes the rent’s exponent of the netlist or modules specififi ed.  Congestion combined with complexity can determine whether the netlist itself is  inherently congested, or the congestion is placement induced. Using congestion  mode, you can fifi nd the congested window and also determine what modules are  placed in the region. Later you can run complexity on these modules and compute the  rent’s complexity on them. Rule of thumb says that any rent’s complexity over 0.7 can be considered as an issue in netlist. 

Timing Closure and Hold Violation

The previous section covered several techniques related to closure of timing which  mainly focused on setup violations. Hold violations are also another kind of timing  failures that you need to be aware of. Hold violations are severe, as reducing the clock  frequency will not help in timing closure. Vivado tool is hold aware and tries to mitigate the violations by detouring and adding extra delay to the paths failing hold. However, you should be aware of these requirements and not solely depend on tool to  fifi x the issues. Buffers can be added in hold failing path with DONT_TOUCH attribute  so that synthesis tool does not optimize them away. Further post-route physical optimization and few router directives can also help to reduce the hold violation. Figure 14.3 provides a top-level flfl ow chart for achieving timing closure on your design.



  • XC5VLX110-2FF1153C

    Manufacturer:Xilinx

  • FPGA Virtex-5 LX Family 110592 Cells 65nm Technology 1V 1153-Pin FCBGA
  • Product Categories: FPGAs (Field Programmable Gate Array)

    Lifecycle:Active Active

    RoHS: No RoHS

  • XCV200E-8PQ240C

    Manufacturer:Xilinx

  • FPGA Virtex-E Family 63.504K Gates 5292 Cells 416MHz 0.18um Technology 1.8V 240-Pin PQFP
  • Product Categories: FPGAs (Field Programmable Gate Array)

    Lifecycle:Obsolete -

    RoHS: No RoHS

  • XC3090-100PQ160C

    Manufacturer:Xilinx

  • FPGA XC3000 Family 6K Gates 320 Cells 100MHz 5V 160-Pin PQFP
  • Product Categories: FPGAs

    Lifecycle:Obsolete -

    RoHS: No RoHS

  • XC2V2000-4BFG957I

    Manufacturer:Xilinx

  • FPGA Virtex-II Family 2M Gates 24192 Cells 650MHz 0.15um Technology 1.5V 957-Pin FCBGA
  • Product Categories: Module RF, IC et Accessoires

    Lifecycle:Obsolete -

    RoHS:

  • XC2V2000-4FF896I

    Manufacturer:Xilinx

  • FPGA Virtex-II Family 2M Gates 24192 Cells 650MHz 0.15um Technology 1.5V 896-Pin FCBGA
  • Product Categories: FPGAs (Field Programmable Gate Array)

    Lifecycle:Obsolete -

    RoHS: No RoHS

Need Help?

Support

If you have any questions about the product and related issues, Please contact us.