FONT SIZE : AAA
Critical Path Analysis
Timing reports can be generated at any stage during the synthesis and/or implementation phase. You should generate timing reports at each stage after synthesis, placement, and routing and analyze the paths to make sure that the design is converging. Catching and fifi xing issues earlier in the flfl ow will save several iterations of the subsequent stages. For example, fifi xing issues at synthesis will save time in place and route stage.
A timing failure might happen due to multiple different reasons. Based on the analysis of the timing paths, fifi xes may be required at synthesis stage or the placement and routing stage. Hence it is important to study the characteristic of top failing paths to determine the reasons and fifi xes. Below are some of the important characteristics in the timing paths that can be examined and remedies that can be taken to mitigate them
Logic vs. Wire Delay
Critical path delay can be broken down into logic delay and wire delay . The percentage of logic and wire delay in critical path can help to determine where to reduce delays. A low logic delay component usually means that wire delay is higher, where potentially flfl oor planning the design can help in timing closure. A higher logic delay component means that there are too many logic levels in the design.
Reducing Logic Levels
For paths with higher levels of logic, looking at the levels of logic in the top failing paths can reveal if there are any issues in the RTL or inferring of the logic.
Synthesis step in Vivado infers structures in optimal way to balance between area and speed. Different RTL coding styles guide the tool to infer structures that are sometimes area optimal or performance optimal. By observing the logic levels in critical path, we can identify if we need to change either RTL coding style or guide the tool to infer for performance as opposed to area. To reduce the levels of logic, you can return to the RTL and check for the following general issues. In addition, refer to Chap. 9 for controlling synthesis behavior.
• Use FSM_ENCODING in your RTL to infer ONE_HOT FSM , which are usually better for speed.
• Use CASE statements instead of nested IF-ELSE statements; though the former takes more area, it has effi cient inferences of Muxes which leads to better delays.
• Add pipeline registers to the critical path. Any change to RTL will require resynthesizing the design. Several iterations may be needed to get optimal depth of logic.
Clock Skew
Clock skew is the difference between delays that clock takes from common source to capture flfl op/sequential element and the launch flfl op/sequential element. Examining the magnitude of clock skew can reveal issues in clocking structure. A design with high clock skew in critical paths usually means that the clocking structure needs to be revisited. Using MMCMs to multiply/divide clocks is recommended than using LUTs. UltraScale and newer devices have a very flfl exible clock architecture and offer lots of clocks to the user. To ease the issue of reducing clock skew and to generate H-tree clocking structures, the device offers CLOCK_ROOT which is the center tap points from where clock distribution happens. CLOCK_ROOT is chosen by Vivado for set of clock loads such that clock skew for the set of loads is minimal. However, in some cases where the paths are legal cross-clock domain paths, clock skew might be higher. In these cases user can choose CLOCK_ROOT manually to reduce the clock skew. UG912 from Xilinx explains the mechanism to modify CLOCK_ROOT location.
Reducing High-Fanout Signal
High-fanout signals typically pose a challenge to the place and route tools, as due to the very nature they have many connections, and the placement will be spread out. Due to this, delay on the net would be relatively higher. If the top several critical paths have some commonality that all of them involve high-fanout signal, some optimization can be done at RTL level to reduce the fanout coupled with options to synthesis tool. Some options are:
Duplicate the driver and tell the synthesis tool not to remove the duplicate logic (attribute DONT_TOUCH ).
For the signals other than control signals such as reset, set, and clock enable, using max_fanout in synthesis will direct synthesis to replicate the driver .
Another option is to use phys_opt_design (post-placement). This command performs timing-based logic replication of high-fanout drivers and critical-path cells. Drivers are replicated, then loads are distributed among the replicated drivers, and the replicated drivers are automatically placed. This optional command can be run after placement and before routing.
Control Sets and Control Set Optimization
In Xilinx FPGA architecture (for 7 series and UltraScale), each slice has eight flfl ip- flfl ops (FFs). These eight FFs share control signals, so the FFs that are placed in the same slice should have same control sets. Hence the flfl ops in the same slice have to share the control set. Placer algorithm honors this constraint by placing FFs of the same control sets together. Xilinx FPGAs can accommodate several thousand control sets; however, the higher the number of control sets, the more complex the job for placer to place flfl ops into slices without wasting flfl ops. report_control_sets command can be used to assess the number of unique control sets in the design. Under verbose options, the command gives details on the distribution of the fanouts of the control signal. Vivado synthesis has an option which is used to specify threshold for synchronous control set optimization to lower number of control sets. The number set to this value specififi es how large the fanout of a control set should be before it starts using it as a control set. For example, if control_set_opt_threshold is set to 5, a synchronous reset that only fans out to 5 registers would be moved to the D input’s logic rather than using the reset line of a register. The default threshold value is currently set to 4. Other ways to reduce control sets is to use resets judiciously. Be selective on the use of resets by observing the following points:
• Have resets only where they have impact on functionality.
• Use synchronous resets rather than asynchronous reset.
Floor Planning
Examining the critical path in the Vivado GUI will show the placement of the logic in the path. Sometimes, placer while trying to optimize several constraints might yield a suboptimal placement. Examining the top several critical paths in the GUI will give an idea if the placer indeed did a suboptimal job in placement of criticalpath object. If so, flfl oor planning can be done to guide the placer. A hierarchical flfl oor plan can reduce the route delay in the critical logic. A good starting point when flfl oor planning for the fifi rst time is to flfl oor plan only the logic that the implementation tools consider timing critical. Generally start with the lower-level hierarchies that the place and route stage fifi nds to be timing critical. More often it is useful to look at the placement of block RAMs and DSP blocks, as these are not distributed throughout the FPGA. Floor planning them not only gives better performance but also predictive results in future iterations of the same project. When the design meets timing, it is also possible to reuse the placement. For SSI devices, flfl oor planning poses additional requirements to consider, which are explained in Chap. 13 .
Physical Optimization
Physical optimization performs optimization on the paths that fail to meet timing. Optimizations involve replication, retiming, hold fifi xing, and placement improvement. Physical optimization is usually run after placement when the timing picture is reasonably accurate. These optimizations are invoked by explicitly running the optional phys_opt_design command. This command performs the following physical optimizations.
High-Fanout Optimization : High-fanout nets, with negative slack within a percentage of the WNS, are considered for replication. The drivers are replicated and the replicated drivers are placed near to cluster of loads.
Placement-Based Optimization : Cells on the critical path are replaced to reduce wire delays.
Rewire : LUT connections are swapped to reduce the number of logic levels for critical signals. LUT equations are modififi ed to maintain design functionality.
Critical-Cell Optimization : Cells in failing paths are replicated. If the loads on a specififi c cell are placed far apart, the cell may be replicated with new drivers placed closer to load clusters. High fanout is not a requirement for this optimization to occur, but the path must fail timing with slack within a percentage of the worst negative slack.
DSP Register Optimization : Registers are moved out of the DSP cell into the logic array or from logic to DSP cells if it improves the delay on the critical path.
Block RAM Register Optimization : Registers are moved out of the block RAM cell into the logic array or from logic to block RAM cells if it improves the delay on the critical path.
Retiming : Registers are moved across combinational logic to provide better timing.
Forced Net Replication : Net drivers are replicated, regardless of timing slack. Replication is based on load placements and requires manual analysis to determine if replication is suffifi cient. If further replication is required, nets can be replicated repeatedly by successive commands. Although timing is ignored, the net must be in a timing-constrained path to trigger the replication.
The above optimizations are run only during post-placement physical optimization steps; however, Vivado also allows to run physical optimization at post-route stage also. Only a subset of the optimizations are run at post-route stage, as the runtime of physical optimization post-routing is higer.
Strategy and Directives
Directives are powerful features that are available with every implementation step (synthesis, optimize design, placement, physical optimization, and routing). Directives give the implementation step to direct behavior of the algorithms toward alternate goal. It changes the implementation step by using:
• Different fl ows
• Different algorithms
• Different objectives
Directives allow each implementation step to enable more design space exploration than in the default mode. Directives have different objectives such as reduce area , reduce runtime , improve performance , and improve power.
Directives are enabled by running any synthesis and implementation step with the option -directive . Usually the names of the directive are chosen to indicate how different they are compared to the default behavior and their objective. Every implementation step has the directive explore . Explore allows the implementation step to work in a high effort mode to meet the timing objective at the expense of runtime. For designs with very tight requirements, it is recommended to use explore directive for most of the implementation steps (especially placement and physical optimization). Directives related to placement usually give the biggest improvement for performance. Please refer to UG904 from Xilinx for details on the list of directives and what each of the directive’s objectives is.
Strategies defifi ne the flfl ow of Vivado and customize the different implementaiton steps, and how each of these steps are confifi gured. As each synthesis and implementation step has varieties of options and directives, strategies confifi gure the best possible combination of these switches. You can also defifi ne your own custom strategy. Strategies are categorized into the following:
• Performance
• Area
• Power
• Flow
• Congestion
Each of the above strategy categories has several strategies which can be used to extract the last mile performance from the tools. In the context of timing closure, categories related to performance and congestion are applicable. One way is to run all the available performance strategies and pick the best results
Congestion and Congestion Alleviation
FPGA routing architecture has different kinds of routing resources to service different scenarios seen in placement of the design. Congestion can happen when in a region there is more demand of certain kinds or all kinds of routing resources than their availability. Extent of the congestion regions defifi nes whether the congestion is local or global. Router and placement algorithms, in order to alleviate congestion, introduce white spaces and detours . These changes may impact the routing delays by worsening them, which impact the timing of the design. There are certain steps you can take to reduce the effect of congestion on timing. Congested regions can be determined by running congestion reporting using report design analysis . Also designs with heavy utilization of block RAMs, MuxF7s, and MuxF8s and distributed RAMs have a tendency to have congestion. Care should be taken to reduce the utilization of any block with high connectivity. Blocks with high connectivity increase number of signals coming in a region where the blocks are placed. If there are many high connectivity blocks placed in a small region, one can increase the size of a region by defifi ning a pblock . The size of the pblock can be increased to make it large enough to have enough routing resources to complete routing all nets and thereby alleviating congestion.
Report Design Analysis
Report design analysis is a command that summarizes several important details on the critical paths. Commonly occurring issues in critical paths are summarized in a tabular format. By looking at the characteristics of several critical paths, issues can be deduced. Report design analysis has three modes of operation:
• Timing
• Congestion
• Complexity
Timing mode is used to fifi nd out the characteristics of critical paths. For each of the path, many important characterisitcs are printed. For example, it is easy to determine if the top critical paths have block RAMs and whether they are registered or not. Or, if the top several critical paths have LUTs which are combined in synthesis stage (we can turn this off by using -lc off option). Xilinx published UG906 provides information on other meaningful information that can be obtained from this report.
Congestion mode gives the post-placement and post-routing congestion windows, and complexity computes the rent’s exponent of the netlist or modules specififi ed. Congestion combined with complexity can determine whether the netlist itself is inherently congested, or the congestion is placement induced. Using congestion mode, you can fifi nd the congested window and also determine what modules are placed in the region. Later you can run complexity on these modules and compute the rent’s complexity on them. Rule of thumb says that any rent’s complexity over 0.7 can be considered as an issue in netlist.
Timing Closure and Hold Violation
The previous section covered several techniques related to closure of timing which mainly focused on setup violations. Hold violations are also another kind of timing failures that you need to be aware of. Hold violations are severe, as reducing the clock frequency will not help in timing closure. Vivado tool is hold aware and tries to mitigate the violations by detouring and adding extra delay to the paths failing hold. However, you should be aware of these requirements and not solely depend on tool to fifi x the issues. Buffers can be added in hold failing path with DONT_TOUCH attribute so that synthesis tool does not optimize them away. Further post-route physical optimization and few router directives can also help to reduce the hold violation. Figure 14.3 provides a top-level flfl ow chart for achieving timing closure on your design.
Manufacturer:Xilinx
Product Categories: FPGAs (Field Programmable Gate Array)
Lifecycle:Active Active
RoHS: No RoHS
Manufacturer:Xilinx
Product Categories: FPGAs (Field Programmable Gate Array)
Lifecycle:Obsolete -
RoHS: No RoHS
Manufacturer:Xilinx
Product Categories: FPGAs
Lifecycle:Obsolete -
RoHS: No RoHS
Manufacturer:Xilinx
Product Categories: Module RF, IC et Accessoires
Lifecycle:Obsolete -
RoHS:
Manufacturer:Xilinx
Product Categories: FPGAs (Field Programmable Gate Array)
Lifecycle:Obsolete -
RoHS: No RoHS
Support