This website uses cookies. By using this site, you consent to the use of cookies. For more information, please take a look at our Privacy Policy.
Home > FPGA Technical Tutorials > Designing with Xilinx FPGAs Using Vivado > Synthesis > Getting the Most of Device Primitives

TABLE OF CONTENTS

Xilinx FPGA FPGA Forum

Getting the Most of Device Primitives

FONT SIZE : AAA

FPGA is made up of a fi xed number of different varieties of structures. Having an understanding of the target architecture and the impact of different mappings will allow you to obtain a very high QoR , by tweaking the actual inference and resource mix—depending on your specifi c design care-abouts. 

Same functionality may be realized using different combinations of primitives. If your design makes excessive use of a specifi c primitive, you might want to imple- ment some of the functionality onto another type of primitive, where possible, even if that other primitive type might usually be considered suboptimal for that specifi c functionality realization. 

This section covers some of the dedicated primitives of Xilinx FPGA and exam- ples of decision-making process to show the best way to obtain optimal results through Vivado synthesis. 

The examples given below are w.r.t Xilinx 7 series, UltraScale, and UltraScale+ devices. The basic idea behind providing these is to give a conceptual understanding which can be adjusted for other architectures, depending on the structure available in those future architectures. 

SRLs 

Xilinx FPGAs contain primitive which is LUTM (LUT memory) which can be con- fi gured as a sequential element like a shift register ( SRL32 ) or a distributed RAM. This section covers some examples to illustrate decision-making process around SRL s. 

Take a simple example of a delay chain of 64 of 1 bit wide. This can be imple- mented in 64 fl ip-fl ops. These would need at least four slices . Or, they can be imple- mented in 2 LUTMs—going into a single slice. Each LUT confi gured as an SRL32 + an additional fl op for better clock_to_out which can all go into a single slice. 

On the other hand, consider a design having many delay lines with small depth (say 3). If these are mapped to SRL , these could cause congestion due to high utili- zation of SRLs . Based on the design statistics, you should control the SRL threshold for getting a better trade-off. Vivado synthesis tool provides directives and switches to change the threshold for SRL inference. 

Structures around SRL also play a role. Consider the following sets of structures: 

• Combo logic followed/preceded by SRL 

• Block RAM/distributed RAM followed/preceded by SRL 

• DSP followed/preceded by SRL 

For better clock_to_out , synthesis will pull out the last stage of SRL into a flip- flop. You can control this behavior using synthesis attributes. You might also con- sider pulling out the first stage of an SRL into fl op which would provide higher flexibility for placement. This can be controlled using synthesis attribute srl_style . For example, srl_style = reg_srl_reg will force the tool to have SRL s w ith registers on both sides. 

Memories 

Designs typically use memories for storing data, buffering, etc. At a fundamental level, a memory is a bank of fl ops with decoding logic at the input and MUX logic at the output . 

FPGAs provide dedicated primitives for implementing memories. These are of two types. First is distributed memory which is implemented using LUTM s and the second are block RAMs which are hard blocks of size 18 k/36 k. For very smaller memories, the obvious choice is register based, since the num- ber of fl ops/ glue logic will be less. 

For choosing between distributed and block RAM based, the first requirement is synchronous nature. An asynchronous read from the memory will be inferred as a distributed RAM . A synchronous read which implies either output data is registered or the read address being registered is a requirement for block RAM to be inferred. 

Since distributed RAM is implemented using a LUT, a six-input LUT can be confi gured to implement a 64 × 1 single port memory. A block RAM can support 18 k/36 k bits. Choosing a crossover point on where to use a distributed RAM and block RAM is important. Synthesis tools use thresholds/timing constraints for infer- ring these memories automatically. 

For highly utilized designs where the design is dominant in one of the primitives, i.e., distributed RAMs vs. block RAMs, you should guide the tool using attributes/ switches to have a different implementation to get balanced utilization of resources. This will in turn affect the place and route tools on providing better opportunities for placement. There is no deterministic optimal ratio of distributed RAMs vs. block RAMs. The right mix depends on various factors. 

Based on few case studies that we have encountered, we will try to mention some of the good practices that can be used based on the scenario. Your design may need its own decision. 

Distributed RAM Usage 

For a highly utilized design with tighter timing constraints, make sure that the dis- tributed RAM percentage of the overall slice usage is relatively low. The reason is that if there are too many distributed RAMs, there would be lot of fabric routing that would converge at each slice/ CLB which would result in congestion . 

Look at configurations of smaller depth, wider data bus. Synthesis tools might look at a combined view of the aspect ratio to decide on inferring distrib- uted or block RAMs. In cases where depth is small, distributed RAMs are a better choice. 

For example, depth × width = 32 × 256. This would result in four block RAMs if used in simple dual port ( SDP ) mode. In terms of distributed RAM, it would be 256 LUTs. In this example it is better to go with 256 LUTs. If we look at block RAM bits that are actually inferred, it is 8192 vs. the total capacity of 147,456 (four block RAMs). 

Block RAM Pipelining 

For higher frequencies, always use the pipeline registers or else the clock_to_out of the block RAM would limit the performance that can be achieved. In the follow- ing situations, synthesis tool might not pull in the register , even if there are pipelines: 

• Feedback path on the register 

• Fanout from the fi rst stage of the pipeline 

Use additional register outside the block RAM for higher performance if block RAM has multi- fanout . Place and route tools would have higher fl exibility in plac- ing this register, based on its fanout load placement. 

DSPs 

DSP blocks come with a number of features. A few to mention are pre-adder, mul- tiplier, and post-adder/accumulator with pipeline register at each output. This section uses examples based on DSP48E2 from Xilinx UltraScale devices. 

DSP48E2 supports a signed multiplier of size 27 × 18, 48-bit post-adder, an input pre-adder which is connected to the 27-bit multiplier port. 

Extra DSPs Inferred 

Note that a multiplier of size 27 × 18 will be mapped into a single DSP block only if the inputs are signed. So the fi rst thing to check is if the inputs are unsigned. 

Adder followed by multiplier when used for full width will not be packed into a single DSP block. A 27-bit addition would result in 28-bit result and then this 28 bit should be used for multiplication. So, the operand size has grown beyond 27—the width of the multiplier. You need to consider the mul- tiplier input size and calculate the maximum possible at the input of DSP prim itive. For a signed multiplier of 27 × 18, taking the carry into consideration, the maximum possible adder at the input is 26 bit. If it is unsigned, it would be less by one more bit. 

Consider a situation, where the multiplier output is tapped/padded with 0 s. before driving an adder. Multiplier output to adder is hardwired, so if there is some truncation/padding, it cannot be done within single DSP block. 

To summarize on the above section, DSP is a powerful block, and to use the capabilities of DSP blocks to fuller extent, make sure you understand the hardwired connections and the widths of the supported primitives internally. 

You can make use of DSP’s pipeline registers for achieving high performance. Make sure to use all the pipelines if you have a tighter timing requirement. 

MUXFs 

These are 2:1 Muxes that multiplex LUTs which can be used for implementing wider functions. For example, two LUT6s are muxed by a MUXF7 which pro- vides a capability for implementing a seven-input function. Similar analogy can be used for MUXF8 and MUXF9. But note that the MUXF8 would have inputs as MUXF7s. 

There is always a trade-off of using MUXFs vs. LUT3, for example, to imple- ment a two-input MUX when used in the context of a complete design. Simply specifi ed in another way, if a MUXF is driving a register, then it would be advanta- geous to use it because there is a direct route from MUXF to register. If it is driving some combo, the LUT3 can be combined with another function which would result in a reduction of one logic level. Synthesis tools can be directed by switches/attri- butes to control the behavior. 

Carry Chains 

For implementing arithmetic operations like adder, subtractor, or comparators, dedi- cated carry chains (or, carry look ahead) have faster routes. 

When using carry chains, make sure to exploit the capability of the architecture. Avoid using an adder and feeding into a combo and then feeding into other adder , as shown in Fig. 9.4 . In this case though the adders are implemented using carry chains, because of the combo, the exit from CARRY to LUT and entry from LUT to CARRY will contribute to a larger percentage of the delay. This can be slightly restructured to have adder, adder, combo or combo, adder, and adder (as shown in Fig. 9.5 ) to minimize the delay. 

The other best practice is to use a register at the output of adder so that they can be packed into the same slice. 

Adder, logic, adder.png

Fig. 9.4 Adder, logic, adder

Fig. 9.5 Logic, adder, adder.png

Fig. 9.5 Logic, adder, adder

  • XC3064-70PG132I

    Manufacturer:Xilinx

  • FPGA XC3000 Family 4.5K Gates 224 Cells 70MHz 5V 132-Pin CPGA
  • Product Categories:

    Lifecycle:Obsolete -

    RoHS: No RoHS

  • XC5VFX30T-1FFG665I

    Manufacturer:Xilinx

  • FPGA Virtex-5 FXT Family 65nm Technology 1V 665-Pin FCBGA
  • Product Categories: FPGAs (Field Programmable Gate Array)

    Lifecycle:Active Active

    RoHS:

  • XC5VFX30T-2FFG665I

    Manufacturer:Xilinx

  • FPGA Virtex-5 FXT Family 65nm Technology 1V 665-Pin FCBGA
  • Product Categories: Embedded - FPGAs (Field Programmable Gate Array)

    Lifecycle:Active Active

    RoHS:

  • XC5VFX70T-1FF665C

    Manufacturer:Xilinx

  • FPGA Virtex-5 FXT Family 65nm Technology 1V 665-Pin FCBGA
  • Product Categories: FPGAs (Field Programmable Gate Array)

    Lifecycle:Active Active

    RoHS: No RoHS

  • XC4003-6PC84C

    Manufacturer:Xilinx

  • FPGA XC4000 Family 3K Gates 100 Cells 100MHz 5V 84-Pin PLCC
  • Product Categories: FPGAs (Field Programmable Gate Array)

    Lifecycle:Obsolete -

    RoHS: No RoHS

Need Help?

Support

If you have any questions about the product and related issues, Please contact us.