This website uses cookies. By using this site, you consent to the use of cookies. For more information, please take a look at our Privacy Policy.
Home > FPGA Technical Tutorials > FPGAs Fundamentals, advanced features, and applications in industrial electronics > Main Architectures and Hardware Resources of FPGAs > Specialized Hardware Blocks

Specialized Hardware Blocks

FONT SIZE : AAA

Several types of specialized hardware blocks are available in most current FPGAs, but not all of them are available in all devices and their number var- ies from one device to another. Actually, the type and number of specialized hardware resources included in a given device determine its target applica- tion domain. Some of the most usual specialized hardware resources—clock management blocks, memory blocks, and transceivers—are described in the following sections. As stated in Section 2.1, because of their special signifi- cance, embedded soft and hard processors, as well as DSP and analog blocks, are separately analyzed in Chapters 3 through 5, respectively. 

Clock Management Blocks

The generation, control, and quality of clock signals are among the most important problems to be faced in the design of complex digital systems, particularly in the case of multirate systems or those requiring very fast data transfer rates, where synchronization among the different parts of the sys- tem is a critical issue. 

Regarding clock management, FPGAs are divided into regions designed to minimize clock propagation delays within them. A set of dedicated clock input pins is assigned to each region, together with resources to man- age and distribute clock signals (Actel 2010; Achronix 2015; Altera 2015c; Microsemi 2015d; Xilinx 2015b), as shown in Figure 2.12. The number of 

FIGURE 2.12.png

FIGURE 2.12 (a) Global, (b) regional, and (c) peripheral clock networks in Altera’s Stratix V devices.

clock regions varies depending on the device size. Global clock lines also exist, as well as other clock lines that connect adjacent clock regions. In some FPGAs, it is possible to execute a clock power down, “disconnecting” global or regional clock signals to reduce power consumption. When a clock line is powered down, all logic associated with it is also switched off, further reducing power consumption. 

In order to reduce the problems associated with clock signals as well as the number of external oscillators needed, FPGAs include clock management blocks (CMBs),* based on either PLLs or delay-locked loops (DLLs). These CMBs are mainly used for frequency synthesis (frequency multiplication or division), skew reduction, and duty cycle/phase control of clock signals. 

Each CMB is associated with one or several dedicated clock inputs, and in most devices, it can also take as input an internal global clock signal or the output of another CMB (chain connection). Chain connections allow dynamic range to be increased for frequency synthesis (both for frequency multiplication and division). 

For optimized performance, CMBs are physically placed close to IOBs and are connected to them with dedicated resources. Therefore, in matrix archi- tectures, such as Microsemi’s IGLOO2 (Figure 2.13a), CMBs are placed in (or close to) the periphery, where IOBs are located. In column-based archi- tectures, such as Xilinx’ Series 7 (Figure 2.13b), CMBs are placed in specific columns, regularly distributed all over the device, but always next to IOBs columns. It should be noted that, in most FPGAs, CMBs are also used to gen- erate the clock signals used by SerDes blocks and transceivers. 

Although the functionality of CMBs is similar regardless of the vendor/ family of devices, their hardware structures are quite diverse. As mentioned 

* Again, different vendors use different names. Xilinx, clock management tiles or digital clock managers; Altera, PLLs or fractional PLLs; Microsemi, clock conditioning circuitry; Achronix, global clock generator.

Location of CMBs in (a) matrix and (b) column-based architecturespng

FIGURE 2.13 Location of CMBs in (a) matrix and (b) column-based architectures.

before, some of them are based on DLLs (digital solution), but most current devices use PLLs (analog solution). Basic PLLs work with integer factors for fre- quency synthesis (integer PLLs), but in the most advanced devices, fractional PLLs (capable of working with noninteger factors) are also available. 

The structure of an integer PLL is depicted in Figure 2.14. Its main purpose is to achieve perfect synchronization (frequency and phase matching) between its output signal and a reference input signal (Barrett 1999). Its operation is as follows: The phase detector generates a voltage proportional to the phase difference between the feedback and reference signals. The low- pass filter averages the output of the phase detector and applies the result- ing signal to a voltage-controlled oscillator, whose resonant frequency (the output frequency) varies accordingly. In this way, the output frequency is dynamically adjusted until the phase detector indicates that the feedback 

Block diagram of an integer PLLpng

FIGURE 2.14 Block diagram of an integer PLL.

and reference signals are in phase. At this point, the PLL is said to have reached the phase-lock condition. 

In case the feedback loop is a direct connection of the output signal to the input of the phase detector (i.e., there is neither a delay block nor a feedback counter), in steady state, the frequency and phase of the output signal follow those of the reference input signal. 

If the delay block is included in the feedback loop, in order for the phase- lock condition to be achieved, the phase of the output signal must lead that of the reference signal by an amount equal to the delay in the feedback loop (Gentile 2008). 

Similarly, if the counter is included in the feedback loop, the frequency of the output signal will be M times the frequency of the reference signal. In this way, the PLL acts as a frequency multiplier by an integer factor M. By simply varying M, the output frequency can be adjusted to any integer multiple of the reference frequency (within the operating limits of the circuit). If the reference frequency is obtained by dividing the frequency of an input signal by an integer scal- ing factor R (prescale counter in Figure 2.14), the output frequency will also be divided by R; that is, the effective multiplying factor would be M/R. 

Therefore, the relatively simple structure in Figure 2.14 allows CMBs to synthesize multiple frequencies from an input clock signal, control the phase of the output signal, and eliminate skew by synchronizing the output signal with the input reference signal. As an example of an actual circuit (Actel 2010), the one in Figure 2.15 provides five programmable dividing coun- ters (C 1 –C 5 ) that can generate up to three signals with different frequencies. There are two delay lines in the feedback loop (one fixed and one program- mable) that can be used to advance the output clock relative to the input clock. Another five lines are available to delay output signals. 

In spite of their simplicity and usefulness, PLLs based on integer divisions have two main drawbacks: 

Structure of Microsemi’s integer PLLpng

FIGURE 2.15 Structure of Microsemi’s integer PLL.

• When multiplying frequency by M, phase noise (jitter in time domain) in the output signal increases 20 · log(M). This effect may be miti- gated by using a higher reference frequency (which would imply the use of a lower value of M to obtain the same output frequency), but this is not always possible because the reference frequency defines the frequency resolution of the PLL, and for some applications, it is a design specification (Barrett 1999; Texas Instruments 2008). 

• The cutoff frequency of the low-pass filter must be lower enough than the reference frequency. For lower cutoff frequencies, the acquisition (or lock) time of the PLL increases. This is the time needed for the PLL to reach steady state (i.e., to synchronize) after power on, reset, or the reconfiguration of its operating parameters (Barrett 1999). 

Fractional PLLs have a better behavior than integer ones in terms of phase noise and acquisition time. Their (fractional) frequency resolution is a frac- tion F of the reference frequency. This means that input frequency can be F times the frequency resolution, resulting in lower phase noise and acquisi- tion time. 

Fractional PLLs are based on the use of a divider by M + K/F in the feed- back loop, where K is the fractional multiply factor. As explained earlier, this would be the frequency multiplying factor of the PLL unless a prescale fre- quency divider by R is applied to the input signal. There are two hardware approaches to obtain a fractional PLL, which are used in different FPGA devices. The simplest one uses an accumulator to dynamically modify the frequency division in the feedback loop, in such a way that in K out of F cycles of the reference signal, the dividing factor is M + 1, and in F − K cycles, the frequency is divided by M, resulting in an average dividing factor equal to [(M + 1)K + M · (F ─ K)]/F = M + K/F. 

This solution adds spurious signals (instantaneous phase errors in the time domain) to the output frequency. Although they can be mitigated by using analog methods, a better solution is achieved by using a differ- ent hardware structure for the fractional PLL. In this second approach, based on a delta-sigma modulator, digital techniques are used to more efficiently reduce phase noise and spurious signals (Barrett 1999; Texas Instruments 2008). 

As an example, the CMB shown in Figure 2.16 (which combines integer and fractional PLLs) uses a delta-sigma modulator associated with the feedback frequency divider (Altera 2015b). It also includes several output dividers (C 0 ─C n ) to generate output clock signals of different frequen- cies, as well as an input clock switch circuit to select the reference signal. Reference signals may be the same (clock redundancy) or have differ- ent frequency (for dual-clock-domain applications). The input and output 

FIGURE 2.16.png

FIGURE 2.16 Integer/fractional PLL from Altera Arria 10 family.

signals of this CMB can be connected to global or regional clock lines, to external clock pins, or to other CMBs. 

In the CMBs of any current FPGA, the feedback signal can be obtained from different sources and be routed through different paths. The way of doing it depends on the target functionality, as described in the following (it must be noted that not all devices provide all these possibilities): 

• Minimize the length of the feedback path to reduce output jitter.* 

• Compensate skew in the clock network used to generate the output of the CMB, which can be generated using either internal or external feedback. In the first case, feedback comes from a global or regional clock line, compensating internal device delays; whereas in the sec- ond case, feedback comes from a device pin, compensating delays at the board level. 

• Generate zero-delay buffer clocks (Gentile 2008). When the signal generated by the CMB is connected to an external clock pin, it may be important to compensate the propagation delays introduced by this pin and the external connections, in order to ensure that the clock signal reaching the external device is synchronized with the CMB’s reference signal. 

• Ensure the phase in the data and clock inputs of the memory ele- ments in IOBs is the same as the phase of the same signals when they reach the device pins; that is, the pin-to-register-input delays of these signals are the same. 

• Ensure this equality of delays from input pins also for the clock and data input signals of SerDes blocks. 

In spite of their similar functionalities, there are many differences among CMBs from different FPGA families in terms of input and output frequency ranges, frequency/phase synchronization ranges, access to interconnection resources, types of signals they can generate (e.g., single-ended, differen- tial), the number of outputs, possible values of frequency multiplying and dividing factors, fixed/variable/programmable delay, and so on. There are obviously also differences in the control signals, but at least two of them are present in all devices: reset, to initialize the CMB, and locked, whose activation validates the output signal (i.e., indicates frequency and/or phase synchronization has been achieved). The combination of both signals allows the correct behavior of the CMB to be checked and recovered if needed. If synchronism is lost, the locked signal will be deactivated. As a response, a reset can be launched for synchronism to be recovered. This process can be automatically executed in some FPGAs. 

Although from all the previously mentioned issues, it may seem that it is difficult for the user to deal with the many different configuration param- eters and operating modes of CMBs, actually this is not the case. Software design tools usually offer IP blocks whose user interfaces require just a few values to be entered and then configuration parameters are automatically computed. 

Memory Blocks

Most digital systems require resources to store significant amounts of data. Memories are the main elements in charge of this task. Since memory access times are usually much longer than propagation delays in logic circuits, memories (in particular external ones) are the bottleneck of many systems in terms of performance. Because of this, FPGA vendors have always paid special attention to optimizing logic resources so that they can support, in the most efficient possible way, the implementation of internal memories. 

Since combinational logic, LUTs, and flip-flops are available in LBs, inter- nal memories can be built by combining the resources of several LBs, result- ing in the so-called distributed memory. However, in order for distributed memory to be more efficient, LBs may be provided with resources additional to those intended to support the implementation of general-purpose logic functions, such as additional data inputs and outputs, enable signals, and clock signals. Because this implies LBs to be more complex and, in addition, it makes no sense to use all LBs in an FPGA to build distributed memories, usually only around 25%–50% (depending on the family of devices) of the LBs in a device are provided with extra resources to facilitate the implemen- tation of different types of memories: RAM, ROM, FIFO, shift registers, or delay lines (Xilinx 2014b; Altera 2015c). The structures of a “general-purpose” LB and another one suitable for distributed memory implementation can be compared in Figure 2.17. 

As FPGA architectures evolved to support the implementation of more and more complex digital systems, memory needs increased. As a consequence, vendors decided to include in their devices dedicated memory blocks, which in addition use specific interconnection lines to optimize access time. They are particularly suitable for implementing “deep” memories (with a large number of positions), whereas distributed memory is more suitable for “wide” memories (with many bits per position) with few positions, shift reg- isters, or delay lines. 

In current FPGAs, both distributed memory and dedicated memory blocks support similar configurations and operating modes. Dedicated memory is structured in basic building blocks of fixed capacity, which can be com- bined to obtain deeper (series connection) or wider (parallel connection) memories. The possible combinations depend on the target type of memory and on the operating mode. The capacity of the blocks largely varies even 

LBs from Xilinx’ Series 7 devices.png

FIGURE 2.17 LBs from Xilinx’ Series 7 devices: (a) general purpose and (b) oriented to distributed memory implementation

Sample Altera’s Cyclone III memory modes.png

FIGURE 2.18 Sample Altera’s Cyclone III memory modes: (a) simple dual-port block RAM, (b) FIFO, and (c) true dual-port block RAM.

among devices of the same family (Altera 2012; Xilinx 2014c; Achronix 2015; Microsemi 2015c). The most common configurations (some of which can be seen in the sample case in Figure 2.18) are

• Single-port RAM, where only one single read or write operation can be performed at a time (each clock cycle) 

• Simple dual-port RAM, where one read and one write operation can be performed simultaneously 

• True dual-port RAM, where it is possible to perform two write oper- ations, two read operations, or one read and one write operation simultaneously (and at different frequencies if required) 

• ROM, where a read operation can be performed in each clock cycle 

• Shift register 

• FIFO, either synchronous (using one clock for both read and write operations) or asynchronous (using two independent clocks for read and write operations). They can generate status flags (“full,” “empty,” “almost full,” “almost empty”; the last two are configurable). 

In dual-port memories, usually word width can be independently configured for each port. In some cases, input and output word widths can also be independently configured for the same port, which eases the efficient imple- mentation of content-addressable memories. Configurations cannot be arbitrary, but have to be chosen from a predefined set. 

Several clock modes can be used in FPGA memories (some of which are mentioned earlier), but not all modes are supported in all configurations: 

• Single clock: All memory resources are synchronized with the same clock signal. 

• Read/write: Two different clocks are used for read and write opera- tions, respectively. 

• Input/output: Uses separate clocks for each input and output port. 

• Independent clocks: Used in dual-port memories to synchronize each port with a different clock signal.

Some memory blocks support error detection or correction using parity bits or dedicated error correction blocks (Xilinx 2014c; Altera 2015b), as shown in Figure 2.19. These are complementary functionalities that can be configured from the software design tools. Regarding parity, depending on data width, one or more parity bits may be added to the original binary combination. In some FPGAs, parity func- tions are not implemented in dedicated hardware, but have to be built from distributed logic. In Xilinx’ Series 7 devices, parity is one of the possibilities offered by the error correction code (ECC) encoder. The circuit in Figure 2.19 cannot be used with distributed memory. It can exclusively be associated 

Error correction resources in Xilinx’ Series 7png

FIGURE 2.19 Error correction resources in Xilinx’ Series 7.

 with dedicated memory blocks, in particular with simple dual-port and FIFO configurations. It allows single-bit errors to be detected and corrected or double-bit errors to be detected. Output signals are available to flag the occurrence of an error and indicate whether or not it could be corrected. 

Dedicated memory blocks based on SRAM cells can be found in all cur- rent FPGAs. In some devices, flash memories with read/write capabilities are also available (Microsemi 2014). Their main advantage comes from the fact of being nonvolatile, and their main drawback is that they require more control signals than SRAM-based ones, therefore making their control from the FPGA fabric more complex. ECC blocks are also available for this kind of memories. 

The addition of memories to FPGA designs is facilitated by software design tools, which automatically partition the memory blocks defined by the designer and assign them to the memory blocks available in the target device, according to the operation modes specified and the design constraints regarding area and speed. Memory contents can also be initialized with the help of the design tools, which allow the contents of text files (where the values to be initially stored in the memories are described with a predefined syntax) to be included in the configuration bitstream.* 

Hard Memory Controllers 

In many FPGA applications, a huge amount of data has to be handled, but there is not enough embedded memory available for that. In such cases, external memory has to be used, and the corresponding memory controller needs to be implemented in the FPGA. Since there exist a wide variety of memories, the required interfaces are also very diverse, from simple parallel or serial interfaces (such as Serial Peripheral Interface [SPI], Inter-Integrated Circuit [I 2 C], and Universal Serial Bus [USB]) to much more complex ones (e.g., DDR). 

To address this issue, FPGA vendors offer different soft † IP core-based solutions. However, these do not provide good-enough performance when dealing with very large memories (up to the GB range) or very fast opera- tion requirements (hundreds of MHz or even GHz). This is the reason why FPGA vendors are including hard memory controllers in their most current devices. For instance, Arria V and 10 families from Altera include dedicated hardware for access control to external DDR/DDR2/DDR3/ DDR4 SDRAM memories (Figure 2.20). Spartan-6 and Virtex-6 families from Xilinx also include DDR3 hard memory controllers, enhanced in Series 7 families of devices and extended in the UltraScale family to sup- port DDR4 memories. 

* FPGA configuration issues are analyzed in detail in Chapter 6.

† The functionality of soft cores is implemented using resources of the FPGA fabric.

Arria 10 hard memory controllerpng

FIGURE 2.20 Arria 10 hard memory controller.

Two types of hard DDR/DDR2/DDR3 memory controllers are available in Microsemi SmartFusion2 devices, one of them accessible from the FPGA fabric and the other from an embedded ARM Cortex-M3 core* (so it cannot then be considered an FPGA resource, but rather one of the core). MachXO2, LatticeXP2, and LatticeECP2/M families from Lattice include circuitry allow- ing DDR/DDR2 memory interfaces to be implemented, whereas LatticeECP3, ECP5, and ECP5-5G families also support DDR3 memory interfaces. 

Compared with soft IP core-based solutions, hard controllers achieve lower latencies and higher access frequencies. They support different data widths, reordering of commands and data for out-of-order execution, definition of priorities for reduced latency, streaming read or write operations for massive data transfer, burst modes, operation modes for continuous access to random sequences of memory addresses, multiport interfaces, low power consump- tion modes, user-controlled partial refresh cycles for reduced consumption, and error-correcting algorithms. 

Let us consider the sample controller in Figure 2.20 (Altera 2016), consist- ing of three main building blocks (all of them physically located in the I/O banks of the devices): 

• The physical layer interface (UniPHY) directly interacts with the I/O pins and is in charge of ensuring an adequate timing between the controller and the external memory. One of the main problems of external memory interfaces is the skew among data lines due to PCB routing. This problem is particularly significant for wide, high- speed buses. UniPHY mitigates this problem by means of configu- rable delay chains, which allow the delay associated with each I/O pin to be independently adjusted so as to align all data in the bus. 

• The memory controller is in charge of maximizing bandwidth, through efficient control of the commands for external memory. It uses two main strategies for that, namely, reordering commands to take advantage of idle/dead cycles and reordering data and commands to 

* As stated in Section 2.1, embedded soft and hard processors are separately analyzed in Chapter 3.

group read or write commands so that they are executed together, minimizing bus turnaround time. 

• The multiport front end (MPFE) manages the access of multiple processes (read or write transactions) implemented in the FPGA fab- ric to the same hard external memory interface. In Arria 10 devices, it is a soft IP core. 

Transceivers 

A key factor for the success of FPGAs in the digital design market is their ability to connect to external devices, modules, and services in the same PCB, through backplane, or at long distance. In order to be able to sup- port applications demanding high data transfer rates, the most recent FPGA families include full-duplex transceivers, compatible with the most advanced industrial serial communication protocols (Cortina Systems and Cisco Systems 2008; PCI-SIG 2014). Data transfer rates up to 56 Gbps can be achieved in some devices, and the number of transceivers per device can be in excess of 100 (e.g., up to 144 in Altera’s Stratix 10 GX family and up to 128 in Xilinx’s Virtex UltraScale + FPGAs). Some of the supported protocols are as follows: 

• Gigabit Ethernet • PCI express (PCIe) 

• 10GBASE-R 

• 10GBASE-KR • Interlaken 

• Open Base Station Architecture Initiative (OBSAI) 

• Common Packet Radio Interface (CPRI) 

• 10 Gb Attachment Unit Interface (XAUI) 

• 10GH Small Form-factor Pluggable Plus (SFP+) 

• Optical Transport Network OTU3 

• DisplayPort 

Transceivers are complex circuits, whose architectures vary among solutions from different FPGA vendors (as can be seen in Figure 2.21), in particular regarding generation and management of clock signals (Altera 2014, 2015d; Xilinx 2014d, 2015c; Achronix 2015; Jiao 2015; Microsemi 2015b). Anyway, they can be basically divided in two parts, namely, transmitter and receiver, each one in turn consisting of two main blocks (depicted in Figure 2.22 for the case of Altera’s Stratix V devices): physical medium attachment (PMA) and physical coding sublayer (PCS). 

Transceivers from (a) Xilinx’ Series 7 and (b) Altera’s Stratix V familiespng

FIGURE 2.21 Transceivers from (a) Xilinx’ Series 7 and (b) Altera’s Stratix V families.

Altera’s Stratix V transceiver.png

FIGURE 2.22 Altera’s Stratix V transceiver: PMA (right) and PCS (left).

Data flows are as follows: In the receiver, serial input data enter the PMA block, whose output is applied to the PCS block, and finally information reaches the FPGA fabric. In the transmitter, output data follow a similar path, but in the opposite direction, from the FPGA fabric to the output of the PMA. 

Given the high complexity of these blocks and taking into account that the detailed analysis of communication protocols is totally out of the scope of this book, only the main functional characteristics shared by most FPGA transceivers are described in the following text. 

The receiver’s PMA consists at least of an input buffer, a clock data recov- ery (CDR) unit, and a deserializer: 

• The input buffer allows the voltage levels and the terminating resis- tors to be configured in order for the input differential terminals to be adapted to the requirements of the different protocols. It supports different equalization modes (such as continuous time linear equal- ization or decision feedback equalization) aimed at increasing the high-frequency gain of the input signal to compensate transmission channel losses. 

• The CDR unit extracts (recovers) the clock signal and the data bits from incoming bitstreams. 

• The deserializer samples the serial input data using the recovered clock signal and converts them into words, whose width (8, 10, 16, 20, 32, 40, 64, or 80 bits) depends on the protocol being used. 

In the transmitter side, the PMA is in charge of serializing output data and sending them through a transmission buffer. This buffer includes circuits to improve signal integrity in the high-speed serial data being transmit- ted. Features include pre- and post-emphasis circuits to compensate losses, internal terminating circuits, or programmable output differential voltage, among others. 

PCSs (both in the transmitter and in the receiver) can be considered as digital processing interfaces between the corresponding PMA and the FPGA fabric. Their main tasks are as follows: 

• Encode (decode) data to be transmitted (being received) to sup- port a variety of standard or proprietary coding solutions (8 B/10 B, 64 B/66 B, 64 B/67 B). 

• Align serial input data to symbol boundaries (receiver). 

• Generate (transmitter) or detect (receiver) the standard patterns (pseudo-random bit sequences [PRBS]) used to check signal integ- rity in high-speed serial links. 

In addition, since transceivers use several clock domains, PCSs usually include deskew circuits (such as the ones described in Section 2.4.1) to align the phase of the different clock signals, as well as circuits to compensate small frequency variations between the external transmitter and the local receiver. 

Depending on the operating mode or the used protocol, the PCS block may not be used. Actually, not all FPGA transceivers include this block. Some devices, in contrast, include transceivers with different types of PCS blocks, supporting different serial data transfer rates. Finally, to ensure integrity of the transmitted data, transceivers must be calibrated before they start to operate. Transceivers in some devices (e.g., Altera’s Stratix 10) include circuits that automatically perform the calibration process at power on. 

Like in the cases of clock management and memory blocks, although trans- ceiver configuration is in principle a complex task, software design tools pro- vide resources to automatically obtain wrappers that allow transceivers to be configured from either predefined models of industrial standards or user- defined custom protocols. 

PCIe Blocks

Among the many existing serial communication protocols, PCIe deserves special attention because of its role as high-speed solution for point-to-point processor communication. Due to this, FPGA vendors have been progres- sively including resources to support the implementation of PCIe buses, from the initial IP-based solutions to the currently available dedicated hard- ware blocks (Curd 2012). From its initial definition (PCI-SIG 2015) to date, three PCIe specifications have been released (a fourth one is pending publication), whose characteris- tics are listed in Table 2.1. 

Many FPGAs (e.g., Microsemi’s SmartFusion2 and IGLOO2, Xilinx’s from Series 5 on, Altera’s from Arria II on) include dedicated hardware blocks to support Gen 1 and Gen 2 specifications, and the most advanced ones (e.g., Xilinx’ Virtex-7 XT and HT, Altera’s Stratix 10) also support Gen 3. The combination of these blocks with transceivers and, in some 

TABLE 2.1

PCIe Base Specifications

PCI Spec

Revision

Link Speed

(GT/s)

Max Bandwidth a

(Gb/s)

Encoding

Scheme 

Overhead (%)

Gen 1
Gen 2
Gen 3
Gen 4b

2.5
5.0
8.0
16.0

2.0
4.0
7.88
15.76

8 B/10 B
8 B/10 B
128 B/130 B
128 B/130 B

20
20
1.5
1.5

a Theoretical value. The actual one is lower because of packet overhead, among other factors.

b Publication pending.

Block diagram of a typical PCIe implementationpng

FIGURE 2.23 Block diagram of a typical PCIe implementation.

cases, memory blocks allows the PCIe physical, data link, and transaction layers functions to be implemented (Figure 2.23), providing full endpoint and root-port functionality in ×1/×2/×4/×8/×16 lane configurations. The application layer is implemented in distributed logic. Communication with the transaction layer is achieved using interfaces usually based on AMBA buses.* A separate transceiver is needed for each lane, so the num- ber of supported lanes depends on the availability of transceivers with PCIe capabilities. 

In addition to basic specifications, some PCIe dedicated hardware blocks also support advanced functionalities, such as multiple-function, single-root I/O virtualization (SR-IOV), advanced error reporting (AER), and end-to-end CRC (ECRC). 

The multiple-function feature allows several PCIe configuration header spaces to share the same PCIe link. From a software perspective, the situa- tion is equivalent to having several PCIe devices, simplifying driver develop- ment (can be the same for all functions) and its portability. 

The SR-IOV interface is an extension to the PCIe specification. When a single CPU (single root) runs different OSs (multiple guests) accessing an I/O device, SR-IOV can be used to assign a virtual configuration space to each OS, providing it with a direct link to the I/O device. In this way, data transfer rates can be very close to those achieved in a nonvirtualized solution. 

AER and ECRC are optional functions oriented to systems with high reli- ability requirements. They allow detection, flagging, and correction of errors associated with PCIe links to be improved. 

One of the major challenges for the implementation of PCIe is that, accord- ing to the Base Specification, links must be operational in less than 100 ms 

* AMBA is a dominating de facto on-chip interconnect specification standard in industry for IP-based design (ARM-proprietary), which was first introduced in 1999 to ease the efficient interconnection of multiple processors and peripherals with different performances (low and high bandwidth). It is currently one of the most popular on-chip busing solutions for SoCs, and as such is analyzed in detail in Chapter 3. 

after power on. Current FPGAs apply different configuration techniques to address this issue. One of them is partial reconfiguration (discussed in detail in Chapter 8): The FPGA is initially configured with a bitstream just contain- ing the PCIe circuitry, and once it is operational, the rest of the FPGA func- tions required are configured on the fly using this link. 

Serial Communication Interfaces

Although serial communication interfaces (such as I 2 C, SPI, and USB) are usually required in many FPGA applications, not many devices include specialized hardware blocks with this kind of functionality, but it is imple- mented either using resources of the FPGA fabric or as part of an embedded hard or soft processor. At the moment, this book is being finalized, and to the best of authors’ knowledge, only Lattice’s and QuickLogic’s devices include such hardware blocks. Lattice’s MachXO2, MachXO3, iCE40LM, and iCE40 Ultra families as well as QuickLogic’s ArcticLink II VX2 family include SPI and I 2 C interfaces. USB and SD/SDIO/MMC/CE-ATA* interfaces are avail- able in ArcticLink devices. Implementing such serial interfaces in hardware allows area, performance, and power consumption to be optimized. 

The embedded function block (EFB) interface of the MachXO3 family (Lattice 2016) is shown in Figure 2.24a. It consists of a set of specialized hard- ware blocks, including one SPI and two I 2 C interfaces. These three blocks are connected to the FPGA fabric through a Wishbone interface (analyzed in Section 3.5.4). The two I 2 C interfaces can be configured as master (thus control- ling other devices in the bus) or slave (thus acting as a resource available for a bus master). Among other features, they support 7 and 10 bit addressing, multi- master arbitration, interrupt request, and up to 400 kHz data transfer speed. The SPI block can also be configured as master or slave. It supports full-duplex data transfer, double-buffered data register, interrupt request, serial clock with programmable polarity and phase, and LSB- or MSB-first data transfer. 

The iCE40 Ultra family (Lattice 2015), whose block diagram is shown in Figure 2.24b, includes up to two I 2 C and two SPI interfaces, similar to those in the MachXO3 family. The distinct characteristic of iCE40 Ultra devices is that they can be categorized as “specific-purpose FPGAs,” that is, con- figurable devices equipped with specific resources targeting specific appli- cations rather than wide applicability (what most FPGAs are intended for). In this case, they are sensor managers targeting mobile platforms, such as smartphones, tablets, and handheld devices. With this purpose, in addition to the serial communication interfaces allowing them to connect to mobile sensors and application processors, they include other specialized hardware blocks, such as on-chip oscillators or DSP functional blocks. 

* Secure Digital (SD), Secure Digital Input Output (SDIO), MultiMediaCard (MMC), and Consumer Electronic-ATA (CE-ATA) are memory card protocol definitions and standards used for solid-state storage. 

(a) MachXO3 EFB interface and (b) block diagram of iCE40 Ultra devicespng

FIGURE 2.24 (a) MachXO3 EFB interface and (b) block diagram of iCE40 Ultra devices.

Similarly, QuickLogic’s ArcticLink and ArcticLink II VX2 families are also oriented to mobile devices, so they include not only serial communication interfaces but also other very specific and complex blocks (only available in these devices and which are analyzed in Section 3.4.1). It is important to note that these FPGAs are nonvolatile devices based on QuickLogic proprietary ViaLink antifuse technology, and therefore one-time programmable (OTP), in contrast with the vast majority of FPGAs currently in the market, which are reconfigurable. 

The block diagram of an ArcticLink II VX2 device (QuickLogic 2013) is shown in Figure 2.25a. It includes two serial interfaces: one SPI and one I 2 C. The I 2 C interface is mainly used as configuration bus for other embedded hardware blocks, although it can also be used as general-purpose interface. The SPI interface can only act as master, and it is intended for controlling 

Block diagram of (a) ArcticLink II VX2 and (b) ArcticLink devicespng

FIGURE 2.25 Block diagram of (a) ArcticLink II VX2 and (b) ArcticLink devices.

external elements such as sensors or displays. It supports up to three slaves and can operate in the frequency range from 1.5 to 27.2 MHz. These interfaces are not physically located in IOBs, but instead, they are connected by the user by means of resources of the FPGA fabric (see Figure 2.25a). This allows the number of external peripherals that can be connected to the interfaces to be extended by implementing a suitable multiplexing logic in the FPGA fabric. 

Other resources included in ArcticLink devices (because they are widely used in handheld devices) are Hi-Speed USB 2.0 On-the-Go (OTG), and SD/ SDIO/MMC/CE-ATA host controllers (Figure 2.25b) (QuickLogic 2010). 

The Hi-Speed USB 2.0 OTG controller is a dual-role device supporting host and device functions. Its main features are as follows: 

• Supports high- (480 Mbps), full- (12 Mbps), and low-speed (1.5 Mbps) transfers 

• Integrated physical layer with dedicated internal PLL 

• Supports both point-to-point and multipoint (root hub) applications 

• Double-buffering scheme for improved throughput and data trans- fer capabilities 

• Supports OTG Host Negotiation Protocol and Session Request Protocol 

• Configurable power management features 

• Integrated 5.2 kB FIFO 

• Sixteen endpoints: one fixed bidirectional control endpoint, one soft- ware programmable IN or OUT endpoint, seven IN endpoints, and seven OUT endpoints 

The SD/SDIO/MMC/CE-ATA controller is compliant with the SD Host Controller Standard Specification, Version 2.0. It supports clock rates up to 52 MHz; 1, 4, or 8 bit data modes; block size up to 512 bytes; and dynamic buffer management to increase data throughput. 

References

Achronix. 2008. Introduction to Achronix FPGAs. White paper WP001-1.6.

Achronix. 2015. Speedster22i HD1000 FPGA data sheet DS005-1.0.

Actel (currently Microsemi). 2010. ProASIC3 FPGA Fabric User’s Guide.

Altera. 2012. Cyclone III Device Handbook.

Altera. 2014. Stratix V Device Handbook. Vol. 2: Transceivers.

Altera. 2015a. MAX 10 FPGA device architecture.

Altera. 2015b. Arria 10 Core Fabric and General Purpose I/Os Handbook.

Altera. 2015c. Stratix V Device Handbook. Vol. 1: Device Interfaces and Integration.

Altera. 2015d. Arria 10 Transceiver PHY User Guide UG-01143.

Altera. 2016. External Memory Interface Handbook Volume 1: Altera Memory Solution

Overview, Design Flow, and General Information.

Barrett, C. 1999. Fractional/integer-N PLL basics. Texas Instruments technical brief

SWRA029. Texas Instruments, Dallas, TX.

Cortina Systems and Cisco Systems. 2008. Interlaken protocol definition. Revision 1.2.

Curd, D. 2012. PCI express for the 7 series FPGAs. Xilinx white paper WP384 (v1.1).

Gentile, K. 2008. Introduction to zero-delay clock timing techniques. Analog Devices application note AN-0983. Analog Devices, Norwood, MA.

Hutton, M. 2015. Understanding how the new HyperFlex architecture enables next generation high-performance systems. Altera white paper WP-01231-1.0.

Jiao, B. 2015. Leveraging UltraScale FPGA transceivers for high-speed serial I/O connectivity. Xilinx white paper WP458 (v1.1).

Kuon, I., Tessier, R., and Rose, J. 2007. FPGA architecture: Survey and challenges.

Foundations and Trends in Electronic Design Automation, 2:135–253.

Lattice. 2015. iCE40 Ultra family datasheet DS1048 (v1.8).

Lattice. 2016. MachXO3 family datasheet DS1047 (v1.6).

Microsemi. 2014. Fusion family of mixed signal FPGAs datasheet. Revision 6.

Microsemi. 2015a. IGLOO2 FPGA and SmartFusion2 SoC FPGA: Datasheet DS0451.

Microsemi. 2015b. SmartFusion2 SoC and IGLOO2 FPGA fabric: User guide UG0445.

Microsemi. 2015c. ProASIC3E flash family FPGAs: Datasheet DS0098.

Microsemi. 2015d. SmartFusion2 and IGLOO2 clocking resources: User guide UG0449.

PCI-SIG. 2015. PCI Express ® base specification revision 3.1a. Available at: https:/pcisig.com/specifications/pciexpress. Accessed November 20, 2016.

QuickLogic. 2010. ArcticLink solution platform datasheet (rev. M).

QuickLogic. 2013. ArcticLink II VX2 solution platform datasheet (rev. 1.0).

Rodriguez-Andina, J.J., Moure, M.J., and Valdes, M.D. 2007. Features, design tools, and application domains of FPGAs. IEEE Transactions on Industrial Electronics, 54:1810–1823.

Rodriguez-Andina, J.J., Valdes, M.D., and Moure, M.J. 2015. Advanced features and industrial applications of FPGAs—A review. IEEE Transactions on Industrial Informatics, 11:853–864.

Saban, K. 2012. Xilinx Stacked Silicon Interconnect Technology delivers break through FPGA capacity, bandwidth, and power efficiency. Xilinx white paper WP380 (v1.2).

Texas Instruments. 2008. Fractional N frequency synthesis. Application note AN-1879.

Xilinx. 2004. Celebrating 20 years of innovation. Xcell Journal, 48:14–16.

Xilinx. 2006. Virtex-5 platform FPGA family technical backgrounder.

Xilinx. 2010. Spartan-6 FPGA Configurable Logic Block: User Guide UG384 (v1.1).

Xilinx. 2014a. Spartan-6 FPGA SelectIO Resources: User Guide UG381 (v1.6).

Xilinx. 2014b. 7 Series FPGAs Configurable Logic Block: User Guide UG474 (v1.7).

Xilinx. 2014c. 7 Series FPGAs Memory Resources: User Guide UG473 (v1.11).

Xilinx. 2014d. 7 Series FPGAs GTP Transceivers: User Guide UG482 (v1.8).

Xilinx. 2015a. 7 Series FPGAs SelectIO Resources: User Guide UG471 (v1.5).

Xilinx. 2015b. 7 Series FPGAs Clocking Resources: User Guide UG472 (v1.11.2).

Xilinx. 2015c. 7 Series FPGAs GTX/GTH Transceivers: User Guide UG476 (v1.11).


  • XC17V02PC44C

    Manufacturer:Xilinx

  • PROM Parallel/Serial 2M-bit 3.3V 44-Pin PLCC
  • Product Categories: Memory - Configuration Proms for FPGA's

    Lifecycle:Obsolete -

    RoHS: No RoHS

  • XCR3384XL-FT256C

    Manufacturer:Xilinx

  • Xilinx BGA
  • Product Categories:

    Lifecycle:Any -

    RoHS: -

  • XCR3512XL-10FT256C

    Manufacturer:Xilinx

  • CPLD CoolRunner XPLA3 Family 12K Gates 512 Macro Cells 97MHz 0.35um Technology 3.3V 256-Pin FTBGA
  • Product Categories: Programmable logic array

    Lifecycle:Active Active

    RoHS: No RoHS

  • XCR3512XL-10PQ208C

    Manufacturer:Xilinx

  • CPLD CoolRunner XPLA3 Family 12K Gates 512 Macro Cells 97MHz 0.35um Technology 3.3V 208-Pin PQFP
  • Product Categories: Programmable logic array

    Lifecycle:Active Active

    RoHS: No RoHS

  • XCR3512XL-12FG324I

    Manufacturer:Xilinx

  • CPLD CoolRunner XPLA3 Family 12K Gates 512 Macro Cells 77MHz 0.35um Technology 3.3V 324-Pin FBGA
  • Product Categories: Programmable logic array

    Lifecycle:Active Active

    RoHS:

Need Help?

Support

If you have any questions about the product and related issues, Please contact us.