FONT SIZE : AAA
This section deals with structured ways of communicating internal modules within an FPGA. Three types of connections are considered: point to point, bus based, and NoC. In this order, they are sorted from less to more complex, but also from the least to the most scalable. Lee et al. (2007) provide a quantitative analysis on the use of these three alternatives, when and how to use them, show- ing a theoretical basis as well as a use case example of a multimedia system.
For a relatively small number of blocks and connections among them, point- to-point solutions are possibly the best choice. There is no connection shar- ing, so every pair of modules is always ready to communicate. This not only gives high communication throughput but also ensures predictable behavior from the system. Even though this is the simplest solution for internal com- munications, there are, however, some problems associated with it. One of them is the lack of standardization. There are less communication standards than with the other solutions presented later. However, it may run the sim- plest transaction protocols, as simple as “one piece of data after another,” the basis for streaming, being able to very efficiently run either synchronous or asynchronous transactions. In many cases, especially for asynchronous trans- actions, FIFO memories must be placed in between both communicating sides. This FIFO memory can be made with a double-clock, double-port scheme to provide true asynchronous access between two regions with different clocks. Sufficient FIFO dimensioning ensures work independence between modules.
An important computing scheme that is feasible to be implemented with such P2P communications is the dataflow model of computation. In this model, all computing elements in the design, also called actors, follow an autonomous trigger rule such that the model starts computing whenever all incoming data from one or more connections between actors are available. Results are pro- duced when finished and sent through output P2P connections. This scheme is self-organized since there is no need of central control. Connections do not require addresses, and in some cases, data “tagging” is possible in case one actor receives different types of data elements from the same connection.
There are promising solutions based on dataflow models of computation since the lack of centralized control and predictability offer good characteris- tics for some applications. However, the main disadvantage of this solution is the increase in the communication area, so this solution does not scale well. So, for solutions that require higher and more varied connectivity, bus solutions, discussed next, or NoC solutions, described in Section 7.3.3, are better choices.
Due to the increase in the use of P2P connections with streamed data in FPGA-based SoCs, standardized efforts from FPGA manufacturers are being made. So, Avalon Stream, in the case of Altera, or AXI Stream, in the case of Xilinx, are available as connectivity solution between cores in SoPC designs. See Sections 3.5.1 and 3.5.2 for details on streaming protocols from AXI and Avalon, respectively.
In Section 3.5, specific interconnect buses for embedded soft or hard proces- sors were described in order to complement the characteristics of the proces- sors or multiprocessors themselves, embedded or embeddable, into many FPGA devices. Thus, it described in some detail different alternatives such as AMBA, AXI, Avalon, CoreConnect, and Wishbone. In this section, we are offering a different and more general view, with the purpose of showing buses together with other design alternatives that would let designers com- plement the decision criteria to choose the right solutions for their specific needs, as well as understanding in greater detail how these solutions operate and when they are required.
A bus is, in general, a bundle of wires that interconnect, in a standardized way, at least two (but normally more) elements using a shared communica- tion infrastructure. If there were just two elements connected, a point-to- point connection would normally be preferred.
Resource sharing has the obvious advantage of reducing resource usage with respect to a dense point-to-point-based complex interconnect, but has the disadvantage of reducing overall throughput and not allowing to customize and adapt bus width and performance to interconnect elements with different speed requirements (we will see later that this can be alleviated by bus bridging).
A bus interconnection involves two things: hardware aspects and a logic protocol. Hardware aspects determine the physical interconnect layer, electrical aspects, and topology of the interconnection. The logic protocol deals with timing issues, type of transactions supported, and arbitration policy, including priority schemes for multimaster-based buses. A bus may contain one or more master elements and one or more slave elements. Master elements start transactions, issue the access type (read or write), make the transaction request, and provide the address. Slave elements respond to incoming requests from a master and receive data (in the case of a write transaction) or provide data (in the case of a read) the address specified by the active master. If more than one master produces a request, an arbitration scheme must be provided in order to decide which master gets the right to make its transaction.
At the physical level, the hardware elements that access the shared medium are defined. Two basic schemes are possible, either using tristate buffers (see Figure 7.2a) or by means of multiplexed access schemes (see Figure 7.2b, which shows a partially connected multiple access scheme).
FIGURE 7.2 Physical interfaces for a bus connection: (a) tristate buffers; (b) multiplexed.
Tristate schemes are not advised because of the danger of producing con- tention on the bus (two elements driving the bus at the same time), which might either jeopardize the circuit or produce—for transient contentions— high current peaks. On the other side, multiplexed accesses do not have such problems, but they scale worse, requiring modifications in the struc- ture when new elements are added or removed. Modern FPGAs, in many cases, do not include internal tristate logic (only in the chip I/Os), so multi- plexed access is mandated.
The physical specification of the bus also includes the bus topology. The simplest one is the single-shared bus. They are inspired on rack-based sys- tems, and only one transaction is possible at the same time. They may be multimaster and allow complex transactions (they will be described later), but their main drawback is the loss of performance with the increased length of the connection and the presence of multiple elements. As a rule of thumb, more than 10 different elements on the same bus will probably show some saturation and force one to use other solutions. Figure 7.3a shows an example of a shared single bus.
Other possible solutions at the physical level include crossbar switch or ring-based topologies, shown, respectively, in Figure 7.3b and c. They over- come the problem of added loss of performance, but the first one significantly increases resource utilization by the provision of multiple paths between masters and slaves, and the second one, while reduced in resource usage, produces added high latencies.
When the number of interconnected elements is high, or they can be sorted into different speed requirements and degrees of utilization, bridged and hierarchical structures offer better results. Figure 7.4 shows a bridge-based topology formed by two buses.
Bridges, like in Figure 7.4, have the advantage of providing parallel access in every segment bus, increasing overall bandwidth. However, bridging may also be used to group components that operate at different speeds into several buses, such that the design in every bus is tailored to the speed required. For instance, complex access schemes on wide buses may be used for high-speed components, while lightweight buses can be used for low-speed peripherals, which only access registers and do not provide DMA features for complex transactions. Figure 7.5 shows a three-level hierarchical bus scheme, linked by two bridges, with highest access rates at the upper level and lowest rates at the bottom. Bridges work as slaves on the upper side and as masters on the bottom side, according to the figure.
Apart from the physical level, buses also standardize the logic protocol to perform the required transactions. Timing and arbitration are the main elements to be defined. Regardless of what timing is to be used in the bus—relationships between signals to ensure correct operation—all buses fall into two different categories: synchronous or asynchronous. In syn- chronous buses, all timings are referred to a master clock signal, which is required to reach all elements in the bus. This issue is, at the same time,
FIGURE 7.3 Bus topologies: (a) single-shared bus; (b) crossbar switch; (c) ring-based bus.
FIGURE 7.4 Bridged bus topology.
FIGURE 7.5 Hierarchical three-level bus example.
its main drawback since synchronization mismatches and skew problems may appear in long high-speed buses. They are, however, much simpler and provide faster access times than asynchronous buses. On the other hand, asynchronous buses do not have a clock, and control is effected by events in specific control signals. This ensures greater compatibility with a wider set of peripherals and modules in general, providing better timing adaptation. As a drawback, the control is more complex since it needs to set handshaking protocols between masters and slaves.
Transaction protocols also define the sequence of functional operations that need to be followed in the transaction, no matter whether it is a simple or a complex (burst) one. This is of particular importance when dealing with mul- timaster buses. In essence, if one or more master modules want to start a trans- action, they do an “arbitration request” (AR in Figure 7.6). The arbiter solves the contention in the ARB cycle, deciding which is the master that will be given the access next, according to the priority or arbitration policy. The mas- ter that is granted the access makes a request—RQ in the figure—and the
FIGURE 7.6 Single-bus transaction protocol.
slave may be busy during some cycles until data are ready, which is notified by asserting an acknowledgment signal, which ends the transaction.
These stages can be overlapped between different transactions in order to maximize bus utilization. In this case, pipelined bus structures—more complex but efficient—are required. Figure 7.7 shows an example of a multi- ple access in a pipelined bus structure. As can be observed, bus utilization is increased significantly, although the arbitration and grant process becomes more complex.
The number of cycles of a granted transaction may be fixed or variable. Arbitration may be centralized or distributed, and different arbitration policies may be set. These policies may be random, with static priorities, or based on periodic priority assignment, such as round robin. While this technique is suitable for distributed arbitration, which provides better scalability, it has the disadvantage of producing potentially large laten- cies, which makes it unsuitable for critical systems, where static priorities are preferred.
FIGURE 7.7 Single pipelined bus transaction protocol, with three overlapped accesses.
The huge overhead produced in P2P communication for a sufficiently large number of interconnected devices and the performance degradation of bus- based structures forced a new paradigm for large systems with multiple interconnection needs: NoC. The main advantage with respect to their pre- decessors is that there is no speed degradation with size since all connec- tions can be made for local and short distance, while being able to achieve parallelism, given the fact that several nodes can communicate in parallel by using multiple paths. It is clear that not all nodes will be able to communi- cate at the same time due to resource sharing, but communication policies under some circumstances may be used to ensure, in a predictable manner, sufficient bandwidth for all possible communications between nodes within the NoC.
An NoC consists of a series of links, connected by routers that intercon- nect multiple cores. Links are normally pairs of bidirectional channels con- necting a pair of routers. The access of cores to the NoC is done on the routers, so, whenever an NoC topology drawing is observed, it must be considered that that router has an extra input/output pair of channels for the core. For instance, Figure 7.8 depicts the two most commonly used NoC topologies—a mesh and a torus, which, in essence, is a mesh connected with the same scheme of adjacency as in a Karnaugh’s map. In the figure, routers in the middle of the mesh have five links (or ports), four for the vis- ible cores, plus one for the core attached to it. All routers in the torus topol- ogy have five link connections.
Apart from these topologies, there are other topologies such as tree, fat tree, ring, octagon, and spider. However, the design and features of an NoC are not solely dependent on the topology. There are many other factors, related to the way packets are formed, how link handshaking is achieved, and how
FIGURE 7.8 (a) Mesh NoC topology and (b) torus topology (a mesh variation).
FIGURE 7.9 Access to/from logic to node router/switch.
packets are routed—including how they are switched and buffered—as well as regarding control flow and arbitration policies, which are even more important than the topology itself. For each of these factors, many different techniques have been proposed and verified, up to the point that a detailed description of all possible individual techniques, as well as a description of the collection of existing commercial and academic NoCs, is out of the scope of the contents of this book. However, some descriptions about these tech- niques and the problems they try to solve are mentioned later on.
Packets are injected into the NoC or withdrawn from it by means of the network interface of every core, as shown in Figure 7.9. The network interface of every core is in charge of producing—or retrieving—data according to the rules defined for the NoC. In the most general sense, data exchanged between two cores at a given time, as a result of a computation done in the transmitting node, are called messages. Messages are split into packets that, at the same time, are split into one or more flits, composed of several phits. A flit is the minimum amount of data that may be exchanged between two network elements (two routers or an NI and its router). A phit is the amount of bits that can be exchanged at a time, and it is dependent on the characteristics of the link. In contrast to conventional networks, links may be composed of a set of parallel wires transmitting more than one bit at a time.
These characteristics of the NoC are architecture dependent, except for the router internal organization, which mostly determines the main NoC com- munication mechanisms from a protocol perspective, not from an architec- tural perspective. These mechanisms are basically as follows:
• Flow control: It determines the node-to-node control rules and is in charge of allocating channels and buffers inside the routers to store packets.
• Routing and switching: Routing refers to the determination of the path between source and destination in the NoC, while switching decides how and when to connect an input and an output port within a router.
• Buffering and arbitration: It is related to the policies that decide which message is to be stored inside the router, either in input buffers or in output buffers, and wait for a future chance to go through an output port. Arbitration deals with the way the routers select which message has the right to go to an output port.
There are three basic types of flow control. The simplest is a handshake between two neighboring elements so that, whenever there is room in the receiver, it accepts the transmission of a packet. A more sophisticated tech- nique, based on the earlier one, is a credit-based flow control, which relies on counting the number of packets that are sent, up to a maximum number, and reduces this number as soon as packets leave the receiving router. The third method, and possibly one of the most used ones nowadays, is the setting of virtual channels (VCs). With regard to VCs, every physical channel is shared by several logical channels, and either equal-time multiplexing or priority- based multiplexing is used as an arbitration policy to resolve which is the next packet to leave every router.
Routing protocols determine the path of messages along the NoC. Livelocks (messages returning without reaching their destination) and deadlocks (a cyclic dependency that keeps messages permanently blocked) must be avoided, yet providing some adaptability and fault tolerance—reacting to traffic conditions and permanent faults. In mesh-based or similar NoCs, routing policies are sim- ple since traversing the network from one point to any other involves moving in any direction that gets the X–Y coordinates of the routers by which messages are passing closer to the final X–Y destination value. Dynamic routing tech- niques, such as “west-first, if possible,” are preferred over static ones, such as “first all X, then all Y,” because they offer the required adaptability.
As opposed to conventional networks, circuit switching is preferred over packet switching because it ensures predictability. The combination of arbi- tration policies and control flow techniques to implement VCs with priority buffering and preemption buffering is, though complex, a way to ensure pre- dictability, and so, it is preferred in many cases over further simplified tech- niques. The problem with this solution is router complexity, and so, it is more suited to NoCs for integrated circuits than for FPGA-based ones, where router implementation on the fabric may consume huge amounts of resources.
Nevertheless, the complexity of the NoC design not only relies on how to select the most appropriate combination of techniques for a given prob- lem, taking into account size, power, QoS, performance (throughput and latency), or predictability. In addition to this, NoCs must be customized in order to obtain optimal implementation for every given application. In gen- eral, if an application involves several tasks, NoCs can be adapted in shape and in resource utilization—link width adaptation and/or buffering size determination—to best fit the aforementioned requirements. Figure 7.10 shows an example of an adapted network, with four tasks that involve two or three cores each, with different link sizing and connectivity.
FIGURE 7.10 Customized NoC with adapted structure and connectivity.
Unfortunately, tools from FPGA vendors do not offer NoC design in their inte- grated environments, and among third-party tools, there are, nowadays, not so many that are purposely tailored for the FPGA design. There are some academic approaches that work with FPGA reconfigurability in order to custom design and even real-time adapt the structure and behavior of NoCs, but they are far from being a mature technology. The future might perhaps bring some NoCs designed and embedded into reconfigurable fabrics, flattening the path for the use of NoC techniques in a wider range of applications. We encourage readers whose designs have reached the point of requiring advanced communication techniques, such as NoCs, to monitor the state of development of these tech- niques since this will probably be one of the fields discussed in this book that has the risk of being outdated soon. Benini and De Micheli (2006) provide an excellent reference for all practical and advanced features of NoCs.
Manufacturer:Xilinx
Product Categories:
Lifecycle:Any -
RoHS:
Manufacturer:Xilinx
Product Categories: CPLDs
Lifecycle:Active Active
RoHS: No RoHS
Manufacturer:Xilinx
Product Categories: Contrôleur logique
Lifecycle:Obsolete -
RoHS: No RoHS
Manufacturer:Xilinx
Product Categories: Programmable logic array
Lifecycle:Any -
RoHS: -
Manufacturer:Xilinx
Product Categories: Programmable logic array
Lifecycle:Any -
RoHS: -
Support